Talk
Intermediate

Breaking Language Barriers: Transtokenization for Multilingual NLP in Open-Source Systems

Approved

Prerequisites:

This session is suited for developers, researchers, and AI enthusiasts with:

  • Basic understanding of NLP concepts such as tokenization and language models
  • Experience with machine learning frameworks like PyTorch or TensorFlow
  • Familiarity with Python programming
  • Interest in multilingual NLP and open-source language model development

  • Attendees should have a basic understanding of NLP, machine learning, and Python. This is an intermediate-level workshop but will be presented at a beginner-friendly pace. Please come with a laptop, access to Python tools, and an interest in multilingual AI development.


Step into a world where multilingual AI models can go beyond language boundaries and deliver high-quality translations across low-resource languages. Through this talk, I’ll show you how transtokenization – a novel technique for transforming tokens between languages – can revolutionize NLP, allowing for the adaptation of large language models to handle multilingual data effectively. By leveraging open-source frameworks like Unsloth and Mistral, we’ll explore the power of transtokenization to fine-tune models using parallel datasets such as the English-Hindi Bible corpus.


This is a research paper where transtokenizatoin is implemented by the author from the language of English to Dutch, where I have implemented success with English to Hindi with additional findings, specially because tokenization for non Latin Languages required further work.

Talk Outline:

1. Introduction

  • Quick intro about myself and what this talk is all about.
  • Why current LLMs struggle with low-resource languages and how transtokenization can help.
  • A look at how Unsloth and Mistral Models fit into multilingual AI.

2. Transtokenization Setup

  • I’ll walk through the setup process, showing the tools and steps needed (like Unsloth) without getting into the weeds of coding.
  • I’ll explain how we use a parallel corpus (like English-Hindi) for model fine-tuning, focusing on the big picture.

3. Demonstrating Transtokenization

  • I’ll show you how the transtokenization process works with a real model.
  • It’s not about coding details but more about understanding the process and what happens at each step.

4. Wrap-Up and Q&A

  • Recap the key points: how transtokenization helps multilingual NLP.
  • Open for any questions you have, and I’ll share useful resources if you want to dig deeper.

Additional Resources:

  • Link to code repo - https://github.com/JaynouOliver/Mistral-7B-v0.3-transtokenized-Hindi
  • Link to Hugging Face models - https://huggingface.co/subhrokomol/Mistral-7B-Instruct-v0.3-transtokenized
  • Link to Research Paper - https://github.com/LAGoM-NLP/transtokenizer
  • Link to Blog by author - https://pieter.ai/trans-tokenization/
  • Link to Unsloth AI - https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing


None
FOSS

suvrakamal das
Machine Learning Engineer XRIGlobal AI
Speaker Image

100 %
Approvability
1
Approvals
0
Rejections
1
Not Sure
This sounds pretty interesting, and isn't the typical LLM slop we get.
Reviewer #1
Approved
I don't think I can provide a detailed review as I do not have any in depth knowledge about LLM's and NLP, but the proposal is written well enough. I'd leave the decision to other reviewers and the organizers.
Reviewer #2
Not Sure