Prerequisites:
This session is suited for developers, researchers, and AI enthusiasts with:
- Basic understanding of NLP concepts such as tokenization and language models
- Experience with machine learning frameworks like PyTorch or TensorFlow
- Familiarity with Python programming
- Interest in multilingual NLP and open-source language model development
- Attendees should have a basic understanding of NLP, machine learning, and Python. This is an intermediate-level workshop but will be presented at a beginner-friendly pace. Please come with a laptop, access to Python tools, and an interest in multilingual AI development.
Step into a world where multilingual AI models can go beyond language boundaries and deliver high-quality translations across low-resource languages. Through this talk, I’ll show you how transtokenization – a novel technique for transforming tokens between languages – can revolutionize NLP, allowing for the adaptation of large language models to handle multilingual data effectively. By leveraging open-source frameworks like Unsloth and Mistral, we’ll explore the power of transtokenization to fine-tune models using parallel datasets such as the English-Hindi Bible corpus.
This is a research paper where transtokenizatoin is implemented by the author from the language of English to Dutch, where I have implemented success with English to Hindi with additional findings, specially because tokenization for non Latin Languages required further work.
Talk Outline:
1. Introduction
- Quick intro about myself and what this talk is all about.
- Why current LLMs struggle with low-resource languages and how transtokenization can help.
- A look at how Unsloth and Mistral Models fit into multilingual AI.
2. Transtokenization Setup
- I’ll walk through the setup process, showing the tools and steps needed (like Unsloth) without getting into the weeds of coding.
- I’ll explain how we use a parallel corpus (like English-Hindi) for model fine-tuning, focusing on the big picture.
3. Demonstrating Transtokenization
- I’ll show you how the transtokenization process works with a real model.
- It’s not about coding details but more about understanding the process and what happens at each step.
4. Wrap-Up and Q&A
- Recap the key points: how transtokenization helps multilingual NLP.
- Open for any questions you have, and I’ll share useful resources if you want to dig deeper.
Additional Resources:
- Link to code repo - https://github.com/JaynouOliver/Mistral-7B-v0.3-transtokenized-Hindi
- Link to Hugging Face models - https://huggingface.co/subhrokomol/Mistral-7B-Instruct-v0.3-transtokenized
- Link to Research Paper - https://github.com/LAGoM-NLP/transtokenizer
- Link to Blog by author - https://pieter.ai/trans-tokenization/
- Link to Unsloth AI - https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing