Skip to Main Content
Talk Intermediate

Breaking Language Barriers: Transtokenization for Multilingual NLP in Open-Source Systems

Approved
Session Description

Prerequisites:

This session is suited for developers, researchers, and AI enthusiasts with:

  • Basic understanding of NLP concepts such as tokenization and language models
  • Experience with machine learning frameworks like PyTorch or TensorFlow
  • Familiarity with Python programming
  • Interest in multilingual NLP and open-source language model development

  • Attendees should have a basic understanding of NLP, machine learning, and Python. This is an intermediate-level workshop but will be presented at a beginner-friendly pace. Please come with a laptop, access to Python tools, and an interest in multilingual AI development.


Step into a world where multilingual AI models can go beyond language boundaries and deliver high-quality translations across low-resource languages. Through this talk, I’ll show you how transtokenization – a novel technique for transforming tokens between languages – can revolutionize NLP, allowing for the adaptation of large language models to handle multilingual data effectively. By leveraging open-source frameworks like Unsloth and Mistral, we’ll explore the power of transtokenization to fine-tune models using parallel datasets such as the English-Hindi Bible corpus.


This is a research paper where transtokenizatoin is implemented by the author from the language of English to Dutch, where I have implemented success with English to Hindi with additional findings, specially because tokenization for non Latin Languages required further work.

Talk Outline:

1. Introduction

  • Quick intro about myself and what this talk is all about.
  • Why current LLMs struggle with low-resource languages and how transtokenization can help.
  • A look at how Unsloth and Mistral Models fit into multilingual AI.

2. Transtokenization Setup

  • I’ll walk through the setup process, showing the tools and steps needed (like Unsloth) without getting into the weeds of coding.
  • I’ll explain how we use a parallel corpus (like English-Hindi) for model fine-tuning, focusing on the big picture.

3. Demonstrating Transtokenization

  • I’ll show you how the transtokenization process works with a real model.
  • It’s not about coding details but more about understanding the process and what happens at each step.

4. Wrap-Up and Q&A

  • Recap the key points: how transtokenization helps multilingual NLP.
  • Open for any questions you have, and I’ll share useful resources if you want to dig deeper.

Additional Resources:

  • Link to code repo - https://github.com/JaynouOliver/Mistral-7B-v0.3-transtokenized-Hindi
  • Link to Hugging Face models - https://huggingface.co/subhrokomol/Mistral-7B-Instruct-v0.3-transtokenized
  • Link to Research Paper - https://github.com/LAGoM-NLP/transtokenizer
  • Link to Blog by author - https://pieter.ai/trans-tokenization/
  • Link to Unsloth AI - https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing


References

Session Categories

FOSS

Speakers

suvrakamal das
Machine Learning Engineer | XRIGlobal AI

Suvrakamal is an ML Engineer at XRI Global USA. Also published papers at SciPy conference.

suvrakamal das

Reviews

This sounds pretty interesting, and isn't the typical LLM slop we get.
Reviewer #1 Approved

I don't think I can provide a detailed review as I do not have any in depth knowledge about LLM's and NLP, but the proposal is written well enough. I'd leave the decision to other reviewers and the organizers.
Reviewer #2 Not Sure