Breaking Language Barriers: Transtokenization for Multilingual NLP in Open-Source Systems

Approved

Session Description

Prerequisites:

This session is suited for developers, researchers, and AI enthusiasts with:

Basic understanding of NLP concepts such as tokenization and language models
Experience with machine learning frameworks like PyTorch or TensorFlow
Familiarity with Python programming
Interest in multilingual NLP and open-source language model development
Attendees should have a basic understanding of NLP, machine learning, and Python. This is an intermediate-level workshop but will be presented at a beginner-friendly pace. Please come with a laptop, access to Python tools, and an interest in multilingual AI development.

Step into a world where multilingual AI models can go beyond language boundaries and deliver high-quality translations across low-resource languages. Through this talk, I’ll show you how transtokenization – a novel technique for transforming tokens between languages – can revolutionize NLP, allowing for the adaptation of large language models to handle multilingual data effectively. By leveraging open-source frameworks like Unsloth and Mistral, we’ll explore the power of transtokenization to fine-tune models using parallel datasets such as the English-Hindi Bible corpus.

This is a research paper where transtokenizatoin is implemented by the author from the language of English to Dutch, where I have implemented success with English to Hindi with additional findings, specially because tokenization for non Latin Languages required further work.

Talk Outline:

1. Introduction

Quick intro about myself and what this talk is all about.
Why current LLMs struggle with low-resource languages and how transtokenization can help.
A look at how Unsloth and Mistral Models fit into multilingual AI.

2. Transtokenization Setup

I’ll walk through the setup process, showing the tools and steps needed (like Unsloth) without getting into the weeds of coding.
I’ll explain how we use a parallel corpus (like English-Hindi) for model fine-tuning, focusing on the big picture.

3. Demonstrating Transtokenization

I’ll show you how the transtokenization process works with a real model.
It’s not about coding details but more about understanding the process and what happens at each step.

4. Wrap-Up and Q&A

Recap the key points: how transtokenization helps multilingual NLP.
Open for any questions you have, and I’ll share useful resources if you want to dig deeper.

Additional Resources:

Link to code repo - https://github.com/JaynouOliver/Mistral-7B-v0.3-transtokenized-Hindi
Link to Hugging Face models - https://huggingface.co/subhrokomol/Mistral-7B-Instruct-v0.3-transtokenized
Link to Research Paper - https://github.com/LAGoM-NLP/transtokenizer
Link to Blog by author - https://pieter.ai/trans-tokenization/
Link to Unsloth AI - https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing

Key Takeaways

None

Reviews

100 %

Approvability

1

Approvals

0

Rejections

1

Not Sure

This sounds pretty interesting, and isn't the typical LLM slop we get.

Reviewer #1

Approved

I don't think I can provide a detailed review as I do not have any in depth knowledge about LLM's and NLP, but the proposal is written well enough. I'd leave the decision to other reviewers and the organizers.

Reviewer #2

Not Sure

Breaking Language Barriers: Transtokenization for Multilingual NLP in Open-Source Systems

Prerequisites:

Talk Outline:

Additional Resources:

suvrakamal das