Building great datasets on your mother tongue with Open-Weight models

Review Pending

Session Description

Since Sarvam is planning to release some open-weight models during June. I want to use that to create nice high quality synthetic datasets for my mother tongue
In native languages like malayalam, the amount of good open-source datasets is very less. The amount of data let's say for Malayalam for ASR and TTS datasets was just about 50-100 hours when I started training models.
We all know as amount of data increases, the accuracy of models proporitonately increase. Also you can feed special entities to be generated like special entities like locations, names, organization names like FOSS United etc.

Key Takeaways

Learn how to create synthethic data with Open-weight models
Why native language datasets matter for model performance and cultural relevance
Challenges in low-resource languages and how to overcome them with synthethic data from open-weight models.

References

Session Categories

Knowledge Commons (Open Hardware, Open Science, Open Data etc.)

Open Data Devroom

Speakers

ML Engineer Sarvam

Reviews

100 %

Approvability

Approvals

Rejections

Not Sure

+1. Short lightning talk on synthetic datasets looks like a good fit.

Reviewer #1

Approved

Reviewer #2

Approved

Reviewer #3

Approved

Talk on generating sythetic data from open data is a good topic

Reviewer #4

Approved