Since Sarvam is planning to release some open-weight models during June. I want to use that to create nice high quality synthetic datasets for my mother tongue
In native languages like malayalam, the amount of good open-source datasets is very less. The amount of data let's say for Malayalam for ASR and TTS datasets was just about 50-100 hours when I started training models.
We all know as amount of data increases, the accuracy of models proporitonately increase. Also you can feed special entities to be generated like special entities like locations, names, organization names like FOSS United etc.
Learn how to create synthethic data with Open-weight models
Why native language datasets matter for model performance and cultural relevance
Challenges in low-resource languages and how to overcome them with synthethic data from open-weight models.
+1. Short lightning talk on synthetic datasets looks like a good fit.