Lightning Talk
Intermediate

Building great datasets on your mother tongue with Open-Weight models

Review Pending
  • Since Sarvam is planning to release some open-weight models during June. I want to use that to create nice high quality synthetic datasets for my mother tongue

  • In native languages like malayalam, the amount of good open-source datasets is very less. The amount of data let's say for Malayalam for ASR and TTS datasets was just about 50-100 hours when I started training models.

  • We all know as amount of data increases, the accuracy of models proporitonately increase. Also you can feed special entities to be generated like special entities like locations, names, organization names like FOSS United etc.

  • Learn how to create synthethic data with Open-weight models

  • Why native language datasets matter for model performance and cultural relevance

  • Challenges in low-resource languages and how to overcome them with synthethic data from open-weight models.

Knowledge Commons (Open Hardware, Open Science, Open Data etc.)
Which track are you applying for?
Open Data Devroom

100 %
Approvability
1
Approvals
0
Rejections
0
Not Sure

+1. Short lightning talk on synthetic datasets looks like a good fit.

Reviewer #1
Approved