Talk
Intermediate

How I ended up maintaining a python package with over 420,000+ downloads

Rejected

Session Description

This talk is a sequel of Kurian's previous Pycon India 2023 talk [OpenAI Whisper and it's amazing power to do fine-tuing in my mother tongue](https://in.pycon.org/cfp/pycon-india-2023/proposals/openai-whisper-and-its-amazing-power-to-do-fine-tuning-demonstrated-on-my-mother-tongue~eENWa/). For that talk one of the core-part was how to fine-tune and my python-library for benchmarking ASR for Malayalam languages - [malayalam-asr-benchmarking](https://github.com/kurianbenoy/malayalam_asr_benchmarking). When building that library, Kurian ended up building another python-library named [whisper_normalizer](https://github.com/kurianbenoy/whisper_normalizer). 


The main reason for building the library [whisper_normalizer](https://github.com/kurianbenoy/whisper_normalizer) was partly because I felt details about the text normalization approach used by whisper which can be found on Appendix Section C pp.21 the paper Robust Speech Recognition via Large-Scale Weak Supervision by OpenAI team is super useful. In my work, I had written another internal library at my previous work place and I was using same whisper normalization algorithm again and again for lot of project as well.


Working on Malayalma Speech to Text benchmarking was the final trigger for me to stop this non-sense and build a python package with [nbdev framework](https://nbdev.fast.ai/). TBH in my previous talk, I didn't even have one slide about this python package as I felt I was just solving one trivial problem for myself. Turns out I was not solving this problem for not just me, but for lot more people. Looking back, now it looks like a good problem to have.


Fast track to December 2023, I noticed this github project all of a sudden has like 30+ stars. It was surprising to me and before thinking too much I realized that I got some months like 50K+ downlads. Then downloads started increasing, and it's constantly increasing all the time. At the time of writing this proposal the number of downloads for my package is as shown in the [tweet](https://x.com/kurianbenoy2/status/1794358397279809550). It's increasing very fast and we plan to hit 500K+ download by the time of presenting this talk.


Maybe the moral of the story is Kurian doing some niche work in Malayalam, which literally no one cared about ended up me with maintaining this nice python package with lot of downloads. 


We have made incremental modifications to the project on the text normalization in indian languages, when we realized `Whisper BasicTextNormalizer` is a bad idea for Indian languages. Why it's such an important problem is documented by [Dr Kavya on her blogpost and issues with BasicTextNormalizer](https://kavyamanohar.com/post/indic-normalizer/). My colleague Abhigyan has extensively worked on Indian Text normalization during his Master's course and we will be discussing that as well in this talk.

Key Takeaways

None

References

Session Categories

FOSS

Speakers

Kurian Benoy
ML Engineer Sarvam.ai
Kurian Benoy

Reviews

100 %
Approvability
3
Approvals
0
Rejections
0
Not Sure
Reviewer #1
Approved
Hell to the yes! Solving a real problem with real tangible impact.
Reviewer #2
Approved
A good guide for people wondering how their FOSS journeys might look like
Reviewer #3
Approved