Alethia: Deciphering the truth in noisy texts using open large language models

Rejected

Session Description

From historical archives to customer databases to health records , nearly every domain struggles with messy, inconsistent data. Traditional fuzzy matching tools often fail when dealing with semantic variations, abbreviations, or creative misspellings. Current fuzzy matching tools often fail with real-world noisy data. I will present our new package alethia (from Greek aletheia meaning "truth"), an open-source Python and R package that leverages LLM embeddings to intelligently decipher and correct such errors.

In my talk, I will:

Demonstrate how LLM embeddings outperform traditional fuzzy matching by understanding semantic meaning
Walk through alethia's architecture
Show real-world applications through live demos on messy public health datasets

Key Takeaways

Learn about semantic similarity techniques beyond traditional fuzzy matching
Learn how to extend alethia for your niche (genealogy? legal docs? scientific papers?)
Learn how to support a package in both R and Python languages

References

http://saket-choudhary.me/

Session Categories

FOSS

Speakers

Saket Choudhary

Assistant Professor IIT Bombay

http://x.com/saketkc

Reviews

0 %

Approvability

Approvals

Rejections

Not Sure

No reviews yet.