Talk
Beginner
First Talk

Alethia: Deciphering the truth in noisy texts using open large language models

Rejected

Session Description

From historical archives to customer databases to health records , nearly every domain struggles with messy, inconsistent data. Traditional fuzzy matching tools often fail when dealing with semantic variations, abbreviations, or creative misspellings. Current fuzzy matching tools often fail with real-world noisy data. I will present our new package alethia (from Greek aletheia meaning "truth"), an open-source Python and R package that leverages LLM embeddings to intelligently decipher and correct such errors.

In my talk, I will:

  1. Demonstrate how LLM embeddings outperform traditional fuzzy matching by understanding semantic meaning
  2. Walk through alethia's architecture
  3. Show real-world applications through live demos on messy public health datasets

Key Takeaways

  1. Learn about semantic similarity techniques beyond traditional fuzzy matching
  2. Learn how to extend alethia for your niche (genealogy? legal docs? scientific papers?)
  3. Learn how to support a package in both R and Python languages

References

Session Categories

FOSS

Speakers

Saket Choudhary
Assistant Professor IIT Bombay
http://x.com/saketkc
Saket Choudhary

Reviews

0 %
Approvability
0
Approvals
0
Rejections
0
Not Sure
No reviews yet.