From historical archives to customer databases to health records , nearly every domain struggles with messy, inconsistent data. Traditional fuzzy matching tools often fail when dealing with semantic variations, abbreviations, or creative misspellings. Current fuzzy matching tools often fail with real-world noisy data. I will present our new package alethia (from Greek aletheia meaning "truth"), an open-source Python and R package that leverages LLM embeddings to intelligently decipher and correct such errors.
In my talk, I will: