ArchiveML

A specialized OCR and NLP pipeline designed to digitize and translate archaic Malayalam land and government records. Our solution converts complex old script variations into modern, readable Malayalam and English, preserving historical data while making legal and ancestral records accessible to everyone.

Description

The Problem

Millions of historical documents in Kerala—specifically land deeds (Aadharam), tax registers, and government gazettes—remain locked in "Old Malayalam" script. These documents often use obsolete ligatures, regional dialects, and shorthand that modern OCR tools fail to recognize. This creates a massive barrier for citizens trying to verify land ownership or research their heritage, often requiring expensive "expert" translators.

The Solution

Our platform provides an end-to-end digital restoration and translation bridge:

  1. Custom OCR Engine: Unlike generic models, our OCR is fine-tuned on datasets of the intricate “Old Script” (Pazhaya Lipi) typography and handwritten archival scripts to ensure high-accuracy character recognition.

  2. Contextual Transliteration: Using a Large Language Model (LLM) fine-tuned on legal and administrative Malayalam, the system converts archaic phrasing into contemporary Malayalam.

  3. Dual-Language Output: The system generates a side-by-side comparison of the original text, a modern Malayalam transcription, and an English translation to maintain legal context.

  4. Document Restoration: Integrated image preprocessing to handle faded ink, weathered paper, and low-contrast scans typical of government archives.

The Impact

By democratizing access to these records, we reduce legal disputes, streamline government digitisation efforts, and preserve Kerala’s linguistic heritage for the digital age.

Issues & Pull Requests Thread
No issues or pull requests added.