This is a desktop application for performing Optical Character Recognition (OCR) on images and PDF files, with a focus on Tamil and English languages.
Cross-Platform: Built with PyQt6 and can be compiled into a standalone executable for Linux.
Image and PDF Support: Open various image formats (PNG, JPG, etc.) and multi-page PDF documents.
Efficient PDF Processing: Converts PDFs to images in a separate thread to keep the UI responsive, with handling for large files.
Parallel OCR: Utilizes multiple CPU cores to process pages in parallel, significantly speeding up OCR tasks.
Tesseract Integration: Powered by the Tesseract OCR engine.
Custom Models: Comes bundled with a custom Tamil Tesseract model (tam_cus) and the standard English model.
Interactive Image Viewer:
View document pages with zoom and fit-to-screen controls.
Highlights recognized words with bounding boxes.
Toggle highlights on or off for better readability.
Advanced OCR Controls:
Confidence Threshold: Adjust the minimum confidence level (0-100%) to filter out uncertain results. Changes are reflected in real-time.
Language Selection: Easily specify which Tesseract language models to use (e.g., tam_cus+eng).
Text Editor:
View and edit the extracted OCR text for proofreading and corrections.
The application tracks edited pages and allows you to reset the text back to the original OCR result.
Adjust the editor's font size for comfort.
Includes a custom Tamil font (marutham.ttf) for proper rendering.
Export Functionality: Save the final, proofread text from all pages into a single .txt file.