Automated reproducibility auditor for ML repositories using AST analysis, sandboxed execution, and LLM checks by producing a 0-100 score with actionable insights and auto-fixes.
RepoAudit is an end-to-end automated system designed to evaluate the reproducibility of machine learning (ML) research repositories. It analyzes public GitHub repositories and generates a quantitative reproducibility score (0–100) backed by both static and dynamic analysis techniques.
The platform combines AST-based code inspection, multi-language parsing, and LLM-powered semantic auditing to assess whether a repository can be reliably reproduced by other researchers. RepoAudit supports Python, R, Julia, and Jupyter Notebooks, making it broadly applicable across modern ML research ecosystems.
The primary goal of RepoAudit is to address a critical issue in ML research:
the gap between published results and actual reproducibility.
It helps:
Researchers validate their work before publication
Reviewers assess implementation quality
Engineers identify reproducibility risks in open-source ML code
RepoAudit evaluates repositories across six weighted categories, producing a final score from 0 to 100:
Environment (15%)
Checks dependency pinning, containerization (Docker), and long-term reproducibility risks such as dependency decay, yanked packages, and known vulnerabilities.
Determinism (20%)
Uses AST analysis to verify proper seeding, detect non-deterministic operations, and identify issues in notebook execution such as out-of-order cells and state mutations.
Datasets (15%)
Ensures data accessibility by detecting hardcoded paths, verifying dataset URLs, and identifying gated or unavailable data sources.
Semantic Alignment (20%)
Uses LLM-based auditing to compare README claims with actual repository structure and implementation.
Execution (20%)
Performs sandboxed execution tests to validate whether the repository can actually run and produce outputs.
Documentation (10%)
Evaluates the presence and quality of essential documentation such as installation steps, usage instructions, and dataset descriptions.
RepoAudit is built as a full-stack distributed system with the following components:
Frontend
Interactive UI for submitting repositories and visualizing results, including score breakdowns, radar charts, and historical trends.
Backend API
Handles audit requests, orchestrates analysis, and serves results via REST endpoints.
Task Queue
Asynchronous processing using Celery and Redis-compatible queues to handle long-running audits.
Analysis Engine
Core logic for static and dynamic analysis, including AST parsing, dependency inspection, and execution replay.
Database & Cache
Stores audit results and enables fast retrieval through multi-layer caching (Redis + PostgreSQL).
RepoAudit deeply inspects source code using Python AST, libcst, and Tree-sitter for multi-language support. This enables precise detection of:
Missing random seeds
Unsafe file paths
Dependency inconsistencies
Cross-file data flow issues
The system performs multi-level execution checks (L0–L3) inside a sandboxed environment:
L0: Dependency installation
L1: Import validation
L2: Script execution
L3: Output generation
This ensures that repositories are not just theoretically reproducible, but actually runnable.
Specialized handling for Jupyter notebooks includes:
Detection of out-of-order execution
Identification of hidden state dependencies
Validation of “Restart & Run All” reproducibility
Flagging inline package installations
RepoAudit verifies:
Dataset availability (URL liveness)
Access restrictions (gated datasets)
Reproducibility of preprocessing pipelines
A unique feature that models long-term reproducibility risk by:
Detecting deprecated or yanked dependencies
Identifying security vulnerabilities (CVEs)
Estimating repository “shelf-life”
Compares README claims with actual implementation to detect:
Mismatched hyperparameters
Incorrect defaults
Inconsistent experiment settings
RepoAudit can automatically fix high-confidence issues using AST transformations:
Injects missing seeds
Pins dependencies
Rewrites unsafe paths
It generates a .patch file for direct application.
Users can compare multiple repositories to:
Identify the most reproducible implementation
Analyze strengths and weaknesses across projects
Visualize differences using radar charts
RepoAudit uses a multi-layer caching strategy:
L1: Redis (fast retrieval)
L2: PostgreSQL (persistent storage)
Repositories are keyed by commit hash, ensuring that repeated audits are instant unless the code changes.
The system exposes REST APIs for:
Submitting repositories for audit
Fetching detailed reports
Tracking audit progress
Comparing multiple repositories
Viewing audit history
It also supports resolving research paper URLs into corresponding GitHub repositories automatically.
RepoAudit integrates with GitHub Actions, enabling:
Automated reproducibility checks on pull requests
Threshold-based validation (e.g., fail if score < 70)
Continuous monitoring of repository quality
The platform is designed to run entirely on free-tier cloud services, making it accessible and cost-efficient:
Backend: Render
Frontend: Vercel
Cache: Upstash Redis
Database: Supabase
LLM: Hugging Face
RepoAudit is useful for:
Researchers: Validate reproducibility before publishing
Reviewers: Assess implementation credibility
Open-source maintainers: Improve code quality
ML engineers: Evaluate third-party repositories