RepoAudit

Automated reproducibility auditor for ML repositories using AST analysis, sandboxed execution, and LLM checks by producing a 0-100 score with actionable insights and auto-fixes.

Description

RepoAudit is an end-to-end automated system designed to evaluate the reproducibility of machine learning (ML) research repositories. It analyzes public GitHub repositories and generates a quantitative reproducibility score (0–100) backed by both static and dynamic analysis techniques.

The platform combines AST-based code inspection, multi-language parsing, and LLM-powered semantic auditing to assess whether a repository can be reliably reproduced by other researchers. RepoAudit supports Python, R, Julia, and Jupyter Notebooks, making it broadly applicable across modern ML research ecosystems.

1. Objective

The primary goal of RepoAudit is to address a critical issue in ML research:
the gap between published results and actual reproducibility.

It helps:

  • Researchers validate their work before publication

  • Reviewers assess implementation quality

  • Engineers identify reproducibility risks in open-source ML code

2. Reproducibility Scoring System

RepoAudit evaluates repositories across six weighted categories, producing a final score from 0 to 100:

  • Environment (15%)
    Checks dependency pinning, containerization (Docker), and long-term reproducibility risks such as dependency decay, yanked packages, and known vulnerabilities.

  • Determinism (20%)
    Uses AST analysis to verify proper seeding, detect non-deterministic operations, and identify issues in notebook execution such as out-of-order cells and state mutations.

  • Datasets (15%)
    Ensures data accessibility by detecting hardcoded paths, verifying dataset URLs, and identifying gated or unavailable data sources.

  • Semantic Alignment (20%)
    Uses LLM-based auditing to compare README claims with actual repository structure and implementation.

  • Execution (20%)
    Performs sandboxed execution tests to validate whether the repository can actually run and produce outputs.

  • Documentation (10%)
    Evaluates the presence and quality of essential documentation such as installation steps, usage instructions, and dataset descriptions.

3. System Architecture

RepoAudit is built as a full-stack distributed system with the following components:

  • Frontend
    Interactive UI for submitting repositories and visualizing results, including score breakdowns, radar charts, and historical trends.

  • Backend API
    Handles audit requests, orchestrates analysis, and serves results via REST endpoints.

  • Task Queue
    Asynchronous processing using Celery and Redis-compatible queues to handle long-running audits.

  • Analysis Engine
    Core logic for static and dynamic analysis, including AST parsing, dependency inspection, and execution replay.

  • Database & Cache
    Stores audit results and enables fast retrieval through multi-layer caching (Redis + PostgreSQL).

4. Key Features

4.1 AST-Based Static Analysis

RepoAudit deeply inspects source code using Python AST, libcst, and Tree-sitter for multi-language support. This enables precise detection of:

  • Missing random seeds

  • Unsafe file paths

  • Dependency inconsistencies

  • Cross-file data flow issues

4.2 Execution Replay Verification

The system performs multi-level execution checks (L0–L3) inside a sandboxed environment:

  • L0: Dependency installation

  • L1: Import validation

  • L2: Script execution

  • L3: Output generation

This ensures that repositories are not just theoretically reproducible, but actually runnable.

4.3 Notebook-Specific Analysis

Specialized handling for Jupyter notebooks includes:

  • Detection of out-of-order execution

  • Identification of hidden state dependencies

  • Validation of “Restart & Run All” reproducibility

  • Flagging inline package installations

4.4 Data Provenance Tracking

RepoAudit verifies:

  • Dataset availability (URL liveness)

  • Access restrictions (gated datasets)

  • Reproducibility of preprocessing pipelines

4.5 Reproducibility Decay Tracking

A unique feature that models long-term reproducibility risk by:

  • Detecting deprecated or yanked dependencies

  • Identifying security vulnerabilities (CVEs)

  • Estimating repository “shelf-life”

4.6 Configuration Drift Detection

Compares README claims with actual implementation to detect:

  • Mismatched hyperparameters

  • Incorrect defaults

  • Inconsistent experiment settings

4.7 Deterministic Auto-Remediation

RepoAudit can automatically fix high-confidence issues using AST transformations:

  • Injects missing seeds

  • Pins dependencies

  • Rewrites unsafe paths

It generates a .patch file for direct application.

4.8 Comparative Analysis

Users can compare multiple repositories to:

  • Identify the most reproducible implementation

  • Analyze strengths and weaknesses across projects

  • Visualize differences using radar charts

5. Performance Optimization

RepoAudit uses a multi-layer caching strategy:

  • L1: Redis (fast retrieval)

  • L2: PostgreSQL (persistent storage)

Repositories are keyed by commit hash, ensuring that repeated audits are instant unless the code changes.

6. API Capabilities

The system exposes REST APIs for:

  • Submitting repositories for audit

  • Fetching detailed reports

  • Tracking audit progress

  • Comparing multiple repositories

  • Viewing audit history

It also supports resolving research paper URLs into corresponding GitHub repositories automatically.

7. CI/CD Integration

RepoAudit integrates with GitHub Actions, enabling:

  • Automated reproducibility checks on pull requests

  • Threshold-based validation (e.g., fail if score < 70)

  • Continuous monitoring of repository quality

8. Deployment Model

The platform is designed to run entirely on free-tier cloud services, making it accessible and cost-efficient:

  • Backend: Render

  • Frontend: Vercel

  • Cache: Upstash Redis

  • Database: Supabase

  • LLM: Hugging Face

9. Use Cases

RepoAudit is useful for:

  • Researchers: Validate reproducibility before publishing

  • Reviewers: Assess implementation credibility

  • Open-source maintainers: Improve code quality

  • ML engineers: Evaluate third-party repositories

Issues & PRs Board
No issues or pull requests added.