hallx — Hallucination Risk Scoring for Production LLM Pipelines
LLMs don't fail loudly. When a model hallucinates, it returns a confident, well-formed response with nothing in the output signalling that something went wrong. Existing approaches like eval frameworks (RAGAS, TruLens) are offline, batch-oriented tools built for pre-deployment testing not for checking individual responses at runtime. Using a second LLM as a judge doubles latency and cost, and still relies on a model to verify another model. hallx sits between the model and your downstream system and scores each response inline using three heuristic signals, no ground truth labels, no secondary model calls.
How it scores:
Schema — validates structure and flags null-injection, where the model fills a required field with nothing meaningful
Consistency — re-runs generation 2–4 times and measures drift; an uncertain model doesn't produce stable outputs
Grounding — checks if response claims have any textual anchor in the context documents provided
These combine into a confidence score (0.0–1.0) and a risk_level of high, medium, or low. Skipped checks are penalised, partial analysis doesn't pass silently.
What you get back:
A proceed or retry action with a suggested temperature and prompt improvement hints
HallxHighRiskError in strict mode, for hard blocking on sensitive paths
An issues list for traceability and auditing
Other things it includes:
SQLite-backed feedback store to record reviewed outcomes and generate calibration reports over time
Safety profiles — fast, balanced, strict — controlling how many consistency runs are made and how harshly incomplete checks are penalised
Adapters for OpenAI, Anthropic, Gemini, Ollama, HuggingFace, and more; sync and async both supported
Available on PyPI (pip install hallx), MIT licensed, pure Python.