Hallucination detection is one of the most active areas in LLM reliability research, but the landscape is fragmented. Teams building production agent systems need to choose between statistical approaches (perplexity, entropy, token-level probability analysis) and LLM-based approaches (using a separate model to judge outputs). Which one actually works better in practice? We built a benchmark to find out.
The Hallucination Detection Landscape
LLM hallucinations fall into several categories: factual inaccuracies (the model asserts something false), contextual inconsistencies (the model contradicts itself or the provided context), logical errors (the model's reasoning is internally inconsistent), and instruction misalignment (the model ignores or misinterprets instructions). Each type requires different detection strategies.
Two broad approaches have emerged. Statistical methods analyze the model's own output probabilities — low-confidence tokens, high-entropy predictions, and semantic uncertainty signals. LLM-based methods deploy a separate model (often a smaller, fine-tuned classifier or a general-purpose LLM with a carefully engineered prompt) to evaluate the output for signs of hallucination. Both approaches have passionate advocates and compelling theoretical foundations.
"The question isn't which approach is better in theory. It's which approach catches more hallucinations at an acceptable latency and cost in your specific production context. Our benchmark was designed to answer that question with data."
Benchmark Methodology
We constructed a benchmark dataset of 5,000 LLM outputs spanning three domains: financial analysis (earnings report summaries), medical information (treatment guideline explanations), and technical documentation (API usage descriptions). Each output was labeled by domain experts for the presence, type, and severity of hallucinations. The dataset includes 1,800 hallucinated outputs (36%) and 3,200 accurate outputs, providing a balanced evaluation set.
We evaluated five detection approaches across three model families (GPT-4, Claude 3, and Llama 3-70B):
- Perplexity thresholding: Flag outputs where the average token-level perplexity exceeds a calibrated threshold.
- Semantic entropy: Measure the model's uncertainty about its own output by analyzing multiple generations and computing semantic-level entropy.
- Token-level probability analysis: Identify low-confidence spans using token log probabilities, flagging outputs with concentrated regions of low confidence.
- LLM-as-judge (small): Finetune a DeBERTa-v3 classifier on hallucination detection data.
- LLM-as-judge (large): Prompt GPT-4 and Claude 3 to evaluate outputs for hallucination using a structured evaluation prompt.
Evaluation Metrics
We measured precision, recall, F1 score, latency (p50 and p99), and cost per prediction. For production deployments, recall is typically the most important metric — missing a hallucination has higher cost than a false positive (which can be reviewed by a human). We also measured decision latency because agent systems need real-time verification to maintain interactive speeds.
Benchmark Results
The results revealed a clear trade-off landscape. Statistical methods delivered high throughput and low cost but capped recall at around 72% for the most challenging hallucination types. LLM-based methods, particularly the large-model judge approach, achieved significantly higher recall (up to 94%) but at 10-50x the latency and cost.
Approach | Recall | Precision | F1 | p50 Latency | Cost/1K predictions
Perplexity thresholding | 58% | 81% | 0.68 | 8ms | $0.02
Semantic entropy | 67% | 84% | 0.74 | 45ms | $0.08
Token-level probability | 72% | 79% | 0.75 | 12ms | $0.03
LLM judge (small) | 83% | 87% | 0.85 | 120ms | $0.15
LLM judge (large) | 94% | 91% | 0.92 | 2.4s | $1.20
Domain-Specific Findings
Performance varied significantly by domain. Statistical methods performed best on technical documentation (78% recall for token-level probability), where hallucinated outputs tend to contain low-probability tokens that deviate from the high-probability patterns of correct technical writing. Medical information was the hardest domain for all approaches — the cost of precision in medical language means models produce confident-sounding but factually incorrect outputs with high token probabilities. LLM-as-judge (large) was the only approach to exceed 85% recall on medical hallucinations.
Financial analysis fell in the middle. Statistical methods caught obvious numerical inconsistencies (a company's reported revenue being wildly different from the source document) but missed subtle reasoning errors (misattributing a growth driver to the wrong business segment). LLM-based methods caught these reasoning errors but occasionally flagged legitimate analytical nuance as hallucination — a false positive pattern that required domain-specific prompt engineering to mitigate.
Cross-Model Generalization
An important finding was that detection approaches generalized reasonably well across model families. A detector trained or calibrated on GPT-4 outputs achieved 90% of its optimal performance when applied to Claude 3 outputs without recalibration. The one exception was Llama 3-70B: all detection approaches performed 5-8% worse on Llama outputs, likely because open-weight models exhibit different probability distributions and error patterns than their proprietary counterparts. Teams using open-weight models should expect to invest more in calibration and may need to train domain-specific detectors.
Detection Latency in Production Contexts
Latency is often the deciding factor in production. An agent responding to a user can tolerate 500ms-2s of verification latency before the delay becomes noticeable. A batch processing agent can tolerate 5-10s. Real-time trading agents can tolerate virtually no latency at all.
Our latency profiling revealed that statistical methods (8-45ms) can run synchronously without perceptible delay in any agent context. The small LLM judge (120ms) is acceptable for most interactive agents but may feel sluggish in real-time chat. The large LLM judge (2.4s median, 8s p99) requires a fundamentally different integration pattern: verification must run asynchronously, with agents either sending partial outputs for streaming verification or using the large model only for post-hoc auditing of critical decisions.
The practical implication is that your detection strategy must be latency-aware. SignalStack's verification infrastructure supports this through configurable verification depth: a fast statistical pre-check runs synchronously and returns results in under 50ms, while deeper LLM-based verification runs asynchronously and delivers results via webhook callback. This tiered approach gives you the latency profile of statistical methods with the recall of LLM-based methods, as long as your agent architecture can handle asynchronous verification results.
When Each Approach Works Best
The benchmark makes one thing clear: there is no universal best approach. The optimal strategy depends on your specific requirements for latency, cost, and the cost of missed hallucinations.
Statistical Methods: High-Throughput Screening
Statistical methods excel as a first-pass filter. If you're processing thousands of agent outputs per minute and need to flag likely hallucinations for review, token-level probability analysis offers the best recall-to-cost ratio. We recommend deploying perplexity or probability analysis as a pre-filter that routes suspicious outputs to a more expensive LLM-based checker, creating a tiered verification pipeline.
LLM-Based Methods: High-Stakes Verification
When the cost of a missed hallucination is high — in medical advice, financial decisions, legal analysis, or customer-facing contract terms — LLM-as-judge with a large model is the clear winner. The 94% recall achieved in our benchmark translates to catching nearly 19 out of every 20 hallucinations, which is the level of reliability required for regulated autonomous systems.
Hybrid Pipelines: The Production Standard
Every production system we've worked with ends up with a hybrid architecture. Statistical methods provide cheap, fast screening for the high-volume, low-stakes outputs. LLM-based methods provide deep verification for the outputs that matter most. SignalStack's claim verification system at /product/claim-verification supports this hybrid approach natively, allowing you to configure verification depth based on the risk level of each agent action.
Start with statistical screening as your default, then route high-stakes outputs to LLM-based verification. Configure your trust scoring thresholds at /docs/guides/trust-scoring to automate escalation: outputs below a confidence threshold trigger deeper verification, while outputs above threshold can proceed with the lightweight check.
Model-Specific Detection Patterns
An unexpected finding was that different model families exhibit characteristic hallucination patterns that inform detection strategy. GPT-4 hallucinations tend to be plausible-sounding fabrications with high token probabilities — the model is confidently wrong. Claude 3 hallucinations more often involve misattribution: correctly citing a source but applying it to the wrong context. Llama 3-70B hallucinations frequently involve instruction misalignment: generating the right information but in the wrong format or with the wrong level of detail.
These patterns have practical implications. For GPT-4 deployments, statistical methods underperform because the model's confident-sounding hallucinations have high token probabilities — you need LLM-based verification that evaluates semantic correctness rather than token-level confidence. For Claude 3 deployments, citation verification is especially important: independently check that cited sources actually support the claims they're attached to. For Llama 3 deployments, all detection methods require calibration and benefit from domain-specific fine-tuning of the detection classifier.
SignalStack's verification system adapts detection strategies based on the model provider: when an agent reports which model generated an output, the verification pipeline adjusts its detection parameters accordingly. This model-aware verification is documented in the claim verification product page at /product/claim-verification and is configured as part of the verification request metadata.
Recommendations for Agent Builders
Based on these results, we recommend the following strategy for teams deploying LLM-based agents in production:
- Deploy token-level probability analysis as your first line of defense. It adds minimal latency, costs almost nothing, and catches the most obvious hallucinations before they reach users.
- Use LLM-based verification for any agent action that has financial, legal, or safety implications. The latency cost (2-3 seconds) is acceptable for high-stakes decisions and the recall improvement is dramatic.
- Benchmark your own domain. Our results show significant domain variation. Run our methodology on your specific use case and data to calibrate thresholds and choose the right approach.
- Monitor and iterate. Hallucination patterns change as models are updated. Continuously evaluate your detection pipeline and retrain classifiers as needed.
Conclusion
Hallucination detection is not a solved problem, but the tools available today — from lightweight statistical methods to powerful LLM-based judges — are good enough to make production agent systems reliable when deployed thoughtfully. The key insight from our benchmark is that the best approach depends on your domain, latency requirements, and tolerance for missed hallucinations. Start with statistical screening, add LLM-based verification for high-stakes outputs, and benchmark continuously. SignalStack's verification infrastructure at /product/claim-verification and trust scoring system at /docs/guides/trust-scoring are designed to support this tiered approach out of the box.
Luke Swestun is the founder of SignalStack. He writes about trust infrastructure, hallucination detection, and building AI agents that can verify before they act.