LLM Hallucinations·10 min read

How We Measure Hallucination Risk Across 30+ Data Sources in Under 200ms

Technical deep dive into SignalStack's multi-source verification pipeline and how we achieve sub-200ms response times.

Luke Swestun·

Sub-200 millisecond verification sounds impossible. 30+ independent data sources, real-time evidence aggregation, scoring computation — how do you do all of that in the time it takes to blink? At SignalStack, we've built a multi-source verification pipeline that achieves exactly this. Here's how it works under the hood.

The Architecture: Parallel Over Sequential

The fundamental insight is that data sources should be queried in parallel, not sequentially. Most naive verification implementations call sources one at a time: search the web, then query the knowledge graph, then check the vector database. This means total latency = sum of all source latencies, which quickly exceeds any practical timeout.

SignalStack queries all sources concurrently. The total latency of the verification pipeline is determined by the slowest reliable source, not the sum of all sources. With proper timeout management and fallback strategies, this keeps P50 response times under 180ms and P99 responses under 400ms.

A simplified view of the pipeline:

text
Claim Input
    |
    v
[Query Dispatcher]  ---->  Web Search (parallel)
    |                        Knowledge Graph (parallel)
    |                        Vector DB (parallel)
    |                        Structured APIs (parallel)
    |                        Document Store (parallel)
    |                        Code Execution (parallel)
    |
    v
[Evidence Aggregator]  <---- All source responses
    |
    v
[Scoring Engine]  ---->  Trust Score + Evidence

The Data Source Layer

Each data source type fills a different role in the verification pipeline. No single source is authoritative on its own; the power comes from cross-referencing independent signals.

General-purpose factual verification. Web search is the broadest source, capable of verifying claims across any domain. The challenge is relevance and authority — not all search results are equal. Our search pipeline ranks results by domain authority, freshness, and topical relevance before passing them to the scoring engine.

Knowledge Graph

Structured factual data. Knowledge graphs excel at entity resolution — verifying that a claim about a person, organization, or place is consistent with established relationships. Our knowledge graph source aggregates data from multiple curated datasets with known provenance.

Vector Database

Customer-specific knowledge. This source searches the customer's own indexed documents — internal wikis, product documentation, knowledge bases. Vector search enables semantic matching even when the claim uses different terminology than the source documents.

Structured APIs

Authoritative data via API. For claims about financial data, weather, stocks, sports scores, and other structured domains, direct API access to authoritative providers gives higher quality evidence than web search.

Document Store

Full-text search over structured documents. Unlike vector search, full-text search guarantees exact matches — critical for verifying claims about specific numbers, dates, and quotes.

Code Execution

Computational verification. Some claims can't be verified by looking them up — they need to be computed. A claim about "what would $1000 invested in BTC in 2020 be worth today" requires actual computation. The code execution source runs sandboxed Python scripts to compute verifiable answers.

The Evidence Aggregation Strategy

Once all sources respond (or time out), the evidence aggregator processes the results. This is where raw source responses become structured evidence items. Each evidence item includes:

  • The source type and identifier
  • A relevance score (how closely the evidence matches the claim)
  • An authority score (how trustworthy the source is)
  • The timestamp of retrieval
  • The supporting or contradicting content

Evidence that doesn't meet minimum relevance thresholds is discarded. This prevents noisy or tangentially related results from affecting the score.

The Scoring Computation

The scoring engine takes aggregated evidence and produces a final trust score. The algorithm considers:

  1. Evidence weight — more corroborating sources increase confidence
  2. Source diversity — evidence from multiple independent source types is stronger than multiple results from the same source type
  3. Contradiction detection — evidence that contradicts the claim reduces the score proportionally to its relevance and authority
  4. Confidence calibration — the final score is calibrated to reflect the probability that the claim is factually correct, based on training data from 500K+ human-verified claims

Here's a simplified Python representation of the scoring computation:

python
def compute_trust_score(claim: str, evidence: list[Evidence]) -> TrustScore:
    # Weight by source type diversity
    source_types = set(e.source_type for e in evidence)
    diversity_bonus = min(len(source_types) / 3.0, 1.0) * 0.1

    # Weight by corroboration vs contradiction
    corroborating = [e for e in evidence if e.aligns_with(claim)]
    contradicting = [e for e in evidence if not e.aligns_with(claim)]

    corroboration_score = sum(
        e.relevance * e.authority for e in corroborating
    ) / max(len(corroborating), 1)

    contradiction_score = sum(
        e.relevance * e.authority for e in contradicting
    ) / max(len(contradicting), 1)

    # Combined score with diversity adjustment
    raw_score = (corroboration_score - contradiction_score + 1) / 2
    final_score = min(max(raw_score + diversity_bonus, 0.0), 1.0)

    return TrustScore(
        score=final_score,
        verdict=classify_verdict(final_score),
        evidence_used=len(corroborating) + len(contradicting),
        source_types_used=len(source_types),
    )


def classify_verdict(score: float) -> str:
    if score >= 0.85:
        return "pass"
    if score >= 0.70:
        return "warn"
    return "fail"

Performance Optimizations

Achieving sub-200ms latency requires optimization at every layer:

  • Connection pooling — all source connections are pre-warmed and pooled
  • Content-addressable caching — identical or semantically similar claims hit a cache with 10ms response time
  • Speculative execution — if the fastest 3 sources all agree with high confidence, we can respond before waiting for slower sources
  • Streaming aggregation — evidence is processed incrementally as sources respond, not batched at the end

Real-World Performance

Across production traffic in Q1 2026, SignalStack's verification pipeline achieved the following latency distribution:

  • P50: 168ms
  • P90: 245ms
  • P95: 310ms
  • P99: 410ms

For latency-sensitive applications (real-time chat, voice agents), enable speculative execution mode. This returns a preliminary score based on the fastest 3 sources within 100ms, then updates with a refined score when all sources complete. The tradeoff is a small reduction in accuracy for a 2x speed improvement.

Conclusion

Sub-200ms multi-source verification is achievable through parallel architecture, intelligent caching, and optimized scoring algorithms. The key insight is that verification latency is a systems engineering problem, not a model capability problem. By treating each data source as an independent query path and aggregating results asynchronously, SignalStack delivers production-grade verification within the latency budget of real-time AI applications. For more details, visit /docs and /product.

LS
Luke Swestun
Founder & CEO

Luke Swestun is the founder of SignalStack. He writes about trust infrastructure, hallucination detection, and building AI agents that can verify before they act.

Build trust into your AI agents

Join hundreds of AI teams using SignalStack to verify information before their agents act. Start with a free trial — no credit card required.

Free plan includes 500 verifications/mo. No credit card required.