Confidence Scores Aren’t a Number: They’re a Policy
How to turn confidence scores into routing rules, sampling, and measurable workflow quality—without drowning reviewers in PDFs.
Confidence scores are only useful if they change what happens next.
CiteLLM returns a confidence score per extracted field and documents practical interpretation ranges (high confidence → often auto-approve; medium → quick verify; low → human review).
This post shows how to convert that into a real policy—one that improves speed and defensibility.
The mistake: using one global threshold for everything
A single threshold like “accept > 0.90” sounds clean.
It fails because:
- some fields are low-risk (e.g., document_title)
- some are high-risk (e.g., bank_account_number, liability_cap_text, income_amount)
- some are “workflow critical” even if low dollar value (e.g., renewal notice window)
Your policy should be field-aware and decision-aware.
A policy table you can implement today
CiteLLM suggests confidence ranges and provides confidence_threshold as an option.
Map that into workflow actions like this:
- 0.95–1.0 — auto-approve (with sampling) for low-risk fields, high volume
- 0.85–0.94 — quick verify with click-to-highlight evidence
- 0.70–0.84 — required review with edit/flag
- < 0.70 — manual / block for edge cases and messy scans
(Adjust per field risk.)
Implement confidence_threshold as a routing primitive
A common pattern:
- run extraction with a threshold (e.g., 0.85)
- anything below is either omitted or flagged for manual handling
- the UI focuses reviewers on what matters
CiteLLM exposes confidence_threshold in request options.
Example:
{
"document_id": "doc_xyz789",
"schema": {
"invoice_number": { "type": "string" },
"invoice_date": { "type": "date" },
"total_amount": { "type": "number" }
},
"options": { "confidence_threshold": 0.85 }
}
Add a second dimension: business impact
Confidence alone is not enough.
Create a field risk tier:
- Tier A (high risk): bank details, identity signals, legal caps, underwriting income
- Tier B (medium): totals, dates, key identifiers
- Tier C (low): titles, headers, non-decision metadata
Then:
- Tier A: require review unless extremely high confidence
- Tier B: review on medium confidence
- Tier C: auto-approve with sampling
This is how you reduce reviewer load without creating silent failure.
Sampling is what keeps auto-approval honest
If you auto-approve anything, sample it.
A practical policy:
- randomly sample 1–5% of “auto-approved” fields weekly
- sample 100% of “Tier A auto-approved” until proven safe
- track override rate from samples and adjust thresholds
Sampling creates a learning loop without making everything human-reviewed.
When something fails: make errors actionable, not mysterious
Even a perfect policy hits operational issues:
- invalid schema
- invalid document
- rate limiting
- quota exceeded
CiteLLM documents error codes and rate limits—use those to build clear retry/escalation behavior.
Examples to plan for:
- rate_limit_exceeded (429): backoff + queue
- quota_exceeded (402): stop-the-line + notify billing/admin
- invalid_schema (400): developer error; surface field path
- invalid_document (400): user-facing; “can’t parse PDF”
Also capture rate-limit headers so you can tune concurrency and avoid user-facing timeouts.
The KPI that matters: “time-to-verify,” not “model accuracy”
In real operations, success looks like:
- reviewers spend seconds, not minutes, verifying exceptions
- high-risk fields are never “trust me”
- audits don’t trigger scavenger hunts
Track:
- median reviewer time per document
- % documents that are “no touch”
- override rate by field
- escalation rate by template/vendor
These metrics tell you where to refine schemas and thresholds.
Takeaway
Confidence scores aren’t a model metric.
They’re a workflow control plane—routing, sampling, and defensibility.
If you treat confidence as policy (not a number), you’ll ship automation that gets used.