Confidence Scores Aren’t a Number: They’re a Policy

How to turn confidence scores into routing rules, sampling, and measurable workflow quality—without drowning reviewers in PDFs.

Confidence scores are only useful if they change what happens next.

CiteLLM returns a confidence score per extracted field and documents practical interpretation ranges (high confidence → often auto-approve; medium → quick verify; low → human review).

This post shows how to convert that into a real policy—one that improves speed and defensibility.

The mistake: using one global threshold for everything

A single threshold like “accept > 0.90” sounds clean.

It fails because:

  • some fields are low-risk (e.g., document_title)
  • some are high-risk (e.g., bank_account_number, liability_cap_text, income_amount)
  • some are “workflow critical” even if low dollar value (e.g., renewal notice window)

Your policy should be field-aware and decision-aware.

A policy table you can implement today

CiteLLM suggests confidence ranges and provides confidence_threshold as an option.

Map that into workflow actions like this:

  • 0.95–1.0 — auto-approve (with sampling) for low-risk fields, high volume
  • 0.85–0.94 — quick verify with click-to-highlight evidence
  • 0.70–0.84 — required review with edit/flag
  • < 0.70 — manual / block for edge cases and messy scans

(Adjust per field risk.)

Implement confidence_threshold as a routing primitive

A common pattern:

  • run extraction with a threshold (e.g., 0.85)
  • anything below is either omitted or flagged for manual handling
  • the UI focuses reviewers on what matters

CiteLLM exposes confidence_threshold in request options.

Example:

{
  "document_id": "doc_xyz789",
  "schema": {
    "invoice_number": { "type": "string" },
    "invoice_date": { "type": "date" },
    "total_amount": { "type": "number" }
  },
  "options": { "confidence_threshold": 0.85 }
}

Add a second dimension: business impact

Confidence alone is not enough.

Create a field risk tier:

  • Tier A (high risk): bank details, identity signals, legal caps, underwriting income
  • Tier B (medium): totals, dates, key identifiers
  • Tier C (low): titles, headers, non-decision metadata

Then:

  • Tier A: require review unless extremely high confidence
  • Tier B: review on medium confidence
  • Tier C: auto-approve with sampling

This is how you reduce reviewer load without creating silent failure.

Sampling is what keeps auto-approval honest

If you auto-approve anything, sample it.

A practical policy:

  • randomly sample 1–5% of “auto-approved” fields weekly
  • sample 100% of “Tier A auto-approved” until proven safe
  • track override rate from samples and adjust thresholds

Sampling creates a learning loop without making everything human-reviewed.

When something fails: make errors actionable, not mysterious

Even a perfect policy hits operational issues:

  • invalid schema
  • invalid document
  • rate limiting
  • quota exceeded

CiteLLM documents error codes and rate limits—use those to build clear retry/escalation behavior.

Examples to plan for:

  • rate_limit_exceeded (429): backoff + queue
  • quota_exceeded (402): stop-the-line + notify billing/admin
  • invalid_schema (400): developer error; surface field path
  • invalid_document (400): user-facing; “can’t parse PDF”

Also capture rate-limit headers so you can tune concurrency and avoid user-facing timeouts.

The KPI that matters: “time-to-verify,” not “model accuracy”

In real operations, success looks like:

  • reviewers spend seconds, not minutes, verifying exceptions
  • high-risk fields are never “trust me”
  • audits don’t trigger scavenger hunts

Track:

  • median reviewer time per document
  • % documents that are “no touch”
  • override rate by field
  • escalation rate by template/vendor

These metrics tell you where to refine schemas and thresholds.

Takeaway

Confidence scores aren’t a model metric.

They’re a workflow control plane—routing, sampling, and defensibility.

If you treat confidence as policy (not a number), you’ll ship automation that gets used.

See the API Request Access