Mismatches Are the Product: Cross-Document Reconciliation with Dual Citations
Catch contradictions early by extracting canonical fields across documents, comparing them, and showing two clickable citations for every mismatch.
The highest value thing your system can do is not “extract more fields.”
It’s this:
Flag the mismatch, and show me the two pieces of proof that disagree.
CiteLLM’s use case examples repeatedly point to the same reality: in high-stakes workflows, it’s the discrepancies (not the misses) that cost you money and time. Citations make discrepancy review fast because every value can be traced back to its source region.
This post shows how to implement cross-document reconciliation in a way reviewers actually like.
Step 1: Normalize your document packs into canonical fields
Pick a canonical schema that works across related documents.
Example: AP 3-way match (invoice, PO, receipt)
Canonical keys:
- po_number
- vendor_name
- line_items (or a simplified representation)
- subtotal_amount
- tax_amount
- total_amount
- currency
- invoice_date / delivery_date
CiteLLM supports schemas for extraction and can run against PDFs provided by base64, URL, or uploaded document IDs.
Step 2: Extract each doc with the same canonical keys
Don’t create an “invoice schema” and a totally different “PO schema” unless you must.
Even if one field is missing in one document, keeping keys consistent reduces complexity.
Example extraction request conceptually targets:
- Invoice PDF → total_amount + citations
- PO PDF → total_amount + citations
- Receipt PDF → total_amount + citations
Each extracted field includes a citation object (page, bbox, snippet, confidence).
Step 3: Normalize before compare (so you don’t create fake mismatches)
Normalization steps that prevent “false exceptions”:
- currency normalization (symbols → ISO codes)
- decimal rounding policy
- date parsing into ISO
- whitespace/casing normalization for IDs
- unit normalization (“EA” vs “Each”)
Do this in your application layer after extraction.
Step 4: Compare and classify outcomes explicitly
Your compare step should return one of:
- match (within tolerance)
- mismatch (requires human decision)
- insufficient evidence (low confidence / missing field)
- suspicious (fraud or policy trigger)
Step 5: For every mismatch, show dual citations
This is the “aha” moment for reviewers.
Instead of:
“Mismatch detected”
Show:
- invoice value + citation (page/bbox/snippet)
- PO/receipt value + citation (page/bbox/snippet)
- the delta (what differs)
- a recommended action (accept invoice / accept PO / partial receipt / escalate)
A mismatch object you can store and display:
{
"mismatch_type": "total_amount",
"invoice": {
"value": 4250.00,
"citation": { "page": 1, "bbox": [300, 245, 420, 270], "snippet": "Total: $4,250.00", "confidence": 0.95 }
},
"po": {
"value": 4100.00,
"citation": { "page": 2, "bbox": [280, 510, 420, 535], "snippet": "Total Amount: $4,100.00", "confidence": 0.93 }
},
"policy": {
"tolerance": 25.00,
"decision_required": true
}
}
This is exactly what citations unlock: fast, defensible, side-by-side verification.
Cross-document reconciliation patterns that deliver real ROI
You can apply the same playbook across industries:
- Underwriting: tax return vs bank deposits. Flag income inconsistencies and show both citations.
- Diligence: deck metrics vs audited financials. Catch inflated KPIs by comparing values and surfacing citations to both sources.
- Contracts: amendment language vs original agreement. Detect term drift and force a reviewer decision with both citations shown.
Use confidence to prioritize (not to hide problems)
Confidence helps you triage mismatches:
- if one side is low confidence, route to “insufficient evidence”
- if both sides are high confidence but disagree, route to “real mismatch”
CiteLLM provides confidence scores per field and documents how to interpret ranges.
What to measure
Cross-doc reconciliation success shows up as:
- % mismatches resolved within SLA
- median time-to-resolution per mismatch
- reviewer minutes saved vs baseline
- $ recovered / prevented (credits, avoided overpayments, prevented underwriting errors)
- repeat mismatch rate by vendor/template
Takeaway
The best document AI systems don’t brag about extraction coverage.
They surface contradictions early—and make resolving them faster than arguing about them.
Dual citations turn mismatches into a one-click decision instead of a 20-minute PDF scavenger hunt.