CiteLLM Documentation
CiteLLM extracts structured data from PDFs and provides precise citations for every extracted value. Each field comes with page numbers, bounding boxes, source text snippets, and confidence scores.
https://api.citellm.com
Quick Start
Extract your first document with citations in three steps:
Get your API key
Sign up at citellm.com and create an API key from your dashboard.
Send a request
POST your PDF (base64 encoded) with a schema defining what to extract:
curl -X POST https://api.citellm.com/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"document": "JVBERi0xLjQK...",
"schema": {
"company_name": { "type": "string" },
"total_revenue": { "type": "number" }
}
}'
Get cited results
Receive extracted data with exact source locations:
{
"data": {
"company_name": "Acme Corporation",
"total_revenue": 4250000
},
"citations": {
"company_name": {
"page": 1,
"bbox": [72, 120, 280, 145],
"snippet": "ACME CORPORATION Annual Report",
"confidence": 0.98
},
"total_revenue": {
"page": 8,
"bbox": [300, 245, 420, 270],
"snippet": "Total Revenue: $4,250,000",
"confidence": 0.95
}
}
}
Authentication
All API requests require a Bearer token in the Authorization header:
Authorization: Bearer YOUR_API_KEY
API keys can be created and managed in your dashboard. Keep your keys secure and never expose them in client-side code.
Extract Data
The primary endpoint for extracting structured data with citations.
/v1/extract
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
document |
string | Yes* | Base64-encoded PDF content |
document_url |
string | Yes* | URL to fetch PDF (alternative to document) |
document_id |
string | Yes* | ID of previously uploaded document |
schema |
object | Yes | Field definitions for extraction |
options |
object | No | Extraction options |
* One of document, document_url, or document_id is required.
Options
| Option | Type | Default | Description |
|---|---|---|---|
confidence_threshold |
number | 0.0 | Minimum confidence to include (0.0-1.0) |
include_alternatives |
boolean | false | Include alternative extractions |
language |
string | "auto" | Document language hint |
Response
{
"id": "ext_abc123xyz",
"status": "completed",
"data": {
"company_name": "Acme Corporation",
"total_revenue": 4250000,
"fiscal_year_end": "2024-12-31"
},
"citations": {
"company_name": {
"page": 1,
"bbox": [72, 120, 280, 145],
"snippet": "ACME CORPORATION Annual Report",
"confidence": 0.98
},
"total_revenue": {
"page": 8,
"bbox": [300, 245, 420, 270],
"snippet": "Total Revenue: $4,250,000",
"confidence": 0.95
},
"fiscal_year_end": {
"page": 1,
"bbox": [400, 120, 520, 145],
"snippet": "Year Ended December 31, 2024",
"confidence": 0.97
}
},
"document": {
"id": "doc_xyz789",
"pages": 24
},
"created_at": "2024-01-15T10:30:00Z"
}
Documents
Upload and manage documents for repeated extractions.
/v1/documents
Upload a document for later use. Send as multipart/form-data:
curl -X POST https://api.citellm.com/v1/documents \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf"
/v1/documents/:id
Retrieve document metadata.
/v1/documents/:id
Delete a document and its associated extractions.
Extractions
Retrieve past extraction results.
/v1/extractions/:id
Get a specific extraction by ID.
/v1/extractions
List all extractions. Supports pagination:
| Parameter | Default | Description |
|---|---|---|
limit |
20 | Results per page (max 100) |
offset |
0 | Pagination offset |
status |
all | Filter: pending, completed, failed |
Schemas
Schemas define what data to extract from documents. Each field specifies a type and optional description.
{
"schema": {
"field_name": {
"type": "string",
"description": "Optional hint for the extractor"
}
}
}
Supported Types
| Type | Description | Example Output |
|---|---|---|
string |
Text values | "Acme Corp" |
number |
Integers or decimals | 4250000 |
date |
ISO 8601 dates | "2024-12-31" |
boolean |
True/false values | true |
array |
List of values | ["item1", "item2"] |
Example Schema
{
"schema": {
"company_name": {
"type": "string",
"description": "Legal name of the company"
},
"total_revenue": {
"type": "number",
"description": "Annual revenue in USD"
},
"fiscal_year_end": {
"type": "date"
},
"is_audited": {
"type": "boolean"
},
"board_members": {
"type": "array",
"items": { "type": "string" }
}
}
}
Citations
Every extracted field includes a citation object pointing to its exact source location.
| Field | Type | Description |
|---|---|---|
page |
integer | Page number (1-indexed) |
bbox |
array | Bounding box [x1, y1, x2, y2] in points |
snippet |
string | Source text containing the value |
confidence |
number | Confidence score (0.0 - 1.0) |
Bounding Box
The bbox array contains coordinates in PDF points (72 points = 1 inch), measured from the bottom-left corner of the page:
x1, y1- Bottom-left cornerx2, y2- Top-right corner
Confidence Scores
Each extraction includes a confidence score from 0.0 to 1.0 indicating extraction reliability.
| Range | Interpretation | Recommendation |
|---|---|---|
| 0.95 - 1.0 | High confidence | Auto-approve in most workflows |
| 0.85 - 0.94 | Medium confidence | Quick human verification |
| 0.70 - 0.84 | Low confidence | Requires human review |
| Below 0.70 | Uncertain | Manual extraction recommended |
Use the confidence_threshold option to filter out low-confidence extractions:
{
"document": "...",
"schema": { ... },
"options": {
"confidence_threshold": 0.85
}
}
Code Examples
Make HTTP requests to the CiteLLM API from any language.
import requests
import base64
# Read and encode the PDF
with open("report.pdf", "rb") as f:
document_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://api.citellm.com/v1/extract",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"document": document_b64,
"schema": {
"company_name": {"type": "string"},
"total_revenue": {"type": "number"}
}
}
)
result = response.json()
print(result["data"]["company_name"])
print(result["citations"]["company_name"]["snippet"])
import fs from 'fs';
const document = fs.readFileSync('report.pdf');
const documentB64 = document.toString('base64');
const response = await fetch('https://api.citellm.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
document: documentB64,
schema: {
company_name: { type: 'string' },
total_revenue: { type: 'number' }
}
})
});
const result = await response.json();
console.log(result.data.company_name);
console.log(result.citations.company_name.snippet);
# Encode PDF to base64
DOCUMENT=$(base64 -i report.pdf)
curl -X POST https://api.citellm.com/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"document\": \"$DOCUMENT\",
\"schema\": {
\"company_name\": {\"type\": \"string\"},
\"total_revenue\": {\"type\": \"number\"}
}
}"
package main
import (
"bytes"
"encoding/base64"
"encoding/json"
"io"
"net/http"
"os"
)
func main() {
// Read and encode PDF
pdfData, _ := os.ReadFile("report.pdf")
documentB64 := base64.StdEncoding.EncodeToString(pdfData)
payload := map[string]interface{}{
"document": documentB64,
"schema": map[string]interface{}{
"company_name": map[string]string{"type": "string"},
"total_revenue": map[string]string{"type": "number"},
},
}
body, _ := json.Marshal(payload)
req, _ := http.NewRequest("POST", "https://api.citellm.com/v1/extract", bytes.NewBuffer(body))
req.Header.Set("Authorization", "Bearer YOUR_API_KEY")
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
resp, _ := client.Do(req)
defer resp.Body.Close()
result, _ := io.ReadAll(resp.Body)
println(string(result))
}
Docker Deployment
Run CiteLLM in your own infrastructure for full data sovereignty.
Quick Start
docker pull citellm/server:latest
docker run -d -p 8080:8080 citellm/server:latest
The API is now available at http://localhost:8080.
With Docker Compose
version: '3.8'
services:
citellm:
image: citellm/server:latest
ports:
- "8080:8080"
environment:
- CITATIONLLM_LICENSE_KEY=${LICENSE_KEY}
- CITATIONLLM_DB_HOST=postgres
depends_on:
- postgres
- redis
postgres:
image: postgres:15
environment:
- POSTGRES_DB=citellm
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
volumes:
postgres-data:
Requirements
- Docker 20.10+
- 8GB RAM minimum (16GB recommended)
- 4 CPU cores minimum
- NVIDIA GPU optional (for faster processing)
Configuration
Environment variables for self-hosted deployments:
| Variable | Default | Description |
|---|---|---|
CITATIONLLM_PORT |
8080 | API server port |
CITATIONLLM_LICENSE_KEY |
- | Enterprise license key |
CITATIONLLM_DB_HOST |
localhost | PostgreSQL host |
CITATIONLLM_REDIS_URL |
- | Redis connection URL |
CITATIONLLM_MAX_PAGES |
100 | Max pages per document |
CITATIONLLM_USE_GPU |
false | Enable GPU acceleration |
Error Codes
API errors return a JSON object with error details:
{
"error": {
"code": "invalid_schema",
"message": "Schema field 'revenue' has invalid type 'money'",
"param": "schema.revenue.type"
}
}
| Code | HTTP | Description |
|---|---|---|
invalid_api_key |
401 | Invalid or missing API key |
rate_limit_exceeded |
429 | Too many requests |
invalid_document |
400 | Cannot parse the document |
invalid_schema |
400 | Schema validation failed |
document_too_large |
413 | Document exceeds size limit |
quota_exceeded |
402 | Monthly page quota exceeded |
internal_error |
500 | Internal server error |
Rate Limits
| Plan | Requests/min | Concurrent |
|---|---|---|
| Starter | 60 | 5 |
| Growth | 300 | 20 |
| Enterprise | Custom | Custom |
Rate limit headers are included in all responses:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1705312800