CiteLLM Documentation

CiteLLM extracts structured data from PDFs and provides precise citations for every extracted value. Each field comes with page numbers, bounding boxes, source text snippets, and confidence scores.

Base URL
https://api.citellm.com

Quick Start

Extract your first document with citations in three steps:

1

Get your API key

Sign up at citellm.com and create an API key from your dashboard.

2

Send a request

POST your PDF (base64 encoded) with a schema defining what to extract:

curl -X POST https://api.citellm.com/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "JVBERi0xLjQK...",
    "schema": {
      "company_name": { "type": "string" },
      "total_revenue": { "type": "number" }
    }
  }'
3

Get cited results

Receive extracted data with exact source locations:

{
  "data": {
    "company_name": "Acme Corporation",
    "total_revenue": 4250000
  },
  "citations": {
    "company_name": {
      "page": 1,
      "bbox": [72, 120, 280, 145],
      "snippet": "ACME CORPORATION Annual Report",
      "confidence": 0.98
    },
    "total_revenue": {
      "page": 8,
      "bbox": [300, 245, 420, 270],
      "snippet": "Total Revenue: $4,250,000",
      "confidence": 0.95
    }
  }
}

Authentication

All API requests require a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

API keys can be created and managed in your dashboard. Keep your keys secure and never expose them in client-side code.

Extract Data

The primary endpoint for extracting structured data with citations.

POST /v1/extract

Request Body

Field Type Required Description
document string Yes* Base64-encoded PDF content
document_url string Yes* URL to fetch PDF (alternative to document)
document_id string Yes* ID of previously uploaded document
schema object Yes Field definitions for extraction
options object No Extraction options

* One of document, document_url, or document_id is required.

Options

Option Type Default Description
confidence_threshold number 0.0 Minimum confidence to include (0.0-1.0)
include_alternatives boolean false Include alternative extractions
language string "auto" Document language hint

Response

{
  "id": "ext_abc123xyz",
  "status": "completed",
  "data": {
    "company_name": "Acme Corporation",
    "total_revenue": 4250000,
    "fiscal_year_end": "2024-12-31"
  },
  "citations": {
    "company_name": {
      "page": 1,
      "bbox": [72, 120, 280, 145],
      "snippet": "ACME CORPORATION Annual Report",
      "confidence": 0.98
    },
    "total_revenue": {
      "page": 8,
      "bbox": [300, 245, 420, 270],
      "snippet": "Total Revenue: $4,250,000",
      "confidence": 0.95
    },
    "fiscal_year_end": {
      "page": 1,
      "bbox": [400, 120, 520, 145],
      "snippet": "Year Ended December 31, 2024",
      "confidence": 0.97
    }
  },
  "document": {
    "id": "doc_xyz789",
    "pages": 24
  },
  "created_at": "2024-01-15T10:30:00Z"
}

Documents

Upload and manage documents for repeated extractions.

POST /v1/documents

Upload a document for later use. Send as multipart/form-data:

curl -X POST https://api.citellm.com/v1/documents \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"
GET /v1/documents/:id

Retrieve document metadata.

DELETE /v1/documents/:id

Delete a document and its associated extractions.

Extractions

Retrieve past extraction results.

GET /v1/extractions/:id

Get a specific extraction by ID.

GET /v1/extractions

List all extractions. Supports pagination:

Parameter Default Description
limit 20 Results per page (max 100)
offset 0 Pagination offset
status all Filter: pending, completed, failed

Schemas

Schemas define what data to extract from documents. Each field specifies a type and optional description.

{
  "schema": {
    "field_name": {
      "type": "string",
      "description": "Optional hint for the extractor"
    }
  }
}

Supported Types

Type Description Example Output
string Text values "Acme Corp"
number Integers or decimals 4250000
date ISO 8601 dates "2024-12-31"
boolean True/false values true
array List of values ["item1", "item2"]

Example Schema

{
  "schema": {
    "company_name": {
      "type": "string",
      "description": "Legal name of the company"
    },
    "total_revenue": {
      "type": "number",
      "description": "Annual revenue in USD"
    },
    "fiscal_year_end": {
      "type": "date"
    },
    "is_audited": {
      "type": "boolean"
    },
    "board_members": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Citations

Every extracted field includes a citation object pointing to its exact source location.

Field Type Description
page integer Page number (1-indexed)
bbox array Bounding box [x1, y1, x2, y2] in points
snippet string Source text containing the value
confidence number Confidence score (0.0 - 1.0)

Bounding Box

The bbox array contains coordinates in PDF points (72 points = 1 inch), measured from the bottom-left corner of the page:

  • x1, y1 - Bottom-left corner
  • x2, y2 - Top-right corner

Confidence Scores

Each extraction includes a confidence score from 0.0 to 1.0 indicating extraction reliability.

Range Interpretation Recommendation
0.95 - 1.0 High confidence Auto-approve in most workflows
0.85 - 0.94 Medium confidence Quick human verification
0.70 - 0.84 Low confidence Requires human review
Below 0.70 Uncertain Manual extraction recommended

Use the confidence_threshold option to filter out low-confidence extractions:

{
  "document": "...",
  "schema": { ... },
  "options": {
    "confidence_threshold": 0.85
  }
}

Code Examples

Make HTTP requests to the CiteLLM API from any language.

import requests
import base64

# Read and encode the PDF
with open("report.pdf", "rb") as f:
    document_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.citellm.com/v1/extract",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "document": document_b64,
        "schema": {
            "company_name": {"type": "string"},
            "total_revenue": {"type": "number"}
        }
    }
)

result = response.json()
print(result["data"]["company_name"])
print(result["citations"]["company_name"]["snippet"])
import fs from 'fs';

const document = fs.readFileSync('report.pdf');
const documentB64 = document.toString('base64');

const response = await fetch('https://api.citellm.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    document: documentB64,
    schema: {
      company_name: { type: 'string' },
      total_revenue: { type: 'number' }
    }
  })
});

const result = await response.json();
console.log(result.data.company_name);
console.log(result.citations.company_name.snippet);
# Encode PDF to base64
DOCUMENT=$(base64 -i report.pdf)

curl -X POST https://api.citellm.com/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"document\": \"$DOCUMENT\",
    \"schema\": {
      \"company_name\": {\"type\": \"string\"},
      \"total_revenue\": {\"type\": \"number\"}
    }
  }"
package main

import (
    "bytes"
    "encoding/base64"
    "encoding/json"
    "io"
    "net/http"
    "os"
)

func main() {
    // Read and encode PDF
    pdfData, _ := os.ReadFile("report.pdf")
    documentB64 := base64.StdEncoding.EncodeToString(pdfData)

    payload := map[string]interface{}{
        "document": documentB64,
        "schema": map[string]interface{}{
            "company_name":  map[string]string{"type": "string"},
            "total_revenue": map[string]string{"type": "number"},
        },
    }
    body, _ := json.Marshal(payload)

    req, _ := http.NewRequest("POST", "https://api.citellm.com/v1/extract", bytes.NewBuffer(body))
    req.Header.Set("Authorization", "Bearer YOUR_API_KEY")
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}
    resp, _ := client.Do(req)
    defer resp.Body.Close()

    result, _ := io.ReadAll(resp.Body)
    println(string(result))
}

Docker Deployment

Run CiteLLM in your own infrastructure for full data sovereignty.

Quick Start

docker pull citellm/server:latest
docker run -d -p 8080:8080 citellm/server:latest

The API is now available at http://localhost:8080.

With Docker Compose

version: '3.8'
services:
  citellm:
    image: citellm/server:latest
    ports:
      - "8080:8080"
    environment:
      - CITATIONLLM_LICENSE_KEY=${LICENSE_KEY}
      - CITATIONLLM_DB_HOST=postgres
    depends_on:
      - postgres
      - redis

  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=citellm
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine

volumes:
  postgres-data:

Requirements

  • Docker 20.10+
  • 8GB RAM minimum (16GB recommended)
  • 4 CPU cores minimum
  • NVIDIA GPU optional (for faster processing)

Configuration

Environment variables for self-hosted deployments:

Variable Default Description
CITATIONLLM_PORT 8080 API server port
CITATIONLLM_LICENSE_KEY - Enterprise license key
CITATIONLLM_DB_HOST localhost PostgreSQL host
CITATIONLLM_REDIS_URL - Redis connection URL
CITATIONLLM_MAX_PAGES 100 Max pages per document
CITATIONLLM_USE_GPU false Enable GPU acceleration

Error Codes

API errors return a JSON object with error details:

{
  "error": {
    "code": "invalid_schema",
    "message": "Schema field 'revenue' has invalid type 'money'",
    "param": "schema.revenue.type"
  }
}
Code HTTP Description
invalid_api_key 401 Invalid or missing API key
rate_limit_exceeded 429 Too many requests
invalid_document 400 Cannot parse the document
invalid_schema 400 Schema validation failed
document_too_large 413 Document exceeds size limit
quota_exceeded 402 Monthly page quota exceeded
internal_error 500 Internal server error

Rate Limits

Plan Requests/min Concurrent
Starter 60 5
Growth 300 20
Enterprise Custom Custom

Rate limit headers are included in all responses:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1705312800