CiteLLM Documentation

CiteLLM extracts structured data from PDFs and provides precise citations for every extracted value. Each field comes with page numbers, bounding boxes, source text snippets, and confidence scores.

Base URL

https://api.citellm.com

Quick Start

Extract your first document with citations in three steps:

1

Get your API key

Sign up at citellm.com and create an API key from your dashboard.

2

Send a request

POST your PDF (base64 encoded) with a schema defining what to extract:

curl -X POST https://api.citellm.com/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "JVBERi0xLjQK...",
    "schema": {
      "company_name": { "type": "string" },
      "total_revenue": { "type": "number" }
    }
  }'

3

Get cited results

Receive extracted data with exact source locations:

{
  "data": {
    "company_name": "Acme Corporation",
    "total_revenue": 4250000
  },
  "citations": {
    "company_name": {
      "page": 1,
      "bbox": [72, 120, 280, 145],
      "snippet": "ACME CORPORATION Annual Report",
      "confidence": 0.98
    },
    "total_revenue": {
      "page": 8,
      "bbox": [300, 245, 420, 270],
      "snippet": "Total Revenue: $4,250,000",
      "confidence": 0.95
    }
  }
}

Authentication

All API requests require a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

API keys can be created and managed in your dashboard. Keep your keys secure and never expose them in client-side code.

Extract Data

The primary endpoint for extracting structured data with citations.

POST /v1/extract

Request Body

Field	Type	Required	Description
`document`	string	Yes*	Base64-encoded PDF content
`document_url`	string	Yes*	URL to fetch PDF (alternative to document)
`document_id`	string	Yes*	ID of previously uploaded document
`schema`	object	Yes	Field definitions for extraction
`options`	object	No	Extraction options

* One of document, document_url, or document_id is required.

Options

Option	Type	Default	Description
`confidence_threshold`	number	0.0	Minimum confidence to include (0.0-1.0)
`include_alternatives`	boolean	false	Include alternative extractions
`language`	string	"auto"	Document language hint

Response

{
  "id": "ext_abc123xyz",
  "status": "completed",
  "data": {
    "company_name": "Acme Corporation",
    "total_revenue": 4250000,
    "fiscal_year_end": "2024-12-31"
  },
  "citations": {
    "company_name": {
      "page": 1,
      "bbox": [72, 120, 280, 145],
      "snippet": "ACME CORPORATION Annual Report",
      "confidence": 0.98
    },
    "total_revenue": {
      "page": 8,
      "bbox": [300, 245, 420, 270],
      "snippet": "Total Revenue: $4,250,000",
      "confidence": 0.95
    },
    "fiscal_year_end": {
      "page": 1,
      "bbox": [400, 120, 520, 145],
      "snippet": "Year Ended December 31, 2024",
      "confidence": 0.97
    }
  },
  "document": {
    "id": "doc_xyz789",
    "pages": 24
  },
  "created_at": "2024-01-15T10:30:00Z"
}

Documents

Upload and manage documents for repeated extractions.

POST /v1/documents

Upload a document for later use. Send as multipart/form-data:

curl -X POST https://api.citellm.com/v1/documents \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"

GET /v1/documents/:id

Retrieve document metadata.

DELETE /v1/documents/:id

Delete a document and its associated extractions.

Extractions

Retrieve past extraction results.

GET /v1/extractions/:id

Get a specific extraction by ID.

GET /v1/extractions

List all extractions. Supports pagination:

Parameter	Default	Description
`limit`	20	Results per page (max 100)
`offset`	0	Pagination offset
`status`	all	Filter: pending, completed, failed

Schemas

Schemas define what data to extract from documents. Each field specifies a type and optional description.

{
  "schema": {
    "field_name": {
      "type": "string",
      "description": "Optional hint for the extractor"
    }
  }
}

Supported Types

Type	Description	Example Output
`string`	Text values	`"Acme Corp"`
`number`	Integers or decimals	`4250000`
`date`	ISO 8601 dates	`"2024-12-31"`
`boolean`	True/false values	`true`
`array`	List of values	`["item1", "item2"]`

Example Schema

{
  "schema": {
    "company_name": {
      "type": "string",
      "description": "Legal name of the company"
    },
    "total_revenue": {
      "type": "number",
      "description": "Annual revenue in USD"
    },
    "fiscal_year_end": {
      "type": "date"
    },
    "is_audited": {
      "type": "boolean"
    },
    "board_members": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Citations

Every extracted field includes a citation object pointing to its exact source location.

Field	Type	Description
`page`	integer	Page number (1-indexed)
`bbox`	array	Bounding box [x1, y1, x2, y2] in points
`snippet`	string	Source text containing the value
`confidence`	number	Confidence score (0.0 - 1.0)

Bounding Box

The bbox array contains coordinates in PDF points (72 points = 1 inch), measured from the bottom-left corner of the page:

x1, y1 - Bottom-left corner
x2, y2 - Top-right corner

Confidence Scores

Each extraction includes a confidence score from 0.0 to 1.0 indicating extraction reliability.

Range	Interpretation	Recommendation
0.95 - 1.0	High confidence	Auto-approve in most workflows
0.85 - 0.94	Medium confidence	Quick human verification
0.70 - 0.84	Low confidence	Requires human review
Below 0.70	Uncertain	Manual extraction recommended

Use the confidence_threshold option to filter out low-confidence extractions:

{
  "document": "...",
  "schema": { ... },
  "options": {
    "confidence_threshold": 0.85
  }
}

Code Examples

Make HTTP requests to the CiteLLM API from any language.

import requests
import base64

# Read and encode the PDF
with open("report.pdf", "rb") as f:
    document_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.citellm.com/v1/extract",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "document": document_b64,
        "schema": {
            "company_name": {"type": "string"},
            "total_revenue": {"type": "number"}
        }
    }
)

result = response.json()
print(result["data"]["company_name"])
print(result["citations"]["company_name"]["snippet"])

import fs from 'fs';

const document = fs.readFileSync('report.pdf');
const documentB64 = document.toString('base64');

const response = await fetch('https://api.citellm.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    document: documentB64,
    schema: {
      company_name: { type: 'string' },
      total_revenue: { type: 'number' }
    }
  })
});

const result = await response.json();
console.log(result.data.company_name);
console.log(result.citations.company_name.snippet);

# Encode PDF to base64
DOCUMENT=$(base64 -i report.pdf)

curl -X POST https://api.citellm.com/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"document\": \"$DOCUMENT\",
    \"schema\": {
      \"company_name\": {\"type\": \"string\"},
      \"total_revenue\": {\"type\": \"number\"}
    }
  }"

package main

import (
    "bytes"
    "encoding/base64"
    "encoding/json"
    "io"
    "net/http"
    "os"
)

func main() {
    // Read and encode PDF
    pdfData, _ := os.ReadFile("report.pdf")
    documentB64 := base64.StdEncoding.EncodeToString(pdfData)

    payload := map[string]interface{}{
        "document": documentB64,
        "schema": map[string]interface{}{
            "company_name":  map[string]string{"type": "string"},
            "total_revenue": map[string]string{"type": "number"},
        },
    }
    body, _ := json.Marshal(payload)

    req, _ := http.NewRequest("POST", "https://api.citellm.com/v1/extract", bytes.NewBuffer(body))
    req.Header.Set("Authorization", "Bearer YOUR_API_KEY")
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{}
    resp, _ := client.Do(req)
    defer resp.Body.Close()

    result, _ := io.ReadAll(resp.Body)
    println(string(result))
}

Docker Deployment

Run CiteLLM in your own infrastructure for full data sovereignty.

Quick Start

docker pull citellm/server:latest
docker run -d -p 8080:8080 citellm/server:latest

The API is now available at http://localhost:8080.

With Docker Compose

version: '3.8'
services:
  citellm:
    image: citellm/server:latest
    ports:
      - "8080:8080"
    environment:
      - CITATIONLLM_LICENSE_KEY=${LICENSE_KEY}
      - CITATIONLLM_DB_HOST=postgres
    depends_on:
      - postgres
      - redis

  postgres:
    image: postgres:15
    environment:
      - POSTGRES_DB=citellm
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine

volumes:
  postgres-data:

Requirements

Docker 20.10+
8GB RAM minimum (16GB recommended)
4 CPU cores minimum
NVIDIA GPU optional (for faster processing)

Configuration

Environment variables for self-hosted deployments:

Variable	Default	Description
`CITATIONLLM_PORT`	8080	API server port
`CITATIONLLM_LICENSE_KEY`	-	Enterprise license key
`CITATIONLLM_DB_HOST`	localhost	PostgreSQL host
`CITATIONLLM_REDIS_URL`	-	Redis connection URL
`CITATIONLLM_MAX_PAGES`	100	Max pages per document
`CITATIONLLM_USE_GPU`	false	Enable GPU acceleration

Error Codes

API errors return a JSON object with error details:

{
  "error": {
    "code": "invalid_schema",
    "message": "Schema field 'revenue' has invalid type 'money'",
    "param": "schema.revenue.type"
  }
}

Code	HTTP	Description
`invalid_api_key`	401	Invalid or missing API key
`rate_limit_exceeded`	429	Too many requests
`invalid_document`	400	Cannot parse the document
`invalid_schema`	400	Schema validation failed
`document_too_large`	413	Document exceeds size limit
`quota_exceeded`	402	Monthly page quota exceeded
`internal_error`	500	Internal server error

Rate Limits

Plan	Requests/min	Concurrent
Starter	60	5
Growth	300	20
Enterprise	Custom	Custom

Rate limit headers are included in all responses:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1705312800