DocPulse — Document Intelligence API

Multi-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.

Quickstart

Single command (Docker)

OPENAI_API_KEY=sk-... docker compose up --build

Then open http://localhost:8081 — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.

The stack (api, worker, postgres, redis) starts automatically. Migrations run on boot. A dev tenant is seeded with the key di_devkey_changeme_in_production (override via DEV_API_KEY env var).

Local development (without Docker for the Go services)

# 1. Start infrastructure
docker compose up -d postgres redis

# 2. Set environment
cp .env.example .env
# Edit .env — add your OPENAI_API_KEY

set -a && source .env && set +a

# 3. Run migrations + create dev tenant
make migrate    # requires psql installed locally
make seed       # prints your API key — save it

# 4. Start API and worker (separate terminals)
make run-api
make run-worker

Usage

Web UI

The API server serves a frontend at /. In dev mode the API key is auto-filled. Steps:

Upload a PDF, DOCX, or image (max 50 MB)
Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)
Click Extract — the UI polls for the result and displays each field with a confidence score

A sample document is included at testdata/sample-invoice.docx.

API

Submit an extraction job

curl -X POST http://localhost:8081/v1/extract \
  -H "Authorization: Bearer di_your_key_here" \
  -F "document=@invoice.pdf" \
  -F 'schema={
    "type": "object",
    "properties": {
      "vendor": {"type": "string"},
      "invoice_number": {"type": "string"},
      "total": {"type": "number"},
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "amount": {"type": "number"}
          }
        }
      }
    },
    "required": ["vendor", "total"]
  }'

# Response:
# {"job_id": "abc-123", "status": "pending", "poll_url": "/v1/jobs/abc-123"}

Poll for results

curl http://localhost:8081/v1/jobs/abc-123 \
  -H "Authorization: Bearer di_your_key_here"

List jobs

curl "http://localhost:8081/v1/jobs?limit=20&offset=0" \
  -H "Authorization: Bearer di_your_key_here"

Default limit is 20, max is 100.

Webhooks

Register a URL to receive a POST when a job completes. The secret is generated server-side and shown once — store it to verify signatures.

# Register
curl -X POST http://localhost:8081/v1/webhooks \
  -H "Authorization: Bearer di_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/webhook"}'

# Response includes the secret — save it:
# {"id": "...", "url": "...", "secret": "abc123...", "active": true}

# Delete
curl -X DELETE http://localhost:8081/v1/webhooks/{id} \
  -H "Authorization: Bearer di_your_key_here"

Each delivery is a POST with:

Content-Type: application/json — body is the full job object
X-DocPulse-Signature: sha256=<hmac> — HMAC-SHA256 of the body using your secret

Verify the signature on your server:

import hmac, hashlib

def verify(secret: str, body: bytes, header: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header)

Failed deliveries are retried up to 5 times with exponential backoff.

Architecture

Client → API (Go/chi) → PostgreSQL (job queue)
                              ↓
                         Worker Pool
                    ┌────────┼────────┐
                    │        │        │
                 Ingest   Chunk    Extract
                    │        │        │
               PDF/OCR   Semantic  LLM Router
               DOCX      Boundary  (fast/strong)
                    │        │        │
                    └────────┼────────┘
                              ↓
                     Result Assembly
                     + Confidence Scoring
                              ↓
                     Job Complete / Webhook

Key decisions:

Async-first: jobs never block HTTP connections
FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue
Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure
Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost
Magic-byte format detection: more robust than trusting file extensions
HMAC-signed webhooks: recipients can verify payload integrity

Project Structure

cmd/api/          — HTTP server entry point
cmd/worker/       — Job processor entry point
internal/
  api/            — HTTP handlers, routing, embedded frontend
  api/middleware/  — Auth, logging, rate limiting
  auth/           — API key generation and hashing
  config/         — Environment-based configuration
  database/       — PostgreSQL stores (jobs, tenants, webhooks)
  domain/         — Core types shared across packages
  extraction/     — Chunking engine
  ingestion/      — Format detection, text extraction (PDF/OCR/DOCX)
  jobs/           — Worker loop and job processing pipeline
  llm/            — Model routing and structured extraction
  storage/        — Object storage interface (local filesystem only)
  webhook/        — Webhook delivery with HMAC signing + retries
migrations/       — SQL schema (auto-applied on API startup)
testdata/         — Sample documents for testing
scripts/          — Dev utilities (seed tenant)
Dockerfile        — Multi-stage build: api and worker targets

Stack

Go 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io

System dependencies (for text extraction):

poppler-utils — pdftotext for native PDFs
tesseract-ocr — OCR for scanned PDFs and images
pandoc — DOCX to text conversion

Known limitations

Storage: only local filesystem (LocalStore) is implemented. S3 support is stubbed but not built.
Schema validation: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.
Job list pagination: limit/offset work and response includes a total count, but there is no cursor-based pagination.
Worker cache: Redis-backed with a configurable TTL (WORKER_CACHE_TTL, default 24h), but no LRU eviction beyond TTL.
make migrate: runs psql directly — requires psql installed on your machine. When using Docker (docker compose up), migrations run automatically on API startup instead.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.claude		.claude
.github/workflows		.github/workflows
cmd		cmd
internal		internal
migrations		migrations
scripts		scripts
testdata		testdata
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocPulse — Document Intelligence API

Quickstart

Single command (Docker)

Local development (without Docker for the Go services)

Usage

Web UI

API

Submit an extraction job

Poll for results

List jobs

Webhooks

Architecture

Project Structure

Stack

Known limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocPulse — Document Intelligence API

Quickstart

Single command (Docker)

Local development (without Docker for the Go services)

Usage

Web UI

API

Submit an extraction job

Poll for results

List jobs

Webhooks

Architecture

Project Structure

Stack

Known limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages