Skip to content

agxp/docpulse

Repository files navigation

DocPulse — Document Intelligence API

Multi-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.

Quickstart

Single command (Docker)

OPENAI_API_KEY=sk-... docker compose up --build

Then open http://localhost:8081 — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.

The stack (api, worker, postgres, redis) starts automatically. Migrations run on boot. A dev tenant is seeded with the key di_devkey_changeme_in_production (override via DEV_API_KEY env var).

Local development (without Docker for the Go services)

# 1. Start infrastructure
docker compose up -d postgres redis

# 2. Set environment
cp .env.example .env
# Edit .env — add your OPENAI_API_KEY

set -a && source .env && set +a

# 3. Run migrations + create dev tenant
make migrate    # requires psql installed locally
make seed       # prints your API key — save it

# 4. Start API and worker (separate terminals)
make run-api
make run-worker

Usage

Web UI

The API server serves a frontend at /. In dev mode the API key is auto-filled. Steps:

  1. Upload a PDF, DOCX, or image (max 50 MB)
  2. Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)
  3. Click Extract — the UI polls for the result and displays each field with a confidence score

A sample document is included at testdata/sample-invoice.docx.

API

Submit an extraction job

curl -X POST http://localhost:8081/v1/extract \
  -H "Authorization: Bearer di_your_key_here" \
  -F "document=@invoice.pdf" \
  -F 'schema={
    "type": "object",
    "properties": {
      "vendor": {"type": "string"},
      "invoice_number": {"type": "string"},
      "total": {"type": "number"},
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "amount": {"type": "number"}
          }
        }
      }
    },
    "required": ["vendor", "total"]
  }'

# Response:
# {"job_id": "abc-123", "status": "pending", "poll_url": "/v1/jobs/abc-123"}

Poll for results

curl http://localhost:8081/v1/jobs/abc-123 \
  -H "Authorization: Bearer di_your_key_here"

List jobs

curl "http://localhost:8081/v1/jobs?limit=20&offset=0" \
  -H "Authorization: Bearer di_your_key_here"

Default limit is 20, max is 100.

Webhooks

Register a URL to receive a POST when a job completes. The secret is generated server-side and shown once — store it to verify signatures.

# Register
curl -X POST http://localhost:8081/v1/webhooks \
  -H "Authorization: Bearer di_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/webhook"}'

# Response includes the secret — save it:
# {"id": "...", "url": "...", "secret": "abc123...", "active": true}

# Delete
curl -X DELETE http://localhost:8081/v1/webhooks/{id} \
  -H "Authorization: Bearer di_your_key_here"

Each delivery is a POST with:

  • Content-Type: application/json — body is the full job object
  • X-DocPulse-Signature: sha256=<hmac> — HMAC-SHA256 of the body using your secret

Verify the signature on your server:

import hmac, hashlib

def verify(secret: str, body: bytes, header: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header)

Failed deliveries are retried up to 5 times with exponential backoff.

Architecture

Client → API (Go/chi) → PostgreSQL (job queue)
                              ↓
                         Worker Pool
                    ┌────────┼────────┐
                    │        │        │
                 Ingest   Chunk    Extract
                    │        │        │
               PDF/OCR   Semantic  LLM Router
               DOCX      Boundary  (fast/strong)
                    │        │        │
                    └────────┼────────┘
                              ↓
                     Result Assembly
                     + Confidence Scoring
                              ↓
                     Job Complete / Webhook

Key decisions:

  • Async-first: jobs never block HTTP connections
  • FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue
  • Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure
  • Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost
  • Magic-byte format detection: more robust than trusting file extensions
  • HMAC-signed webhooks: recipients can verify payload integrity

Project Structure

cmd/api/          — HTTP server entry point
cmd/worker/       — Job processor entry point
internal/
  api/            — HTTP handlers, routing, embedded frontend
  api/middleware/  — Auth, logging, rate limiting
  auth/           — API key generation and hashing
  config/         — Environment-based configuration
  database/       — PostgreSQL stores (jobs, tenants, webhooks)
  domain/         — Core types shared across packages
  extraction/     — Chunking engine
  ingestion/      — Format detection, text extraction (PDF/OCR/DOCX)
  jobs/           — Worker loop and job processing pipeline
  llm/            — Model routing and structured extraction
  storage/        — Object storage interface (local filesystem only)
  webhook/        — Webhook delivery with HMAC signing + retries
migrations/       — SQL schema (auto-applied on API startup)
testdata/         — Sample documents for testing
scripts/          — Dev utilities (seed tenant)
Dockerfile        — Multi-stage build: api and worker targets

Stack

Go 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io

System dependencies (for text extraction):

  • poppler-utils — pdftotext for native PDFs
  • tesseract-ocr — OCR for scanned PDFs and images
  • pandoc — DOCX to text conversion

Known limitations

  • Storage: only local filesystem (LocalStore) is implemented. S3 support is stubbed but not built.
  • Schema validation: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.
  • Job list pagination: limit/offset work and response includes a total count, but there is no cursor-based pagination.
  • Worker cache: Redis-backed with a configurable TTL (WORKER_CACHE_TTL, default 24h), but no LRU eviction beyond TTL.
  • make migrate: runs psql directly — requires psql installed on your machine. When using Docker (docker compose up), migrations run automatically on API startup instead.

About

Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors