Production-oriented document classification backend for pdf, docx and image uploads. The current pipeline uses direct text extraction when possible, adaptive local PaddleOCR for OCR, and a three-chunk Ollama vote to classify each document.
FastAPIbackend with versioned APIs- Direct text extraction for digital PDFs, DOCX, markdown, and plain text
- One adaptive local
PaddleOCRbackend- GPU path when local Paddle CUDA is available
- CPU fallback path otherwise
- Concurrent startup warmup for both OCR and classifier so the first request does not pay the full cold-start cost
- Ollama-based classification with:
- persistent HTTP client
- up to 3 representative chunks per document
- concurrent chunk classification
- majority-vote aggregation
- heuristic fallback and override when the model returns a weak
other
- Structured JSON logging, health reporting, and Prometheus metrics exposed at
/metrics
At backend startup, the OCR layer checks local Paddle CUDA support and picks one model profile:
OCR_DEVICE=auto
OCR_GPU_DETECTION_MODEL=PP-OCRv5_server_det
OCR_GPU_RECOGNITION_MODEL=PP-OCRv4_server_rec_docOCR_DEVICE=auto
OCR_CPU_DETECTION_MODEL=PP-OCRv4_mobile_det
OCR_CPU_RECOGNITION_MODEL=en_PP-OCRv4_mobile_recIf you explicitly set OCR_DEVICE=cpu, the CPU path is forced even when CUDA is available.
POST /api/v1/documents/classifyaccepts the upload.IngestionServicestores the file underdata/uploads.ExtractionServicechooses direct-text extraction or OCR.
Probe page 1 for embedded text
├── text length >= MIN_DIRECT_TEXT_LENGTH
│ └── extract remaining pages → return full digital text (no OCR)
└── text length < MIN_DIRECT_TEXT_LENGTH
└── fall through to OCR on page 1 only (PaddleOCR, scale=1.5)
- Images are resized/compressed before OCR.
- OCR runs through the selected PaddleOCR profile.
ClassificationServicebuilds up to 3 vote chunks from extracted pages.- If there are many page chunks, it keeps representative chunks from the beginning, middle, and end.
- If there are too few page chunks, it falls back to first-page paragraph chunks or splits the extracted text into 3 bounded chunks.
- Ollama classifies the selected chunks concurrently.
- The final class is chosen by majority vote.
- If the votes split, confidence is used as the tie-break.
- If the winning result is
otherbut heuristics strongly match a known type such asresume, the heuristic result overrides it.
Both OCR and classifier warm up concurrently at startup:
[PaddleOCR init + warmup] ← runs in thread executor
[Classifier init + warmup] ← runs concurrently in event loop
↓
app ready — total time = max(ocr, classifier)
The app exposes Prometheus metrics at:
A separate monitoring stack is included under:
That stack runs:
Prometheusfor scraping and storing metricsGrafanafor dashboardsblackbox-exporterfor uptime and health probes
docintel_http_requests_total- total HTTP requests by method, path, and status code
docintel_http_request_duration_seconds- HTTP latency histogram by method and path
docintel_document_classifications_total- completed classification requests by document type, predicted class, and success or failure
docintel_document_classification_duration_seconds- end-to-end processing latency for document classification
docintel_document_stage_duration_seconds- per-step timing for
ingestion,extraction, andclassification
- per-step timing for
docintel_document_upload_size_bytes- uploaded file size distribution by extension
docintel_service_available- whether OCR and classifier services are available after startup
docintel_startup_degraded- whether the app started in degraded mode
probe_success{job="fastapi-alive"}- whether the app is reachable
probe_success{job="fastapi-health"}- whether
/api/healthreturned anokhealth response
- whether
probe_success{job="ollama-alive"}- whether the Ollama server is reachable
probe_duration_seconds- blackbox probe latency
probe_http_status_code- HTTP status codes returned by the probed endpoints
FastAPI Health- app reachability, app health, Ollama reachability, probe latency, and status code
FastAPI App Metrics- total classified docs
- average processing time
- average processing time of each step over time
- documents classified per minute
- average processing time by document type
- request rate and latency
- classification counts and failed classifications
- upload size distribution
- service availability and degraded startup state
doc-intel-engine/
├── app/
│ ├── api/
│ ├── core/
│ ├── schemas/
│ ├── services/
│ └── main.py
├── data/
├── deployment/
│ └── docker-compose.yml
├── models/
├── scripts/
└── tests/
Use Python 3.11.
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]Paddle is environment-specific and is intentionally not pinned inside pyproject.toml.
- CPU local setup: install
paddlepaddle - GPU local setup: install the matching
paddlepaddle-gpuwheel for your CUDA stack
Example verification:
uv run python -c "import paddle; print(paddle.__version__); print(paddle.device.is_compiled_with_cuda())"ollama serve
ollama pull qwen2.5:1.5buv run uvicorn app.main:app --host 0.0.0.0 --port 8000Open:
- http://localhost:8000/
- http://localhost:8000/docs
- http://localhost:8000/api/health
- http://localhost:8000/metrics
If you want monitoring dashboards locally:
docker compose -f deployment/monitoring/docker-compose.yml up -dOpen:
- http://localhost:9090 for Prometheus
- http://localhost:3001 for Grafana
Use this path when you want PaddleOCR to run on an NVIDIA GPU instead of falling back to CPU.
Before installing Python dependencies, confirm the machine is GPU-ready:
- NVIDIA driver is installed correctly
nvidia-smiworks on the host- your Paddle GPU wheel matches the host CUDA stack
- if you plan to use Docker, NVIDIA Container Toolkit is installed
Quick host check:
nvidia-smiIf nvidia-smi fails, fix the driver/runtime first. The app cannot use GPU until the host GPU stack is healthy.
Create and activate the virtual environment, then install the repo:
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev]Install a paddlepaddle-gpu build that matches your CUDA version. This repo does not pin one specific GPU wheel because the correct package depends on the machine.
Example pattern:
uv pip install paddlepaddle-gpuIf your environment needs a specific wheel index or version, use the Paddle install command that matches your CUDA runtime.
Run:
uv run python -c "import paddle; print(paddle.__version__); print(paddle.device.is_compiled_with_cuda())"Expected result for a working GPU install:
True
You can also confirm the project picks the GPU path by starting the API and checking /api/health.
These values match the repo's active GPU OCR profile:
OCR_DEVICE=auto
PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK=True
OCR_GPU_DETECTION_MODEL=PP-OCRv5_server_det
OCR_GPU_RECOGNITION_MODEL=PP-OCRv4_server_rec_doc
OCR_CPU_DETECTION_MODEL=PP-OCRv4_mobile_det
OCR_CPU_RECOGNITION_MODEL=en_PP-OCRv4_mobile_recNotes:
OCR_DEVICE=autolets the app select GPU when Paddle reports CUDA supportOCR_DEVICE=gpu:0is reasonable if you want to express GPU intent explicitly- if CUDA is unavailable at runtime, the app will fall back to CPU and
/api/healthwill show that
For the default local classifier path:
ollama serve
ollama pull qwen2.5:1.5bStart the backend:
uv run uvicorn app.main:app --host 0.0.0.0 --port 8000Then verify:
curl http://127.0.0.1:8000/api/healthIn the OCR model section, look for GPU signals such as:
status: "loaded"uses_gpu: truedetails.active_device: "gpu:0"details.selected_path: "gpu"
This repo already includes a GPU-oriented container setup:
Current GPU-relevant settings already present there:
- GPU-capable PaddleX base image in the Dockerfile
gpus: allipc: hostshm_size: 8gbOCR_DEVICE=auto
Run the API container:
docker compose -f deployment/docker-compose.yml up --build apiFor Docker GPU execution to work, the host must expose the NVIDIA runtime into containers. If the container starts but /api/health reports CPU, the usual issue is host-side GPU runtime configuration rather than app code.
paddle.device.is_compiled_with_cuda()returnsFalse: wrong Paddle wheel or no CUDA-capable runtimenvidia-smiworks but app still uses CPU: the installed Paddle package does not match the machine CUDA stack- Docker sees no GPU: NVIDIA Container Toolkit or Docker GPU runtime is not configured
/api/healthreportsunavailable: OCR startup failed; inspect startup logs for the exact Paddle/PaddleOCR error
Current active .env controls:
APP_ENVHOSTPORTLOG_LEVELGLOG_minloglevelUPLOAD_DIRMAX_UPLOAD_SIZE_MBCATEGORIESENABLE_API_KEY_AUTHAPI_KEYSJWT_SECRETRATE_LIMITMIN_DIRECT_TEXT_LENGTHOCR_LANGUAGEOCR_TARGET_LANGUAGESOCR_DETECT_ORIENTATIONOCR_DEVICEPADDLE_PDX_DISABLE_MODEL_SOURCE_CHECKOCR_GPU_DETECTION_MODELOCR_GPU_RECOGNITION_MODELOCR_CPU_DETECTION_MODELOCR_CPU_RECOGNITION_MODELIMAGE_OCR_MAX_DIMENSIONIMAGE_OCR_JPEG_QUALITYCLASSIFIER_PROVIDERCLASSIFIER_MODELOLLAMA_BASE_URLOLLAMA_KEEP_ALIVECLASSIFICATION_TIMEOUT_SECONDSCLASSIFICATION_MAX_PARALLEL_CHUNKSCLASSIFICATION_FIRST_PAGE_TARGET_CHARSCLASSIFICATION_FIRST_PAGE_MIN_CHARSCLASSIFICATION_FIRST_PAGE_MAX_CHUNKSCLASSIFICATION_FIRST_PAGE_BATCH_SIZECLASSIFICATION_EARLY_EXIT_CONFIDENCECLASSIFICATION_CHUNK_MAX_TOKENSCLASSIFICATION_FINAL_MAX_TOKENSOLLAMA_MAX_CONNECTIONSTEXT_SNIPPET_LIMITCLASSIFICATION_CHUNK_PAGES
The current Compose file is for:
api
The API container expects Ollama to run on the host machine.
docker compose -f deployment/docker-compose.yml up --build- API exposed on
http://localhost:8000 - host Ollama is used through:
http://host.docker.internal:11434
- API container is configured with:
OCR_DEVICE=autoPADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK=True
- mounted volumes:
../data -> /app/data../models -> /app/models
- The Dockerfile is using a Paddle GPU-capable base image
- Compose enables:
gpus: allipc: hostshm_size: 8gb
- In practice, Docker deployment still depends on the target machine having a compatible GPU runtime and matching Paddle environment
The active production deployment runs without Docker:
- Python venv +
systemdmanages the FastAPI process - Nginx sits in front on port 80 and handles rate limiting
- Ollama runs as a separate
systemdservice on127.0.0.1:11434 - ngrok provides optional public HTTPS exposure
Full step-by-step guide: deployment/UBUNTU_CPU_DEPLOYMENT.md
Ansible automation for the same Ubuntu CPU path is available under:
deployment/ansible/site.ymldeployment/ansible/README.md
The backend service unit is at deployment/doc-intel.service:
[Service]
User=imtiaz
WorkingDirectory=/home/imtiaz/Documents/Doc-Intelligence/
EnvironmentFile=/home/imtiaz/Documents/Doc-Intelligence/.env.production
ExecStart=/home/imtiaz/Documents/Doc-Intelligence/.venv/bin/uvicorn app.main:app --host 127.0.0.1 --port 8000
Restart=always
RestartSec=5Manage it:
sudo systemctl daemon-reload
sudo systemctl enable doc-intel
sudo systemctl restart doc-intel
journalctl -u doc-intel -fRate limiting is handled entirely by Nginx — not by the app. The config is at deployment/nginx-doc-intel.conf:
limit_req_zone $binary_remote_addr zone=doc_intel:10m rate=10r/m;
server {
listen 80;
server_name _;
client_max_body_size 25M;
location / {
limit_req zone=doc_intel burst=5 nodelay;
limit_req_status 429;
proxy_pass http://127.0.0.1:8000;
...
}
}Install and reload:
sudo cp deployment/nginx-doc-intel.conf /etc/nginx/sites-available/doc-intel
sudo ln -sf /etc/nginx/sites-available/doc-intel /etc/nginx/sites-enabled/doc-intel
sudo nginx -t && sudo systemctl reload nginxngrok tunnels the local Nginx port 80 to a public HTTPS URL:
ngrok http 80ngrok prints a URL like https://<id>.ngrok-free.app. Use that as the base URL for external testing.
ngrok free tier limit: 40 req/min through the tunnel. Nginx rate limiting (10r/m, burst=5) fires before that.
deployment/deploy_ubuntu.sh pulls, syncs deps, and restarts all three services:
bash deployment/deploy_ubuntu.shWith the current local PaddleOCR setup, model files are cached by PaddleX under your user profile, for example:
C:\Users\<username>\.paddlex\official_models\
Typical cached models for the current setup:
PP-OCRv5_server_detPP-OCRv4_server_rec_docPP-OCRv4_mobile_deten_PP-OCRv4_mobile_rec- auxiliary orientation/unwarping models used by PaddleOCR
GET /api/health
Returns model status and active OCR/classifier configuration.
POST /api/v1/documents/classify
Current response shape:
doc_idfilenamedoc_typeclassificationconfidencelatency_msocr_text_preview
If API-key auth is enabled, clients must send:
X-API-Key: your-secret-keyThe app currently suppresses only known startup noise:
- Paddle model hoster connectivity warnings
RequestsDependencyWarningNo ccache foundstartup warning- low-signal GLOG warnings through
GLOG_minloglevel=2
Real runtime errors, OCR failures, and application exceptions are still visible.
- OCR and classifier warm up concurrently at startup — total startup time is
max(ocr_warmup, classifier_warmup)not the sum - PDF OCR fallback only processes page 1 — sufficient for classification, not for full document extraction
- CPU OCR fallback is lighter but less accurate on stylized or noisy documents
- DB persistence is scaffolded but not wired into request handling
- document processing is still synchronous request-time work
pytestcurl http://127.0.0.1/api/healthcurl -X POST http://127.0.0.1/api/v1/documents/classify \
-H "X-API-Key: <your-api-key>" \
-F "file=@/path/to/sample.pdf"Through ngrok:
curl -X POST https://<id>.ngrok-free.app/api/v1/documents/classify \
-H "X-API-Key: <your-api-key>" \
-F "file=@/path/to/sample.pdf"Fire 15 requests with 5 in parallel — you should see a mix of 200 and 429:
seq 1 15 | xargs -P5 -I{} curl -s -o /dev/null -w "%{http_code}\n" http://127.0.0.1/api/healthA 429 from Nginx has no X-Ngrok-Error-Code header. A 429 from the ngrok tunnel limit does.
seq 1 10 | xargs -P3 -I{} curl -s -X POST \
-H "X-API-Key: <your-api-key>" \
-F "file=@tests/fixtures/sample.pdf" \
http://127.0.0.1/api/v1/documents/classifyWatch logs live:
journalctl -u doc-intel -f