Paperless-NGX Cortex

Paperless-NGX Cortex is a separate intelligence layer for Paperless-ngx. It keeps Paperless as the source of truth, processes documents locally (sync, OCR layers, embeddings, suggestions), and supports explicit manual writeback only.

What this project is (and why it exists)

I built this because Paperless-ngx is excellent at storage and search, but I wanted a focused intelligence layer that can be audited, resumed, and controlled without ever auto-writing back. The goal is to make document understanding and metadata suggestions fast, local, and reviewable.

Friendly reminder

This started as personal project and is heavy biased towards my personal home setup. I thought, maybe the code, prompts, techniques or else could be useful for someone out there, looking to achieve similar.

Benefits

Keeps Paperless-ngx as the source of truth and never auto-writes.
Adds local OCR quality checks and optional vision OCR without overwriting the baseline.
Produces embeddings, semantic search, suggestions, and summaries you can review before applying.
Handles large documents with resumable, observable pipeline steps.
Adds per-document chat with follow-up question suggestions.
Surfaces similar documents and potential duplicates from embeddings.

Processing diagram

flowchart TD
  A[Paperless-ngx] --> B[Sync metadata + baseline text]
  B --> C{Need extra OCR?}
  C -- No --> E[Embeddings]
  C -- Yes --> D[Vision OCR optional]
  D --> E
  E --> F[Suggestions]
  F --> G{Large doc?}
  G -- Yes --> H[Page notes + hierarchical summary]
  G -- No --> I[Review]
  H --> I
  I --> J[Manual writeback]

Current status

Delivery phases

MVP (core intelligence layer): Done
- Sync from Paperless, local storage, embeddings, semantic search, suggestions, queue/worker, manual writeback.
Phase 1 (robustness + UX streamlining): Done
- Pipeline hardening + triage/log observability baseline delivered.
Phase 2 (advanced evidence locator / on-the-fly bbox resolution): Planned / partial design only
- Spec exists, full implementation not complete yet.

Practical interpretation

You can use the app end-to-end today.
Current engineering focus is quality and reliability, not greenfield features.

Product principles

No automatic writeback to Paperless.
All AI outputs are reviewed locally first.
Writeback is explicit and manual.
Local processing should be resumable, observable, and robust for large docs.

Core flow (current)

Sync metadata + text baseline from Paperless.
Optionally run vision OCR as additional layer (never overwrite baseline).
Generate embeddings (paperless and/or vision source strategy).
Generate suggestions (paperless/vision + best pick).
For large docs: page notes + hierarchical summary.
Review locally, then explicitly write back selected fields.

Per-document operations also allow targeted manual re-runs for individual steps (for example similarity_index) without forcing a full reset/reprocess.

Requirements and installation

Prerequisites

Python >=3.13 for the backend.
Node.js >=18 for the frontend.
Paperless-ngx instance reachable by URL and API token.
Postgres, Redis, and a supported vector store (Qdrant or Weaviate) (local installs or Docker).
An OpenAI-compatible LLM endpoint (local or remote). For user-facing operations and UI guidance, see docs/manual/README.md.

Backend (recommended: uv)

cd backend
uv sync
uv run alembic upgrade head
uv run uvicorn app.main:app --reload --port 8000

Backend (pip + requirements.txt)

A pinned requirements.txt is generated at backend/requirements.txt.

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
alembic upgrade head
uvicorn app.main:app --reload --port 8000

To refresh requirements.txt from pyproject.toml:

cd backend
uv export --format requirements.txt --no-dev --output-file requirements.txt

Worker (optional, queue mode)

cd backend
uv run python -m app.worker

Frontend

cd frontend
npm install
npm run dev

Local setup (database + migrations)

Copy .env.example to .env and fill values. Do not commit .env to GitHub.
Ensure Postgres, Redis, and your active vector store are running.
Create the database specified by DATABASE_URL.
Run migrations with Alembic.

Example Postgres setup:

createdb paperless_intelligence
createuser paperless

Example migrations:

cd backend
uv run alembic upgrade head

Docker

App-only (backend + frontend + redis)

docker compose -f docker-compose.app.yml up --build

Full stack (app + postgres + redis + qdrant)

docker compose -f docker-compose.full.yml up --build

Important: LLM_BASE_URL must be set in your .env. It is not set in docker-compose.full.yml. Docker uses :8000 for the API and serves the frontend from the backend container unless you run the frontend dev server separately.

Worker-only container

docker compose -f docker-compose.worker.yml up --build

Configuration

Set values in .env.

Minimum for a real setup

PAPERLESS_BASE_URL
PAPERLESS_API_TOKEN
DATABASE_URL
VECTOR_STORE_PROVIDER
vector-store-specific settings for Qdrant or Weaviate
LLM_BASE_URL
TEXT_MODEL
EMBEDDING_MODEL

Configuration docs

.env.example for concrete environment variables and example values
docs/config-reference.md for grouped runtime configuration guidance
docs/architecture-overview.md for the technical component overview

Documentation map

For users

MANUAL.md: documentation entry point
docs/manual/README.md: end-user manual
docs/manual/14-tages-checkliste.md: daily checklist
docs/manual/12-similar-workflow.md: similar-doc review workflow
docs/manual/13-team-policy.md: concise working rules

For admins and operators

docs/manual/15-admin-und-betrieb.md: admin and UI operations guide
docs/manual/16-settings-und-live-model-provider.md: live model-provider settings and API-key behavior
docs/architecture-overview.md: architecture overview
docs/config-reference.md: grouped configuration reference

For developers and contributors

CHANGELOG.md: granular change history
agents.md: compact project state and next actions
CONTRIBUTING.md: contribution notes
docs/execution-blueprint-large-doc-worker.md: large-document worker strategy

API/client generation

cd frontend
ORVAL_API_URL=http://localhost:8000/api/openapi.json npm run api:generate

Versioning (simple start, no CI)

The root VERSION file is the source of truth.

python scripts/sync_version.py

This synchronizes:

backend/pyproject.toml
frontend/package.json
frontend/src/generated/version.ts

GET /api/status exposes app_version, api_version, and frontend_version; the frontend footer renders them.

License

MIT License. See LICENSE. Provided “as is”, without warranty of any kind.

Name		Name	Last commit message	Last commit date
Latest commit History 1,093 Commits
.github/workflows		.github/workflows
backend		backend
docker		docker
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.env.worker.example		.env.worker.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
CURRENT_STATUS_ANALYSIS.md		CURRENT_STATUS_ANALYSIS.md
Dockerfile		Dockerfile
FINAL_STATUS.md		FINAL_STATUS.md
IMPROVEMENT_SUMMARY.md		IMPROVEMENT_SUMMARY.md
LICENSE		LICENSE
MANUAL.md		MANUAL.md
README.md		README.md
TYPE_CHECKING_EXPANSION.md		TYPE_CHECKING_EXPANSION.md
VERSION		VERSION
agents.md		agents.md
docker-compose.app.yml		docker-compose.app.yml
docker-compose.full.yml		docker-compose.full.yml
docker-compose.worker.yml		docker-compose.worker.yml

Folders and files

Latest commit

History

Repository files navigation

Paperless-NGX Cortex

What this project is (and why it exists)

Friendly reminder

Benefits

Processing diagram

Current status

Delivery phases

Practical interpretation

Product principles

Core flow (current)

Requirements and installation

Prerequisites

Backend (recommended: uv)

Backend (pip + requirements.txt)

Worker (optional, queue mode)

Frontend

Local setup (database + migrations)

Docker

App-only (backend + frontend + redis)

Full stack (app + postgres + redis + qdrant)

Worker-only container

Configuration

Minimum for a real setup

Configuration docs

Documentation map

For users

For admins and operators

For developers and contributors

API/client generation

Versioning (simple start, no CI)

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages