A searchable, OCR-indexed mirror of the PURSUE UAP document release that the U.S. Department of War published at https://www.war.gov/UFO/ on May 8, 2026.
Most of the 161 records are scanned PDFs with no searchable text. This project:
- Discovers the underlying CSV manifest the war.gov page renders from.
- Mirrors every PDF / image locally (war.gov is fronted by Akamai which blocks plain HTTP clients, so we drive Playwright).
- OCRs every page with Tesseract (born-digital pages use native text).
- Indexes the result for full-text + AI-assisted search.
- Serves it through a Next.js 15 / Tailwind v4 app with a catalog, document viewer, instant search, and an Anthropic Claude RAG chat with inline page citations.
alien-files/
├── pipeline/ Python: Playwright + PyMuPDF + Tesseract
│ ├── 01_discover.py capture the war.gov UFO record manifest
│ ├── 02_download.py mirror every PDF + thumbnail
│ ├── 03_extract.py pdfplumber → text; Tesseract on scans
│ ├── 05_publish.py copy artifacts into web/public/data
│ └── requirements.txt
├── data/
│ ├── pdfs/ mirrored PDFs (gitignored, ~5–10 GB)
│ ├── text/ per-document OCR JSON (gitignored)
│ └── json/ index.json + fulltext.json + raw csv
└── web/ Next.js 15 app
├── src/app/ routes: /, /search, /chat, /doc/[slug], /about
├── src/components/ catalog, document-viewer, search, chat
└── src/app/api/
├── chat/route.ts Claude RAG endpoint
└── pdf/[id]/route.ts local PDF proxy (Akamai-friendly)
# 0. Prereqs: Python 3.12, Node 22, Tesseract 5 (Windows: install at
# "C:\Program Files\Tesseract-OCR" — installer at
# https://github.com/UB-Mannheim/tesseract/releases ).
# 1. Pipeline (one-time corpus build, ~3 hours wall-clock with 4 OCR workers)
cd pipeline
python -m venv .venv
.\.venv\Scripts\pip install -r requirements.txt
.\.venv\Scripts\playwright install chromium
.\.venv\Scripts\python 01_discover.py # writes data/json/index.json
.\.venv\Scripts\python 02_download.py # writes data/pdfs/*.pdf
.\.venv\Scripts\python 03_extract.py --workers 4 # OCR + auto-publish
# (03 calls 05_publish on completion to refresh web/public/data)
# 2. Web app
cd ..\web
npm install
copy .env.example .env.local # then edit and add ANTHROPIC_API_KEY
npm run dev
# → http://localhost:300002_download.py and 03_extract.py are both idempotent — they skip files
that already exist. Pass --force to redo everything, or --limit N for a
quick smoke test.
When war.gov publishes new releases, just rerun the whole chain:
python 01_discover.py && python 02_download.py && python 03_extract.py --workers 4
03_extract.py rebuilds fulltext.json from scratch every run, so
removing a document also removes it from search.
ANTHROPIC_API_KEY=sk-ant-...
Without the key the /chat page renders, but submitting a question returns a
friendly "missing key" message. Everything else (catalog, viewer, search) works
with no API keys at all.
| Mode | Where it lives | What it does |
|---|---|---|
| Catalog filter | / (catalog-client.tsx) |
Instant client-side filtering by agency / type / free-text |
| Full-text search | /search (search-client.tsx) |
MiniSearch over every page of OCR text + record metadata, BM25-ish |
| RAG chat | /chat (chat-client.tsx) |
Claude Haiku answers grounded on top-N matched pages, with citations |
The web app is a stock Next.js 15 project; deploy on Vercel in 30 seconds:
vercel deploy --prod
PDFs themselves are too large to ship through Vercel build (~5–10 GB). Two recommended options:
- Cloudflare R2 (10 GB free, no egress) — upload
data/pdfs/to a bucket, then changeweb/src/app/api/pdf/[id]/route.tstoredirect()to the R2 URL instead of streaming locally. - GitHub Releases — attach the PDFs as a single
.zipasset and proxy through the API route. Cheapest if you don't need hot-linking.
The OCR JSON (fulltext.json, ~30–80 MB) ships fine inside the Next.js
build.
- Tesseract on heavily-redacted carbon copies is best-effort. A page marked
[OCR]should be treated as approximate text, not authoritative. - Source documents are unmodified U.S. Government works. ALIEN.FILES never edits them — the OCR layer is independent.
- This site is unofficial and not affiliated with the Department of War, AARO, or any U.S. government entity.
MIT. Source documents remain in the public domain as works of the U.S. federal government.