ALIEN.FILES

A searchable, OCR-indexed mirror of the PURSUE UAP document release that the U.S. Department of War published at https://www.war.gov/UFO/ on May 8, 2026.

Most of the 161 records are scanned PDFs with no searchable text. This project:

Discovers the underlying CSV manifest the war.gov page renders from.
Mirrors every PDF / image locally (war.gov is fronted by Akamai which blocks plain HTTP clients, so we drive Playwright).
OCRs every page with Tesseract (born-digital pages use native text).
Indexes the result for full-text + AI-assisted search.
Serves it through a Next.js 15 / Tailwind v4 app with a catalog, document viewer, instant search, and an Anthropic Claude RAG chat with inline page citations.

alien-files/
├── pipeline/                 Python: Playwright + PyMuPDF + Tesseract
│   ├── 01_discover.py        capture the war.gov UFO record manifest
│   ├── 02_download.py        mirror every PDF + thumbnail
│   ├── 03_extract.py         pdfplumber → text; Tesseract on scans
│   ├── 05_publish.py         copy artifacts into web/public/data
│   └── requirements.txt
├── data/
│   ├── pdfs/                 mirrored PDFs (gitignored, ~5–10 GB)
│   ├── text/                 per-document OCR JSON (gitignored)
│   └── json/                 index.json + fulltext.json + raw csv
└── web/                      Next.js 15 app
    ├── src/app/              routes: /, /search, /chat, /doc/[slug], /about
    ├── src/components/       catalog, document-viewer, search, chat
    └── src/app/api/
        ├── chat/route.ts     Claude RAG endpoint
        └── pdf/[id]/route.ts local PDF proxy (Akamai-friendly)

Quick start

# 0. Prereqs: Python 3.12, Node 22, Tesseract 5 (Windows: install at
#    "C:\Program Files\Tesseract-OCR" — installer at
#    https://github.com/UB-Mannheim/tesseract/releases ).

# 1. Pipeline (one-time corpus build, ~3 hours wall-clock with 4 OCR workers)
cd pipeline
python -m venv .venv
.\.venv\Scripts\pip install -r requirements.txt
.\.venv\Scripts\playwright install chromium

.\.venv\Scripts\python 01_discover.py    # writes data/json/index.json
.\.venv\Scripts\python 02_download.py    # writes data/pdfs/*.pdf
.\.venv\Scripts\python 03_extract.py --workers 4   # OCR + auto-publish
# (03 calls 05_publish on completion to refresh web/public/data)

# 2. Web app
cd ..\web
npm install
copy .env.example .env.local   # then edit and add ANTHROPIC_API_KEY
npm run dev
# → http://localhost:3000

Re-running

02_download.py and 03_extract.py are both idempotent — they skip files that already exist. Pass --force to redo everything, or --limit N for a quick smoke test.

When war.gov publishes new releases, just rerun the whole chain:

python 01_discover.py && python 02_download.py && python 03_extract.py --workers 4

03_extract.py rebuilds fulltext.json from scratch every run, so removing a document also removes it from search.

Environment variables (`web/.env.local`)

ANTHROPIC_API_KEY=sk-ant-...

Without the key the /chat page renders, but submitting a question returns a friendly "missing key" message. Everything else (catalog, viewer, search) works with no API keys at all.

Search modes

Mode	Where it lives	What it does
Catalog filter	`/` (catalog-client.tsx)	Instant client-side filtering by agency / type / free-text
Full-text search	`/search` (search-client.tsx)	MiniSearch over every page of OCR text + record metadata, BM25-ish
RAG chat	`/chat` (chat-client.tsx)	Claude Haiku answers grounded on top-N matched pages, with citations

Deployment

The web app is a stock Next.js 15 project; deploy on Vercel in 30 seconds:

vercel deploy --prod

PDFs themselves are too large to ship through Vercel build (~5–10 GB). Two recommended options:

Cloudflare R2 (10 GB free, no egress) — upload data/pdfs/ to a bucket, then change web/src/app/api/pdf/[id]/route.ts to redirect() to the R2 URL instead of streaming locally.
GitHub Releases — attach the PDFs as a single .zip asset and proxy through the API route. Cheapest if you don't need hot-linking.

The OCR JSON (fulltext.json, ~30–80 MB) ships fine inside the Next.js build.

Caveats

Tesseract on heavily-redacted carbon copies is best-effort. A page marked [OCR] should be treated as approximate text, not authoritative.
Source documents are unmodified U.S. Government works. ALIEN.FILES never edits them — the OCR layer is independent.
This site is unofficial and not affiliated with the Department of War, AARO, or any U.S. government entity.

License

MIT. Source documents remain in the public domain as works of the U.S. federal government.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/json		data/json
pipeline		pipeline
web		web
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALIEN.FILES

Quick start

Re-running

Environment variables (`web/.env.local`)

Search modes

Deployment

Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ALIEN.FILES

Quick start

Re-running

Environment variables (web/.env.local)

Search modes

Deployment

Caveats

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment variables (`web/.env.local`)

Packages