Skip to content

Pump-OS/alien-files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALIEN.FILES

A searchable, OCR-indexed mirror of the PURSUE UAP document release that the U.S. Department of War published at https://www.war.gov/UFO/ on May 8, 2026.

Most of the 161 records are scanned PDFs with no searchable text. This project:

  1. Discovers the underlying CSV manifest the war.gov page renders from.
  2. Mirrors every PDF / image locally (war.gov is fronted by Akamai which blocks plain HTTP clients, so we drive Playwright).
  3. OCRs every page with Tesseract (born-digital pages use native text).
  4. Indexes the result for full-text + AI-assisted search.
  5. Serves it through a Next.js 15 / Tailwind v4 app with a catalog, document viewer, instant search, and an Anthropic Claude RAG chat with inline page citations.
alien-files/
├── pipeline/                 Python: Playwright + PyMuPDF + Tesseract
│   ├── 01_discover.py        capture the war.gov UFO record manifest
│   ├── 02_download.py        mirror every PDF + thumbnail
│   ├── 03_extract.py         pdfplumber → text; Tesseract on scans
│   ├── 05_publish.py         copy artifacts into web/public/data
│   └── requirements.txt
├── data/
│   ├── pdfs/                 mirrored PDFs (gitignored, ~5–10 GB)
│   ├── text/                 per-document OCR JSON (gitignored)
│   └── json/                 index.json + fulltext.json + raw csv
└── web/                      Next.js 15 app
    ├── src/app/              routes: /, /search, /chat, /doc/[slug], /about
    ├── src/components/       catalog, document-viewer, search, chat
    └── src/app/api/
        ├── chat/route.ts     Claude RAG endpoint
        └── pdf/[id]/route.ts local PDF proxy (Akamai-friendly)

Quick start

# 0. Prereqs: Python 3.12, Node 22, Tesseract 5 (Windows: install at
#    "C:\Program Files\Tesseract-OCR" — installer at
#    https://github.com/UB-Mannheim/tesseract/releases ).

# 1. Pipeline (one-time corpus build, ~3 hours wall-clock with 4 OCR workers)
cd pipeline
python -m venv .venv
.\.venv\Scripts\pip install -r requirements.txt
.\.venv\Scripts\playwright install chromium

.\.venv\Scripts\python 01_discover.py    # writes data/json/index.json
.\.venv\Scripts\python 02_download.py    # writes data/pdfs/*.pdf
.\.venv\Scripts\python 03_extract.py --workers 4   # OCR + auto-publish
# (03 calls 05_publish on completion to refresh web/public/data)

# 2. Web app
cd ..\web
npm install
copy .env.example .env.local   # then edit and add ANTHROPIC_API_KEY
npm run dev
# → http://localhost:3000

Re-running

02_download.py and 03_extract.py are both idempotent — they skip files that already exist. Pass --force to redo everything, or --limit N for a quick smoke test.

When war.gov publishes new releases, just rerun the whole chain:

python 01_discover.py && python 02_download.py && python 03_extract.py --workers 4

03_extract.py rebuilds fulltext.json from scratch every run, so removing a document also removes it from search.


Environment variables (web/.env.local)

ANTHROPIC_API_KEY=sk-ant-...

Without the key the /chat page renders, but submitting a question returns a friendly "missing key" message. Everything else (catalog, viewer, search) works with no API keys at all.


Search modes

Mode Where it lives What it does
Catalog filter / (catalog-client.tsx) Instant client-side filtering by agency / type / free-text
Full-text search /search (search-client.tsx) MiniSearch over every page of OCR text + record metadata, BM25-ish
RAG chat /chat (chat-client.tsx) Claude Haiku answers grounded on top-N matched pages, with citations

Deployment

The web app is a stock Next.js 15 project; deploy on Vercel in 30 seconds:

vercel deploy --prod

PDFs themselves are too large to ship through Vercel build (~5–10 GB). Two recommended options:

  • Cloudflare R2 (10 GB free, no egress) — upload data/pdfs/ to a bucket, then change web/src/app/api/pdf/[id]/route.ts to redirect() to the R2 URL instead of streaming locally.
  • GitHub Releases — attach the PDFs as a single .zip asset and proxy through the API route. Cheapest if you don't need hot-linking.

The OCR JSON (fulltext.json, ~30–80 MB) ships fine inside the Next.js build.


Caveats

  • Tesseract on heavily-redacted carbon copies is best-effort. A page marked [OCR] should be treated as approximate text, not authoritative.
  • Source documents are unmodified U.S. Government works. ALIEN.FILES never edits them — the OCR layer is independent.
  • This site is unofficial and not affiliated with the Department of War, AARO, or any U.S. government entity.

License

MIT. Source documents remain in the public domain as works of the U.S. federal government.

About

ALIEN.FILES - searchable, OCR-indexed mirror of the May 2026 PURSUE UAP document release (war.gov/UFO).

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors