Skip to content

feat(investigator): add five modules for citizen & investigative journalism#2

Merged
hyperpolymath merged 1 commit into
mainfrom
claude/enhance-epstein-investigator-tool-NJnr9
Apr 16, 2026
Merged

feat(investigator): add five modules for citizen & investigative journalism#2
hyperpolymath merged 1 commit into
mainfrom
claude/enhance-epstein-investigator-tool-NJnr9

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Summary

Adds a toolkit of investigator-focused extraction modules aimed at citizen journalists and investigative reporters working with large document releases (Epstein filings, FinCEN Files, Panama/Paradise/Pandora Papers, etc.). Each module is a standalone Zig translation unit with stable C-ABI entry points, callable from Chapel, Rust, Julia, OCaml, or Python ctypes.

All new modules are pattern-based (no ML dependency), deterministic, tested, and wired into docudactyl_ffi.zig so exports are included in the shared library.

What's new

Module Purpose
flight_log.zig Extracts aircraft tail numbers (N908JE, G-EJES), IATA/ICAO airport codes (whitelisted to reduce false positives on three-letter acronyms like CEO/FBI), phone numbers, street addresses, and passenger-manifest markers.
entity_graph.zig Builds a cross-document co-occurrence graph of capitalised personal names. Exports to GraphML (Gephi, yEd, Cytoscape) and CSV (spreadsheets, Maltego, Neo4j). Edge weight accumulates across documents — high weight flags recurring co-occurrence worth examining.
redaction_recovery.zig Extends the base redaction-detection stage with per-page density maps and overlay-only text recovery for PDFs whose content stream was not scrubbed. Only extracts text already present in the stream — no encryption bypass.
evasion_detect.zig Categorises non-answer patterns in deposition transcripts ("I don't recall", "Fifth Amendment", "asked and answered", ...) across 8 categories. Reports per-1000-token rate to flag evasive witness segments.
investigator_summary.zig Emits investigator-friendly per-document JSON summaries with human-readable flags (has_redactions, has_recoverable_text, deposition, high_evasion, ...).

Why this matters for investigators

The base HPC pipeline answers "what's in this document?". Investigative journalism needs different questions, which these modules address:

  1. Who appears with whom, how often?entity_graph
  2. What was hidden under black bars?redaction_recovery (relevant because many Epstein-era court releases exhibit overlay-only redactions with intact content streams)
  3. Where did the jet actually go?flight_log
  4. Where did the witness stop answering?evasion_detect
  5. What does this document contain, at a glance?investigator_summary

Files changed

  • ffi/zig/src/flight_log.zig (new)
  • ffi/zig/src/entity_graph.zig (new)
  • ffi/zig/src/redaction_recovery.zig (new)
  • ffi/zig/src/evasion_detect.zig (new)
  • ffi/zig/src/investigator_summary.zig (new)
  • ffi/zig/src/docudactyl_ffi.zig (import + comptime reference to new modules)
  • docs/INVESTIGATOR-TOOLKIT.md (new — workflow guide + C-ABI reference)

Design notes

  • No schema changes. New modules do not touch schema/stages.capnp, capnp.zig, the Cap'n Proto offset table, FFIBridge.chpl, or the Idris2 ABI proofs. They stand alongside the existing legal_ner / financial_extract / speaker_id pattern of standalone modules with their own C-ABI entry points.
  • No new C dependenciesredaction_recovery.zig reuses Poppler + GLib which Docudactyl already links.
  • File I/O uses the proven std.io.fixedBufferStream + file.writeAll pattern that's already used elsewhere in the codebase (e.g., lith_adapter.zig, quality_stats.zig), rather than bufferedWriter.
  • Legal & ethical note on redaction_recovery.zig: it extracts text already present in the document's content stream (the same text cmd-A, cmd-C in Preview would reveal) — it does not break encryption, does not OCR under pixel-level redactions, and does not decode protected content.

Test plan

  • cd ffi/zig && zig build — builds shared + static libraries
  • cd ffi/zig && zig build test — runs all unit tests including the 20+ new ones across the five modules
  • Integration smoke test: invoke ddac_flight_log_process / ddac_evasion_detect / ddac_entity_graph_* from Chapel via the existing FFI harness with a 10-document mini-corpus
  • Verify GraphML output loads in Gephi without errors
  • Verify CSV edge list opens in a spreadsheet
  • Verify per-document JSON summary is valid JSON (jq . out.json)
  • Run redaction_recovery against a known-redacted PDF from the Epstein release set and confirm recoverable text extraction

License

All new files are released under PMPL-1.0-or-later (with MPL-2.0 fallback), matching the rest of the project.

https://claude.ai/code/session_01Rnf2JHDP9gMcZW43qxsbSU

…nalism

Adds a toolkit of investigator-focused extraction modules aimed at citizen
journalists and investigative reporters working with large document
releases (Epstein filings, FinCEN Files, Panama/Paradise/Pandora Papers,
etc.). Each module is a standalone Zig translation unit with stable C-ABI
entry points, callable from Chapel, Rust, Julia, OCaml, or Python ctypes.

Modules:

- flight_log.zig: extracts aircraft tail numbers (N908JE, G-EJES),
  IATA/ICAO airport codes (whitelisted to reduce false positives),
  phone numbers, street addresses, and passenger-manifest markers.

- entity_graph.zig: builds a cross-document co-occurrence graph of
  capitalised personal names with GraphML + CSV export for use in
  Gephi / Cytoscape / Maltego / spreadsheets.

- redaction_recovery.zig: extends the base redaction-detection stage
  with per-page density maps and overlay-only text recovery for PDFs
  whose content stream was not scrubbed. Only extracts text already
  present in the stream — no encryption bypass.

- evasion_detect.zig: categorises non-answer patterns in deposition
  transcripts ("I don't recall", "Fifth Amendment", "asked and
  answered", etc.) with per-1000-token rate metric to flag evasive
  witness segments.

- investigator_summary.zig: emits investigator-friendly per-document
  JSON summaries with human-readable flags (has_redactions,
  has_recoverable_text, deposition, high_evasion, ...).

Also adds docs/INVESTIGATOR-TOOLKIT.md documenting the recommended
workflow and each module's C-ABI surface.

All modules are pattern-based (no ML dependency), deterministic, and
tested. Wired into docudactyl_ffi.zig so exports are included in the
shared library.
@hyperpolymath hyperpolymath merged commit 7b0b9b8 into main Apr 16, 2026
18 of 23 checks passed
@hyperpolymath hyperpolymath deleted the claude/enhance-epstein-investigator-tool-NJnr9 branch April 16, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants