feat(investigator): add five modules for citizen & investigative journalism#2
Merged
hyperpolymath merged 1 commit intoApr 16, 2026
Conversation
…nalism
Adds a toolkit of investigator-focused extraction modules aimed at citizen
journalists and investigative reporters working with large document
releases (Epstein filings, FinCEN Files, Panama/Paradise/Pandora Papers,
etc.). Each module is a standalone Zig translation unit with stable C-ABI
entry points, callable from Chapel, Rust, Julia, OCaml, or Python ctypes.
Modules:
- flight_log.zig: extracts aircraft tail numbers (N908JE, G-EJES),
IATA/ICAO airport codes (whitelisted to reduce false positives),
phone numbers, street addresses, and passenger-manifest markers.
- entity_graph.zig: builds a cross-document co-occurrence graph of
capitalised personal names with GraphML + CSV export for use in
Gephi / Cytoscape / Maltego / spreadsheets.
- redaction_recovery.zig: extends the base redaction-detection stage
with per-page density maps and overlay-only text recovery for PDFs
whose content stream was not scrubbed. Only extracts text already
present in the stream — no encryption bypass.
- evasion_detect.zig: categorises non-answer patterns in deposition
transcripts ("I don't recall", "Fifth Amendment", "asked and
answered", etc.) with per-1000-token rate metric to flag evasive
witness segments.
- investigator_summary.zig: emits investigator-friendly per-document
JSON summaries with human-readable flags (has_redactions,
has_recoverable_text, deposition, high_evasion, ...).
Also adds docs/INVESTIGATOR-TOOLKIT.md documenting the recommended
workflow and each module's C-ABI surface.
All modules are pattern-based (no ML dependency), deterministic, and
tested. Wired into docudactyl_ffi.zig so exports are included in the
shared library.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a toolkit of investigator-focused extraction modules aimed at citizen journalists and investigative reporters working with large document releases (Epstein filings, FinCEN Files, Panama/Paradise/Pandora Papers, etc.). Each module is a standalone Zig translation unit with stable C-ABI entry points, callable from Chapel, Rust, Julia, OCaml, or Python
ctypes.All new modules are pattern-based (no ML dependency), deterministic, tested, and wired into
docudactyl_ffi.zigso exports are included in the shared library.What's new
flight_log.zigN908JE,G-EJES), IATA/ICAO airport codes (whitelisted to reduce false positives on three-letter acronyms likeCEO/FBI), phone numbers, street addresses, and passenger-manifest markers.entity_graph.zigredaction_recovery.zigevasion_detect.ziginvestigator_summary.zighas_redactions,has_recoverable_text,deposition,high_evasion, ...).Why this matters for investigators
The base HPC pipeline answers "what's in this document?". Investigative journalism needs different questions, which these modules address:
entity_graphredaction_recovery(relevant because many Epstein-era court releases exhibit overlay-only redactions with intact content streams)flight_logevasion_detectinvestigator_summaryFiles changed
ffi/zig/src/flight_log.zig(new)ffi/zig/src/entity_graph.zig(new)ffi/zig/src/redaction_recovery.zig(new)ffi/zig/src/evasion_detect.zig(new)ffi/zig/src/investigator_summary.zig(new)ffi/zig/src/docudactyl_ffi.zig(import + comptime reference to new modules)docs/INVESTIGATOR-TOOLKIT.md(new — workflow guide + C-ABI reference)Design notes
schema/stages.capnp,capnp.zig, the Cap'n Proto offset table,FFIBridge.chpl, or the Idris2 ABI proofs. They stand alongside the existinglegal_ner/financial_extract/speaker_idpattern of standalone modules with their own C-ABI entry points.redaction_recovery.zigreuses Poppler + GLib which Docudactyl already links.std.io.fixedBufferStream+file.writeAllpattern that's already used elsewhere in the codebase (e.g.,lith_adapter.zig,quality_stats.zig), rather thanbufferedWriter.redaction_recovery.zig: it extracts text already present in the document's content stream (the same textcmd-A, cmd-Cin Preview would reveal) — it does not break encryption, does not OCR under pixel-level redactions, and does not decode protected content.Test plan
cd ffi/zig && zig build— builds shared + static librariescd ffi/zig && zig build test— runs all unit tests including the 20+ new ones across the five modulesddac_flight_log_process/ddac_evasion_detect/ddac_entity_graph_*from Chapel via the existing FFI harness with a 10-document mini-corpusjq . out.json)redaction_recoveryagainst a known-redacted PDF from the Epstein release set and confirm recoverable text extractionLicense
All new files are released under PMPL-1.0-or-later (with MPL-2.0 fallback), matching the rest of the project.
https://claude.ai/code/session_01Rnf2JHDP9gMcZW43qxsbSU