Hybrid AI writing forensics pipeline for Mediaite.com: deterministic stages scrape → extract → analyze → report, combining statistical stylometry (change-points, time series, hypothesis tests) with embedding drift and optional token-probability and AI-baseline comparison.
Pull requests receive a CI report comment (pytest summary and line coverage vs main) from .github/workflows/ci-report.yml.
- What it is: A deterministic scrape → extract → analyze → report pipeline over a WordPress newsroom corpus, with stylometry, embedding drift, optional token-probability signals, and Quarto outputs.
- Who it is for: Forensic reviewers, operators reproducing a locked configuration, and engineers extending the pipeline.
- What it is not: Outputs are statistical and documentary signals, not legal findings or definitive attribution of authorship or tool use. See Responsible use.
- Contributors: Human workflow and PR expectations are in
CONTRIBUTING.md. Automation and agent rules live inAGENTS.md.
git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
uv run forensics preflightThen configure real authors in config.toml (see Configuration) before any live scrape. For a full run and report, continue with Typical workflows.
- What this project does
- Models, measurements, and algorithms
- Local machine setup
- Forensic assurance and chain of custody
- Responsible use
- Architecture
- Requirements
- Installation
- Configuration
- CLI
- Typical workflows
- Repository layout
- Data layout
- Optional dependency extras
- Reports (Quarto)
- Notebooks and prompts
- Development
- Documentation
- Agent and contributor notes
The codebase implements two complementary lenses (see docs/ARCHITECTURE.md and docs/adr/ADR-001-hybrid-forensics-methodology.md):
flowchart TB
subgraph corpus[Corpus]
WP[WordPress REST + HTML]
DB[("SQLite articles.db")]
WP --> DB
end
subgraph features[Feature plane]
F["Lexical + structural + content + productivity"]
E[384-d sentence embeddings]
DB --> F
DB --> E
end
subgraph lensA["Pipeline A: stylometry"]
CP[Change-points + rolling stats + convergence]
HT[Hypothesis tests + effect sizes + FDR]
F --> CP --> HT
end
subgraph lensB["Pipeline B: embedding drift"]
DR[Centroid velocity + cosine decay + variance]
E --> DR
end
subgraph out[Outputs]
ART[Analysis JSON + custody + metadata]
REP[Quarto HTML / PDF]
HT --> ART
DR --> ART
ART --> REP
end
| Track | Role |
|---|---|
| Pipeline A — Statistical stylometry | Lexical, structural, content, and productivity features over time; change-point methods (PELT, BOCPD, and related tests), rolling statistics, convergence windows, classical tests, effect sizes, and multiple-comparison correction. |
| Pipeline B — Embedding drift | Sentence-transformer embeddings (default 384-dimensional sentence-transformers/all-MiniLM-L6-v2); centroid velocity, similarity decay, intra-period variance, optional UMAP views, optional comparison to synthetic “AI baseline” text. |
Optional tracks (extras and config):
| Track | Role |
|---|---|
| Phase 9 — Token-level probability | Reference language model (default Hugging Face GPT-2) for perplexity-style signals; optional Binoculars-style contrast using Falcon-7B base vs instruct checkpoints (uv sync --extra probability). |
| Phase 10 — AI baseline generation | Local Ollama models (configurable) generate synthetic articles for controlled comparison (uv sync --extra baseline). |
Outputs include SQLite + Parquet + DuckDB-friendly artifacts, JSONL exports, analysis JSON under data/analysis/, optional probability and baseline trees under data/probability/ and data/ai_baseline/, and Quarto-driven reports under data/reports/.
flowchart LR
S[Scrape] --> X[Extract]
X --> N[Analyze]
N --> P[Report]
S -. optional .-> P9[Phase 9 probability]
X -. optional .-> P9
N -. optional .-> P10[Phase 10 AI baseline]
Settings below default from config.toml and src/forensics/config/settings.py; override with FORENSICS_ environment variables. Nested analysis knobs use one __ segment per model level (for example FORENSICS_ANALYSIS__HYPOTHESIS__SIGNIFICANCE_THRESHOLD or FORENSICS_ANALYSIS__CONVERGENCE__CONVERGENCE_USE_PERMUTATION). See ADR-016 in docs/adr/016-analysis-config-nesting.md.
| Component | Default | Purpose |
|---|---|---|
| spaCy | en_core_web_md (TOML: spacy_model) |
Tokenization, linguistic features, and preflight validation. Default pipeline ships as a wheel in pyproject.toml; uv sync installs it (use spacy download only if you change spacy_model). |
| Sentence Transformers | sentence-transformers/all-MiniLM-L6-v2 ([analysis] embedding_model) |
Dense 384-d article embeddings for drift, similarity decay, and monthly centroids. embedding_model_version is recorded for provenance. |
| scikit-learn | LDA, TF–IDF, etc. | Topic diversity, self-similarity, and related content features (see src/forensics/features/). |
Implemented feature families (see Feature families in docs/ARCHITECTURE.md):
- Lexical — type–token ratio (TTR), MATTR, hapax rates, Yule’s K, Simpson’s D, stylometric “AI marker” and function-word style signals.
- Structural — sentence length statistics, passive voice ratio, punctuation profile, parse-depth style measures via spaCy.
- Content — n-gram entropy, rolling LDA topic diversity, formulaic / hedging-style scores.
- Productivity — inter-article gaps, rolling counts, burst-style signals over the timeline.
textstat contributes readability-style scalar signals where used in the feature pipeline.
| Method | Library / notes |
|---|---|
| PELT | ruptures, RBF cost; penalty from AnalysisConfig.pelt_penalty (default 3.0). |
| BOCPD | Bayesian online change-point detection (custom scipy-based implementation); hazard and threshold from settings. |
| Chow, CUSUM, Kleinberg bursts | Implemented under src/forensics/analysis/ (see architecture doc). |
| Convergence windows | Windows where a minimum fraction of features move together (convergence_window_days, convergence_min_feature_ratio); optional permutation null (convergence_use_permutation) logs empirical p-values without changing detected windows. |
- Tests: Welch’s t, Mann–Whitney U, Kolmogorov–Smirnov (as implemented in analysis modules).
- Effect sizes: Cohen’s d where applicable.
- Intervals: Bootstrap resamples (
bootstrap_iterations, default 1000). - Multiple comparisons: Benjamini–Hochberg or Bonferroni (
multiple_comparison_method).
| Setting | Default | Role |
|---|---|---|
reference_model |
gpt2 |
Causal LM for perplexity-style features. |
reference_model_revision |
pinned revision id | Reproducible HF snapshot. |
binoculars_model_base / binoculars_model_instruct |
Falcon-7B pair | Optional contrastive signal (binoculars_enabled default false). |
max_sequence_length, sliding_window_stride, batch_size, device |
see config.toml |
Windowing and compute. |
Local Ollama HTTP API (baseline.ollama_base_url); model tags from baseline.models and temperatures from baseline.temperatures. Generated artifacts and manifests live under data/ai_baseline/ (see docs/RUNBOOK.md).
[survey] thresholds (min_articles, min_span_days, min_words_per_article, yearly density, recent activity) gate which authors qualify for blind survey runs (forensics survey). See docs/RUNBOOK.md.
-
Install uv and Python 3.13 (see
requires-pythoninpyproject.toml). -
Clone and install dependencies
git clone git@github.com:Abstract-Data/mediaite-ghostink.git cd mediaite-ghostink make peer-setupmake peer-setuprunsuv sync --extra dev --extra tui,uv run forensics validate, anduv run forensics peer-setup(copy-pasteuv synctiers, spaCy / Quarto notes, andollama pulllines from[baseline] modelswhen present). For hints only after you have synced:uv run forensics peer-setup. -
spaCy pipeline — Default
en_core_web_mdis installed as a direct wheel dependency inpyproject.toml(en-core-web-md @ https://github.com/explosion/...whl);uv sync/make peer-setupbrings it in. Useuv run python -m spacy download <name>only if you changespacy_modelinconfig.toml. -
Edit
config.toml(see alsoconfig.toml.examplefor the same template with setup notes) — Replace template authors (placeholder-target/placeholder-control) with real author rows before any live scrape; the CLI rejects placeholders on discover/metadata/fetch paths. -
Optional: Quarto — Required for
forensics reportand the report step offorensics all. Install Quarto soquartois on yourPATH. -
Optional extras (also shown by
forensics peer-setup)uv sync --extra probability— Phase 9 token features (forensics extract --probability); pulls torch / transformers (large download).uv sync --extra baseline— Phase 10 Ollama-driven baseline generation (scripts/generate_baseline.py,forensics analyze --ai-baseline, …).uv sync --extra tui— Interactiveforensics setupwizard (uv run forensics setup).
-
Optional: Ollama — For baseline generation, install Ollama and pull the model tags listed in
[baseline] models(seedocs/RUNBOOK.md);forensics peer-setupprints oneollama pull <tag>per configured tag. Useforensics peer-setup --check-ollamato verify reachability (no auto-pull). -
Validate before a long run
uv run forensics validate uv run forensics preflight # add --strict to fail on warnings -
Secrets / environment — Copy
.env.exampleif your deployment uses external secrets or observability; the core pipeline is driven byconfig.tomlandFORENSICS_*. Override the config file path withFORENSICS_CONFIG_FILE.
Default SQLite corpus path is data/articles.db under the project root (see DEFAULT_DB_RELATIVE in settings).
This project is structured for auditable, staged research: each stage reads and writes defined artifacts so an independent reviewer can trace what was collected, how it was transformed, and which parameters were active.
- Scrape — WordPress REST discovery, metadata, optional bulk or per-article body fetch, simhash near-duplicate control (
simhash_threshold), persistence to SQLite (content_hashper article, scrape timestamps). - Extract — Deterministic feature vectors to Parquet; embeddings to
data/embeddings/; optional probability parquet. - Analyze — JSON results under
data/analysis/; analysis run rows in SQLite; corpus custody file written after analysis (see below). - Report — Quarto render from
notebooks/intodata/reports/.
Canonical paths are summarized in docs/ARCHITECTURE.md.
flowchart TB
subgraph stages[Stages]
direction TB
T1[1 Scrape]
T2[2 Extract]
T3[3 Analyze]
T4[4 Report]
T1 --> T2 --> T3 --> T4
end
subgraph artifacts[Primary artifacts]
direction TB
M[authors_manifest.jsonl]
DB[("data/articles.db")]
FE["data/features/ Parquet tables"]
EM[data/embeddings/]
AN["data/analysis/ JSON + corpus_custody.json"]
RP[data/reports/]
end
T1 --> M
T1 --> DB
T2 --> FE
T2 --> EM
T3 --> AN
T4 --> RP
- Per-article
content_hash— SHA-256 of normalized article text at ingest (seeforensics.utils.hashing.content_hash); stored in SQLite for tamper-evident comparison of body text. corpus_custody.json— Written underdata/analysis/after analysis (write_corpus_custody): records a corpus-level hash derived from ordered per-articlecontent_hashvalues so later runs can detect corpus drift.compute_config_hash— Deterministic hash of the full resolvedForensicsSettingspayload (excluding derived paths) for tying reports to a configuration snapshot (get_run_metadata/ run metadata patterns).
flowchart LR
subgraph ingest[Ingest integrity]
TXT[Normalized article text]
H1[SHA-256 content_hash]
TXT --> H1
H1 --> ROW[(articles row)]
end
subgraph freeze[Post-analysis freeze]
ROW --> CH[Ordered concat of content_hash]
CH --> CORP[corpus_custody.json]
CFG[ForensicsSettings] --> CFH[config hash]
CFH --> META[run_metadata.json]
CORP --> META
end
subgraph check[Verification]
LIVE[Recompute live corpus hash]
CORP --> CMP{Matches?}
LIVE --> CMP
CMP -->|yes| OK[Continue report or analyze]
CMP -->|no| BAD[Exit non-zero]
end
uv run forensics report --verify— Before render, recomputes the live corpus hash and compares it todata/analysis/corpus_custody.json; fails if missing or mismatched.uv run forensics analyze --verify-corpus— Same hash check without rendering a report.
forensics lock-preregistration writes a hashed lock of analysis thresholds to data/preregistration/preregistration_lock.json. analyze always runs verify_preregistration and records status in data/analysis/run_metadata.json (ok / missing / mismatch). This supports pre-registered vs exploratory analysis discipline (see docs/RUNBOOK.md).
config.toml includes:
[chain_of_custody]
verify_corpus_hash = true
verify_raw_archives = true
log_all_generations = trueThese flags document the intended custody posture. verify_corpus_hash is enforced when you pass --verify / --verify-corpus as above (the TOML flags are not yet auto-wired to skip those CLI switches). Raw-archive and generation-log toggles are reserved for stricter operational policies; baseline manifests still record generation metadata under data/ai_baseline/ when you run Phase 10.
data/articles.jsonl— Human-readable corpus export for review.uv run forensics export— Single-file DuckDB bundle over SQLite + optional Parquet + analysis JSON (see runbook).
- Frozen
config.toml(orFORENSICS_CONFIG_FILEcopy) andFORENSICS_*env used for the run. data/analysis/run_metadata.json,corpus_custody.json, per-author*_result.jsonand related analysis JSON.analysis_runsrows indata/articles.db(stage descriptions and timing where recorded).- Quarto HTML/PDF outputs and the notebook sources under
notebooks/. - For probability / baseline:
data/probability/model_card.json,data/ai_baseline/generation_manifest.json, and referenced model revisions.
- Scraping: Defaults respect
robots.txt, use a declareduser_agent, and apply rate limits ([scraping]). Adjust only in line with site policy and applicable law. - Outcomes: Stylometry and drift metrics are statistical signals, not legal findings. Targets vs controls must be defined before confirmatory interpretation; use preregistration and documented baselines.
- Synthetic text: AI baseline generation is for controlled comparison, not for passing off as human journalism.
- Entrypoint:
forensicsconsole script →src/forensics/cli/__init__.py(Typer). Useuv run forensics --helpanduv run forensics <command> --help. - Stages: scraper (WordPress discovery + HTTP + dedup), feature extraction (
src/forensics/features/), analysis (src/forensics/analysis/), reporting (src/forensics/reporting/). Full orchestration forforensics alllives insrc/forensics/pipeline.py. - Configuration:
src/forensics/config/settings.pyloadsconfig.tomlat the project root withFORENSICS_environment overrides (pydantic-settings). Override the TOML path withFORENSICS_CONFIG_FILE.
Storage and model contracts are summarized in docs/ARCHITECTURE.md and ADRs under docs/adr/.
flowchart TB
API[WordPress REST API]
API --> SQL[(SQLite write store)]
SQL --> PQ[Parquet feature store]
SQL --> J[articles.jsonl export]
SQL --> DD[DuckDB analytical layer]
PQ --> DD
DD --> NB[Notebooks + Quarto]
J --> NB
NB --> OUT[HTML / PDF reports]
- Python 3.13 (see
requires-pythoninpyproject.toml). - uv for environments and script execution (
uv run …). - Quarto on your
PATHif you runforensics reportor the report step offorensics all(download). - spaCy English model for feature work aligned with CI (default
en_core_web_md) — installed with the project via the pinned wheel inpyproject.tomlwhen you runuv sync/make peer-setup. Runuv run python -m spacy download <name>only if you pointspacy_modelat another pipeline.
git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
# or: uv sync --extra dev --extra tui && uv run forensics validateCopy .env.example to .env when you need optional secrets or observability hooks. Core pipeline configuration remains config.toml + FORENSICS_*.
- Authors and scraping: Edit
config.toml. Replace template rows whose slugs areplaceholder-target/placeholder-controlwith real authors before any live scrape; the CLI rejects those placeholders for discover/metadata/fetch paths. - Nested settings: Tables such as
[scraping],[analysis],[survey],[probability],[baseline],[report], and[chain_of_custody]tune rate limits, analysis thresholds, survey eligibility, optional Phase 9/10 behavior, and report output. - Environment: Nested keys via
FORENSICS_*are described insrc/forensics/config/settings.pyand.env.example.
Global options (Typer app root):
uv run forensics --version
uv run forensics -v scrape --help # example: DEBUG logs for scrape| Command | Purpose |
|---|---|
scrape |
WordPress author discovery, article metadata, HTML fetch, simhash dedup, optional raw archive. Combine flags as documented in uv run forensics scrape --help (e.g. --discover, --metadata, --fetch, --dedup, --archive, --dry-run with --fetch, --force-refresh with discover). |
extract |
Feature extraction + embeddings from data/articles.db. Options include --author, --skip-embeddings, and --probability (requires --extra probability). |
analyze |
Modes via flags: --changepoint, --timeseries, --drift, --convergence, --compare, --ai-baseline, corpus --verify-corpus, optional --author. With no analysis flags, the default runs time-series plus the full convergence-oriented analysis path; add flags to narrow or extend. See uv run forensics analyze --help. |
report |
Quarto render (--notebook, --format html |
all |
End-to-end: full scrape (dispatch_scrape with all stage flags false → same path as bare forensics scrape) → extract_all_features → run_analyze(AnalyzeRequest(timeseries=True, convergence=True)) (no --changepoint / --drift unless you change pipeline.py) → run_report. See docs/ARCHITECTURE.md. |
validate, preflight, survey, calibrate, export, lock-preregistration, setup |
Operational and quality workflows — see docs/RUNBOOK.md. |
Full pipeline (after configuring real authors and installing Quarto):
uv run forensics allIncremental scrape (example):
uv run forensics scrape --discover
uv run forensics scrape --metadata
uv run forensics scrape --fetch --dry-run # count only
uv run forensics scrape --fetchFeatures then analysis:
uv run forensics extract
uv run forensics analyze --changepoint --timeseries
uv run forensics analyze --drift
uv run forensics report --format html| Path | Role |
|---|---|
src/forensics/ |
Application package (CLI, scraper, features, analysis, storage, config, models). |
tests/ |
Pytest suite (unit/, integration/, evals/, fixtures, Hypothesis tests). |
docs/ |
Architecture, testing policy, runbook, ADRs, deployment notes. |
_quarto.yml, index.qmd |
Quarto book project config and landing chapter (output under data/reports/). |
notebooks/ |
Jupyter chapters consumed by Quarto. |
prompts/ |
Versioned prompts for agents and pipeline phases. |
scripts/ |
Maintenance and one-off utilities. |
evals/ |
Eval scenarios referenced from tooling or docs. |
| Path | Role |
|---|---|
data/articles.db |
Primary SQLite store (articles, authors, run metadata). |
data/authors_manifest.jsonl |
Discovered author manifest from scrape. |
data/raw/ |
Raw HTML / year archives (see scrape --archive). |
data/features/ |
Per-author feature tables (Parquet). |
data/embeddings/ |
Embedding batches used by drift and reports. |
data/analysis/ |
Per-author JSON results, run metadata, corpus_custody.json. |
data/articles.jsonl |
JSONL export for auditing. |
data/reports/ |
Quarto book output (see _quarto.yml project.output-dir). |
data/probability/ |
Phase 9 outputs when enabled. |
data/ai_baseline/ |
Phase 10 synthetic baseline artifacts. |
data/survey/, data/calibration/ |
Survey and calibration run outputs (see runbook). |
Exact filenames evolve with the pipeline; treat docs/ARCHITECTURE.md as the conceptual map.
Defined in pyproject.toml:
| Extra | Install | Use |
|---|---|---|
dev |
uv sync --extra dev |
pytest, pytest-cov, Hypothesis, Ruff, pre-commit. |
probability |
uv sync --extra probability |
Phase 9 token-level features (torch, transformers); forensics extract --probability. |
baseline |
uv sync --extra baseline |
pydantic-ai + evals for baseline workflows; local [baseline] config in config.toml (Ollama) for generation smoke tests. |
tui |
uv sync --extra tui |
Textual wizard: uv run forensics setup. |
- Project config:
_quarto.yml(book title, chapters undernotebooks/, output todata/reports/for local runs). forensics reportshells out toquarto; install separately if missing.--verifychecks corpus hash material underdata/analysis/(seesrc/forensics/utils/provenance.py).- Hosted report: the bound Quarto book is embedded under
https://abstract-data.github.io/mediaite-ghostink/report/(entry…/report/index.htmlif the directory URL does not resolve), alongside the operator documentation site (Astro Starlight underwebsite/, deployed by.github/workflows/deploy-docs.yml). For a local docs preview, runmake docs-quartofirst sowebsite/public/report/exists. This supersedes the prior Cloudflare Pages deploy ofdata/reports/.
notebooks/— Exploratory and chapter notebooks wired into the Quarto book.prompts/— Versioned agent / phase prompts withcurrent.mdpointers; seeprompts/README.mdfor the release contract.
uv sync --extra dev
uv run ruff check .
uv run ruff format --check .
uv run pytest tests/ -v
uv run pytest tests/ -v --cov=src --cov-report=term-missing- Default pytest options (markers, coverage on
forensics) live inpyproject.toml. Slow tests are marked@pytest.mark.slow; default runs exclude them (-m 'not slow'). Run them withuv run pytest tests/ -m slowwhen needed. - CI:
.github/workflows/ci.ymlruns Ruff lint/format and pytest with coverage JSON for PR comments. - Pre-commit:
uv run pre-commit installusing.pre-commit-config.yaml.
Testing policy and coverage gates are documented in docs/TESTING.md.
The canonical operator markdown lives under docs/ and is also
published — alongside the auto-generated CLI reference, the Python API
reference, the ADRs, and the embedded Quarto report — at
https://abstract-data.github.io/mediaite-ghostink/.
The site is built from website/ and deployed by
.github/workflows/deploy-docs.yml.
Run make docs-dev for a local preview.
| Document | Contents |
|---|---|
docs/ARCHITECTURE.md |
Runtime flow, modules, storage, feature and analysis methods. |
docs/TESTING.md |
Test layout, commands, coverage rules. |
docs/RUNBOOK.md |
Operational runbook (survey, calibration, export, baseline, preflight, docs site). |
docs/DEPLOYMENTS.md |
Deployment notes. |
docs/GUARDRAILS.md |
Recurring failure patterns and mitigations. |
docs/adr/ |
Architecture decision records. |
website/ |
Astro Starlight documentation site (Bun + @abstractdata/starlight-theme). |
CONTRIBUTING.md |
Pull requests, checks, and handoff expectations. |
SECURITY.md |
How to report security issues responsibly. |
LICENSE |
MIT license (Abstract Data LLC). |
AGENTS.md— Boundaries, commands, embedding pin, data directories, and conventions for automation and humans.CONTRIBUTING.md— Pull requests, local checks, formatting, and how to updateHANDOFF.mdafter substantive work.- Governance / hooks:
.github/workflows/agents-governance.ymlanddocs/adr/ADR-003-agent-governance-and-hooks.md.