Skip to content

Abstract-Data/mediaite-ghostink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

363 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mediaite-ghostink

Hybrid AI writing forensics pipeline for Mediaite.com: deterministic stages scrape → extract → analyze → report, combining statistical stylometry (change-points, time series, hypothesis tests) with embedding drift and optional token-probability and AI-baseline comparison.

CI Docs Python 3.13 uv Ruff pytest Hypothesis Coverage pre-commit packaging

Pull requests receive a CI report comment (pytest summary and line coverage vs main) from .github/workflows/ci-report.yml.

At a glance

  • What it is: A deterministic scrape → extract → analyze → report pipeline over a WordPress newsroom corpus, with stylometry, embedding drift, optional token-probability signals, and Quarto outputs.
  • Who it is for: Forensic reviewers, operators reproducing a locked configuration, and engineers extending the pipeline.
  • What it is not: Outputs are statistical and documentary signals, not legal findings or definitive attribution of authorship or tool use. See Responsible use.
  • Contributors: Human workflow and PR expectations are in CONTRIBUTING.md. Automation and agent rules live in AGENTS.md.

Five-minute smoke test

git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
uv run forensics preflight

Then configure real authors in config.toml (see Configuration) before any live scrape. For a full run and report, continue with Typical workflows.


Table of contents


What this project does

The codebase implements two complementary lenses (see docs/ARCHITECTURE.md and docs/adr/ADR-001-hybrid-forensics-methodology.md):

flowchart TB
  subgraph corpus[Corpus]
    WP[WordPress REST + HTML]
    DB[("SQLite articles.db")]
    WP --> DB
  end
  subgraph features[Feature plane]
    F["Lexical + structural + content + productivity"]
    E[384-d sentence embeddings]
    DB --> F
    DB --> E
  end
  subgraph lensA["Pipeline A: stylometry"]
    CP[Change-points + rolling stats + convergence]
    HT[Hypothesis tests + effect sizes + FDR]
    F --> CP --> HT
  end
  subgraph lensB["Pipeline B: embedding drift"]
    DR[Centroid velocity + cosine decay + variance]
    E --> DR
  end
  subgraph out[Outputs]
    ART[Analysis JSON + custody + metadata]
    REP[Quarto HTML / PDF]
    HT --> ART
    DR --> ART
    ART --> REP
  end
Loading
Track Role
Pipeline A — Statistical stylometry Lexical, structural, content, and productivity features over time; change-point methods (PELT, BOCPD, and related tests), rolling statistics, convergence windows, classical tests, effect sizes, and multiple-comparison correction.
Pipeline B — Embedding drift Sentence-transformer embeddings (default 384-dimensional sentence-transformers/all-MiniLM-L6-v2); centroid velocity, similarity decay, intra-period variance, optional UMAP views, optional comparison to synthetic “AI baseline” text.

Optional tracks (extras and config):

Track Role
Phase 9 — Token-level probability Reference language model (default Hugging Face GPT-2) for perplexity-style signals; optional Binoculars-style contrast using Falcon-7B base vs instruct checkpoints (uv sync --extra probability).
Phase 10 — AI baseline generation Local Ollama models (configurable) generate synthetic articles for controlled comparison (uv sync --extra baseline).

Outputs include SQLite + Parquet + DuckDB-friendly artifacts, JSONL exports, analysis JSON under data/analysis/, optional probability and baseline trees under data/probability/ and data/ai_baseline/, and Quarto-driven reports under data/reports/.

flowchart LR
  S[Scrape] --> X[Extract]
  X --> N[Analyze]
  N --> P[Report]
  S -. optional .-> P9[Phase 9 probability]
  X -. optional .-> P9
  N -. optional .-> P10[Phase 10 AI baseline]
Loading

Models, measurements, and algorithms

Settings below default from config.toml and src/forensics/config/settings.py; override with FORENSICS_ environment variables. Nested analysis knobs use one __ segment per model level (for example FORENSICS_ANALYSIS__HYPOTHESIS__SIGNIFICANCE_THRESHOLD or FORENSICS_ANALYSIS__CONVERGENCE__CONVERGENCE_USE_PERMUTATION). See ADR-016 in docs/adr/016-analysis-config-nesting.md.

NLP and embeddings

Component Default Purpose
spaCy en_core_web_md (TOML: spacy_model) Tokenization, linguistic features, and preflight validation. Default pipeline ships as a wheel in pyproject.toml; uv sync installs it (use spacy download only if you change spacy_model).
Sentence Transformers sentence-transformers/all-MiniLM-L6-v2 ([analysis] embedding_model) Dense 384-d article embeddings for drift, similarity decay, and monthly centroids. embedding_model_version is recorded for provenance.
scikit-learn LDA, TF–IDF, etc. Topic diversity, self-similarity, and related content features (see src/forensics/features/).

Stylometry and readability (per article)

Implemented feature families (see Feature families in docs/ARCHITECTURE.md):

  1. Lexical — type–token ratio (TTR), MATTR, hapax rates, Yule’s K, Simpson’s D, stylometric “AI marker” and function-word style signals.
  2. Structural — sentence length statistics, passive voice ratio, punctuation profile, parse-depth style measures via spaCy.
  3. Content — n-gram entropy, rolling LDA topic diversity, formulaic / hedging-style scores.
  4. Productivity — inter-article gaps, rolling counts, burst-style signals over the timeline.

textstat contributes readability-style scalar signals where used in the feature pipeline.

Change-point and time-series analysis

Method Library / notes
PELT ruptures, RBF cost; penalty from AnalysisConfig.pelt_penalty (default 3.0).
BOCPD Bayesian online change-point detection (custom scipy-based implementation); hazard and threshold from settings.
Chow, CUSUM, Kleinberg bursts Implemented under src/forensics/analysis/ (see architecture doc).
Convergence windows Windows where a minimum fraction of features move together (convergence_window_days, convergence_min_feature_ratio); optional permutation null (convergence_use_permutation) logs empirical p-values without changing detected windows.

Statistical inference

  • Tests: Welch’s t, Mann–Whitney U, Kolmogorov–Smirnov (as implemented in analysis modules).
  • Effect sizes: Cohen’s d where applicable.
  • Intervals: Bootstrap resamples (bootstrap_iterations, default 1000).
  • Multiple comparisons: Benjamini–Hochberg or Bonferroni (multiple_comparison_method).

Optional: token probability (extra probability)

Setting Default Role
reference_model gpt2 Causal LM for perplexity-style features.
reference_model_revision pinned revision id Reproducible HF snapshot.
binoculars_model_base / binoculars_model_instruct Falcon-7B pair Optional contrastive signal (binoculars_enabled default false).
max_sequence_length, sliding_window_stride, batch_size, device see config.toml Windowing and compute.

Optional: AI baseline text (extra baseline)

Local Ollama HTTP API (baseline.ollama_base_url); model tags from baseline.models and temperatures from baseline.temperatures. Generated artifacts and manifests live under data/ai_baseline/ (see docs/RUNBOOK.md).

Survey mode (newsroom-wide)

[survey] thresholds (min_articles, min_span_days, min_words_per_article, yearly density, recent activity) gate which authors qualify for blind survey runs (forensics survey). See docs/RUNBOOK.md.


Local machine setup

  1. Install uv and Python 3.13 (see requires-python in pyproject.toml).

  2. Clone and install dependencies

    git clone git@github.com:Abstract-Data/mediaite-ghostink.git
    cd mediaite-ghostink
    make peer-setup

    make peer-setup runs uv sync --extra dev --extra tui, uv run forensics validate, and uv run forensics peer-setup (copy-paste uv sync tiers, spaCy / Quarto notes, and ollama pull lines from [baseline] models when present). For hints only after you have synced: uv run forensics peer-setup.

  3. spaCy pipeline — Default en_core_web_md is installed as a direct wheel dependency in pyproject.toml (en-core-web-md @ https://github.com/explosion/...whl); uv sync / make peer-setup brings it in. Use uv run python -m spacy download <name> only if you change spacy_model in config.toml.

  4. Edit config.toml (see also config.toml.example for the same template with setup notes) — Replace template authors (placeholder-target / placeholder-control) with real author rows before any live scrape; the CLI rejects placeholders on discover/metadata/fetch paths.

  5. Optional: Quarto — Required for forensics report and the report step of forensics all. Install Quarto so quarto is on your PATH.

  6. Optional extras (also shown by forensics peer-setup)

    • uv sync --extra probability — Phase 9 token features (forensics extract --probability); pulls torch / transformers (large download).
    • uv sync --extra baseline — Phase 10 Ollama-driven baseline generation (scripts/generate_baseline.py, forensics analyze --ai-baseline, …).
    • uv sync --extra tui — Interactive forensics setup wizard (uv run forensics setup).
  7. Optional: Ollama — For baseline generation, install Ollama and pull the model tags listed in [baseline] models (see docs/RUNBOOK.md); forensics peer-setup prints one ollama pull <tag> per configured tag. Use forensics peer-setup --check-ollama to verify reachability (no auto-pull).

  8. Validate before a long run

    uv run forensics validate
    uv run forensics preflight          # add --strict to fail on warnings
  9. Secrets / environment — Copy .env.example if your deployment uses external secrets or observability; the core pipeline is driven by config.toml and FORENSICS_*. Override the config file path with FORENSICS_CONFIG_FILE.

Default SQLite corpus path is data/articles.db under the project root (see DEFAULT_DB_RELATIVE in settings).


Forensic assurance and chain of custody

This project is structured for auditable, staged research: each stage reads and writes defined artifacts so an independent reviewer can trace what was collected, how it was transformed, and which parameters were active.

Stages and artifacts

  1. Scrape — WordPress REST discovery, metadata, optional bulk or per-article body fetch, simhash near-duplicate control (simhash_threshold), persistence to SQLite (content_hash per article, scrape timestamps).
  2. Extract — Deterministic feature vectors to Parquet; embeddings to data/embeddings/; optional probability parquet.
  3. Analyze — JSON results under data/analysis/; analysis run rows in SQLite; corpus custody file written after analysis (see below).
  4. Report — Quarto render from notebooks/ into data/reports/.

Canonical paths are summarized in docs/ARCHITECTURE.md.

flowchart TB
  subgraph stages[Stages]
    direction TB
    T1[1 Scrape]
    T2[2 Extract]
    T3[3 Analyze]
    T4[4 Report]
    T1 --> T2 --> T3 --> T4
  end
  subgraph artifacts[Primary artifacts]
    direction TB
    M[authors_manifest.jsonl]
    DB[("data/articles.db")]
    FE["data/features/ Parquet tables"]
    EM[data/embeddings/]
    AN["data/analysis/ JSON + corpus_custody.json"]
    RP[data/reports/]
  end
  T1 --> M
  T1 --> DB
  T2 --> FE
  T2 --> EM
  T3 --> AN
  T4 --> RP
Loading

Integrity and hashing

  • Per-article content_hash — SHA-256 of normalized article text at ingest (see forensics.utils.hashing.content_hash); stored in SQLite for tamper-evident comparison of body text.
  • corpus_custody.json — Written under data/analysis/ after analysis (write_corpus_custody): records a corpus-level hash derived from ordered per-article content_hash values so later runs can detect corpus drift.
  • compute_config_hash — Deterministic hash of the full resolved ForensicsSettings payload (excluding derived paths) for tying reports to a configuration snapshot (get_run_metadata / run metadata patterns).
flowchart LR
  subgraph ingest[Ingest integrity]
    TXT[Normalized article text]
    H1[SHA-256 content_hash]
    TXT --> H1
    H1 --> ROW[(articles row)]
  end
  subgraph freeze[Post-analysis freeze]
    ROW --> CH[Ordered concat of content_hash]
    CH --> CORP[corpus_custody.json]
    CFG[ForensicsSettings] --> CFH[config hash]
    CFH --> META[run_metadata.json]
    CORP --> META
  end
  subgraph check[Verification]
    LIVE[Recompute live corpus hash]
    CORP --> CMP{Matches?}
    LIVE --> CMP
    CMP -->|yes| OK[Continue report or analyze]
    CMP -->|no| BAD[Exit non-zero]
  end
Loading

Verification commands

  • uv run forensics report --verify — Before render, recomputes the live corpus hash and compares it to data/analysis/corpus_custody.json; fails if missing or mismatched.
  • uv run forensics analyze --verify-corpus — Same hash check without rendering a report.

Preregistration (confirmatory runs)

forensics lock-preregistration writes a hashed lock of analysis thresholds to data/preregistration/preregistration_lock.json. analyze always runs verify_preregistration and records status in data/analysis/run_metadata.json (ok / missing / mismatch). This supports pre-registered vs exploratory analysis discipline (see docs/RUNBOOK.md).

Configuration: [chain_of_custody]

config.toml includes:

[chain_of_custody]
verify_corpus_hash = true
verify_raw_archives = true
log_all_generations = true

These flags document the intended custody posture. verify_corpus_hash is enforced when you pass --verify / --verify-corpus as above (the TOML flags are not yet auto-wired to skip those CLI switches). Raw-archive and generation-log toggles are reserved for stricter operational policies; baseline manifests still record generation metadata under data/ai_baseline/ when you run Phase 10.

Exports and databases

  • data/articles.jsonl — Human-readable corpus export for review.
  • uv run forensics export — Single-file DuckDB bundle over SQLite + optional Parquet + analysis JSON (see runbook).

What reviewers should ask for

  • Frozen config.toml (or FORENSICS_CONFIG_FILE copy) and FORENSICS_* env used for the run.
  • data/analysis/run_metadata.json, corpus_custody.json, per-author *_result.json and related analysis JSON.
  • analysis_runs rows in data/articles.db (stage descriptions and timing where recorded).
  • Quarto HTML/PDF outputs and the notebook sources under notebooks/.
  • For probability / baseline: data/probability/model_card.json, data/ai_baseline/generation_manifest.json, and referenced model revisions.

Responsible use

  • Scraping: Defaults respect robots.txt, use a declared user_agent, and apply rate limits ([scraping]). Adjust only in line with site policy and applicable law.
  • Outcomes: Stylometry and drift metrics are statistical signals, not legal findings. Targets vs controls must be defined before confirmatory interpretation; use preregistration and documented baselines.
  • Synthetic text: AI baseline generation is for controlled comparison, not for passing off as human journalism.

Architecture

  • Entrypoint: forensics console script → src/forensics/cli/__init__.py (Typer). Use uv run forensics --help and uv run forensics <command> --help.
  • Stages: scraper (WordPress discovery + HTTP + dedup), feature extraction (src/forensics/features/), analysis (src/forensics/analysis/), reporting (src/forensics/reporting/). Full orchestration for forensics all lives in src/forensics/pipeline.py.
  • Configuration: src/forensics/config/settings.py loads config.toml at the project root with FORENSICS_ environment overrides (pydantic-settings). Override the TOML path with FORENSICS_CONFIG_FILE.

Storage and model contracts are summarized in docs/ARCHITECTURE.md and ADRs under docs/adr/.

flowchart TB
  API[WordPress REST API]
  API --> SQL[(SQLite write store)]
  SQL --> PQ[Parquet feature store]
  SQL --> J[articles.jsonl export]
  SQL --> DD[DuckDB analytical layer]
  PQ --> DD
  DD --> NB[Notebooks + Quarto]
  J --> NB
  NB --> OUT[HTML / PDF reports]
Loading

Requirements

  • Python 3.13 (see requires-python in pyproject.toml).
  • uv for environments and script execution (uv run …).
  • Quarto on your PATH if you run forensics report or the report step of forensics all (download).
  • spaCy English model for feature work aligned with CI (default en_core_web_md) — installed with the project via the pinned wheel in pyproject.toml when you run uv sync / make peer-setup. Run uv run python -m spacy download <name> only if you point spacy_model at another pipeline.

Installation

git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
# or: uv sync --extra dev --extra tui && uv run forensics validate

Copy .env.example to .env when you need optional secrets or observability hooks. Core pipeline configuration remains config.toml + FORENSICS_*.


Configuration

  1. Authors and scraping: Edit config.toml. Replace template rows whose slugs are placeholder-target / placeholder-control with real authors before any live scrape; the CLI rejects those placeholders for discover/metadata/fetch paths.
  2. Nested settings: Tables such as [scraping], [analysis], [survey], [probability], [baseline], [report], and [chain_of_custody] tune rate limits, analysis thresholds, survey eligibility, optional Phase 9/10 behavior, and report output.
  3. Environment: Nested keys via FORENSICS_* are described in src/forensics/config/settings.py and .env.example.

CLI

Global options (Typer app root):

uv run forensics --version
uv run forensics -v scrape --help   # example: DEBUG logs for scrape
Command Purpose
scrape WordPress author discovery, article metadata, HTML fetch, simhash dedup, optional raw archive. Combine flags as documented in uv run forensics scrape --help (e.g. --discover, --metadata, --fetch, --dedup, --archive, --dry-run with --fetch, --force-refresh with discover).
extract Feature extraction + embeddings from data/articles.db. Options include --author, --skip-embeddings, and --probability (requires --extra probability).
analyze Modes via flags: --changepoint, --timeseries, --drift, --convergence, --compare, --ai-baseline, corpus --verify-corpus, optional --author. With no analysis flags, the default runs time-series plus the full convergence-oriented analysis path; add flags to narrow or extend. See uv run forensics analyze --help.
report Quarto render (--notebook, --format html
all End-to-end: full scrape (dispatch_scrape with all stage flags false → same path as bare forensics scrape) → extract_all_featuresrun_analyze(AnalyzeRequest(timeseries=True, convergence=True)) (no --changepoint / --drift unless you change pipeline.py) → run_report. See docs/ARCHITECTURE.md.
validate, preflight, survey, calibrate, export, lock-preregistration, setup Operational and quality workflows — see docs/RUNBOOK.md.

Typical workflows

Full pipeline (after configuring real authors and installing Quarto):

uv run forensics all

Incremental scrape (example):

uv run forensics scrape --discover
uv run forensics scrape --metadata
uv run forensics scrape --fetch --dry-run   # count only
uv run forensics scrape --fetch

Features then analysis:

uv run forensics extract
uv run forensics analyze --changepoint --timeseries
uv run forensics analyze --drift
uv run forensics report --format html

Repository layout

Path Role
src/forensics/ Application package (CLI, scraper, features, analysis, storage, config, models).
tests/ Pytest suite (unit/, integration/, evals/, fixtures, Hypothesis tests).
docs/ Architecture, testing policy, runbook, ADRs, deployment notes.
_quarto.yml, index.qmd Quarto book project config and landing chapter (output under data/reports/).
notebooks/ Jupyter chapters consumed by Quarto.
prompts/ Versioned prompts for agents and pipeline phases.
scripts/ Maintenance and one-off utilities.
evals/ Eval scenarios referenced from tooling or docs.

Data layout

Path Role
data/articles.db Primary SQLite store (articles, authors, run metadata).
data/authors_manifest.jsonl Discovered author manifest from scrape.
data/raw/ Raw HTML / year archives (see scrape --archive).
data/features/ Per-author feature tables (Parquet).
data/embeddings/ Embedding batches used by drift and reports.
data/analysis/ Per-author JSON results, run metadata, corpus_custody.json.
data/articles.jsonl JSONL export for auditing.
data/reports/ Quarto book output (see _quarto.yml project.output-dir).
data/probability/ Phase 9 outputs when enabled.
data/ai_baseline/ Phase 10 synthetic baseline artifacts.
data/survey/, data/calibration/ Survey and calibration run outputs (see runbook).

Exact filenames evolve with the pipeline; treat docs/ARCHITECTURE.md as the conceptual map.


Optional dependency extras

Defined in pyproject.toml:

Extra Install Use
dev uv sync --extra dev pytest, pytest-cov, Hypothesis, Ruff, pre-commit.
probability uv sync --extra probability Phase 9 token-level features (torch, transformers); forensics extract --probability.
baseline uv sync --extra baseline pydantic-ai + evals for baseline workflows; local [baseline] config in config.toml (Ollama) for generation smoke tests.
tui uv sync --extra tui Textual wizard: uv run forensics setup.

Reports (Quarto)


Notebooks and prompts

  • notebooks/ — Exploratory and chapter notebooks wired into the Quarto book.
  • prompts/ — Versioned agent / phase prompts with current.md pointers; see prompts/README.md for the release contract.

Development

uv sync --extra dev
uv run ruff check .
uv run ruff format --check .
uv run pytest tests/ -v
uv run pytest tests/ -v --cov=src --cov-report=term-missing

Testing policy and coverage gates are documented in docs/TESTING.md.


Documentation

The canonical operator markdown lives under docs/ and is also published — alongside the auto-generated CLI reference, the Python API reference, the ADRs, and the embedded Quarto report — at https://abstract-data.github.io/mediaite-ghostink/. The site is built from website/ and deployed by .github/workflows/deploy-docs.yml. Run make docs-dev for a local preview.

Document Contents
docs/ARCHITECTURE.md Runtime flow, modules, storage, feature and analysis methods.
docs/TESTING.md Test layout, commands, coverage rules.
docs/RUNBOOK.md Operational runbook (survey, calibration, export, baseline, preflight, docs site).
docs/DEPLOYMENTS.md Deployment notes.
docs/GUARDRAILS.md Recurring failure patterns and mitigations.
docs/adr/ Architecture decision records.
website/ Astro Starlight documentation site (Bun + @abstractdata/starlight-theme).
CONTRIBUTING.md Pull requests, checks, and handoff expectations.
SECURITY.md How to report security issues responsibly.
LICENSE MIT license (Abstract Data LLC).

Agent and contributor notes