mediaite-ghostink

Hybrid AI writing forensics pipeline for Mediaite.com: deterministic stages scrape → extract → analyze → report, combining statistical stylometry (change-points, time series, hypothesis tests) with embedding drift and optional token-probability and AI-baseline comparison.

Pull requests receive a CI report comment (pytest summary and line coverage vs main) from .github/workflows/ci-report.yml.

At a glance

What it is: A deterministic scrape → extract → analyze → report pipeline over a WordPress newsroom corpus, with stylometry, embedding drift, optional token-probability signals, and Quarto outputs.
Who it is for: Forensic reviewers, operators reproducing a locked configuration, and engineers extending the pipeline.
What it is not: Outputs are statistical and documentary signals, not legal findings or definitive attribution of authorship or tool use. See Responsible use.
Contributors: Human workflow and PR expectations are in CONTRIBUTING.md. Automation and agent rules live in AGENTS.md.

Five-minute smoke test

git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
uv run forensics preflight

Then configure real authors in config.toml (see Configuration) before any live scrape. For a full run and report, continue with Typical workflows.

What this project does
Models, measurements, and algorithms
Local machine setup
Forensic assurance and chain of custody
Responsible use
Architecture
Requirements
Installation
Configuration
CLI
Typical workflows
Repository layout
Data layout
Optional dependency extras
Reports (Quarto)
Notebooks and prompts
Development
Documentation
Agent and contributor notes

What this project does

The codebase implements two complementary lenses (see docs/ARCHITECTURE.md and docs/adr/ADR-001-hybrid-forensics-methodology.md):

flowchart TB
  subgraph corpus[Corpus]
    WP[WordPress REST + HTML]
    DB[("SQLite articles.db")]
    WP --> DB
  end
  subgraph features[Feature plane]
    F["Lexical + structural + content + productivity"]
    E[384-d sentence embeddings]
    DB --> F
    DB --> E
  end
  subgraph lensA["Pipeline A: stylometry"]
    CP[Change-points + rolling stats + convergence]
    HT[Hypothesis tests + effect sizes + FDR]
    F --> CP --> HT
  end
  subgraph lensB["Pipeline B: embedding drift"]
    DR[Centroid velocity + cosine decay + variance]
    E --> DR
  end
  subgraph out[Outputs]
    ART[Analysis JSON + custody + metadata]
    REP[Quarto HTML / PDF]
    HT --> ART
    DR --> ART
    ART --> REP
  end

Track	Role
Pipeline A — Statistical stylometry	Lexical, structural, content, and productivity features over time; change-point methods (PELT, BOCPD, and related tests), rolling statistics, convergence windows, classical tests, effect sizes, and multiple-comparison correction.
Pipeline B — Embedding drift	Sentence-transformer embeddings (default 384-dimensional `sentence-transformers/all-MiniLM-L6-v2`); centroid velocity, similarity decay, intra-period variance, optional UMAP views, optional comparison to synthetic “AI baseline” text.

Optional tracks (extras and config):

Track	Role
Phase 9 — Token-level probability	Reference language model (default Hugging Face GPT-2) for perplexity-style signals; optional Binoculars-style contrast using Falcon-7B base vs instruct checkpoints (`uv sync --extra probability`).
Phase 10 — AI baseline generation	Local Ollama models (configurable) generate synthetic articles for controlled comparison (`uv sync --extra baseline`).

Outputs include SQLite + Parquet + DuckDB-friendly artifacts, JSONL exports, analysis JSON under data/analysis/, optional probability and baseline trees under data/probability/ and data/ai_baseline/, and Quarto-driven reports under data/reports/.

flowchart LR
  S[Scrape] --> X[Extract]
  X --> N[Analyze]
  N --> P[Report]
  S -. optional .-> P9[Phase 9 probability]
  X -. optional .-> P9
  N -. optional .-> P10[Phase 10 AI baseline]

Models, measurements, and algorithms

Settings below default from config.toml and src/forensics/config/settings.py; override with FORENSICS_ environment variables. Nested analysis knobs use one __ segment per model level (for example FORENSICS_ANALYSIS__HYPOTHESIS__SIGNIFICANCE_THRESHOLD or FORENSICS_ANALYSIS__CONVERGENCE__CONVERGENCE_USE_PERMUTATION). See ADR-016 in docs/adr/016-analysis-config-nesting.md.

NLP and embeddings

Component	Default	Purpose
spaCy	`en_core_web_md` (TOML: `spacy_model`)	Tokenization, linguistic features, and preflight validation. Default pipeline ships as a wheel in `pyproject.toml`; `uv sync` installs it (use `spacy download` only if you change `spacy_model`).
Sentence Transformers	`sentence-transformers/all-MiniLM-L6-v2` (`[analysis] embedding_model`)	Dense 384-d article embeddings for drift, similarity decay, and monthly centroids. `embedding_model_version` is recorded for provenance.
scikit-learn	LDA, TF–IDF, etc.	Topic diversity, self-similarity, and related content features (see `src/forensics/features/`).

Stylometry and readability (per article)

Implemented feature families (see Feature families in docs/ARCHITECTURE.md):

Lexical — type–token ratio (TTR), MATTR, hapax rates, Yule’s K, Simpson’s D, stylometric “AI marker” and function-word style signals.
Structural — sentence length statistics, passive voice ratio, punctuation profile, parse-depth style measures via spaCy.
Content — n-gram entropy, rolling LDA topic diversity, formulaic / hedging-style scores.
Productivity — inter-article gaps, rolling counts, burst-style signals over the timeline.

textstat contributes readability-style scalar signals where used in the feature pipeline.

Change-point and time-series analysis

Method	Library / notes
PELT	`ruptures`, RBF cost; penalty from `AnalysisConfig.pelt_penalty` (default 3.0).
BOCPD	Bayesian online change-point detection (custom `scipy`-based implementation); hazard and threshold from settings.
Chow, CUSUM, Kleinberg bursts	Implemented under `src/forensics/analysis/` (see architecture doc).
Convergence windows	Windows where a minimum fraction of features move together (`convergence_window_days`, `convergence_min_feature_ratio`); optional permutation null (`convergence_use_permutation`) logs empirical p-values without changing detected windows.

Statistical inference

Tests: Welch’s t, Mann–Whitney U, Kolmogorov–Smirnov (as implemented in analysis modules).
Effect sizes: Cohen’s d where applicable.
Intervals: Bootstrap resamples (bootstrap_iterations, default 1000).
Multiple comparisons: Benjamini–Hochberg or Bonferroni (multiple_comparison_method).

Optional: token probability (extra `probability`)

Setting	Default	Role
`reference_model`	`gpt2`	Causal LM for perplexity-style features.
`reference_model_revision`	pinned revision id	Reproducible HF snapshot.
`binoculars_model_base` / `binoculars_model_instruct`	Falcon-7B pair	Optional contrastive signal (`binoculars_enabled` default false).
`max_sequence_length`, `sliding_window_stride`, `batch_size`, `device`	see `config.toml`	Windowing and compute.

Optional: AI baseline text (extra `baseline`)

Local Ollama HTTP API (baseline.ollama_base_url); model tags from baseline.models and temperatures from baseline.temperatures. Generated artifacts and manifests live under data/ai_baseline/ (see docs/RUNBOOK.md).

Survey mode (newsroom-wide)

[survey] thresholds (min_articles, min_span_days, min_words_per_article, yearly density, recent activity) gate which authors qualify for blind survey runs (forensics survey). See docs/RUNBOOK.md.

Local machine setup

Install uv and Python 3.13 (see requires-python in pyproject.toml).
Clone and install dependencies
```
git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
```
make peer-setup runs uv sync --extra dev --extra tui, uv run forensics validate, and uv run forensics peer-setup (copy-paste uv sync tiers, spaCy / Quarto notes, and ollama pull lines from [baseline] models when present). For hints only after you have synced: uv run forensics peer-setup.
spaCy pipeline — Default en_core_web_md is installed as a direct wheel dependency in pyproject.toml (en-core-web-md @ https://github.com/explosion/...whl); uv sync / make peer-setup brings it in. Use uv run python -m spacy download <name> only if you change spacy_model in config.toml.
Edit config.toml (see also config.toml.example for the same template with setup notes) — Replace template authors (placeholder-target / placeholder-control) with real author rows before any live scrape; the CLI rejects placeholders on discover/metadata/fetch paths.
Optional: Quarto — Required for forensics report and the report step of forensics all. Install Quarto so quarto is on your PATH.
Optional extras (also shown by forensics peer-setup)
- uv sync --extra probability — Phase 9 token features (forensics extract --probability); pulls torch / transformers (large download).
- uv sync --extra baseline — Phase 10 Ollama-driven baseline generation (scripts/generate_baseline.py, forensics analyze --ai-baseline, …).
- uv sync --extra tui — Interactive forensics setup wizard (uv run forensics setup).
Optional: Ollama — For baseline generation, install Ollama and pull the model tags listed in [baseline] models (see docs/RUNBOOK.md); forensics peer-setup prints one ollama pull <tag> per configured tag. Use forensics peer-setup --check-ollama to verify reachability (no auto-pull).

Validate before a long run

uv run forensics validate
uv run forensics preflight          # add --strict to fail on warnings

Secrets / environment — Copy .env.example if your deployment uses external secrets or observability; the core pipeline is driven by config.toml and FORENSICS_*. Override the config file path with FORENSICS_CONFIG_FILE.

Default SQLite corpus path is data/articles.db under the project root (see DEFAULT_DB_RELATIVE in settings).

Forensic assurance and chain of custody

This project is structured for auditable, staged research: each stage reads and writes defined artifacts so an independent reviewer can trace what was collected, how it was transformed, and which parameters were active.

Stages and artifacts

Scrape — WordPress REST discovery, metadata, optional bulk or per-article body fetch, simhash near-duplicate control (simhash_threshold), persistence to SQLite (content_hash per article, scrape timestamps).
Extract — Deterministic feature vectors to Parquet; embeddings to data/embeddings/; optional probability parquet.
Analyze — JSON results under data/analysis/; analysis run rows in SQLite; corpus custody file written after analysis (see below).
Report — Quarto render from notebooks/ into data/reports/.

Canonical paths are summarized in docs/ARCHITECTURE.md.

flowchart TB
  subgraph stages[Stages]
    direction TB
    T1[1 Scrape]
    T2[2 Extract]
    T3[3 Analyze]
    T4[4 Report]
    T1 --> T2 --> T3 --> T4
  end
  subgraph artifacts[Primary artifacts]
    direction TB
    M[authors_manifest.jsonl]
    DB[("data/articles.db")]
    FE["data/features/ Parquet tables"]
    EM[data/embeddings/]
    AN["data/analysis/ JSON + corpus_custody.json"]
    RP[data/reports/]
  end
  T1 --> M
  T1 --> DB
  T2 --> FE
  T2 --> EM
  T3 --> AN
  T4 --> RP

Integrity and hashing

Per-article content_hash — SHA-256 of normalized article text at ingest (see forensics.utils.hashing.content_hash); stored in SQLite for tamper-evident comparison of body text.
corpus_custody.json — Written under data/analysis/ after analysis (write_corpus_custody): records a corpus-level hash derived from ordered per-article content_hash values so later runs can detect corpus drift.
compute_config_hash — Deterministic hash of the full resolved ForensicsSettings payload (excluding derived paths) for tying reports to a configuration snapshot (get_run_metadata / run metadata patterns).

flowchart LR
  subgraph ingest[Ingest integrity]
    TXT[Normalized article text]
    H1[SHA-256 content_hash]
    TXT --> H1
    H1 --> ROW[(articles row)]
  end
  subgraph freeze[Post-analysis freeze]
    ROW --> CH[Ordered concat of content_hash]
    CH --> CORP[corpus_custody.json]
    CFG[ForensicsSettings] --> CFH[config hash]
    CFH --> META[run_metadata.json]
    CORP --> META
  end
  subgraph check[Verification]
    LIVE[Recompute live corpus hash]
    CORP --> CMP{Matches?}
    LIVE --> CMP
    CMP -->|yes| OK[Continue report or analyze]
    CMP -->|no| BAD[Exit non-zero]
  end

Verification commands

uv run forensics report --verify — Before render, recomputes the live corpus hash and compares it to data/analysis/corpus_custody.json; fails if missing or mismatched.
uv run forensics analyze --verify-corpus — Same hash check without rendering a report.

Preregistration (confirmatory runs)

forensics lock-preregistration writes a hashed lock of analysis thresholds to data/preregistration/preregistration_lock.json. analyze always runs verify_preregistration and records status in data/analysis/run_metadata.json (ok / missing / mismatch). This supports pre-registered vs exploratory analysis discipline (see docs/RUNBOOK.md).

Configuration: `[chain_of_custody]`

config.toml includes:

[chain_of_custody]
verify_corpus_hash = true
verify_raw_archives = true
log_all_generations = true

These flags document the intended custody posture. verify_corpus_hash is enforced when you pass --verify / --verify-corpus as above (the TOML flags are not yet auto-wired to skip those CLI switches). Raw-archive and generation-log toggles are reserved for stricter operational policies; baseline manifests still record generation metadata under data/ai_baseline/ when you run Phase 10.

Exports and databases

data/articles.jsonl — Human-readable corpus export for review.
uv run forensics export — Single-file DuckDB bundle over SQLite + optional Parquet + analysis JSON (see runbook).

What reviewers should ask for

Frozen config.toml (or FORENSICS_CONFIG_FILE copy) and FORENSICS_* env used for the run.
data/analysis/run_metadata.json, corpus_custody.json, per-author *_result.json and related analysis JSON.
analysis_runs rows in data/articles.db (stage descriptions and timing where recorded).
Quarto HTML/PDF outputs and the notebook sources under notebooks/.
For probability / baseline: data/probability/model_card.json, data/ai_baseline/generation_manifest.json, and referenced model revisions.

Responsible use

Scraping: Defaults respect robots.txt, use a declared user_agent, and apply rate limits ([scraping]). Adjust only in line with site policy and applicable law.
Outcomes: Stylometry and drift metrics are statistical signals, not legal findings. Targets vs controls must be defined before confirmatory interpretation; use preregistration and documented baselines.
Synthetic text: AI baseline generation is for controlled comparison, not for passing off as human journalism.

Architecture

Entrypoint: forensics console script → src/forensics/cli/__init__.py (Typer). Use uv run forensics --help and uv run forensics <command> --help.
Stages: scraper (WordPress discovery + HTTP + dedup), feature extraction (src/forensics/features/), analysis (src/forensics/analysis/), reporting (src/forensics/reporting/). Full orchestration for forensics all lives in src/forensics/pipeline.py.
Configuration: src/forensics/config/settings.py loads config.toml at the project root with FORENSICS_ environment overrides (pydantic-settings). Override the TOML path with FORENSICS_CONFIG_FILE.

Storage and model contracts are summarized in docs/ARCHITECTURE.md and ADRs under docs/adr/.

flowchart TB
  API[WordPress REST API]
  API --> SQL[(SQLite write store)]
  SQL --> PQ[Parquet feature store]
  SQL --> J[articles.jsonl export]
  SQL --> DD[DuckDB analytical layer]
  PQ --> DD
  DD --> NB[Notebooks + Quarto]
  J --> NB
  NB --> OUT[HTML / PDF reports]

Requirements

Python 3.13 (see requires-python in pyproject.toml).
uv for environments and script execution (uv run …).
Quarto on your PATH if you run forensics report or the report step of forensics all (download).
spaCy English model for feature work aligned with CI (default en_core_web_md) — installed with the project via the pinned wheel in pyproject.toml when you run uv sync / make peer-setup. Run uv run python -m spacy download <name> only if you point spacy_model at another pipeline.

Installation

git clone git@github.com:Abstract-Data/mediaite-ghostink.git
cd mediaite-ghostink
make peer-setup
# or: uv sync --extra dev --extra tui && uv run forensics validate

Copy .env.example to .env when you need optional secrets or observability hooks. Core pipeline configuration remains config.toml + FORENSICS_*.

Configuration

Authors and scraping: Edit config.toml. Replace template rows whose slugs are placeholder-target / placeholder-control with real authors before any live scrape; the CLI rejects those placeholders for discover/metadata/fetch paths.
Nested settings: Tables such as [scraping], [analysis], [survey], [probability], [baseline], [report], and [chain_of_custody] tune rate limits, analysis thresholds, survey eligibility, optional Phase 9/10 behavior, and report output.
Environment: Nested keys via FORENSICS_* are described in src/forensics/config/settings.py and .env.example.

CLI

Global options (Typer app root):

uv run forensics --version
uv run forensics -v scrape --help   # example: DEBUG logs for scrape

Command	Purpose
`scrape`	WordPress author discovery, article metadata, HTML fetch, simhash dedup, optional raw archive. Combine flags as documented in `uv run forensics scrape --help` (e.g. `--discover`, `--metadata`, `--fetch`, `--dedup`, `--archive`, `--dry-run` with `--fetch`, `--force-refresh` with discover).
`extract`	Feature extraction + embeddings from `data/articles.db`. Options include `--author`, `--skip-embeddings`, and `--probability` (requires `--extra probability`).
`analyze`	Modes via flags: `--changepoint`, `--timeseries`, `--drift`, `--convergence`, `--compare`, `--ai-baseline`, corpus `--verify-corpus`, optional `--author`. With no analysis flags, the default runs time-series plus the full convergence-oriented analysis path; add flags to narrow or extend. See `uv run forensics analyze --help`.
`report`	Quarto render (`--notebook`, `--format` html
`all`	End-to-end: full scrape (`dispatch_scrape` with all stage flags false → same path as bare `forensics scrape`) → `extract_all_features` → `run_analyze(AnalyzeRequest(timeseries=True, convergence=True))` (no `--changepoint` / `--drift` unless you change `pipeline.py`) → `run_report`. See `docs/ARCHITECTURE.md`.
`validate`, `preflight`, `survey`, `calibrate`, `export`, `lock-preregistration`, `setup`	Operational and quality workflows — see `docs/RUNBOOK.md`.

Typical workflows

Full pipeline (after configuring real authors and installing Quarto):

uv run forensics all

Incremental scrape (example):

uv run forensics scrape --discover
uv run forensics scrape --metadata
uv run forensics scrape --fetch --dry-run   # count only
uv run forensics scrape --fetch

Features then analysis:

uv run forensics extract
uv run forensics analyze --changepoint --timeseries
uv run forensics analyze --drift
uv run forensics report --format html

Repository layout

Path	Role
`src/forensics/`	Application package (CLI, scraper, features, analysis, storage, config, models).
`tests/`	Pytest suite (`unit/`, `integration/`, `evals/`, fixtures, Hypothesis tests).
`docs/`	Architecture, testing policy, runbook, ADRs, deployment notes.
`_quarto.yml`, `index.qmd`	Quarto book project config and landing chapter (output under `data/reports/`).
`notebooks/`	Jupyter chapters consumed by Quarto.
`prompts/`	Versioned prompts for agents and pipeline phases.
`scripts/`	Maintenance and one-off utilities.
`evals/`	Eval scenarios referenced from tooling or docs.

Data layout

Path	Role
`data/articles.db`	Primary SQLite store (articles, authors, run metadata).
`data/authors_manifest.jsonl`	Discovered author manifest from scrape.
`data/raw/`	Raw HTML / year archives (see scrape `--archive`).
`data/features/`	Per-author feature tables (Parquet).
`data/embeddings/`	Embedding batches used by drift and reports.
`data/analysis/`	Per-author JSON results, run metadata, `corpus_custody.json`.
`data/articles.jsonl`	JSONL export for auditing.
`data/reports/`	Quarto book output (see `_quarto.yml` `project.output-dir`).
`data/probability/`	Phase 9 outputs when enabled.
`data/ai_baseline/`	Phase 10 synthetic baseline artifacts.
`data/survey/`, `data/calibration/`	Survey and calibration run outputs (see runbook).

Exact filenames evolve with the pipeline; treat docs/ARCHITECTURE.md as the conceptual map.

Optional dependency extras

Defined in pyproject.toml:

Extra	Install	Use
`dev`	`uv sync --extra dev`	pytest, pytest-cov, Hypothesis, Ruff, pre-commit.
`probability`	`uv sync --extra probability`	Phase 9 token-level features (torch, transformers); `forensics extract --probability`.
`baseline`	`uv sync --extra baseline`	pydantic-ai + evals for baseline workflows; local [baseline] config in `config.toml` (Ollama) for generation smoke tests.
`tui`	`uv sync --extra tui`	Textual wizard: `uv run forensics setup`.

Reports (Quarto)

Project config: _quarto.yml (book title, chapters under notebooks/, output to data/reports/ for local runs).
forensics report shells out to quarto; install separately if missing.
--verify checks corpus hash material under data/analysis/ (see src/forensics/utils/provenance.py).
Hosted report: the bound Quarto book is embedded under https://abstract-data.github.io/mediaite-ghostink/report/ (entry …/report/index.html if the directory URL does not resolve), alongside the operator documentation site (Astro Starlight under website/, deployed by .github/workflows/deploy-docs.yml). For a local docs preview, run make docs-quarto first so website/public/report/ exists. This supersedes the prior Cloudflare Pages deploy of data/reports/.

Notebooks and prompts

notebooks/ — Exploratory and chapter notebooks wired into the Quarto book.
prompts/ — Versioned agent / phase prompts with current.md pointers; see prompts/README.md for the release contract.

Development

uv sync --extra dev
uv run ruff check .
uv run ruff format --check .
uv run pytest tests/ -v
uv run pytest tests/ -v --cov=src --cov-report=term-missing

Default pytest options (markers, coverage on forensics) live in pyproject.toml. Slow tests are marked @pytest.mark.slow; default runs exclude them (-m 'not slow'). Run them with uv run pytest tests/ -m slow when needed.
CI: .github/workflows/ci.yml runs Ruff lint/format and pytest with coverage JSON for PR comments.
Pre-commit: uv run pre-commit install using .pre-commit-config.yaml.

Testing policy and coverage gates are documented in docs/TESTING.md.

Documentation

The canonical operator markdown lives under docs/ and is also published — alongside the auto-generated CLI reference, the Python API reference, the ADRs, and the embedded Quarto report — at https://abstract-data.github.io/mediaite-ghostink/. The site is built from website/ and deployed by .github/workflows/deploy-docs.yml. Run make docs-dev for a local preview.

Document	Contents
`docs/ARCHITECTURE.md`	Runtime flow, modules, storage, feature and analysis methods.
`docs/TESTING.md`	Test layout, commands, coverage rules.
`docs/RUNBOOK.md`	Operational runbook (survey, calibration, export, baseline, preflight, docs site).
`docs/DEPLOYMENTS.md`	Deployment notes.
`docs/GUARDRAILS.md`	Recurring failure patterns and mitigations.
`docs/adr/`	Architecture decision records.
`website/`	Astro Starlight documentation site (Bun + `@abstractdata/starlight-theme`).
`CONTRIBUTING.md`	Pull requests, checks, and handoff expectations.
`SECURITY.md`	How to report security issues responsibly.
`LICENSE`	MIT license (Abstract Data LLC).

Agent and contributor notes

AGENTS.md — Boundaries, commands, embedding pin, data directories, and conventions for automation and humans.
CONTRIBUTING.md — Pull requests, local checks, formatting, and how to update HANDOFF.md after substantive work.
Governance / hooks: .github/workflows/agents-governance.yml and docs/adr/ADR-003-agent-governance-and-hooks.md.

Name		Name	Last commit message	Last commit date
Latest commit History 363 Commits
.claude		.claude
.cursor		.cursor
.github/workflows		.github/workflows
.vscode		.vscode
data		data
docs		docs
evals		evals
notebooks		notebooks
prompts		prompts
scripts		scripts
src/forensics		src/forensics
tests		tests
website		website
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.release-please-manifest.json		.release-please-manifest.json
.ruff.toml		.ruff.toml
AGENTS.md		AGENTS.md
AGENTS.staging.md		AGENTS.staging.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
HANDOFF.md		HANDOFF.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
_quarto.yml		_quarto.yml
config.external_controls.example.toml		config.external_controls.example.toml
config.toml		config.toml
config.toml.example		config.toml.example
index.qmd		index.qmd
pyproject.toml		pyproject.toml
release-notes-v0.1.0.md		release-notes-v0.1.0.md
release-please-config.json		release-please-config.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

mediaite-ghostink

At a glance

Five-minute smoke test

Table of contents

What this project does

Models, measurements, and algorithms

NLP and embeddings

Stylometry and readability (per article)

Change-point and time-series analysis

Statistical inference

Optional: token probability (extra probability)

Optional: AI baseline text (extra baseline)

Survey mode (newsroom-wide)

Local machine setup

Forensic assurance and chain of custody

Stages and artifacts

Integrity and hashing

Verification commands

Preregistration (confirmatory runs)

Configuration: [chain_of_custody]

Exports and databases

What reviewers should ask for

Responsible use

Architecture

Requirements

Installation

Configuration

CLI

Typical workflows

Repository layout

Data layout

Optional dependency extras

Reports (Quarto)

Notebooks and prompts

Development

Documentation

Agent and contributor notes

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Optional: token probability (extra `probability`)

Optional: AI baseline text (extra `baseline`)

Configuration: `[chain_of_custody]`

Packages