Hawk

The distribution database. Ingest rows, query distributions.

Hawk digests data into compact probability distributions, discards the raw rows, and lets you query the distributions directly -- compare, explain, track drift, and discover correlations through an information-theoretic lens.

40,000x compression: 209,527 news articles --> ~6KB on disk
Microsecond queries: no row scanning, distribution math runs directly
SQL-like DSL: 15 commands including COMPARE, EXPLAIN, TRACK, MI, NEAREST
10 built-in metrics: JSD, KL, PSI, Hellinger, Wasserstein, MI, NMI, Cramer's V, conditional MI, entropy
Distributions-only by default: raw-log retention is opt-in

hawk> COMPARE category BETWEEN time:2013 AND time:2022

Metric              Value
──────────────────  ──────────────────────────────────────────
JSD                 0.684139
PSI                 36.357643
Hellinger           0.782895
Entropy(A)          3.6248 bits
Entropy(B)          3.1460 bits
Samples             34583 vs 1398

--- Top Movers ---
POLITICS            +0.2854  (0.000 → 0.285)  contrib=0.1427
WELLNESS            -0.2150  (0.232 → 0.017)  contrib=0.0796
U.S. NEWS           +0.1724  (0.000 → 0.172)  contrib=0.0862

Why it's different

No existing system combines persistent distribution storage, a query language for distributions, and information-theoretic metrics. The closest tools each cover one piece:

	Hawk	WhyLogs	Evidently	Prometheus	Druid / ClickHouse
Persists distributions, not rows	yes	yes (profiles)	no	yes (histograms)	no
SQL-like query language	yes	no	no	PromQL (limited)	SQL (over rows)
JSD / KL / PSI / MI as queries	yes	no	via Python API	no	no
Joint distributions as first-class	yes	no	no	no	no
Embeddable Rust library	yes	Python / Java	Python	no	no
Temporal drift tracking as query	`TRACK`	SaaS dashboard	Python code	time-range query	time query

Use cases

ML feature drift monitoring

Track how feature distributions shift over time. Detect drift before model performance degrades.

TRACK feature_x FROM time:2024-01 GRANULARITY daily

A/B test analysis

Compare distributions between control and treatment groups without pulling raw data.

COMPARE conversion_bucket BETWEEN variant:control AND variant:treatment

Data quality monitoring

Scan all variables for unexpected distribution shifts between ingestion batches.

COMPARE category ACROSS ingest_date

Model risk / regulatory PSI tracking

Decompose total divergence across all variables to satisfy regulatory model validation requirements (SR 11-7, Basel III).

EXPLAIN time:2023Q4 VS time:2024Q4

Privacy-preserving analytics sharing

Distribute the database file; recipients query distributions without seeing raw rows. Raw-log retention is opt-in, so by default no individual records are stored.

How it works

Define variables (categorical or continuous) and dimensions (e.g., time)
Ingest data from CSV, JSON, or Parquet -- Hawk builds histograms and contingency tables
Query the distributions directly using a SQL-like language or web UI

The database stores only the distribution summaries, not the raw data. Everything is built on entropy and information theory: JSD for comparison, mutual information for association, KL divergence for directionality.

Quick start

# Build
cargo build --release

# Start the web UI
cargo run --release --bin hawk-server -- my_database.db 3000

# Or use the CLI
cargo run --release --bin hawk -- my_database.db

Query language

-- Compare two distribution slices
COMPARE category BETWEEN time:2013 AND time:2022

-- With dimension filters
COMPARE category BETWEEN time:2013 AND time:2022 WHERE region:US

-- Compare all pairs across a dimension
COMPARE category ACROSS time

-- What drives the divergence?
EXPLAIN time:2013 VS time:2022

-- Track drift over time
TRACK category FROM time:2012 GRANULARITY yearly

-- Show a distribution (top 5 categories)
SHOW category AT time:2022 TOP 5

-- Entropy ranking
RANK category BY ENTROPY OVER time

-- Mutual information between variables
MI author, category AT time:2016

-- Conditional MI (controlling for time)
CMI author, category GIVEN time

-- Find strongest associations
CORRELATIONS OVER time LIMIT 10

-- Pairwise distance matrix
PAIRWISE time ON category USING jsd

-- Nearest distributions
NEAREST time:2022 ON time LIMIT 3 USING hellinger

-- Export results
EXPORT STATS AS JSON
EXPORT COMPARE category ACROSS time AS CSV

-- Metadata
STATS
SCHEMA
DIMENSIONS time

Example outputs

Drift tracking:

hawk> TRACK category FROM time:2012 GRANULARITY yearly

Time  Entropy  Drift (JSD)
────  ───────  ────────────────
2012  3.6310   0.0314
2013  3.6248   0.3571 <- shift
2014  4.8237   0.1656 <- shift
2015  4.4118   0.0561 <- shift
2018  3.3050   0.1775 <- shift
2020  3.0430   0.0372
2021  2.9053   0.0286
2022  3.1460   0.0000

Explain divergence:

hawk> EXPLAIN time:2013 VS time:2022

Variable          JSD       Fraction
────────────────  ────────  ──────────────
TOTAL             0.830323  100.0%
category          0.684139  82.4%
  POLITICS        +0.2854   contrib=0.1427
  WELLNESS        -0.2150   contrib=0.0796
  U.S. NEWS       +0.1724   contrib=0.0862
author            0.146184  17.6%
  Mary Papenfuss  +0.0715   contrib=0.0358

Association strength:

hawk> MI author, category AT time:2016

Metric       Value
───────────  ───────────
MI           1.7794 bits
NMI          0.5537
Cramer's V   0.5186
Samples      5688
Strength     strong

Metrics

All metrics are rooted in information theory:

Metric	Formula	Range	What it measures
Entropy	H(X) = -Σ p_i log p_i	[0, log k]	Distribution uncertainty
JSD	H(M) - ½H(P) - ½H(Q)	[0, 1]	Symmetric divergence
KL divergence	Σ p_i log(p_i/q_i)	[0, ∞)	Directional divergence
PSI	KL(P\|\|Q) + KL(Q\|\|P)	[0, ∞)	Population stability (<0.1 stable, >0.2 significant)
Hellinger	(1/√2)√(Σ(√p-√q)²)	[0, 1]	Bounded symmetric distance
Wasserstein	Σ\|CDF_P - CDF_Q\|·Δx	[0, ∞)	Earth mover's distance (histograms only)
MI	H(X)+H(Y)-H(X,Y)	[0, ∞)	Shared information between variables
NMI	MI / min(H(X),H(Y))	[0, 1]	Normalized association strength
Cramer's V	√(χ²/(n·min(r-1,c-1)))	[0, 1]	Effect size for categorical association
Conditional MI	I(X;Y\|Z) = H(X,Z)+H(Y,Z)-H(X,Y,Z)-H(Z)	[0, ∞)	Association between two variables controlling for a third

Web UI

cargo run --release --bin hawk-server -- my_database.db 3000
# Open http://localhost:3000

Features:

Interactive query input with htmx (no page reloads)
SVG charts: diverging bar charts for COMPARE, entropy timelines for TRACK, distribution bars for SHOW, heatmaps for PAIRWISE
Clickable schema sidebar
Query history (persisted in localStorage)
Streaming ingestion endpoint: POST /ingest with JSON body

Streaming ingestion

The web server accepts live data via HTTP:

# Single record
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '{"category": "TECH", "date": "2024-01-15"}'

# Batch
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '[{"category": "TECH", "date": "2024-01-15"}, {"category": "SPORTS", "date": "2024-01-16"}]'

Storage format

Hawk uses a custom binary format with zstd compression:

[4 bytes] "HAWK" magic
[4 bytes] format version (u32 LE)
[rest]    zstd-compressed bincode payload

A database that digests 209K news articles (42 categories, 20 authors, 11 years) occupies ~6KB on disk.

File	Contents
`meta.edb`	Schema, counters, config
`distributions.edb`	All marginal distributions + joint contingency tables
`dist_index.edb`	Lookup index for (variable, dimension_key) -> distribution
`snapshots.edb`	Historical distribution snapshots

Architecture

Single library crate (hawk-engine) with modules:

hawk_engine::core       Types: Distribution, Joint, Schema, DimensionKey
hawk_engine::math       Entropy, JSD, KL, PSI, Hellinger, MI, NMI, Cramer's V, Wasserstein
hawk_engine::storage    Binary file storage, zstd compression, mmap reads, locking
hawk_engine::ingest     CSV/JSON/Parquet ingestion, rayon parallelism, schema inference
hawk_engine::query      Query engine: compare, explain, track, pairwise, correlations
hawk_engine::sql        SQL-like DSL: tokenizer, recursive descent parser, executor

Plus a separate binary crate (hawk-server) for the web UI: axum + htmx, SVG charts, streaming ingestion endpoint.

Using as a library

[dependencies]
hawk-engine = "0.1"

use hawk_engine::storage::{Database, OpenMode};
use hawk_engine::query::QueryEngine;

let db = Database::open("my.db", OpenMode::ReadOnly).unwrap();
let engine = QueryEngine::default();
let result = engine.compare(&db, "time:2023", "time:2024", None).unwrap();
println!("JSD = {:.6}", result.jsd);

Published to crates.io.

Building

cargo build --release
cargo test

Requirements: Rust 1.75+

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
crates		crates
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hawk

Why it's different

Use cases

ML feature drift monitoring

A/B test analysis

Data quality monitoring

Model risk / regulatory PSI tracking

Privacy-preserving analytics sharing

How it works

Quick start

Query language

Example outputs

Metrics

Web UI

Streaming ingestion

Storage format

Architecture

Using as a library

Building

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hawk

Why it's different

Use cases

ML feature drift monitoring

A/B test analysis

Data quality monitoring

Model risk / regulatory PSI tracking

Privacy-preserving analytics sharing

How it works

Quick start

Query language

Example outputs

Metrics

Web UI

Streaming ingestion

Storage format

Architecture

Using as a library

Building

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages