Skip to content

repfly/hawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hawk

The distribution database. Ingest rows, query distributions.

Hawk digests data into compact probability distributions, discards the raw rows, and lets you query the distributions directly -- compare, explain, track drift, and discover correlations through an information-theoretic lens.

  • 40,000x compression: 209,527 news articles --> ~6KB on disk
  • Microsecond queries: no row scanning, distribution math runs directly
  • SQL-like DSL: 15 commands including COMPARE, EXPLAIN, TRACK, MI, NEAREST
  • 10 built-in metrics: JSD, KL, PSI, Hellinger, Wasserstein, MI, NMI, Cramer's V, conditional MI, entropy
  • Distributions-only by default: raw-log retention is opt-in
hawk> COMPARE category BETWEEN time:2013 AND time:2022

Metric              Value
──────────────────  ──────────────────────────────────────────
JSD                 0.684139
PSI                 36.357643
Hellinger           0.782895
Entropy(A)          3.6248 bits
Entropy(B)          3.1460 bits
Samples             34583 vs 1398

--- Top Movers ---
POLITICS            +0.2854  (0.000 → 0.285)  contrib=0.1427
WELLNESS            -0.2150  (0.232 → 0.017)  contrib=0.0796
U.S. NEWS           +0.1724  (0.000 → 0.172)  contrib=0.0862

Why it's different

No existing system combines persistent distribution storage, a query language for distributions, and information-theoretic metrics. The closest tools each cover one piece:

Hawk WhyLogs Evidently Prometheus Druid / ClickHouse
Persists distributions, not rows yes yes (profiles) no yes (histograms) no
SQL-like query language yes no no PromQL (limited) SQL (over rows)
JSD / KL / PSI / MI as queries yes no via Python API no no
Joint distributions as first-class yes no no no no
Embeddable Rust library yes Python / Java Python no no
Temporal drift tracking as query TRACK SaaS dashboard Python code time-range query time query

Use cases

ML feature drift monitoring

Track how feature distributions shift over time. Detect drift before model performance degrades.

TRACK feature_x FROM time:2024-01 GRANULARITY daily

A/B test analysis

Compare distributions between control and treatment groups without pulling raw data.

COMPARE conversion_bucket BETWEEN variant:control AND variant:treatment

Data quality monitoring

Scan all variables for unexpected distribution shifts between ingestion batches.

COMPARE category ACROSS ingest_date

Model risk / regulatory PSI tracking

Decompose total divergence across all variables to satisfy regulatory model validation requirements (SR 11-7, Basel III).

EXPLAIN time:2023Q4 VS time:2024Q4

Privacy-preserving analytics sharing

Distribute the database file; recipients query distributions without seeing raw rows. Raw-log retention is opt-in, so by default no individual records are stored.

How it works

  1. Define variables (categorical or continuous) and dimensions (e.g., time)
  2. Ingest data from CSV, JSON, or Parquet -- Hawk builds histograms and contingency tables
  3. Query the distributions directly using a SQL-like language or web UI

The database stores only the distribution summaries, not the raw data. Everything is built on entropy and information theory: JSD for comparison, mutual information for association, KL divergence for directionality.

Quick start

# Build
cargo build --release

# Start the web UI
cargo run --release --bin hawk-server -- my_database.db 3000

# Or use the CLI
cargo run --release --bin hawk -- my_database.db

Query language

-- Compare two distribution slices
COMPARE category BETWEEN time:2013 AND time:2022

-- With dimension filters
COMPARE category BETWEEN time:2013 AND time:2022 WHERE region:US

-- Compare all pairs across a dimension
COMPARE category ACROSS time

-- What drives the divergence?
EXPLAIN time:2013 VS time:2022

-- Track drift over time
TRACK category FROM time:2012 GRANULARITY yearly

-- Show a distribution (top 5 categories)
SHOW category AT time:2022 TOP 5

-- Entropy ranking
RANK category BY ENTROPY OVER time

-- Mutual information between variables
MI author, category AT time:2016

-- Conditional MI (controlling for time)
CMI author, category GIVEN time

-- Find strongest associations
CORRELATIONS OVER time LIMIT 10

-- Pairwise distance matrix
PAIRWISE time ON category USING jsd

-- Nearest distributions
NEAREST time:2022 ON time LIMIT 3 USING hellinger

-- Export results
EXPORT STATS AS JSON
EXPORT COMPARE category ACROSS time AS CSV

-- Metadata
STATS
SCHEMA
DIMENSIONS time

Example outputs

Drift tracking:

hawk> TRACK category FROM time:2012 GRANULARITY yearly

Time  Entropy  Drift (JSD)
────  ───────  ────────────────
2012  3.6310   0.0314
2013  3.6248   0.3571 <- shift
2014  4.8237   0.1656 <- shift
2015  4.4118   0.0561 <- shift
2018  3.3050   0.1775 <- shift
2020  3.0430   0.0372
2021  2.9053   0.0286
2022  3.1460   0.0000

Explain divergence:

hawk> EXPLAIN time:2013 VS time:2022

Variable          JSD       Fraction
────────────────  ────────  ──────────────
TOTAL             0.830323  100.0%
category          0.684139  82.4%
  POLITICS        +0.2854   contrib=0.1427
  WELLNESS        -0.2150   contrib=0.0796
  U.S. NEWS       +0.1724   contrib=0.0862
author            0.146184  17.6%
  Mary Papenfuss  +0.0715   contrib=0.0358

Association strength:

hawk> MI author, category AT time:2016

Metric       Value
───────────  ───────────
MI           1.7794 bits
NMI          0.5537
Cramer's V   0.5186
Samples      5688
Strength     strong

Metrics

All metrics are rooted in information theory:

Metric Formula Range What it measures
Entropy H(X) = -Σ p_i log p_i [0, log k] Distribution uncertainty
JSD H(M) - ½H(P) - ½H(Q) [0, 1] Symmetric divergence
KL divergence Σ p_i log(p_i/q_i) [0, ∞) Directional divergence
PSI KL(P||Q) + KL(Q||P) [0, ∞) Population stability (<0.1 stable, >0.2 significant)
Hellinger (1/√2)√(Σ(√p-√q)²) [0, 1] Bounded symmetric distance
Wasserstein Σ|CDF_P - CDF_Q|·Δx [0, ∞) Earth mover's distance (histograms only)
MI H(X)+H(Y)-H(X,Y) [0, ∞) Shared information between variables
NMI MI / min(H(X),H(Y)) [0, 1] Normalized association strength
Cramer's V √(χ²/(n·min(r-1,c-1))) [0, 1] Effect size for categorical association
Conditional MI I(X;Y|Z) = H(X,Z)+H(Y,Z)-H(X,Y,Z)-H(Z) [0, ∞) Association between two variables controlling for a third

Web UI

cargo run --release --bin hawk-server -- my_database.db 3000
# Open http://localhost:3000

Features:

  • Interactive query input with htmx (no page reloads)
  • SVG charts: diverging bar charts for COMPARE, entropy timelines for TRACK, distribution bars for SHOW, heatmaps for PAIRWISE
  • Clickable schema sidebar
  • Query history (persisted in localStorage)
  • Streaming ingestion endpoint: POST /ingest with JSON body

Streaming ingestion

The web server accepts live data via HTTP:

# Single record
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '{"category": "TECH", "date": "2024-01-15"}'

# Batch
curl -X POST http://localhost:3000/ingest \
  -H 'Content-Type: application/json' \
  -d '[{"category": "TECH", "date": "2024-01-15"}, {"category": "SPORTS", "date": "2024-01-16"}]'

Storage format

Hawk uses a custom binary format with zstd compression:

[4 bytes] "HAWK" magic
[4 bytes] format version (u32 LE)
[rest]    zstd-compressed bincode payload

A database that digests 209K news articles (42 categories, 20 authors, 11 years) occupies ~6KB on disk.

File Contents
meta.edb Schema, counters, config
distributions.edb All marginal distributions + joint contingency tables
dist_index.edb Lookup index for (variable, dimension_key) -> distribution
snapshots.edb Historical distribution snapshots

Architecture

Single library crate (hawk-engine) with modules:

hawk_engine::core       Types: Distribution, Joint, Schema, DimensionKey
hawk_engine::math       Entropy, JSD, KL, PSI, Hellinger, MI, NMI, Cramer's V, Wasserstein
hawk_engine::storage    Binary file storage, zstd compression, mmap reads, locking
hawk_engine::ingest     CSV/JSON/Parquet ingestion, rayon parallelism, schema inference
hawk_engine::query      Query engine: compare, explain, track, pairwise, correlations
hawk_engine::sql        SQL-like DSL: tokenizer, recursive descent parser, executor

Plus a separate binary crate (hawk-server) for the web UI: axum + htmx, SVG charts, streaming ingestion endpoint.

Using as a library

[dependencies]
hawk-engine = "0.1"
use hawk_engine::storage::{Database, OpenMode};
use hawk_engine::query::QueryEngine;

let db = Database::open("my.db", OpenMode::ReadOnly).unwrap();
let engine = QueryEngine::default();
let result = engine.compare(&db, "time:2023", "time:2024", None).unwrap();
println!("JSD = {:.6}", result.jsd);

Published to crates.io.

Building

cargo build --release
cargo test

Requirements: Rust 1.75+

License

MIT

About

A distribution-native analytics engine.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors