Skip to content

EML Operator-Inspired Optimizations: Log Quantization, Unified Distance, EML Trees #351

@shaal

Description

@shaal

📋 Status: Ready for maintainer review (2026-04-14)

Proposal fully implemented, benchmarked on real public ANN datasets, and committed to branch issue/eml on fork shaal/ruvector. PR #352 is open but this issue contains everything needed to evaluate the work end-to-end. All four proof reports live in bench_results/ on the branch.

One-line ask: merge the branch (or cherry-pick the relevant commits) if the SIFT1M real-data win and the honest failure-mode documentation meet the bar; reject with specific concerns if not.


TL;DR

Four optimizations inspired by Odrzywołek's "All elementary functions from a single operator" paper, validated through a rigorous four-stage proof chain that includes one self-disproof and one hypothesis-test failure — honestly reported.

Headline real-data result (the single most important number)

On SIFT1M (100K subset, the canonical ANN benchmark, 3 seeds): +14.0% QPS at ef=64, +0.75% Recall@1, −3.3% build time. Recall preserved or slightly improved across k=1/10/100. Full table below.

Honest caveat

On GloVe-100d: −10.4% QPS at ef=256 (recall preserved). The non-power-of-two 100D dimension causes SimSIMD's scalar tail-handling to execute per distance call. We tried zero-padding as a fix (v4) and it made the regression worse due to per-call padding overhead. The correct fix (pad-at-insert) is scoped as follow-up work; the API hook for it is already shipped.

Four-stage proof chain

Stage Outcome Report
v1 Micro + end-to-end scalar unified kernel ❌ Disproved — −21% QPS regression eml_proof_2026-04-14.md
v2 Port compute() to SimSIMD, re-run synthetic ✅ Confirmed — +5–11% QPS, recall preserved eml_proof_2026-04-14_v2.md
v3 Real public ANN datasets (SIFT1M + GloVe) ⚠️ Mixed — SIFT1M +14% win; GloVe −10% regression eml_proof_2026-04-14_v3.md
v4 Dimension-padding hypothesis test ❌ Disproved — per-call padding makes it worse eml_proof_2026-04-14_v4.md

The Proposal

Four optimizations derived from the proof that eml(x, y) = exp(x) − ln(y) combined with the constant 1 is functionally complete for all elementary mathematical functions — analogous to NAND gates in Boolean logic. The paper's identities we use:

  • eml(x, 1) = exp(x) — exponential
  • eml(0, y) = 1 − ln(y) — shifted logarithm
  • eml(ln(a), exp(b)) = a − b — subtraction

Optimization 1 — LogQuantized (Logarithmic Scalar Quantization)

Problem. ScalarQuantized maps values linearly to [0, 255]. Real embedding distributions (transformer outputs, ReLU activations, frequency features) are non-uniform — values cluster near zero with heavy tails. Linear mapping wastes bits on the sparse tail.

Solution. Apply ln(x − min + 1) before uniform quantization and exp(q) − 1 + min to reconstruct. Same 4× compression as scalar, finer granularity where values are dense.

Implementation. src/quantization.rs — same u8 storage as ScalarQuantized, same SIMD-accelerated distance path.

Optimization 2 — UnifiedDistanceParams (Branch-free Distance Kernel)

Problem. distance() dispatches via match metric { … } on every call. Inside HNSW search, this match executes thousands of times per query.

Solution. Pre-resolve the metric once into boolean flags (normalize, use_abs_diff, use_sq_diff, apply_sqrt, negate). The inner loop then evaluates a predictable if/else if cascade on those flags, which branch-predicts perfectly for a given index.

Implementation (src/advanced/eml.rs). After v1's disproof, compute() now dispatches directly to simsimd::SpatialSimilarity::{cosine, sqeuclidean, dot} — same hand-tuned AVX2/NEON kernels as the baseline.

Optimization 3 — EmlTree / EmlModel (Learned-Index CDF Approximator)

Problem. RecursiveModelIndex uses linear models for CDF approximation; real distributions are rarely linear.

Solution. Replace linear model with a trainable binary tree of eml() nodes. The paper's functional-completeness proof guarantees any elementary CDF can be represented.

Evidence. On y = exp(x), a depth-1 EML tree converges in 12 iterations to test-MSE 0.000000 vs linear model's 0.194 (~2 billion× improvement).

Optimization 4 — EmlScoreFusion (Non-linear Hybrid Search Scoring)

Problem. Linear fusion α·vector + (1−α)·bm25 can't capture non-linear interactions.

Solution. eml(a·vector + b, c·bm25 + d) — captures the real-world pattern that vector similarity has exponential impact while keyword matching has logarithmic impact. Measured 3.16× asymmetry in tests/eml_proof.rs (PROOF 5).

Overhead. 2.5 ns/pair absolute (4.3× slower than linear). Negligible for top-K hybrid scoring.


Evidence Detail

Micro-evidence (tests/eml_proof.rs — 8 integration tests, all passing)

Claim Measurement
LogQuantized MSE (half-normal / ReLU) 20.9% lower than Scalar
LogQuantized MSE (exponential) 52.5% lower than Scalar
LogQuantized MSE (log-normal) 5.1% lower than Scalar
Unified distance correctness max diff < 3.34e-6 across 1,710 vector pairs
EML tree on exp(x) ~2 billion× lower test-MSE than linear model
EML score fusion asymmetry 3.16× vector/keyword ratio
Zero-padding semantic neutrality proven for Euclidean / Cosine / DotProduct / Manhattan

End-to-End real-data evidence (SIFT1M — the headline win)

Dataset: SIFT1M from Texmex corpus, 100K base × 500 queries × 128D, Euclidean.
HNSW: M=16, ef_construction=200. Three seeds [42, 1337, 2024].

Metric Baseline (Scalar+Dispatch) LogQuantized (Log+Unified SIMD) Delta
QPS ef=64 2,946 ± 300 3,358 ± 262 +14.0%
QPS ef=256 1,604 ± 63 1,653 ± 94 +3.0%
Recall@1 (ef=256) 0.9807 ± 0.0057 0.9880 ± 0.0016 +0.75%
Recall@10 0.9843 0.9853 +0.1% (~same)
Recall@100 0.9785 0.9785 ~0%
Build time 39.0 s 37.7 s −3.3%
p50 ef=64 337 µs 298 µs −11.5% (better)
p95 ef=64 480 µs 417 µs −13.1% (better)
p99 ef=64 538 µs 504 µs −6.3% (better)
p99.9 ef=64 658 µs ± 147 882 µs ± 551 +34.0% (high stddev)
p50 ef=256 618 µs 591 µs −4.4%
p95/p99 ef=256 ≈baseline ±10%
p99.9 ef=256 1,190 µs 1,628 µs +36.9% (tail noise)

End-to-End real-data evidence (GloVe-100d — the honest regression)

Dataset: GloVe-100d from Stanford NLP, 100K base × 500 queries × 100D, Cosine.
HNSW: identical config.

Metric Baseline Log+Unified Delta
QPS ef=64 2,483 2,429 −2.2% (~same)
QPS ef=256 1,198 1,073 −10.4% WORSE ⚠️
Recall@1 0.9740 0.9713 −0.27% (noise)
Recall@10 0.9698 0.9705 +0.1%
Recall@100 0.9374 0.9375 ~0%

Root cause: 100D is not a multiple of 8, so SimSIMD runs a scalar tail-loop on every distance call. Cosine normalization also dispatches to the same simsimd::cosine() function in both baseline and unified paths, so the theoretical branch-free advantage has no room to manifest.

v4 padding hypothesis test (honest disproof)

We added pad_to_power_of_two: bool to UnifiedDistanceParams and HnswIndex::new_unified_padded(), zero-padding 100D → 104D per-call via thread-local scratch. Re-ran the same GloVe benchmark:

GloVe Δ vs baseline v3 unpadded v4 padded Interpretation
QPS ef=64 −2.2% −24.6% WORSE padding hurt
QPS ef=256 −10.4% −23.4% WORSE padding hurt
Build time +3.3% +31.7% WORSE padding hurt

Why: per-call padding path = RefCell::borrow_mut() × 2 + clear() × 2 + extend_from_slice(100) × 2 (400-byte memcpy each) + resize(104) × 2 + SimSIMD call. That's ~60–100 cycles overhead per distance call vs ~5–10 cycles saved on the SimSIMD tail. Net loss. HNSW does thousands of distance calls per query, so the loss amplifies.

Correct fix (scoped as follow-up, not in this PR): pad vectors once at add() / search() time, store padded in HNSW, let compute_raw() see pre-aligned slices with zero per-call overhead. The pad_to_power_of_two API hook is shipped so this follow-up lands without a public API change. A TODO(future) comment near UnifiedDistanceParams::with_padding() records this.


API Surface (all opt-in; defaults unchanged)

// New variant in existing enum — backward compatible
pub enum QuantizationConfig {
    None, Scalar, Product { ... }, Binary,
    Log,                       // ← NEW
}

// New constructors alongside existing HnswIndex::new()
impl HnswIndex {
    pub fn new(...)                   // UNCHANGED — default
    pub fn new_unified(...)           // ← NEW: branch-free kernel
    pub fn new_unified_padded(...)    // ← NEW: + zero-padding (experimental, see v4)
}

// New type in advanced::eml
pub struct UnifiedDistanceParams {
    pub negate: bool,
    pub normalize: bool,
    pub use_abs_diff: bool,
    pub use_sq_diff: bool,
    pub apply_sqrt: bool,
    pub pad_to_power_of_two: bool,    // ← NEW (default false)
}
impl UnifiedDistanceParams {
    pub fn euclidean() -> Self { ... }
    pub fn cosine() -> Self { ... }
    pub fn dot_product() -> Self { ... }
    pub fn manhattan() -> Self { ... }
    pub fn from_metric(DistanceMetric) -> Self { ... }
    pub fn with_padding(self, bool) -> Self { ... }    // ← NEW
    pub fn compute(&self, &[f32], &[f32]) -> f32 { ... }
    pub fn batch_compute(...) -> Vec<f32> { ... }
    pub fn batch_compute_parallel(...) -> Vec<f32> { ... }  // feature = "parallel"
}

// Plus: EmlTree, EmlModel, EmlScoreFusion, TrainConfig, eml(), eml_safe()
// all under `ruvector_core::advanced::eml`, opt-in usage.

Zero breaking changes. Default VectorDB::new(DbOptions::default()) still produces ScalarQuantized + match-dispatched distance + standard HnswIndex::new(). Users opt into each EML feature explicitly.


Test & Validation Coverage

Suite Count Result
EML unit tests (advanced::eml::tests) 27 ✅ all pass (21 original + 6 new padding correctness)
Quantization tests (incl. LogQuantized) 18 ✅ all pass
HNSW tests 7 ✅ all pass
Integration proofs (tests/eml_proof.rs) 8 ✅ all pass
Total across all affected modules 60+ ✅ all pass

Micro-benchmarks: cargo bench -p ruvector-core --bench eml_bench — 6 Criterion groups.

End-to-end: cargo bench -p ruvector-core --bench eml_end_to_end — fast Criterion subset runs in ~1 min; full real-dataset proof in ~15 min.


Reproducibility (maintainer can re-run from scratch)

# Check out the branch
git fetch https://github.com/shaal/ruvector.git issue/eml
git checkout -b review/eml-issue-351 FETCH_HEAD

# Pre-cache datasets (one-time, ~1GB; gitignored)
mkdir -p bench_data && cd bench_data
curl -fLO ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz && tar xzf sift.tar.gz
curl -fLO https://nlp.stanford.edu/data/glove.6B.zip && unzip -o glove.6B.zip glove.6B.100d.txt
cd ..

# Run correctness tests (< 1 min)
cargo test -p ruvector-core --lib --release -- advanced::eml:: quantization:: index::hnsw::
cargo test -p ruvector-core --test eml_proof --release -- --nocapture

# Micro-benchmarks (~5 min)
cargo bench -p ruvector-core --bench eml_bench

# End-to-end real-dataset proof (v3 table, ~15 min at 100K)
EML_FULL_PROOF=1 EML_REAL_DATASETS=1 EML_PROOF_N=100000 EML_PROOF_Q=500 \
  cargo bench -p ruvector-core --bench eml_end_to_end -- eml_e2e_full_proof

# v4 padding test on GloVe only (~5 min)
EML_FULL_PROOF=1 EML_REAL_DATASETS=1 EML_SYNTHETIC_DATASETS=0 EML_SKIP_SIFT1M=1 \
  EML_PAD_UNIFIED=1 EML_PROOF_N=100000 EML_PROOF_Q=500 \
  cargo bench -p ruvector-core --bench eml_end_to_end -- eml_e2e_full_proof

Each benchmark prints a Markdown table + inline CSV to stderr, identical format to the committed proof reports.


Files Changed (on branch issue/eml)

File Role
src/advanced/eml.rs Core EML operator, EmlTree, training, SIMD UnifiedDistanceParams with padding, EmlScoreFusion
src/quantization.rs LogQuantized type with SIMD distance
src/types.rs QuantizationConfig::Log variant + doc comment with power-of-two caveat + all proof refs
src/index/hnsw.rs HnswDistanceFn enum, DistanceStrategy, new_unified{_padded}()
benches/eml_bench.rs 6 Criterion micro-benchmark groups
benches/eml_end_to_end.rs End-to-end ANN proof: .fvecs + GloVe loaders, Dataset abstraction, env-gated proof runner
tests/eml_proof.rs 8 integration proof tests
docs/adr/ADR-033-eml-operator-optimizations.md Architecture decision record
docs/adr/DDD-eml-bounded-context.md Domain-driven design context
docs/benchmarks/BENCHMARKING_GUIDE.md New section + headline numbers + v1–v4 references
bench_results/eml_proof_2026-04-14{,_v2,_v3,_v4}.md Four proof reports (evidence chain)
.gitignore /bench_data/ excluded

Commit history (4 commits on the branch, ready for cherry-pick or squash-merge)

9bc8d165 docs: align types.rs + BENCHMARKING_GUIDE.md with v4 finding + TODO
9cdc6907 feat: real-dataset ANN proof (SIFT1M + GloVe) + dimension padding option
ab18edf9 feat: end-to-end ANN proof for EML optimizations + SIMD port
5a4b5f8e feat: add EML operator-inspired optimizations for quantization, distance, and learned indexes

Known Follow-up Work (explicitly NOT in this PR)

  1. Pad-at-insert HNSW variant. Closes the GloVe regression if the hypothesis holds at the insert layer instead of the per-distance-call layer. The API hook (pad_to_power_of_two flag + new_unified_padded()) is already shipped so this lands without a public API change. Estimated scope: change HnswIndex::add_batch() and search_with_ef() to pad once; ~100 LOC.
  2. Full 1M SIFT1M run. Current validation uses the 100K subset for a ~15-min run; 1M would take ~3 hours but confirm the win scales.
  3. ANN-Benchmarks protocol on Deep1M / MS MARCO / Fashion-MNIST. Broader corpus coverage before declaring the wins universally production-ready.
  4. Faiss / pgvector comparative benchmarks. Cross-library validation.

None of these are blocking for the current PR — they're future work the proof chain identifies, not gaps in the current evidence.


What to Review

For the maintainer, the minimum viable review path is:

  1. Read the TL;DR above — is the SIFT1M headline win credible and valuable?
  2. Open v3 report — scan the Markdown table for SIFT1M; verify the reproducibility instructions make sense.
  3. Open v4 report — verify the honest regression disclosure for GloVe is adequate and the pad-at-insert follow-up plan is sound.
  4. Skim types.rs around line 105 — verify QuantizationConfig::Log doc carries the caveat clearly.
  5. Verify defaults unchangedDbOptions::default() still produces ScalarQuantized; no existing user gets silently opted into new behavior.

If those five checks are satisfied, this can merge. If any concerns surface, happy to address them directly in comments or spin up additional benchmarks.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions