Skip to content

the16thpythonist/hyper-fingerprints

Repository files navigation

Hyperdimensional Fingerprints (HDF)

banner

Python 3.9+ License: MIT NumPy + optional Rust

Real-valued, fixed-size molecular fingerprints β€” no training, just NumPy (with optional Rust acceleration).

Hyper Fingerprints encodes molecules into continuous vector representations using Holographic Reduced Representations (HRR) with graph message passing. The result is a deterministic, real-valued fingerprint that works as a drop-in feature vector for similarity search, clustering, or any downstream ML task.

πŸš€ Quick start

from hyper_fingerprints import Encoder, cosine_similarity

enc = Encoder(dimension=512, seed=42)

# Encode molecules (SMILES strings or RDKit Mol objects)
fps = enc.encode(["CCO", "CO", "c1ccccc1"])  # shape: (3, 512), dtype: float64

# Cosine similarity β€” similar molecules get similar vectors
sim = cosine_similarity(fps, fps)

print(f"ethanol vs methanol: {sim[0, 1]:.3f}")   # high similarity
print(f"ethanol vs benzene:  {sim[0, 2]:.3f}")    # low similarity

To use a custom atom vocabulary, pass atom_types at init:

enc = Encoder(dimension=512, atom_types=["C", "N", "O", "H", "Si"])

See examples/00_quickstart.ipynb for a full walkthrough covering similarity search, joint fingerprints, custom atom types, save/load, and scikit-learn integration.

πŸ“– API

Encoder

Encoder(
    dimension=256,      # hypervector size
    depth=3,            # message-passing layers (structural context radius)
    atom_types=None,    # atom vocabulary (default: Br, C, Cl, F, I, N, O, P, S)
    seed=None,          # random seed for reproducible codebook generation
    normalize=False,    # L2-normalize after each message-passing layer
    backend="auto",     # "auto" | "rust" | "numpy"
)

Molecules can be passed as SMILES strings, RDKit Mol objects, or lists of either.

Methods

encode(molecules) -> np.ndarray β€” Encode molecules into order-N hypervector fingerprints. Returns shape (batch_size, dimension).

encode_joint(molecules) -> np.ndarray β€” Concatenation of order-0 (atom identity only, no structural context) and order-N (full message-passing) embeddings. Returns shape (batch_size, 2 * dimension). Useful when you want both local atom-level and structural information in one vector.

Persistence

save(path) / Encoder.load(path) β€” Persist and restore an encoder (config + codebook) as a single .npz file. Useful for sharing a fixed fingerprint scheme or deploying without needing to track the seed.

enc.save("encoder.npz")
loaded = Encoder.load("encoder.npz")

Parameter guidance

Parameter Guidance
dimension 32-256 for Bayesian optimization. 1024-2048 as a starting point for property prediction.
depth Controls structural context radius, analogous to Morgan radius. depth=3 captures up to 3-bond neighborhoods. Higher values capture more global structure but increase computation.

Atom features

Each atom is described by 5 discrete features:

Feature Bins Values
Atom type len(atom_types) (varies with vocabulary) Index into the atom vocabulary
Degree 6 0-5
Formal charge 3 neutral, positive, negative
Total Hs 4 0-3
Is aromatic 2 0, 1

⚠️ Limitations

  • No bond type features β€” bonds are treated as unweighted edges. Single, double, and aromatic bonds are not distinguished in the current feature scheme.
  • No stereochemistry β€” chirality and cis/trans isomerism are not encoded.
  • No GPU acceleration β€” encoding is CPU-only (NumPy or optional Rust extension).
  • Codebook scales with vocabulary β€” the codebook has product(feature_bins) entries (1296 for the default 9 atom types). Large custom atom type lists will increase memory usage.

πŸ“¦ Installation

Requires Python 3.9+ and a Rust toolchain (1.83+).

From source (recommended for development)

git clone https://github.com/the16thpythonist/hyper-fingerprints.git
cd hyper-fingerprints

# Install maturin (builds the Rust extension)
pip install maturin

# Build and install in development mode (editable, release-optimized)
RUSTFLAGS="-C target-cpu=native" maturin develop --release

# Verify the Rust backend is available
python -c "from hyper_fingerprints._core import encode_batch_rs; print('Rust OK')"

From a pre-built wheel

# Build the wheel first
./build.sh

# Install the wheel
pip install target/wheels/hyper_fingerprints-*.whl

Dependencies

  • numpy >= 1.24
  • rdkit >= 2024.0.0
  • Rust toolchain >= 1.83 (build-time only)

πŸ”§ Rust backend

The Rust extension accelerates both SMILES parsing/feature extraction (~23x) and the message-passing pipeline (~22x), for a combined ~22x end-to-end speedup. When installed, it is used automatically:

enc = Encoder(dimension=512, seed=42, backend="rust")   # require Rust
enc = Encoder(dimension=512, seed=42, backend="numpy")  # force pure Python
enc = Encoder(dimension=512, seed=42, backend="auto")   # default: Rust if available

πŸ§ͺ Development

Install in dev mode with the Rust extension:

pip install maturin
RUSTFLAGS="-C target-cpu=native" maturin develop --release
pip install -e ".[dev]"

Run tests:

pytest

Build a wheel and test it in a clean environment:

nox -s build_test

Run tests across Python 3.9-3.13 with nox:

nox -s tests

Fingerprint outputs are regression-tested against recorded fixtures to ensure numerical stability across releases.

πŸ“š References

This project builds on the theory of Holographic Reduced Representations and Vector Symbolic Architectures:

  • Plate, T. A. (1995). Holographic Reduced Representations. IEEE Transactions on Neural Networks, 6(3), 623-641. doi:10.1109/72.377968
  • Kanerva, P. (2009). Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors. Cognitive Computation, 1(2), 139-159. doi:10.1007/s12559-009-9009-8

πŸ“„ License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors