Real-valued, fixed-size molecular fingerprints β no training, just NumPy (with optional Rust acceleration).
Hyper Fingerprints encodes molecules into continuous vector representations using Holographic Reduced Representations (HRR) with graph message passing. The result is a deterministic, real-valued fingerprint that works as a drop-in feature vector for similarity search, clustering, or any downstream ML task.
from hyper_fingerprints import Encoder, cosine_similarity
enc = Encoder(dimension=512, seed=42)
# Encode molecules (SMILES strings or RDKit Mol objects)
fps = enc.encode(["CCO", "CO", "c1ccccc1"]) # shape: (3, 512), dtype: float64
# Cosine similarity β similar molecules get similar vectors
sim = cosine_similarity(fps, fps)
print(f"ethanol vs methanol: {sim[0, 1]:.3f}") # high similarity
print(f"ethanol vs benzene: {sim[0, 2]:.3f}") # low similarityTo use a custom atom vocabulary, pass atom_types at init:
enc = Encoder(dimension=512, atom_types=["C", "N", "O", "H", "Si"])See examples/00_quickstart.ipynb for a full
walkthrough covering similarity search, joint fingerprints, custom atom types,
save/load, and scikit-learn integration.
Encoder(
dimension=256, # hypervector size
depth=3, # message-passing layers (structural context radius)
atom_types=None, # atom vocabulary (default: Br, C, Cl, F, I, N, O, P, S)
seed=None, # random seed for reproducible codebook generation
normalize=False, # L2-normalize after each message-passing layer
backend="auto", # "auto" | "rust" | "numpy"
)Molecules can be passed as SMILES strings, RDKit Mol objects, or lists of
either.
encode(molecules) -> np.ndarray β Encode molecules into order-N
hypervector fingerprints. Returns shape (batch_size, dimension).
encode_joint(molecules) -> np.ndarray β Concatenation of order-0 (atom
identity only, no structural context) and order-N (full message-passing)
embeddings. Returns shape (batch_size, 2 * dimension). Useful when you want
both local atom-level and structural information in one vector.
save(path) / Encoder.load(path) β Persist and restore an encoder
(config + codebook) as a single .npz file. Useful for sharing a fixed
fingerprint scheme or deploying without needing to track the seed.
enc.save("encoder.npz")
loaded = Encoder.load("encoder.npz")| Parameter | Guidance |
|---|---|
dimension |
32-256 for Bayesian optimization. 1024-2048 as a starting point for property prediction. |
depth |
Controls structural context radius, analogous to Morgan radius. depth=3 captures up to 3-bond neighborhoods. Higher values capture more global structure but increase computation. |
Each atom is described by 5 discrete features:
| Feature | Bins | Values |
|---|---|---|
| Atom type | len(atom_types) (varies with vocabulary) |
Index into the atom vocabulary |
| Degree | 6 | 0-5 |
| Formal charge | 3 | neutral, positive, negative |
| Total Hs | 4 | 0-3 |
| Is aromatic | 2 | 0, 1 |
- No bond type features β bonds are treated as unweighted edges. Single, double, and aromatic bonds are not distinguished in the current feature scheme.
- No stereochemistry β chirality and cis/trans isomerism are not encoded.
- No GPU acceleration β encoding is CPU-only (NumPy or optional Rust extension).
- Codebook scales with vocabulary β the codebook has
product(feature_bins)entries (1296 for the default 9 atom types). Large custom atom type lists will increase memory usage.
Requires Python 3.9+ and a Rust toolchain (1.83+).
git clone https://github.com/the16thpythonist/hyper-fingerprints.git
cd hyper-fingerprints
# Install maturin (builds the Rust extension)
pip install maturin
# Build and install in development mode (editable, release-optimized)
RUSTFLAGS="-C target-cpu=native" maturin develop --release
# Verify the Rust backend is available
python -c "from hyper_fingerprints._core import encode_batch_rs; print('Rust OK')"# Build the wheel first
./build.sh
# Install the wheel
pip install target/wheels/hyper_fingerprints-*.whlnumpy >= 1.24rdkit >= 2024.0.0- Rust toolchain >= 1.83 (build-time only)
The Rust extension accelerates both SMILES parsing/feature extraction (~23x) and the message-passing pipeline (~22x), for a combined ~22x end-to-end speedup. When installed, it is used automatically:
enc = Encoder(dimension=512, seed=42, backend="rust") # require Rust
enc = Encoder(dimension=512, seed=42, backend="numpy") # force pure Python
enc = Encoder(dimension=512, seed=42, backend="auto") # default: Rust if availableInstall in dev mode with the Rust extension:
pip install maturin
RUSTFLAGS="-C target-cpu=native" maturin develop --release
pip install -e ".[dev]"Run tests:
pytestBuild a wheel and test it in a clean environment:
nox -s build_testRun tests across Python 3.9-3.13 with nox:
nox -s testsFingerprint outputs are regression-tested against recorded fixtures to ensure numerical stability across releases.
This project builds on the theory of Holographic Reduced Representations and Vector Symbolic Architectures:
- Plate, T. A. (1995). Holographic Reduced Representations. IEEE Transactions on Neural Networks, 6(3), 623-641. doi:10.1109/72.377968
- Kanerva, P. (2009). Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors. Cognitive Computation, 1(2), 139-159. doi:10.1007/s12559-009-9009-8
This project is licensed under the MIT License.
