Skip to content

Implement a database abstraction layer that will enable the use of different DB backends.#282

Open
Davidyz wants to merge 96 commits intomainfrom
feat/db_layer
Open

Implement a database abstraction layer that will enable the use of different DB backends.#282
Davidyz wants to merge 96 commits intomainfrom
feat/db_layer

Conversation

@Davidyz
Copy link
Owner

@Davidyz Davidyz commented Sep 1, 2025

Part of #221.

This will most likely be incompatible with the existing configuration, in the sense that we'd need to follow similar patterns for embedding functions and rerankers. As a temporary solution, we could maybe add a function that transforms the old config to the new one internally.

I'm not committed to this implementation, but I need some hands-on experience to know what we'd need from the abstraction layer. If this works out, we could just go with this.
Having spent some time looking into langchain implementations, I thought their approach is a bit bloated for our simple RAG tool that specialises in local files that are organised in directories (and makes extensive use of metadata). As such, I decided to follow this PR and implement my own database connector (mostly based on chromadb API design), which we can then use to implement supports for new databases.

@Davidyz Davidyz linked an issue Sep 1, 2025 that may be closed by this pull request
@Davidyz Davidyz force-pushed the feat/db_layer branch 4 times, most recently from 54787ef to c3b83f8 Compare September 6, 2025 04:34
@Davidyz Davidyz force-pushed the feat/db_layer branch 4 times, most recently from 6f91093 to 7a432fc Compare September 16, 2025 09:43
@Davidyz Davidyz marked this pull request as ready for review September 16, 2025 10:05
@Davidyz Davidyz force-pushed the feat/db_layer branch 6 times, most recently from edd3382 to 21b820b Compare September 19, 2025 09:24
@codecov
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 99.80237% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.76%. Comparing base (171361e) to head (799e1fe).

Files with missing lines Patch % Lines
src/vectorcode/database/chroma.py 99.09% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #282      +/-   ##
==========================================
+ Coverage   99.72%   99.76%   +0.03%     
==========================================
  Files          25       32       +7     
  Lines        1845     2099     +254     
==========================================
+ Hits         1840     2094     +254     
  Misses          5        5              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Davidyz
Copy link
Owner Author

Davidyz commented Sep 27, 2025

For the sake of easily configuring database settings for all projects, I'm planning to modify the config file resolution so that project configs will be merged with the global config. This means you can only configure the db/embedding/reranker only once in the global config.

@Davidyz
Copy link
Owner Author

Davidyz commented Oct 5, 2025

As a proof-of-concept, I'll try to get chromadb 1.x working as part of this PR. This is likely going to introduce packaging change. Specifically, the default chromadb version constraint will be <2.0.0, with an optional dep group that pin to ==0.6.3.

@superbiche
Copy link

superbiche commented Feb 5, 2026

USearch Adapter Implementation - DBAL Validation & Benchmarks

I implemented a USearch + SQLite hybrid adapter using the DBAL interface from this PR to validate the abstraction layer design (with help from Claude). The implementation is available at superbiche/VectorCode@feat/usearch-adapter.

Benchmark Results

Large Codebase Benchmarks (vs ChromaDB)

Codebase Files Chunks Unfiltered 10% Exclusion 50% Exclusion
Linux kernel 50k 250k 2.10x 258.6x 237.6x
VS Code 7.9k 39k 2.71x 69.5x 50.9x
Kubernetes 24k 120k 2.09x 153.6x 130.2x

The dramatic speedup for filtered queries comes from the different approach: ChromaDB filters during HNSW traversal (expensive with large exclusion sets), while USearch over-fetches and does simple Python set lookup.

DBAL Feedback

What works well:

  • Abstract base class design - clean separation, easy to extend
  • Config-driven initialization via `db_params`
  • Well-structured types (`QueryResult`, `CollectionInfo`, `VectoriseStats`)

Suggestions:

  1. Index deletion - USearch doesn't support removing individual vectors. Consider adding optional rebuild_index() or documenting this limitation for backends that share it.
  2. Score semantics - Document whether higher = better match (ChromaDB uses negative distances, USearch uses positive).
  3. Collection metadata - Consider documenting minimal required fields (path, embedding_function, created_by).

Architecture

USearch only stores vectors + integer keys, so I paired it with SQLite for metadata:

~/.local/share/vectorcode/usearch/<collection_id>/
├── index.usearch      # Vector index (HNSW)
└── metadata.db        # SQLite: chunks, paths, content_hash

ChromaDB-Free Operation

To enable USearch to work completely independently of ChromaDB (avoiding version conflicts with Pydantic 2.x), I added:

  1. Lazy imports (f87da45) - ChromaDB modules are only imported when actually using ChromaDB connectors
  2. Standalone embedding functions (3f8fd3c) - Native implementations of OllamaEmbeddingFunction and SentenceTransformerEmbeddingFunction that don't require ChromaDB's embedding_functions module

This allows users to run USearch without ChromaDB installed at all, or with an incompatible ChromaDB version in their environment. The get_embedding_function() now:

  1. First tries standalone implementations (Ollama, SentenceTransformer)
  2. Falls back to ChromaDB's embedding_functions only if needed
  3. Gracefully handles ImportError when ChromaDB is unavailable/incompatible

Happy to submit the adapter as a follow-up PR once this merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Enable the use of multiple DB types

2 participants