From a4d78ce6f489eedb39a1cf691ec4af2b0b2eb05d Mon Sep 17 00:00:00 2001 From: Dylan Couzon Date: Fri, 15 May 2026 22:07:12 -0400 Subject: [PATCH] add tutorial --- fusion-methods/Choosing_a_Fusion_Method.ipynb | 1028 +++++++++++++++++ 1 file changed, 1028 insertions(+) create mode 100644 fusion-methods/Choosing_a_Fusion_Method.ipynb diff --git a/fusion-methods/Choosing_a_Fusion_Method.ipynb b/fusion-methods/Choosing_a_Fusion_Method.ipynb new file mode 100644 index 0000000..f44b4d0 --- /dev/null +++ b/fusion-methods/Choosing_a_Fusion_Method.ipynb @@ -0,0 +1,1028 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell-00-title", + "metadata": {}, + "source": [ + "# Choosing a Fusion Method: RRF, DBSF, and FormulaQuery for Hybrid Search\n", + "\n", + "| Time: 45 min | Level: Intermediate | Requires: Qdrant 1.17.x or later |\n", + "| --- | --- | --- |\n", + "\n", + "Hybrid search combines two or more retrievers (typically dense + sparse) into a single ranking. Qdrant ships **RRF** and **DBSF** for fusion, plus **`FormulaQuery`** for ranking logic on top (recency, boosts, geo). This notebook compares the three on [BEIR/SciFact](https://huggingface.co/datasets/BeIR/scifact) (5,183 documents, 300 queries with relevance labels) and walks through tuning weighted RRF when you have an eval set.\n", + "\n", + "Everything runs against a [Qdrant Cloud Free Tier Cluster](https://qdrant.tech/documentation/cloud/create-cluster/#free-clusters) with [Cloud Inference](https://qdrant.tech/documentation/cloud/inference/) for both dense and BM25 embeddings, so no local model downloads are required." + ] + }, + { + "cell_type": "markdown", + "id": "cell-01-setup-head", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell-02-install", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q \"qdrant-client>=1.17.0\" datasets ranx pandas numpy" + ] + }, + { + "cell_type": "markdown", + "id": "cell-03-cloud-aside", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell-04-imports", + "metadata": {}, + "outputs": [], + "source": "import random\nfrom collections import defaultdict\n\nimport numpy as np\nimport pandas as pd\nfrom datasets import load_dataset\nfrom google.colab import userdata\nfrom qdrant_client import QdrantClient, models\nfrom ranx import Qrels, Run, evaluate\n\n# Set QDRANT_URL and QDRANT_API_KEY as Colab secrets (left sidebar, key icon).\n# Get the values from https://cloud.qdrant.io\nQDRANT_URL = userdata.get(\"QDRANT_URL\")\nQDRANT_API_KEY = userdata.get(\"QDRANT_API_KEY\")\n\nclient = QdrantClient(\n url=QDRANT_URL,\n api_key=QDRANT_API_KEY,\n cloud_inference=True,\n)\n\nCOLLECTION = \"scifact_fusion\"\nDENSE_MODEL = \"sentence-transformers/all-minilm-l6-v2\"\nBM25_MODEL = \"qdrant/bm25\"\nDENSE_DIM = 384\nBM25_AVG_LEN = 220 # SciFact abstracts average roughly 220 words; tune to your corpus\n\nrandom.seed(42)\nnp.random.seed(42)" + }, + { + "cell_type": "markdown", + "id": "cell-05-load-head", + "metadata": {}, + "source": [ + "## Load BEIR/SciFact\n", + "\n", + "SciFact is a small fact-checking corpus. We use the BEIR `test` split, which provides 300 queries each paired with their known-relevant documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell-06-load", + "metadata": {}, + "outputs": [], + "source": [ + "corpus_ds = load_dataset(\"BeIR/scifact\", \"corpus\", split=\"corpus\")\n", + "queries_ds = load_dataset(\"BeIR/scifact\", \"queries\", split=\"queries\")\n", + "qrels_ds = load_dataset(\"BeIR/scifact-qrels\", split=\"test\")\n", + "\n", + "point_to_id = {i: doc[\"_id\"] for i, doc in enumerate(corpus_ds)}\n", + "\n", + "query_ids_with_qrels = {str(row[\"query-id\"]) for row in qrels_ds}\n", + "eval_queries = [q for q in queries_ds if q[\"_id\"] in query_ids_with_qrels]\n", + "\n", + "qrels_dict = defaultdict(dict)\n", + "for row in qrels_ds:\n", + " qrels_dict[str(row[\"query-id\"])][str(row[\"corpus-id\"])] = int(row[\"score\"])\n", + "\n", + "print(f\"Corpus: {len(corpus_ds)} documents\")\n", + "print(f\"Eval queries: {len(eval_queries)}\")\n", + "print(f\"Known-relevant (query, doc) pairs: {sum(len(v) for v in qrels_dict.values())}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell-07-schema-head", + "metadata": {}, + "source": "## Create the Collection\n\nTwo retrievers stored as named vectors on the same point:\n\n- **`dense`**: `all-minilm-l6-v2` (384 dimensions, cosine).\n- **`bm25`**: sparse, with `Modifier.IDF` so Qdrant computes inverse document frequency from the indexed corpus. Without this modifier, scores would be term-frequency-only rather than full BM25." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell-08-schema", + "metadata": {}, + "outputs": [], + "source": [ + "if client.collection_exists(COLLECTION):\n", + " client.delete_collection(COLLECTION)\n", + "\n", + "client.create_collection(\n", + " collection_name=COLLECTION,\n", + " vectors_config={\n", + " \"dense\": models.VectorParams(size=DENSE_DIM, distance=models.Distance.COSINE),\n", + " },\n", + " sparse_vectors_config={\n", + " \"bm25\": models.SparseVectorParams(modifier=models.Modifier.IDF),\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cell-09-index-head", + "metadata": {}, + "source": [ + "## Index the Corpus\n", + "\n", + "We send raw text in both vector slots and let Cloud Inference produce the embeddings server-side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cell-10-upload", + "metadata": {}, + "outputs": [], + "source": [ + "points = []\n", + "for i, doc in enumerate(corpus_ds):\n", + " text = ((doc.get(\"title\") or \"\") + \" \" + (doc.get(\"text\") or \"\")).strip()\n", + " points.append(\n", + " models.PointStruct(\n", + " id=i,\n", + " vector={\n", + " \"dense\": models.Document(text=text, model=DENSE_MODEL),\n", + " \"bm25\": models.Document(\n", + " text=text,\n", + " model=BM25_MODEL,\n", + " # avg_len defaults to 256 (document-length). SciFact abstracts are\n", + " # ~220 words, so we tune it to the field's actual average. Important\n", + " # for BM25 ranking quality on short fields.\n", + " options={\"avg_len\": BM25_AVG_LEN},\n", + " ),\n", + " },\n", + " payload={\"doc_id\": doc[\"_id\"], \"title\": doc.get(\"title\", \"\"), \"text\": doc.get(\"text\", \"\")},\n", + " )\n", + " )\n", + "\n", + "client.upload_points(\n", + " collection_name=COLLECTION,\n", + " points=points,\n", + " batch_size=256,\n", + ")\n", + "print(f\"Indexed {len(points)} points\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell-11-problem-head", + "metadata": {}, + "source": [ + "## 1. Combining Two Ranked Lists\n", + "\n", + "Hybrid search returns two ranked lists per query. The first instinct is to combine them with a tunable weight like `alpha * dense + (1 - alpha) * sparse`. Let's look at the raw scores for one query to see how that combination behaves in practice." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "cell-12-distributions", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: '0-dimensional biomaterials show inductive properties.'\n", + "\n", + "Dense top scores: [0.354, 0.331, 0.292, 0.291, 0.29, 0.282, 0.279, 0.271, 0.266, 0.26]\n", + "BM25 top scores: [10.949, 10.814, 10.734, 10.469, 10.233, 10.138, 9.878, 9.643, 9.233, 9.076]\n" + ] + } + ], + "source": [ + "demo_query = eval_queries[0][\"text\"]\n", + "print(f\"Query: {demo_query!r}\\n\")\n", + "\n", + "dense_hits = client.query_points(\n", + " collection_name=COLLECTION,\n", + " query=models.Document(text=demo_query, model=DENSE_MODEL),\n", + " using=\"dense\",\n", + " limit=20,\n", + " with_payload=False,\n", + ").points\n", + "\n", + "sparse_hits = client.query_points(\n", + " collection_name=COLLECTION,\n", + " query=models.Document(text=demo_query, model=BM25_MODEL, options={\"avg_len\": BM25_AVG_LEN}),\n", + " using=\"bm25\",\n", + " limit=20,\n", + " with_payload=False,\n", + ").points\n", + "\n", + "print(\"Dense top scores:\", [round(h.score, 3) for h in dense_hits[:10]])\n", + "print(\"BM25 top scores: \", [round(h.score, 3) for h in sparse_hits[:10]])" + ] + }, + { + "cell_type": "markdown", + "id": "cell-13-problem-explanation", + "metadata": {}, + "source": "The dense scores are bounded (cosine, around 0.3 to 0.7) and the BM25 scores are unbounded positives (anywhere from 2 to 20+ depending on rare-term content). If you do `0.5 * dense + 0.5 * sparse`, BM25 dominates by an order of magnitude. Push toward `alpha = 0` or `alpha = 1` and you turn off one retriever entirely, which isn't fusion but selection. The BM25 scale also shifts per query, so a fixed alpha that performs well on one query distribution may not transfer cleanly to another without first normalizing the scores.\n\nRRF and DBSF take different routes out of this. `FormulaQuery` sits in a different category (a ranking-logic layer on top of fusion); we'll cover it later in the FormulaQuery section." + }, + { + "cell_type": "markdown", + "id": "cell-15-rrf-head", + "metadata": {}, + "source": "## 2. Reciprocal Rank Fusion (RRF)\n\nRRF sidesteps the score-scale problem by discarding scores and using only ranks. Each document's score becomes:\n\n$$ \\text{rrf}(d) = \\sum_i \\frac{1}{k + r_i(d)} $$\n\nwhere $r_i(d)$ is the document's rank in retriever $i$ and $k$ is a smoothing constant (default 2 in Qdrant; classic literature uses 60). Ranks are on the same scale by construction, so combining them is well-defined.\n\n**RRF: things to know**\n\n- Ranks ignore score magnitudes: a doc that's 10x better than its neighbor by raw score ranks the same as one that barely edges it out. In practice the loss is small, which is why RRF works well as a default.\n- `k` defaults to 2 in Qdrant, but is tunable via `models.RrfQuery(rrf=models.Rrf(k=...))` (since v1.16). Smaller `k` sharpens the rank-1 advantage; larger `k` (the classic literature default of 60) smooths it.\n- An eval set is optional with the default `k=2`, but worth having if you change `k`.\n\nBefore the first metric, a brief note on the eval setup: every method in this notebook is evaluated on the same 300 SciFact queries, so the side-by-side table is directly comparable. For notebook simplicity we tune and report on the same queries; in production you would hold out a val split. We flag the best-practice version again in the next section when we tune weighted-RRF weights." + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "cell-16-helpers", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Eval queries: 300\n" + ] + } + ], + "source": [ + "TOP_K = 10\n", + "PREFETCH_LIMIT = 100\n", + "\n", + "eval_qrels = Qrels({q[\"_id\"]: qrels_dict[q[\"_id\"]] for q in eval_queries if q[\"_id\"] in qrels_dict})\n", + "\n", + "print(f\"Eval queries: {len(eval_queries)}\")\n", + "\n", + "METRICS = [\"ndcg@10\", \"recall@100\", \"mrr@10\"]\n", + "\n", + "\n", + "def run_fusion(query_text, fusion_query, top_k=TOP_K):\n", + " return client.query_points(\n", + " collection_name=COLLECTION,\n", + " prefetch=[\n", + " models.Prefetch(\n", + " query=models.Document(text=query_text, model=DENSE_MODEL),\n", + " using=\"dense\",\n", + " limit=PREFETCH_LIMIT,\n", + " ),\n", + " models.Prefetch(\n", + " query=models.Document(text=query_text, model=BM25_MODEL, options={\"avg_len\": BM25_AVG_LEN}),\n", + " using=\"bm25\",\n", + " limit=PREFETCH_LIMIT,\n", + " ),\n", + " ],\n", + " query=fusion_query,\n", + " limit=top_k,\n", + " with_payload=[\"doc_id\"],\n", + " ).points\n", + "\n", + "\n", + "def run_single(query_text, using, top_k=TOP_K):\n", + " if using == \"dense\":\n", + " query = models.Document(text=query_text, model=DENSE_MODEL)\n", + " else:\n", + " query = models.Document(text=query_text, model=BM25_MODEL, options={\"avg_len\": BM25_AVG_LEN})\n", + " return client.query_points(\n", + " collection_name=COLLECTION,\n", + " query=query,\n", + " using=using,\n", + " limit=top_k,\n", + " with_payload=[\"doc_id\"],\n", + " ).points\n", + "\n", + "\n", + "def build_run(name, query_fn):\n", + " \"\"\"Build a ranx Run over the eval queries.\"\"\"\n", + " run_dict = {}\n", + " for q in eval_queries:\n", + " hits = query_fn(q[\"text\"])\n", + " run_dict[q[\"_id\"]] = {point_to_id[h.id]: float(h.score) for h in hits}\n", + " return Run(run_dict, name=name)\n", + "\n", + "\n", + "results_table = {}" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "cell-17-rrf-eval", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ndcg@10recall@100mrr@10
dense0.6540840.9316670.607063
bm250.6834460.9246670.647511
rrf0.7234670.9583330.681292
\n", + "
" + ], + "text/plain": [ + " ndcg@10 recall@100 mrr@10\n", + "dense 0.654084 0.931667 0.607063\n", + "bm25 0.683446 0.924667 0.647511\n", + "rrf 0.723467 0.958333 0.681292" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dense_only = build_run(\"dense\", lambda q: run_single(q, \"dense\", top_k=100))\n", + "sparse_only = build_run(\"bm25\", lambda q: run_single(q, \"bm25\", top_k=100))\n", + "\n", + "rrf_run = build_run(\n", + " \"rrf\",\n", + " lambda q: run_fusion(q, models.RrfQuery(rrf=models.Rrf()), top_k=100),\n", + ")\n", + "\n", + "for run in (dense_only, sparse_only, rrf_run):\n", + " results_table[run.name] = {m: evaluate(eval_qrels, run, m) for m in METRICS}\n", + "\n", + "pd.DataFrame(results_table).T" + ] + }, + { + "cell_type": "markdown", + "id": "cell-17b-rrf-metrics-explainer", + "metadata": {}, + "source": "**What these metrics measure**\n\n- **nDCG@10**: rewards relevant docs near the top with logarithmic position discount. The most informative single number for ranking quality.\n- **Recall@100**: did the prefetch find every relevant doc? Ranking doesn't matter, only inclusion. A reranker downstream can fix top-10 order, but it can't recover docs lost to the prefetch ceiling.\n- **MRR@10**: position of the first relevant doc. Best for navigational or single-answer tasks where the top result is what counts.\n\n**Reference numbers on SciFact:**\n\n| Method | nDCG@10 | Recall@100 | MRR@10 |\n| --- | --- | --- | --- |\n| dense | 0.654 | 0.932 | 0.607 |\n| bm25 | 0.683 | 0.925 | 0.648 |\n| rrf | 0.723 | 0.958 | 0.681 |\n\nRRF beats both single-retriever baselines on every metric: +0.04 nDCG@10 over BM25, +0.03 Recall@100, +0.03 MRR@10. If your row lands materially below either baseline, the usual suspects are a too-small `PREFETCH_LIMIT`, an `avg_len` mismatch on BM25, or low diversity between retrievers (both finding the same docs)." + }, + { + "cell_type": "markdown", + "id": "cell-18-rrf-commentary", + "metadata": {}, + "source": "## 3. Weighted RRF and How to Pick Weights\n\nDefault RRF treats both retrievers equally. In real applications one retriever is usually stronger (often dense on natural language, BM25 on identifier-heavy queries). Qdrant's `RrfQuery` takes a `weights` array. The question is how to pick the numbers.\n\nA grid search is a simple and common approach. Here's a reusable helper that searches over a list of weight tuples and scores each on the eval set. For notebook simplicity we tune and report on the same queries; we'll flag the best-practice change for production data after the sweep.\n\n**Weighted RRF: when to consider it**\n\n- Helpful when you have an eval set (queries paired with known-relevant docs) and the retrievers differ in strength. With a small set (under ~50 queries) or matched retrievers, the default `(1.0, 1.0)` is usually fine.\n- Worth retuning when retrievers, corpus, or chunking change substantially." + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "cell-19-tune-helper", + "metadata": {}, + "outputs": [], + "source": [ + "def tune_rrf_weights(\n", + " eval_queries,\n", + " qrels_dict,\n", + " fusion_runner,\n", + " weight_grid,\n", + " metric=\"ndcg@10\",\n", + "):\n", + " \"\"\"Grid-search weighted RRF weights on the eval set.\n", + "\n", + " Note: this tunes and reports on the same queries for notebook simplicity.\n", + " In production, hold out a val split: pick weights on train, report metrics\n", + " on val. See the best-practice callout after the sweep table.\n", + "\n", + " fusion_runner: callable (query_text, weights) -> list of hits with .id and .score\n", + " weight_grid: list of weight tuples, e.g. [(1.0, 1.0), (2.0, 1.0), (1.0, 2.0)]\n", + " Returns: dict with best_weights, sweep_table (DataFrame), best_score.\n", + " \"\"\"\n", + " eval_qrels = Qrels({q[\"_id\"]: qrels_dict[q[\"_id\"]] for q in eval_queries if q[\"_id\"] in qrels_dict})\n", + "\n", + " rows = []\n", + " for weights in weight_grid:\n", + " run = {q[\"_id\"]: {point_to_id[h.id]: float(h.score) for h in fusion_runner(q[\"text\"], weights)} for q in eval_queries}\n", + " score = evaluate(eval_qrels, Run(run, name=f\"w={weights}\"), metric)\n", + " rows.append({\"weights\": weights, metric: score})\n", + "\n", + " sweep = pd.DataFrame(rows).sort_values(metric, ascending=False).reset_index(drop=True)\n", + " best = sweep.iloc[0]\n", + " return {\n", + " \"best_weights\": best[\"weights\"],\n", + " \"sweep_table\": sweep,\n", + " \"best_score\": best[metric],\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "cell-19b-tuning-heads-up", + "metadata": {}, + "source": "### Running the grid search\n\nThe next cell calls `tune_rrf_weights` with 8 weight tuples and scores each on the full eval set, then reports the best weights and their nDCG@10.\n\n> **Heads up: this cell takes ~10-30 minutes on Cloud Inference free tier.** The grid search fires roughly 2,400 `query_points` calls (8 weight tuples × 300 queries, each generating two server-side embeddings).\n>\n> If you haven't changed anything earlier in the notebook, the saved outputs from a previous run are already populated. You can read along without rerunning." + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "cell-20-run-tuning", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Best weights (dense, bm25): (1.0, 2.0)\n", + "Best nDCG@10 on eval set: 0.7264\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
weightsndcg@10
0(1.0, 2.0)0.726411
1(1.0, 1.0)0.721844
2(1.0, 3.0)0.721009
3(1.0, 1.5)0.720400
4(1.5, 1.0)0.717578
5(2.0, 1.0)0.713063
6(5.0, 1.0)0.706649
7(3.0, 1.0)0.703982
\n", + "
" + ], + "text/plain": [ + " weights ndcg@10\n", + "0 (1.0, 2.0) 0.726411\n", + "1 (1.0, 1.0) 0.721844\n", + "2 (1.0, 3.0) 0.721009\n", + "3 (1.0, 1.5) 0.720400\n", + "4 (1.5, 1.0) 0.717578\n", + "5 (2.0, 1.0) 0.713063\n", + "6 (5.0, 1.0) 0.706649\n", + "7 (3.0, 1.0) 0.703982" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def run_weighted_rrf(query_text, weights, top_k=100):\n", + " return run_fusion(\n", + " query_text,\n", + " models.RrfQuery(rrf=models.Rrf(weights=list(weights))),\n", + " top_k=top_k,\n", + " )\n", + "\n", + "weight_grid = [\n", + " (1.0, 3.0), (1.0, 2.0), (1.0, 1.5), (1.0, 1.0),\n", + " (1.5, 1.0), (2.0, 1.0), (3.0, 1.0), (5.0, 1.0),\n", + "]\n", + "\n", + "tuning = tune_rrf_weights(\n", + " eval_queries=eval_queries,\n", + " qrels_dict=qrels_dict,\n", + " fusion_runner=run_weighted_rrf,\n", + " weight_grid=weight_grid,\n", + ")\n", + "\n", + "print(f\"Best weights (dense, bm25): {tuning['best_weights']}\")\n", + "print(f\"Best nDCG@10 on eval set: {tuning['best_score']:.4f}\")\n", + "tuning[\"sweep_table\"]" + ] + }, + { + "cell_type": "markdown", + "id": "cell-21-sweep-reading", + "metadata": {}, + "source": "On this SciFact run, the search picked `(1.0, 2.0)` as the best weights, with nDCG@10 = 0.7264. The default `(1.0, 1.0)` scored 0.7218, a 0.0046 gap. Every weight tuple in the grid landed within 0.025 nDCG of the top, so the metric surface is fairly flat on this dataset and retriever pair.\n\n**A note on noise.** When we lock in `(1.0, 2.0)` and re-evaluate it in the next cell, the reported score will likely land closer to 0.721 than the 0.726 the sweep saw. Qdrant's HNSW retrieval is approximate. The gap between the sweep result and the re-evaluation will be roughly 0.005 nDCG, the same scale as the tuning lift itself. Trust the loop to surface a reasonable weight pair; don't read too much into small differences in the best-score number.\n\n**Tuning typically helps more when retrievers differ in strength** (dense on natural language, BM25 on identifier-heavy queries). On SciFact with this retriever pair, dense and BM25 are well-matched, so tuning here is effectively a no-op.\n\nFor production, retune when your retrievers change (new model, new chunking), when your corpus drifts substantially, or roughly every quarter on a fresh eval sample.\n\n> **Best practice for real data: hold out a val split.** This notebook tunes and reports on the same set of queries for simplicity. In production:\n>\n> - Split your queries into train (~70%) and val (~30%) before tuning.\n> - Search the weight space on the **train** queries (these decide the winner).\n> - Report metrics from the **held-out val** queries (these are what you can claim).\n>\n> Reporting on the same queries you tune on inflates the metric, because you literally picked the weights that maximize that score on those queries.\n\nNow lock in the best weights and add the row to the comparison table:" + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "cell-22-weighted-rrf-eval", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ndcg@10recall@100mrr@10
dense0.6540840.9316670.607063
bm250.6834460.9246670.647511
rrf0.7234670.9583330.681292
weighted_rrf(1.0, 2.0)0.7213880.9503330.679183
\n", + "
" + ], + "text/plain": [ + " ndcg@10 recall@100 mrr@10\n", + "dense 0.654084 0.931667 0.607063\n", + "bm25 0.683446 0.924667 0.647511\n", + "rrf 0.723467 0.958333 0.681292\n", + "weighted_rrf(1.0, 2.0) 0.721388 0.950333 0.679183" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "best_w = tuning[\"best_weights\"]\n", + "weighted_rrf_run = build_run(\n", + " f\"weighted_rrf{best_w}\",\n", + " lambda q: run_weighted_rrf(q, best_w, top_k=100),\n", + ")\n", + "results_table[weighted_rrf_run.name] = {m: evaluate(eval_qrels, weighted_rrf_run, m) for m in METRICS}\n", + "pd.DataFrame(results_table).T" + ] + }, + { + "cell_type": "markdown", + "id": "cell-23-dbsf-head", + "metadata": {}, + "source": [ + "## 4. Distribution-Based Score Fusion (DBSF)\n", + "\n", + "DBSF keeps the raw scores but normalizes their distributions before combining. For each retriever's returned set, Qdrant computes the mean $\\mu$ and sample standard deviation $\\sigma$, then linearly remaps every score using the 3-sigma extremes as endpoints:\n", + "\n", + "$$ \\hat{s} = \\frac{s - (\\mu - 3\\sigma)}{6\\sigma} $$\n", + "\n", + "Normalized scores are summed across retrievers. Different score magnitudes no longer drive the combination because each retriever contributes on the same comparable range.\n", + "\n", + "**Edge cases:** scores are **not** clipped to [0, 1]; values outside the 3-sigma range remain outside it. If all returned scores are identical (or only one point is returned), DBSF emits `0.5` rather than dividing by zero.\n", + "\n", + "**DBSF: when to consider it**\n", + "\n", + "- Works well when you trust your retrievers' raw score magnitudes to carry signal (well-calibrated dense + BM25 with corpus IDF is typical). Less reliable on retrievers with heavy-tailed score distributions.\n", + "- Normalization uses the prefetch top-k as its sample, so a small `PREFETCH_LIMIT` or a query with a dominant outlier can produce unstable rankings.\n", + "- No hyperparameters to tune; an eval set helps confirm whether DBSF outperforms RRF on your corpus.\n", + "\n", + "Run DBSF on the eval set and add the row to the comparison table:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "cell-24-dbsf-eval", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ndcg@10recall@100mrr@10
dense0.6540840.9316670.607063
bm250.6834460.9246670.647511
rrf0.7234670.9583330.681292
weighted_rrf(1.0, 2.0)0.7213880.9503330.679183
dbsf0.7360250.9583330.702729
\n", + "
" + ], + "text/plain": [ + " ndcg@10 recall@100 mrr@10\n", + "dense 0.654084 0.931667 0.607063\n", + "bm25 0.683446 0.924667 0.647511\n", + "rrf 0.723467 0.958333 0.681292\n", + "weighted_rrf(1.0, 2.0) 0.721388 0.950333 0.679183\n", + "dbsf 0.736025 0.958333 0.702729" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dbsf_run = build_run(\n", + " \"dbsf\",\n", + " lambda q: run_fusion(q, models.FusionQuery(fusion=models.Fusion.DBSF), top_k=100),\n", + ")\n", + "results_table[\"dbsf\"] = {m: evaluate(eval_qrels, dbsf_run, m) for m in METRICS}\n", + "pd.DataFrame(results_table).T" + ] + }, + { + "cell_type": "markdown", + "id": "cell-25-side-by-side-head", + "metadata": {}, + "source": "DBSF beat RRF on this run by +0.013 nDCG@10 (0.736 vs 0.723) and +0.022 MRR@10 (0.703 vs 0.681). Recall@100 ties at 0.958 across DBSF and plain RRF; weighted RRF trails marginally at 0.950.\n\n## 5. Side-by-Side\n\nAll methods on the same metrics, sorted by nDCG@10:" + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "cell-26-summary", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ndcg@10recall@100mrr@10
dbsf0.7360250.9583330.702729
rrf0.7234670.9583330.681292
weighted_rrf(1.0, 2.0)0.7213880.9503330.679183
bm250.6834460.9246670.647511
dense0.6540840.9316670.607063
\n", + "
" + ], + "text/plain": [ + " ndcg@10 recall@100 mrr@10\n", + "dbsf 0.736025 0.958333 0.702729\n", + "rrf 0.723467 0.958333 0.681292\n", + "weighted_rrf(1.0, 2.0) 0.721388 0.950333 0.679183\n", + "bm25 0.683446 0.924667 0.647511\n", + "dense 0.654084 0.931667 0.607063" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "summary = pd.DataFrame(results_table).T.sort_values(\"ndcg@10\", ascending=False)\n", + "summary" + ] + }, + { + "cell_type": "markdown", + "id": "cell-27-side-by-side-reading", + "metadata": {}, + "source": "On this SciFact run, DBSF leads at 0.736 nDCG@10, RRF follows at 0.723, weighted RRF at the best `(1.0, 2.0)` lands at 0.721, then BM25 and dense bring up the rear. Recall@100 ties at 0.958 for DBSF and RRF, with weighted RRF marginally behind at 0.950. DBSF was highest on this run for both ranking metrics (nDCG and MRR).\n\nThe right method for your application depends on whether your retrievers' score magnitudes carry signal (DBSF), whether you have an eval set to tune on (weighted RRF), or whether you want a strong default (RRF). On SciFact with these retrievers, DBSF was the win. On a different corpus or retriever pair the order can flip, so re-run this notebook on your own data before committing.\n\nFor a deeper breakdown of when to prefer each, see the [FAQ on RRF vs. DBSF](/documentation/faq/qdrant-fundamentals/#when-should-i-use-reciprocal-rank-fusion-rrf-vs-distribution-based-score-fusion-dbsf-for-hybrid-search).\n\n## 6. `FormulaQuery`: Fusion Plus Business Logic\n\n`FormulaQuery` layers ranking logic on top of a fused result. The pattern is:\n\n- Inner prefetch fuses dense + sparse with RRF (or DBSF).\n- Outer `FormulaQuery` applies recency decay, popularity boosts, or category multipliers using the fused `$score` and payload fields.\n\nIt's not a way to write tuned-alpha fusion. Writing `0.7 * $score[0] + 0.3 * $score[1]` over raw retriever scores reintroduces the same scale problem that breaks naive linear fusion. If the prefetches are themselves `rrf` or `dbsf`, the scores are already on comparable scales and a weighted formula sum works.\n\n**FormulaQuery: when to consider it**\n\n- Decay coefficients need to be calibrated against the fused score scale. Small RRF scores plus an unweighted decay term that returns `[0, 1]` will let recency crowd out relevance.\n- Multi-knob tuning is brittle without an eval set; single well-understood adjustments are easier to defend.\n- Higher per-query latency than plain fusion because the engine evaluates the expression for each candidate.\n\nLet's demonstrate recency decay. SciFact has no timestamps, so we'll inject a synthetic `published_at` per document. This section teaches the mechanics, not a metric." + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "cell-28-synthetic-dates", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "UpdateResult(operation_id=20761, status=)" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from datetime import datetime, timedelta\n", + "\n", + "rng = random.Random(0)\n", + "now = datetime(2026, 5, 14)\n", + "\n", + "operations = []\n", + "for i in range(len(corpus_ds)):\n", + " days_ago = rng.randint(0, 365 * 3)\n", + " operations.append(\n", + " models.SetPayloadOperation(\n", + " set_payload=models.SetPayload(\n", + " points=[i],\n", + " payload={\"published_at\": (now - timedelta(days=days_ago)).isoformat() + \"Z\"},\n", + " )\n", + " )\n", + " )\n", + "\n", + "for start in range(0, len(operations), 500):\n", + " client.batch_update_points(collection_name=COLLECTION, update_operations=operations[start:start + 500])\n", + "\n", + "client.create_payload_index(\n", + " collection_name=COLLECTION,\n", + " field_name=\"published_at\",\n", + " field_schema=models.PayloadSchemaType.DATETIME,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "cell-29-formula-demo", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- RRF only ---\n", + " 4132 score=0.5000 \n", + " 4399 score=0.5000 \n", + " 766 score=0.3333 \n", + " 2891 score=0.3333 \n", + " 4241 score=0.2500 \n", + "\n", + "--- RRF + 0.1 * exp_decay on published_at (scale=180d, midpoint=0.5) ---\n", + " 4399 score=0.5209 2025-04-02T00:00:00Z The first gene of the Bacillus subtilis clpC opero\n", + " 4132 score=0.5153 2025-01-12T00:00:00Z Complex Tissue and Disease Modeling using hiPSCs.\n", + " 766 score=0.3395 2024-05-21T00:00:00Z Nonlinear Elasticity in Biological Gels\n", + " 2891 score=0.3374 2024-02-05T00:00:00Z The Epithelial-Mesenchymal Transition Generates Ce\n", + " 4241 score=0.2976 2025-11-02T00:00:00Z New opportunities: the use of nanotechnologies to \n" + ] + } + ], + "source": [ + "demo_query = eval_queries[0][\"text\"]\n", + "\n", + "fused_only = run_fusion(demo_query, models.RrfQuery(rrf=models.Rrf()), top_k=5)\n", + "\n", + "fused_then_decay = client.query_points(\n", + " collection_name=COLLECTION,\n", + " prefetch=models.Prefetch(\n", + " prefetch=[\n", + " models.Prefetch(\n", + " query=models.Document(text=demo_query, model=DENSE_MODEL),\n", + " using=\"dense\",\n", + " limit=PREFETCH_LIMIT,\n", + " ),\n", + " models.Prefetch(\n", + " query=models.Document(text=demo_query, model=BM25_MODEL, options={\"avg_len\": BM25_AVG_LEN}),\n", + " using=\"bm25\",\n", + " limit=PREFETCH_LIMIT,\n", + " ),\n", + " ],\n", + " query=models.RrfQuery(rrf=models.Rrf()),\n", + " limit=PREFETCH_LIMIT,\n", + " ),\n", + " query=models.FormulaQuery(\n", + " formula=models.SumExpression(\n", + " sum=[\n", + " models.MultExpression(mult=[1.0, \"$score\"]),\n", + " # Decay weight 0.1: RRF scores in the top-5 are roughly 0.2 to 0.5, while\n", + " # un-weighted decay returns [0, 1]. Without this coefficient the decay\n", + " # dominates the sum and recency replaces relevance instead of nudging it.\n", + " models.MultExpression(\n", + " mult=[\n", + " 0.1,\n", + " models.ExpDecayExpression(\n", + " exp_decay=models.DecayParamsExpression(\n", + " x=models.DatetimeKeyExpression(datetime_key=\"published_at\"),\n", + " target=models.DatetimeExpression(datetime=now.isoformat() + \"Z\"),\n", + " scale=86400 * 180,\n", + " midpoint=0.5,\n", + " )\n", + " ),\n", + " ]\n", + " ),\n", + " ]\n", + " )\n", + " ),\n", + " limit=5,\n", + " with_payload=[\"doc_id\", \"published_at\", \"title\"],\n", + ").points\n", + "\n", + "print(\"--- RRF only ---\")\n", + "for h in fused_only:\n", + " print(f\" {h.id} score={h.score:.4f} {h.payload.get('title', '')[:70]}\")\n", + "\n", + "print(\"\\n--- RRF + 0.1 * exp_decay on published_at (scale=180d, midpoint=0.5) ---\")\n", + "for h in fused_then_decay:\n", + " print(f\" {h.id} score={h.score:.4f} {h.payload.get('published_at')} {h.payload.get('title', '')[:50]}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell-30-formula-reading", + "metadata": {}, + "source": "The decay term tilts the ranking toward more recent documents. The `scale` and `midpoint` knobs control how aggressive the decay is: `scale=180 days, midpoint=0.5` means a doc 180 days from the target receives a decay value of 0.5, vs 1.0 for a brand-new doc.\n\nThe decay is wrapped in `MultExpression(mult=[0.1, ...])` because RRF scores are small (typically 0.2 to 0.5 in the top-5) while un-weighted decay returns `[0, 1]`. The `0.1` coefficient caps the decay's contribution so it nudges the ranking rather than replacing it. See [Search Relevance](/documentation/search/search-relevance/) for the full decay function reference.\n\nOther patterns that fit `FormulaQuery` naturally:\n\n- **Category boost:** multiply `$score` by 1.3 when `payload.category == \"featured\"`.\n- **Popularity prior:** sum `$score` with `log(view_count + 1) / 10`.\n- **Geo decay:** [`GaussDecayExpression`](/documentation/search/search-relevance/) on distance from a user's location." + }, + { + "cell_type": "markdown", + "id": "cell-31-decision-conclusion", + "metadata": {}, + "source": "## Picking a Method: At a Glance\n\n| Factor | RRF | Weighted RRF | DBSF | FormulaQuery |\n| --- | --- | --- | --- | --- |\n| Eval set required | No | Yes | Recommended | Yes for multi-knob tuning |\n| Hyperparameters to tune | `k` (optional, since v1.16) | Per-retriever weights, `k` (optional) | None | Decay scale, midpoint, term coefficients |\n| Preserves score magnitudes | No | No | Yes | Depends on prefetch |\n| Layers business logic | No | No | No | Yes |\n| Fit for well-calibrated retrievers | OK | OK | **Strong** | OK (over fused prefetch) |\n| Fit for retrievers with very different score scales | OK | OK with tuning | Limited | Limited (without normalized prefetches) |\n| Latency overhead | None | None | Small | Medium |\n\n## Wrap-up\n\n- **RRF** is a strong default; ranks sidestep the score-scale problem.\n- **Weighted RRF** is worth trying if you have an eval set. Tuned lifts are often within noise on well-matched retrievers (as on SciFact here).\n- **DBSF** can outperform tuned RRF on well-calibrated retrievers; neither dominates the other in general.\n- **`FormulaQuery`** layers business logic (recency, boosts, geo) on top of fused results.\n\nWant help wiring this into a production retrieval stack? [Join us on Discord](https://qdrant.to/discord)." + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file