fix: drop legacy unique content_hash index by EtanHey · Pull Request #245 · EtanHey/brainlayer

EtanHey · 2026-04-16T12:01:26Z

Summary

migrate legacy databases by dropping any unique index on chunks.content_hash and recreating the intended non-unique lookup index
keep _ensure_content_hash_column() as the single place that repairs this schema drift before enrichment/backfill writes run
add regression coverage for the exact failure mode where duplicate content backfill aborts on a legacy unique index

Verification

pytest tests/test_enrichment_controller.py
pytest tests/test_concurrent_enrichment.py
production DB schema repair verified: idx_content_hash_unique removed, idx_content_hash present
focused live enrichment verification on production DB: attempted 3, enriched 3, skipped 0, failed 0

Context

this fixes the flatlined enrichment runs caused by UNIQUE constraint failed: chunks.content_hash during UPDATE chunks SET content_hash = ? WHERE id = ?
indexing still has a separate operational problem: com.brainlayer.watch is not currently loaded, and recent watch logs show database is locked plus No space left on device; that is not changed in this PR

Note

Medium Risk
Touches enrichment-time SQLite schema/index repair and search reranking logic, so failures could impact enrichment runs or result ordering; changes are localized and covered by targeted tests.

Overview
Fixes enrichment schema drift by enhancing _ensure_content_hash_column() to detect and drop any unique index that includes chunks.content_hash, then ensure a non-unique lookup index exists so backfill/realtime enrichment can write duplicate hashes without aborting.

Centralizes DEFAULT_REALTIME_ENRICH_SINCE_HOURS env parsing into new config.get_int_env() (with malformed-value fallback) and updates CLI/MCP entrypoints to import the shared default; adds regression tests for the env fallback and for legacy-DB unique-index scenarios.

Adjusts MMR reranking in SearchMixin._mmr_rerank_scored_results() to cap vector-based selection to n_results and to keep lexical-only (non-vector) hits in their original score slots; adds coverage for this ordering behavior.

^{Reviewed by Cursor Bugbot for commit e37754d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Drop legacy unique `content_hash` index and replace with non-unique index in `chunks` table

Updates _ensure_content_hash_column in enrichment_controller.py to detect and drop any unique index on content_hash, then creates a non-unique idx_content_hash index, fixing failures caused by duplicate content hashes.
Consolidates DEFAULT_REALTIME_ENRICH_SINCE_HOURS parsing into a new get_int_env helper in config.py, which falls back to the default and logs a warning on malformed input instead of raising at import time.
Fixes MMR reranking in SearchMixin._mmr_rerank_scored_results to preserve original position slots for non-vector (lexical-only) hits and cap selection at n_results.

^{Macroscope summarized e37754d.}

Summary by CodeRabbit

Improvements
- Enhanced search result reranking for more relevant and deduplicated results.
- Improved database index management for content lookups, optimizing performance.
- Strengthened configuration validation with graceful fallback handling for invalid settings.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-04-16T12:01:35Z

📝 Walkthrough

Walkthrough

This PR consolidates environment-based configuration into a shared config.py module, refactors _ensure_content_hash_column() to properly manage SQLite indexes, and improves MMR reranking logic to better preserve non-vector search results in their original positions while applying deduplication.

Changes

Cohort / File(s)	Summary
Configuration Centralization `src/brainlayer/config.py`, `src/brainlayer/cli/__init__.py`, `src/brainlayer/mcp/__init__.py`, `src/brainlayer/mcp/enrich_handler.py`	Introduced `config.py` with `get_int_env()` utility and `DEFAULT_REALTIME_ENRICH_SINCE_HOURS` constant. Removed direct environment parsing from cli and mcp modules; all now import from shared config.
SQLite Index Management `src/brainlayer/enrichment_controller.py`, `tests/test_enrichment_controller.py`	Enhanced `_ensure_content_hash_column()` to inspect and create non-unique index `idx_content_hash` on `chunks` table via SQLite PRAGMA queries. Added migration tests for legacy unique index removal and backfill verification.
MMR Reranking Algorithm `src/brainlayer/search_repo.py`, `tests/test_hybrid_search.py`	Modified `_mmr_rerank_scored_results()` to limit selection to a target count, include all non-selected MMR candidates, and preserve non-vector hits in their original score positions. Added test validating lexical-only entries remain in top results.
Configuration Testing `tests/test_enrich_defaults.py`	Added test for invalid environment variable fallback behavior, verifying all modules correctly default to `8760` when `BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS` is malformed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: unified enrichment controller — brain_enrich MCP tool + 3 backends #112: Introduces and utilizes DEFAULT_REALTIME_ENRICH_SINCE_HOURS constant across enrichment modules that this PR refactors into a shared configuration module.
MMR post-retrieval dedup for brain_search (R86 Q4) #242: Adds usage of DEFAULT_REALTIME_ENRICH_SINCE_HOURS constant in enrichment defaults that this PR consolidates from environment parsing.

Poem

🐰 The config spreads its wings so wide,
From scattered env vars, now unified with pride,
Indexes hop through SQLite's humble store,
While MMR keeps non-vectors, forevermore! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: removing a legacy unique index on content_hash and replacing it with a non-unique one, which is the primary structural fix addressed throughout the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/enrichment-content-hash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

EtanHey · 2026-04-16T12:01:37Z

@cursor @BugBot review

EtanHey · 2026-04-16T12:01:37Z

@codex review

EtanHey · 2026-04-16T12:01:37Z

@coderabbitai review

cursor · 2026-04-16T12:01:40Z

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

coderabbitai · 2026-04-16T12:01:43Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed697058b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T12:05:08Z

+        reranked = [mmr_candidates[idx] for idx in selected]
+        return reranked + fallback_candidates + tail_candidates


Keep non-vector hits in score order after MMR rerank

Chunks can legitimately be missing chunk_vectors rows (for example, newly stored manual/MCP memories before embedding backfill), so they enter fallback_candidates even when they have the highest hybrid score. Returning reranked + fallback_candidates + tail_candidates always pushes those lexical-only hits behind all vector-backed MMR candidates, so hybrid_search(..., n_results=...) can drop the most relevant FTS matches from the top results. This is a retrieval-correctness regression for partial-embedding databases.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T12:05:08Z

 from .store_handler import _brain_update as _brain_update
 from .tags_handler import _brain_tags_mcp as _brain_tags_mcp

+DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760"))


Guard enrich default env parsing at module import

DEFAULT_REALTIME_ENRICH_SINCE_HOURS is parsed with int(...) at import time, so any malformed BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS value (e.g. empty string or 24h) raises ValueError and prevents the MCP module from loading at all. Because this happens before request handling, one bad environment value can take the whole tool server down instead of falling back to a safe default.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ed69705. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e37754d7f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T14:39:59Z

+    except Exception:
+        pass
+
+    return True


Propagate content_hash index migration failures

Do not swallow exceptions here: if DROP INDEX/CREATE INDEX fails (for example due a transient database is locked during startup), _ensure_content_hash_column still returns success and _ensure_enrichment_columns marks the store as schema-ready, so the process will not retry the migration and legacy unique content_hash constraints can continue breaking duplicate-hash backfill/write paths. This makes the new repair logic fail silently in exactly the lock-contention environment called out in this repo.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/config.py`:
- Around line 21-25: The current int parsing returns int(value) even for 0 or
negatives; update the parse so that after converting value = int(value) you
reject non-positive numbers by checking if value <= 0, then call
logger.warning("Invalid %s=%r; using default %s", name, raw, default) and return
default; otherwise return the parsed positive integer. Locate the try/except
block that uses name, raw, default and replace the bare return int(value) with
this validation (keep the same warning message and exception handling).

In `@src/brainlayer/enrichment_controller.py`:
- Around line 402-422: The current try/except around the index scan/drop/create
swallows all exceptions and returns True even when DDL fails; update the logic
in the function containing this block (the code using cursor.execute,
has_content_hash_index, DROP INDEX and CREATE INDEX) to remove the bare
except/pass, implement retry-on-SQLITE_BUSY for each DDL operation (PRAGMA
index_info, DROP INDEX, CREATE INDEX) with a small backoff and limited attempts,
and if retries exhaust or any non-BUSY error occurs, propagate or return a
failure (do not return True); ensure the final return reflects actual success
only after successful DDL or confirmed existing index.
- Around line 405-414: The migration loop currently drops any unique index that
contains "content_hash" even if it's part of a composite index; change the logic
in the loop that iterates over indexes (using variables index_name, is_unique,
quoted_name, columns and cursor.execute/PRAGMA index_info) to only DROP INDEX
when the index is unique and its columns exactly equal ["content_hash"]
(preferably also verify the index_name matches the known legacy name if
available) — otherwise skip it so composite indexes like (project, content_hash)
or (conversation_id, content_hash) are preserved.

In `@tests/test_enrich_defaults.py`:
- Around line 24-28: The test currently calls
monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False) then
importlib.reload(config), importlib.reload(cli), importlib.reload(mcp),
importlib.reload(enrich_handler), which can leak state if the env var was set
before the test; update the test cleanup in tests/test_enrich_defaults.py to
capture the original value (e.g., orig =
os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS")) before deleting, then
restore that original value (use monkeypatch.setenv if orig is not None,
otherwise monkeypatch.delenv) and only then call importlib.reload(config),
importlib.reload(cli), importlib.reload(mcp), importlib.reload(enrich_handler)
so module constants reload with the original environment baseline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dc8e29d2-3ca5-453f-ba28-ecefa273d309

📥 Commits

Reviewing files that changed from the base of the PR and between fd04ecd and e37754d.

📒 Files selected for processing (9)

src/brainlayer/cli/__init__.py
src/brainlayer/config.py
src/brainlayer/enrichment_controller.py
src/brainlayer/mcp/__init__.py
src/brainlayer/mcp/enrich_handler.py
src/brainlayer/search_repo.py
tests/test_enrich_defaults.py
tests/test_enrichment_controller.py
tests/test_hybrid_search.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: test (3.12)
GitHub Check: test (3.13)
GitHub Check: test (3.11)
GitHub Check: Cursor Bugbot
GitHub Check: Macroscope - Correctness Check

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

tests/test_enrichment_controller.py
src/brainlayer/cli/__init__.py
src/brainlayer/mcp/enrich_handler.py
tests/test_enrich_defaults.py
src/brainlayer/enrichment_controller.py
tests/test_hybrid_search.py
src/brainlayer/search_repo.py
src/brainlayer/config.py
src/brainlayer/mcp/__init__.py

src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

src/brainlayer/cli/__init__.py
src/brainlayer/mcp/enrich_handler.py
src/brainlayer/enrichment_controller.py
src/brainlayer/search_repo.py
src/brainlayer/config.py
src/brainlayer/mcp/__init__.py

🧠 Learnings (13)

📚 Learning: 2026-04-06T08:40:13.531Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

src/brainlayer/cli/__init__.py
src/brainlayer/mcp/enrich_handler.py
tests/test_enrich_defaults.py
src/brainlayer/config.py
src/brainlayer/mcp/__init__.py

📚 Learning: 2026-04-02T23:32:14.543Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T23:32:14.543Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

src/brainlayer/cli/__init__.py
src/brainlayer/mcp/enrich_handler.py
tests/test_enrich_defaults.py
src/brainlayer/config.py
src/brainlayer/mcp/__init__.py

📚 Learning: 2026-04-03T11:34:19.303Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-03T11:34:19.303Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`

Applied to files:

src/brainlayer/cli/__init__.py

📚 Learning: 2026-03-22T15:55:22.017Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.

Applied to files:

src/brainlayer/mcp/enrich_handler.py

📚 Learning: 2026-04-01T01:24:44.281Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment backend priority: Groq (primary/cloud) → Gemini (fallback) → Ollama (offline last-resort), configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

src/brainlayer/mcp/enrich_handler.py
src/brainlayer/mcp/__init__.py

📚 Learning: 2026-04-06T08:40:13.531Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle columns: `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search; allow `include_archived=True` to show history

Applied to files:

src/brainlayer/enrichment_controller.py

📚 Learning: 2026-04-04T23:24:03.159Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-04T23:24:03.159Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search

Applied to files:

src/brainlayer/enrichment_controller.py
src/brainlayer/search_repo.py

📚 Learning: 2026-04-11T16:54:45.631Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T16:54:45.631Z
Learning: Applies to `src/brainlayer/enrichment_controller.py`, `src/brainlayer/pipeline/write_queue.py`, and related enrichment pipeline files: A per-store single-writer queue is used for SQLite enrichment writes because SQLite allows only one writer at a time; direct concurrent writes caused lock contention under sustained Gemini Flex traffic. Do not flag serialized write patterns in this path as a performance concern — the queue is intentional.

Applied to files:

src/brainlayer/enrichment_controller.py

📚 Learning: 2026-04-03T11:43:08.915Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-03T11:43:08.915Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes

Applied to files:

src/brainlayer/enrichment_controller.py

📚 Learning: 2026-04-01T01:24:44.281Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/mcp/*.py : MCP tools include: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person, brain_tags, brain_supersede, brain_archive (legacy brainlayer_* aliases still supported)

Applied to files:

src/brainlayer/mcp/__init__.py

📚 Learning: 2026-04-11T16:54:45.631Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T16:54:45.631Z
Learning: Applies to `src/brainlayer/enrichment_controller.py` and `src/brainlayer/pipeline/rate_limiter.py`: Gemini API calls in the enrichment pipeline are gated by a token bucket rate limiter. The rate is controlled by `BRAINLAYER_ENRICH_RATE` (default `5/s`, burst `10`) to keep throughput inside the Gemini Flex intended envelope. This default supersedes the earlier 0.2 (12 RPM) default for the Gemini Flex integration path.

Applied to files:

src/brainlayer/mcp/__init__.py

📚 Learning: 2026-04-06T08:40:13.531Z

Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via `enrichment_controller.py`, and Ollama as offline last-resort; allow override via `BRAINLAYER_ENRICH_BACKEND` env var

Applied to files:

src/brainlayer/mcp/__init__.py

📚 Learning: 2026-03-17T01:04:22.497Z

Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

src/brainlayer/mcp/__init__.py

🔇 Additional comments (3)

src/brainlayer/search_repo.py (1)

177-197: Nice slot-preserving recombination for lexical-only hits.

Rebuilding top_candidates in place keeps non-vector results anchored to their original score slots while still letting MMR diversify the vector-backed positions.

tests/test_enrichment_controller.py (1)

381-397: Good regression coverage for the legacy-index migration.

Using a real SQLite database here exercises the exact index-introspection and duplicate-backfill path that mocked cursors would miss.

Also applies to: 424-447

tests/test_hybrid_search.py (1)

349-410: Strong regression test for MMR slot preservation and dedupe behavior.

This covers the lexical-only slot retention and duplicate-vector suppression path clearly, which is exactly the risky edge case from the rerank change.

coderabbitai · 2026-04-16T14:42:15Z

+    try:
+        return int(value)
+    except ValueError:
+        logger.warning("Invalid %s=%r; using default %s", name, raw, default)
+        return default


⚠️ Potential issue | 🟡 Minor

Reject non-positive integers here as well.

int() accepts 0 and negative values, so BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS=-1 becomes the shared default even though this setting is only valid for positive lookback windows. Treat <= 0 the same as malformed input and fall back to default.

Suggested fix

def get_int_env(name: str, default: int) -> int: """Read an integer env var, falling back cleanly on malformed values.""" raw = os.environ.get(name) if raw is None: return default value = raw.strip() if not value: return default try: - return int(value) + parsed = int(value) except ValueError: logger.warning("Invalid %s=%r; using default %s", name, raw, default) return default + if parsed <= 0: + logger.warning("Invalid %s=%r; using default %s", name, raw, default) + return default + return parsed

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/config.py` around lines 21 - 25, The current int parsing returns int(value) even for 0 or negatives; update the parse so that after converting value = int(value) you reject non-positive numbers by checking if value <= 0, then call logger.warning("Invalid %s=%r; using default %s", name, raw, default) and return default; otherwise return the parsed positive integer. Locate the try/except block that uses name, raw, default and replace the bare return int(value) with this validation (keep the same warning message and exception handling).

coderabbitai · 2026-04-16T14:42:15Z

+    try:
+        indexes = list(cursor.execute("PRAGMA index_list(chunks)"))
+        has_content_hash_index = False
+        for row in indexes:
+            index_name = row[1]
+            is_unique = bool(row[2])
+            quoted_name = index_name.replace('"', '""')
+            columns = [info[2] for info in cursor.execute(f'PRAGMA index_info("{quoted_name}")')]
+            if "content_hash" not in columns:
+                continue
+            if is_unique:
+                cursor.execute(f'DROP INDEX IF EXISTS "{quoted_name}"')
+                continue
+            has_content_hash_index = True
+
+        if not has_content_hash_index:
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_content_hash ON chunks(content_hash)")
+    except Exception:
+        pass
+
+    return True


⚠️ Potential issue | 🟠 Major

Don't report schema repair success after failed DDL.

Lines 419-422 swallow every failure from the index scan/drop/create path and still return True. If DROP INDEX or CREATE INDEX hits SQLITE_BUSY, callers will cache this store as repaired and continue into writes that can still fail with the same UNIQUE constraint this helper is meant to fix. Retry busy DDL and surface failure instead of passing here.

As per coding guidelines, "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/enrichment_controller.py` around lines 402 - 422, The current try/except around the index scan/drop/create swallows all exceptions and returns True even when DDL fails; update the logic in the function containing this block (the code using cursor.execute, has_content_hash_index, DROP INDEX and CREATE INDEX) to remove the bare except/pass, implement retry-on-SQLITE_BUSY for each DDL operation (PRAGMA index_info, DROP INDEX, CREATE INDEX) with a small backoff and limited attempts, and if retries exhaust or any non-BUSY error occurs, propagate or return a failure (do not return True); ensure the final return reflects actual success only after successful DDL or confirmed existing index.

coderabbitai · 2026-04-16T14:42:15Z

+        for row in indexes:
+            index_name = row[1]
+            is_unique = bool(row[2])
+            quoted_name = index_name.replace('"', '""')
+            columns = [info[2] for info in cursor.execute(f'PRAGMA index_info("{quoted_name}")')]
+            if "content_hash" not in columns:
+                continue
+            if is_unique:
+                cursor.execute(f'DROP INDEX IF EXISTS "{quoted_name}"')
+                continue


⚠️ Potential issue | 🟠 Major

Only drop the legacy single-column unique index.

This loop removes any unique index whose column list merely includes content_hash. That also drops composite constraints like (project, content_hash) or (conversation_id, content_hash), which are not the legacy bug this migration is targeting. Please restrict the repair to unique indexes whose indexed columns are exactly ["content_hash"] (and ideally the known legacy name) before issuing DROP INDEX.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/brainlayer/enrichment_controller.py` around lines 405 - 414, The migration loop currently drops any unique index that contains "content_hash" even if it's part of a composite index; change the logic in the loop that iterates over indexes (using variables index_name, is_unique, quoted_name, columns and cursor.execute/PRAGMA index_info) to only DROP INDEX when the index is unique and its columns exactly equal ["content_hash"] (preferably also verify the index_name matches the known legacy name if available) — otherwise skip it so composite indexes like (project, content_hash) or (conversation_id, content_hash) are preserved.

coderabbitai · 2026-04-16T14:42:15Z

+        monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False)
+        importlib.reload(config)
+        importlib.reload(cli)
+        importlib.reload(mcp)
+        importlib.reload(enrich_handler)


⚠️ Potential issue | 🟡 Minor

Restore the original env value before reloading modules in cleanup.

Cleanup currently forces the env var to “unset” before reloading modules. If it was set before this test, subsequent module constants can be reloaded with the wrong baseline and leak state across tests.

Suggested fix

+import os import importlib @@ def test_invalid_realtime_enrich_since_hours_env_falls_back(monkeypatch): @@ - monkeypatch.setenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "24h") + env_key = "BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS" + original = os.environ.get(env_key) + monkeypatch.setenv(env_key, "24h") @@ finally: - monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False) + if original is None: + monkeypatch.delenv(env_key, raising=False) + else: + monkeypatch.setenv(env_key, original) importlib.reload(config) importlib.reload(cli) importlib.reload(mcp) importlib.reload(enrich_handler)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_enrich_defaults.py` around lines 24 - 28, The test currently calls monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False) then importlib.reload(config), importlib.reload(cli), importlib.reload(mcp), importlib.reload(enrich_handler), which can leak state if the env var was set before the test; update the test cleanup in tests/test_enrich_defaults.py to capture the original value (e.g., orig = os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS")) before deleting, then restore that original value (use monkeypatch.setenv if orig is not None, otherwise monkeypatch.delenv) and only then call importlib.reload(config), importlib.reload(cli), importlib.reload(mcp), importlib.reload(enrich_handler) so module constants reload with the original environment baseline.

EtanHey added 5 commits April 14, 2026 19:32

Add MMR dedup to hybrid search

43a0d2a

Fix enrichment default since-hours for MCP/CLI

00e08f5

fix: sort imports in MCP enrich handler

b983f95

style: apply ruff format to PR files

d545514

fix: drop legacy unique content hash index

ed69705

greptile-apps bot reviewed Apr 16, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 16, 2026

View reviewed changes

cursor bot reviewed Apr 16, 2026

View reviewed changes

Comment thread src/brainlayer/search_repo.py Outdated

Comment thread src/brainlayer/mcp/enrich_handler.py Outdated

EtanHey added 2 commits April 16, 2026 17:33

Merge origin/main into fix/enrichment-content-hash

38a68af

style: format merged enrichment tests

e37754d

chatgpt-codex-connector bot reviewed Apr 16, 2026

View reviewed changes

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

EtanHey merged commit 05c878c into main Apr 16, 2026
7 checks passed

EtanHey deleted the fix/enrichment-content-hash branch April 16, 2026 14:42

		reranked = [mmr_candidates[idx] for idx in selected]
		return reranked + fallback_candidates + tail_candidates

Conversation

EtanHey commented Apr 16, 2026 • edited by macroscopeapp bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Context

Drop legacy unique content_hash index and replace with non-unique index in chunks table

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

EtanHey commented Apr 16, 2026

Uh oh!

EtanHey commented Apr 16, 2026

Uh oh!

EtanHey commented Apr 16, 2026

Uh oh!

cursor bot commented Apr 16, 2026

Uh oh!

coderabbitai bot commented Apr 16, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Apr 16, 2026 •

edited by macroscopeapp bot

Loading

Drop legacy unique `content_hash` index and replace with non-unique index in `chunks` table

coderabbitai bot commented Apr 16, 2026 •

edited

Loading