Skip to content

fix: drop legacy unique content_hash index#245

Merged
EtanHey merged 7 commits intomainfrom
fix/enrichment-content-hash
Apr 16, 2026
Merged

fix: drop legacy unique content_hash index#245
EtanHey merged 7 commits intomainfrom
fix/enrichment-content-hash

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 16, 2026

Summary

  • migrate legacy databases by dropping any unique index on chunks.content_hash and recreating the intended non-unique lookup index
  • keep _ensure_content_hash_column() as the single place that repairs this schema drift before enrichment/backfill writes run
  • add regression coverage for the exact failure mode where duplicate content backfill aborts on a legacy unique index

Verification

  • pytest tests/test_enrichment_controller.py
  • pytest tests/test_concurrent_enrichment.py
  • production DB schema repair verified: idx_content_hash_unique removed, idx_content_hash present
  • focused live enrichment verification on production DB: attempted 3, enriched 3, skipped 0, failed 0

Context

  • this fixes the flatlined enrichment runs caused by UNIQUE constraint failed: chunks.content_hash during UPDATE chunks SET content_hash = ? WHERE id = ?
  • indexing still has a separate operational problem: com.brainlayer.watch is not currently loaded, and recent watch logs show database is locked plus No space left on device; that is not changed in this PR

Note

Medium Risk
Touches enrichment-time SQLite schema/index repair and search reranking logic, so failures could impact enrichment runs or result ordering; changes are localized and covered by targeted tests.

Overview
Fixes enrichment schema drift by enhancing _ensure_content_hash_column() to detect and drop any unique index that includes chunks.content_hash, then ensure a non-unique lookup index exists so backfill/realtime enrichment can write duplicate hashes without aborting.

Centralizes DEFAULT_REALTIME_ENRICH_SINCE_HOURS env parsing into new config.get_int_env() (with malformed-value fallback) and updates CLI/MCP entrypoints to import the shared default; adds regression tests for the env fallback and for legacy-DB unique-index scenarios.

Adjusts MMR reranking in SearchMixin._mmr_rerank_scored_results() to cap vector-based selection to n_results and to keep lexical-only (non-vector) hits in their original score slots; adds coverage for this ordering behavior.

Reviewed by Cursor Bugbot for commit e37754d. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Drop legacy unique content_hash index and replace with non-unique index in chunks table

  • Updates _ensure_content_hash_column in enrichment_controller.py to detect and drop any unique index on content_hash, then creates a non-unique idx_content_hash index, fixing failures caused by duplicate content hashes.
  • Consolidates DEFAULT_REALTIME_ENRICH_SINCE_HOURS parsing into a new get_int_env helper in config.py, which falls back to the default and logs a warning on malformed input instead of raising at import time.
  • Fixes MMR reranking in SearchMixin._mmr_rerank_scored_results to preserve original position slots for non-vector (lexical-only) hits and cap selection at n_results.

Macroscope summarized e37754d.

Summary by CodeRabbit

  • Improvements
    • Enhanced search result reranking for more relevant and deduplicated results.
    • Improved database index management for content lookups, optimizing performance.
    • Strengthened configuration validation with graceful fallback handling for invalid settings.

Copy link
Copy Markdown

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

This PR consolidates environment-based configuration into a shared config.py module, refactors _ensure_content_hash_column() to properly manage SQLite indexes, and improves MMR reranking logic to better preserve non-vector search results in their original positions while applying deduplication.

Changes

Cohort / File(s) Summary
Configuration Centralization
src/brainlayer/config.py, src/brainlayer/cli/__init__.py, src/brainlayer/mcp/__init__.py, src/brainlayer/mcp/enrich_handler.py
Introduced config.py with get_int_env() utility and DEFAULT_REALTIME_ENRICH_SINCE_HOURS constant. Removed direct environment parsing from cli and mcp modules; all now import from shared config.
SQLite Index Management
src/brainlayer/enrichment_controller.py, tests/test_enrichment_controller.py
Enhanced _ensure_content_hash_column() to inspect and create non-unique index idx_content_hash on chunks table via SQLite PRAGMA queries. Added migration tests for legacy unique index removal and backfill verification.
MMR Reranking Algorithm
src/brainlayer/search_repo.py, tests/test_hybrid_search.py
Modified _mmr_rerank_scored_results() to limit selection to a target count, include all non-selected MMR candidates, and preserve non-vector hits in their original score positions. Added test validating lexical-only entries remain in top results.
Configuration Testing
tests/test_enrich_defaults.py
Added test for invalid environment variable fallback behavior, verifying all modules correctly default to 8760 when BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS is malformed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 The config spreads its wings so wide,
From scattered env vars, now unified with pride,
Indexes hop through SQLite's humble store,
While MMR keeps non-vectors, forevermore! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: removing a legacy unique index on content_hash and replacing it with a non-unique one, which is the primary structural fix addressed throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/enrichment-content-hash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 16, 2026

@cursor @BugBot review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 16, 2026

@codex review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 16, 2026

@coderabbitai review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 16, 2026

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed697058b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/brainlayer/search_repo.py Outdated
Comment on lines +189 to +190
reranked = [mmr_candidates[idx] for idx in selected]
return reranked + fallback_candidates + tail_candidates
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep non-vector hits in score order after MMR rerank

Chunks can legitimately be missing chunk_vectors rows (for example, newly stored manual/MCP memories before embedding backfill), so they enter fallback_candidates even when they have the highest hybrid score. Returning reranked + fallback_candidates + tail_candidates always pushes those lexical-only hits behind all vector-backed MMR candidates, so hybrid_search(..., n_results=...) can drop the most relevant FTS matches from the top results. This is a retrieval-correctness regression for partial-embedding databases.

Useful? React with 👍 / 👎.

Comment thread src/brainlayer/mcp/__init__.py Outdated
from .store_handler import _brain_update as _brain_update
from .tags_handler import _brain_tags_mcp as _brain_tags_mcp

DEFAULT_REALTIME_ENRICH_SINCE_HOURS = int(os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "8760"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard enrich default env parsing at module import

DEFAULT_REALTIME_ENRICH_SINCE_HOURS is parsed with int(...) at import time, so any malformed BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS value (e.g. empty string or 24h) raises ValueError and prevents the MCP module from loading at all. Because this happens before request handling, one bad environment value can take the whole tool server down instead of falling back to a safe default.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ed69705. Configure here.

Comment thread src/brainlayer/search_repo.py Outdated
Comment thread src/brainlayer/mcp/enrich_handler.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e37754d7f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +419 to +422
except Exception:
pass

return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate content_hash index migration failures

Do not swallow exceptions here: if DROP INDEX/CREATE INDEX fails (for example due a transient database is locked during startup), _ensure_content_hash_column still returns success and _ensure_enrichment_columns marks the store as schema-ready, so the process will not retry the migration and legacy unique content_hash constraints can continue breaking duplicate-hash backfill/write paths. This makes the new repair logic fail silently in exactly the lock-contention environment called out in this repo.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/brainlayer/config.py`:
- Around line 21-25: The current int parsing returns int(value) even for 0 or
negatives; update the parse so that after converting value = int(value) you
reject non-positive numbers by checking if value <= 0, then call
logger.warning("Invalid %s=%r; using default %s", name, raw, default) and return
default; otherwise return the parsed positive integer. Locate the try/except
block that uses name, raw, default and replace the bare return int(value) with
this validation (keep the same warning message and exception handling).

In `@src/brainlayer/enrichment_controller.py`:
- Around line 402-422: The current try/except around the index scan/drop/create
swallows all exceptions and returns True even when DDL fails; update the logic
in the function containing this block (the code using cursor.execute,
has_content_hash_index, DROP INDEX and CREATE INDEX) to remove the bare
except/pass, implement retry-on-SQLITE_BUSY for each DDL operation (PRAGMA
index_info, DROP INDEX, CREATE INDEX) with a small backoff and limited attempts,
and if retries exhaust or any non-BUSY error occurs, propagate or return a
failure (do not return True); ensure the final return reflects actual success
only after successful DDL or confirmed existing index.
- Around line 405-414: The migration loop currently drops any unique index that
contains "content_hash" even if it's part of a composite index; change the logic
in the loop that iterates over indexes (using variables index_name, is_unique,
quoted_name, columns and cursor.execute/PRAGMA index_info) to only DROP INDEX
when the index is unique and its columns exactly equal ["content_hash"]
(preferably also verify the index_name matches the known legacy name if
available) — otherwise skip it so composite indexes like (project, content_hash)
or (conversation_id, content_hash) are preserved.

In `@tests/test_enrich_defaults.py`:
- Around line 24-28: The test currently calls
monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False) then
importlib.reload(config), importlib.reload(cli), importlib.reload(mcp),
importlib.reload(enrich_handler), which can leak state if the env var was set
before the test; update the test cleanup in tests/test_enrich_defaults.py to
capture the original value (e.g., orig =
os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS")) before deleting, then
restore that original value (use monkeypatch.setenv if orig is not None,
otherwise monkeypatch.delenv) and only then call importlib.reload(config),
importlib.reload(cli), importlib.reload(mcp), importlib.reload(enrich_handler)
so module constants reload with the original environment baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dc8e29d2-3ca5-453f-ba28-ecefa273d309

📥 Commits

Reviewing files that changed from the base of the PR and between fd04ecd and e37754d.

📒 Files selected for processing (9)
  • src/brainlayer/cli/__init__.py
  • src/brainlayer/config.py
  • src/brainlayer/enrichment_controller.py
  • src/brainlayer/mcp/__init__.py
  • src/brainlayer/mcp/enrich_handler.py
  • src/brainlayer/search_repo.py
  • tests/test_enrich_defaults.py
  • tests/test_enrichment_controller.py
  • tests/test_hybrid_search.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Macroscope - Correctness Check
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior
Enforce one-write-at-a-time concurrency constraint; reads are safe but brain_digest is write-heavy and must not run in parallel with other MCP work
Run pytest before claiming behavior changed safely; current test suite has 929 tests

**/*.py: Use paths.py:get_db_path() for all database path resolution; all scripts and CLI must use this function rather than hardcoding paths
When performing bulk database operations: stop enrichment workers first, checkpoint WAL before and after, drop FTS triggers before bulk deletes, batch deletes in 5-10K chunks, and checkpoint every 3 batches

Files:

  • tests/test_enrichment_controller.py
  • src/brainlayer/cli/__init__.py
  • src/brainlayer/mcp/enrich_handler.py
  • tests/test_enrich_defaults.py
  • src/brainlayer/enrichment_controller.py
  • tests/test_hybrid_search.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/config.py
  • src/brainlayer/mcp/__init__.py
src/brainlayer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/brainlayer/**/*.py: Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely
Classification must preserve ai_code, stack_trace, and user_message verbatim; skip noise entries entirely and summarize build_log and dir_listing entries (structure only)
Use AST-aware chunking via tree-sitter; never split stack traces; mask large tool output
For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via enrichment_controller.py, and Ollama as offline last-resort; allow override via BRAINLAYER_ENRICH_BACKEND env var
Configure enrichment rate via BRAINLAYER_ENRICH_RATE environment variable (default 0.2 = 12 RPM)
Implement chunk lifecycle columns: superseded_by, aggregated_into, archived_at on chunks table; exclude lifecycle-managed chunks from default search; allow include_archived=True to show history
Implement brain_supersede with safety gate for personal data (journals, notes, health/finance); use soft-delete for brain_archive with timestamp
Add supersedes parameter to brain_store for atomic store-and-replace operations
Run linting and formatting with: ruff check src/ && ruff format src/
Run tests with pytest
Use PRAGMA wal_checkpoint(FULL) before and after bulk database operations to prevent WAL bloat

Files:

  • src/brainlayer/cli/__init__.py
  • src/brainlayer/mcp/enrich_handler.py
  • src/brainlayer/enrichment_controller.py
  • src/brainlayer/search_repo.py
  • src/brainlayer/config.py
  • src/brainlayer/mcp/__init__.py
🧠 Learnings (13)
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Configure enrichment rate via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

  • src/brainlayer/cli/__init__.py
  • src/brainlayer/mcp/enrich_handler.py
  • tests/test_enrich_defaults.py
  • src/brainlayer/config.py
  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-02T23:32:14.543Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-02T23:32:14.543Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment rate configurable via `BRAINLAYER_ENRICH_RATE` environment variable (default 0.2 = 12 RPM)

Applied to files:

  • src/brainlayer/cli/__init__.py
  • src/brainlayer/mcp/enrich_handler.py
  • tests/test_enrich_defaults.py
  • src/brainlayer/config.py
  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-03T11:34:19.303Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-03T11:34:19.303Z
Learning: Applies to src/brainlayer/cli.py : Use Typer CLI framework for command-line interface in `src/brainlayer/`

Applied to files:

  • src/brainlayer/cli/__init__.py
📚 Learning: 2026-03-22T15:55:22.017Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 100
File: src/brainlayer/enrichment_controller.py:175-199
Timestamp: 2026-03-22T15:55:22.017Z
Learning: In `src/brainlayer/enrichment_controller.py`, the `parallel` parameter in `enrich_local()` is intentionally kept in the function signature (currently unused, suppressed with `# noqa: ARG001`) for API stability. Parallel local enrichment via a thread pool or process pool is planned for a future iteration. Do not flag this as dead code requiring removal.

Applied to files:

  • src/brainlayer/mcp/enrich_handler.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/*enrichment*.py : Enrichment backend priority: Groq (primary/cloud) → Gemini (fallback) → Ollama (offline last-resort), configurable via `BRAINLAYER_ENRICH_BACKEND` environment variable

Applied to files:

  • src/brainlayer/mcp/enrich_handler.py
  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : Implement chunk lifecycle columns: `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search; allow `include_archived=True` to show history

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-04-04T23:24:03.159Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-04T23:24:03.159Z
Learning: Applies to src/brainlayer/{vector_store,search}*.py : Chunk lifecycle: implement columns `superseded_by`, `aggregated_into`, `archived_at` on chunks table; exclude lifecycle-managed chunks from default search

Applied to files:

  • src/brainlayer/enrichment_controller.py
  • src/brainlayer/search_repo.py
📚 Learning: 2026-04-11T16:54:45.631Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T16:54:45.631Z
Learning: Applies to `src/brainlayer/enrichment_controller.py`, `src/brainlayer/pipeline/write_queue.py`, and related enrichment pipeline files: A per-store single-writer queue is used for SQLite enrichment writes because SQLite allows only one writer at a time; direct concurrent writes caused lock contention under sustained Gemini Flex traffic. Do not flag serialized write patterns in this path as a performance concern — the queue is intentional.

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-04-03T11:43:08.915Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-03T11:43:08.915Z
Learning: Applies to src/brainlayer/*bulk*.py : Before bulk database operations: stop enrichment workers, checkpoint WAL with `PRAGMA wal_checkpoint(FULL)`, drop FTS triggers before bulk deletes

Applied to files:

  • src/brainlayer/enrichment_controller.py
📚 Learning: 2026-04-01T01:24:44.281Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-01T01:24:44.281Z
Learning: Applies to src/brainlayer/mcp/*.py : MCP tools include: brain_search, brain_store, brain_recall, brain_entity, brain_expand, brain_update, brain_digest, brain_get_person, brain_tags, brain_supersede, brain_archive (legacy brainlayer_* aliases still supported)

Applied to files:

  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-11T16:54:45.631Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-04-11T16:54:45.631Z
Learning: Applies to `src/brainlayer/enrichment_controller.py` and `src/brainlayer/pipeline/rate_limiter.py`: Gemini API calls in the enrichment pipeline are gated by a token bucket rate limiter. The rate is controlled by `BRAINLAYER_ENRICH_RATE` (default `5/s`, burst `10`) to keep throughput inside the Gemini Flex intended envelope. This default supersedes the earlier 0.2 (12 RPM) default for the Gemini Flex integration path.

Applied to files:

  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-04-06T08:40:13.531Z
Learnt from: CR
Repo: EtanHey/brainlayer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-06T08:40:13.531Z
Learning: Applies to src/brainlayer/**/*.py : For enrichment backend selection: use Groq as primary backend (cloud, configured in launchd plist), Gemini as fallback via `enrichment_controller.py`, and Ollama as offline last-resort; allow override via `BRAINLAYER_ENRICH_BACKEND` env var

Applied to files:

  • src/brainlayer/mcp/__init__.py
📚 Learning: 2026-03-17T01:04:22.497Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 0
File: :0-0
Timestamp: 2026-03-17T01:04:22.497Z
Learning: Applies to src/brainlayer/mcp/**/*.py and brain-bar/Sources/BrainBar/MCPRouter.swift: The 8 required MCP tools are `brain_search`, `brain_store`, `brain_recall`, `brain_entity`, `brain_expand`, `brain_update`, `brain_digest`, `brain_tags`. `brain_tags` is the 8th tool, replacing `brain_get_person`, as defined in the Phase B spec merged in PR `#72`. The Python MCP server already implements `brain_tags`. Legacy `brainlayer_*` aliases must be maintained for backward compatibility.

Applied to files:

  • src/brainlayer/mcp/__init__.py
🔇 Additional comments (3)
src/brainlayer/search_repo.py (1)

177-197: Nice slot-preserving recombination for lexical-only hits.

Rebuilding top_candidates in place keeps non-vector results anchored to their original score slots while still letting MMR diversify the vector-backed positions.

tests/test_enrichment_controller.py (1)

381-397: Good regression coverage for the legacy-index migration.

Using a real SQLite database here exercises the exact index-introspection and duplicate-backfill path that mocked cursors would miss.

Also applies to: 424-447

tests/test_hybrid_search.py (1)

349-410: Strong regression test for MMR slot preservation and dedupe behavior.

This covers the lexical-only slot retention and duplicate-vector suppression path clearly, which is exactly the risky edge case from the rerank change.

Comment thread src/brainlayer/config.py
Comment on lines +21 to +25
try:
return int(value)
except ValueError:
logger.warning("Invalid %s=%r; using default %s", name, raw, default)
return default
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject non-positive integers here as well.

int() accepts 0 and negative values, so BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS=-1 becomes the shared default even though this setting is only valid for positive lookback windows. Treat <= 0 the same as malformed input and fall back to default.

Suggested fix
 def get_int_env(name: str, default: int) -> int:
     """Read an integer env var, falling back cleanly on malformed values."""
     raw = os.environ.get(name)
     if raw is None:
         return default

     value = raw.strip()
     if not value:
         return default

     try:
-        return int(value)
+        parsed = int(value)
     except ValueError:
         logger.warning("Invalid %s=%r; using default %s", name, raw, default)
         return default
+    if parsed <= 0:
+        logger.warning("Invalid %s=%r; using default %s", name, raw, default)
+        return default
+    return parsed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/config.py` around lines 21 - 25, The current int parsing
returns int(value) even for 0 or negatives; update the parse so that after
converting value = int(value) you reject non-positive numbers by checking if
value <= 0, then call logger.warning("Invalid %s=%r; using default %s", name,
raw, default) and return default; otherwise return the parsed positive integer.
Locate the try/except block that uses name, raw, default and replace the bare
return int(value) with this validation (keep the same warning message and
exception handling).

Comment on lines +402 to +422
try:
indexes = list(cursor.execute("PRAGMA index_list(chunks)"))
has_content_hash_index = False
for row in indexes:
index_name = row[1]
is_unique = bool(row[2])
quoted_name = index_name.replace('"', '""')
columns = [info[2] for info in cursor.execute(f'PRAGMA index_info("{quoted_name}")')]
if "content_hash" not in columns:
continue
if is_unique:
cursor.execute(f'DROP INDEX IF EXISTS "{quoted_name}"')
continue
has_content_hash_index = True

if not has_content_hash_index:
cursor.execute("CREATE INDEX IF NOT EXISTS idx_content_hash ON chunks(content_hash)")
except Exception:
pass

return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't report schema repair success after failed DDL.

Lines 419-422 swallow every failure from the index scan/drop/create path and still return True. If DROP INDEX or CREATE INDEX hits SQLITE_BUSY, callers will cache this store as repaired and continue into writes that can still fail with the same UNIQUE constraint this helper is meant to fix. Retry busy DDL and surface failure instead of passing here.

As per coding guidelines, "Flag risky DB or concurrency changes explicitly and do not hand-wave lock behavior" and "Use retry logic on SQLITE_BUSY errors; each worker must use its own database connection to handle concurrency safely".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 402 - 422, The current
try/except around the index scan/drop/create swallows all exceptions and returns
True even when DDL fails; update the logic in the function containing this block
(the code using cursor.execute, has_content_hash_index, DROP INDEX and CREATE
INDEX) to remove the bare except/pass, implement retry-on-SQLITE_BUSY for each
DDL operation (PRAGMA index_info, DROP INDEX, CREATE INDEX) with a small backoff
and limited attempts, and if retries exhaust or any non-BUSY error occurs,
propagate or return a failure (do not return True); ensure the final return
reflects actual success only after successful DDL or confirmed existing index.

Comment on lines +405 to +414
for row in indexes:
index_name = row[1]
is_unique = bool(row[2])
quoted_name = index_name.replace('"', '""')
columns = [info[2] for info in cursor.execute(f'PRAGMA index_info("{quoted_name}")')]
if "content_hash" not in columns:
continue
if is_unique:
cursor.execute(f'DROP INDEX IF EXISTS "{quoted_name}"')
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only drop the legacy single-column unique index.

This loop removes any unique index whose column list merely includes content_hash. That also drops composite constraints like (project, content_hash) or (conversation_id, content_hash), which are not the legacy bug this migration is targeting. Please restrict the repair to unique indexes whose indexed columns are exactly ["content_hash"] (and ideally the known legacy name) before issuing DROP INDEX.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/brainlayer/enrichment_controller.py` around lines 405 - 414, The
migration loop currently drops any unique index that contains "content_hash"
even if it's part of a composite index; change the logic in the loop that
iterates over indexes (using variables index_name, is_unique, quoted_name,
columns and cursor.execute/PRAGMA index_info) to only DROP INDEX when the index
is unique and its columns exactly equal ["content_hash"] (preferably also verify
the index_name matches the known legacy name if available) — otherwise skip it
so composite indexes like (project, content_hash) or (conversation_id,
content_hash) are preserved.

Comment on lines +24 to +28
monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False)
importlib.reload(config)
importlib.reload(cli)
importlib.reload(mcp)
importlib.reload(enrich_handler)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Restore the original env value before reloading modules in cleanup.

Cleanup currently forces the env var to “unset” before reloading modules. If it was set before this test, subsequent module constants can be reloaded with the wrong baseline and leak state across tests.

Suggested fix
+import os
 import importlib
@@
 def test_invalid_realtime_enrich_since_hours_env_falls_back(monkeypatch):
@@
-    monkeypatch.setenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", "24h")
+    env_key = "BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS"
+    original = os.environ.get(env_key)
+    monkeypatch.setenv(env_key, "24h")
@@
     finally:
-        monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False)
+        if original is None:
+            monkeypatch.delenv(env_key, raising=False)
+        else:
+            monkeypatch.setenv(env_key, original)
         importlib.reload(config)
         importlib.reload(cli)
         importlib.reload(mcp)
         importlib.reload(enrich_handler)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_enrich_defaults.py` around lines 24 - 28, The test currently calls
monkeypatch.delenv("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS", raising=False) then
importlib.reload(config), importlib.reload(cli), importlib.reload(mcp),
importlib.reload(enrich_handler), which can leak state if the env var was set
before the test; update the test cleanup in tests/test_enrich_defaults.py to
capture the original value (e.g., orig =
os.environ.get("BRAINLAYER_DEFAULT_ENRICH_SINCE_HOURS")) before deleting, then
restore that original value (use monkeypatch.setenv if orig is not None,
otherwise monkeypatch.delenv) and only then call importlib.reload(config),
importlib.reload(cli), importlib.reload(mcp), importlib.reload(enrich_handler)
so module constants reload with the original environment baseline.

@EtanHey EtanHey merged commit 05c878c into main Apr 16, 2026
7 checks passed
@EtanHey EtanHey deleted the fix/enrichment-content-hash branch April 16, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant