fix: use i18n candidate patterns for entity extraction in miner and palace by mvalentsev · Pull Request #931 · MemPalace/mempalace

mvalentsev · 2026-04-16T00:24:07Z

Summary

#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only [A-Z][a-z]{2,} for entity name extraction,
silently missing Cyrillic, accented Latin, and other non-Latin names:

miner.py _extract_entities_for_metadata() -- drawer metadata tags
palace.py build_closet_lines() -- closet index entity tags
entity_registry.py extract_unknown_candidates() -- Wikipedia lookup

For example, mining a file with "Михаил написал код" produces zero entity
metadata because [A-Z] never matches Cyrillic uppercase.

Changes

Add _candidate_entity_words() helper in palace.py that loads
candidate patterns from get_entity_patterns() (same i18n source as
entity_detector), with lazy-cached compiled regexes
Replace hardcoded ASCII regex in all 3 locations with the shared helper
diary_ingest.py imports _extract_entities_for_metadata from miner,
so it gets the fix automatically
Add regression test: Cyrillic name appears in entity metadata when
entity_languages includes "ru"

Test plan

New test: test_entity_metadata_finds_cyrillic_names
Full suite: 945 passed
ruff check + ruff format --check: clean

…alace entity_detector.py was refactored in MemPalace#911 to load candidate patterns from i18n locale JSON files, supporting non-Latin scripts (Cyrillic, accented Latin, etc.). But three other code paths still hardcoded the ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity names in metadata tagging, closet indexing, and registry lookups. Replace the hardcoded regex with a shared _candidate_entity_words() helper that reuses the same i18n candidate_patterns as entity_detector.

mvalentsev · 2026-04-16T05:37:47Z

Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms.

qodo-ai-reviewer · 2026-04-16T13:36:06Z

Hi, _candidate_entity_words() swallows re.error during pattern compilation and permanently caches the partial/empty compiled list, which can silently disable entity extraction for configured languages.

Severity: action required | Category: reliability

How to fix: Log and avoid caching failures

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.palace._candidate_entity_words() silently drops invalid regex patterns (re.error) and then caches the resulting compiled regex list forever. This makes i18n entity extraction fail silently and non-recoverably within the process.

Issue Context

Patterns are loaded from i18n JSON via get_entity_patterns(). A single bad pattern (or any unexpected pattern content) currently yields no visibility and may leave _CANDIDATE_RX_CACHE empty.

Fix Focus Areas

mempalace/palace.py[134-160]

Log which pattern failed to compile (include language tuple if available)

Consider not caching when compilation yields an empty set (or cache keyed by language; see separate finding)

Optionally fall back to English candidate patterns if compilation yields none

We noticed a couple of other issues in this PR as well — happy to share if helpful.

Found by Qodo code review

mvalentsev · 2026-04-16T13:58:14Z

Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here.

A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version.

mvalentsev marked this pull request as ready for review April 16, 2026 00:28

mvalentsev requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 00:28

igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining labels Apr 16, 2026

mvalentsev added 2 commits April 16, 2026 10:35

fix: use pre-wrapped candidate patterns after MemPalace#932 refactor

973bd62

mvalentsev force-pushed the fix/i18n-entity-metadata branch from 58db004 to 973bd62 Compare April 16, 2026 05:37

remove unnecessary comment

cde0f5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use i18n candidate patterns for entity extraction in miner and palace#931

fix: use i18n candidate patterns for entity extraction in miner and palace#931
mvalentsev wants to merge 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata

mvalentsev commented Apr 16, 2026

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

qodo-ai-reviewer commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvalentsev commented Apr 16, 2026

Summary

Changes

Test plan

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

qodo-ai-reviewer commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants