Skip to content

fix: use i18n candidate patterns for entity extraction in miner and palace#931

Open
mvalentsev wants to merge 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata
Open

fix: use i18n candidate patterns for entity extraction in miner and palace#931
mvalentsev wants to merge 3 commits intoMemPalace:developfrom
mvalentsev:fix/i18n-entity-metadata

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

Summary

#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only [A-Z][a-z]{2,} for entity name extraction,
silently missing Cyrillic, accented Latin, and other non-Latin names:

  • miner.py _extract_entities_for_metadata() -- drawer metadata tags
  • palace.py build_closet_lines() -- closet index entity tags
  • entity_registry.py extract_unknown_candidates() -- Wikipedia lookup

For example, mining a file with "Михаил написал код" produces zero entity
metadata because [A-Z] never matches Cyrillic uppercase.

Changes

  • Add _candidate_entity_words() helper in palace.py that loads
    candidate patterns from get_entity_patterns() (same i18n source as
    entity_detector), with lazy-cached compiled regexes
  • Replace hardcoded ASCII regex in all 3 locations with the shared helper
  • diary_ingest.py imports _extract_entities_for_metadata from miner,
    so it gets the fix automatically
  • Add regression test: Cyrillic name appears in entity metadata when
    entity_languages includes "ru"

Test plan

  • New test: test_entity_metadata_finds_cyrillic_names
  • Full suite: 945 passed
  • ruff check + ruff format --check: clean

@mvalentsev mvalentsev marked this pull request as ready for review April 16, 2026 00:28
@igorls igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining labels Apr 16, 2026
…alace

entity_detector.py was refactored in MemPalace#911 to load candidate patterns
from i18n locale JSON files, supporting non-Latin scripts (Cyrillic,
accented Latin, etc.). But three other code paths still hardcoded the
ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity
names in metadata tagging, closet indexing, and registry lookups.

Replace the hardcoded regex with a shared _candidate_entity_words()
helper that reuses the same i18n candidate_patterns as entity_detector.
@mvalentsev mvalentsev force-pushed the fix/i18n-entity-metadata branch from 58db004 to 973bd62 Compare April 16, 2026 05:37
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms.

@qodo-ai-reviewer
Copy link
Copy Markdown

Hi, _candidate_entity_words() swallows re.error during pattern compilation and permanently caches the partial/empty compiled list, which can silently disable entity extraction for configured languages.

Severity: action required | Category: reliability

How to fix: Log and avoid caching failures

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.palace._candidate_entity_words() silently drops invalid regex patterns (re.error) and then caches the resulting compiled regex list forever. This makes i18n entity extraction fail silently and non-recoverably within the process.

Issue Context

Patterns are loaded from i18n JSON via get_entity_patterns(). A single bad pattern (or any unexpected pattern content) currently yields no visibility and may leave _CANDIDATE_RX_CACHE empty.

Fix Focus Areas

  • mempalace/palace.py[134-160]
    • Log which pattern failed to compile (include language tuple if available)
    • Consider not caching when compilation yields an empty set (or cache keyed by language; see separate finding)
    • Optionally fall back to English candidate patterns if compilation yields none

We noticed a couple of other issues in this PR as well — happy to share if helpful.


Found by Qodo code review

@mvalentsev
Copy link
Copy Markdown
Contributor Author

Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here.

A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants