fix: use i18n candidate patterns for entity extraction in miner and palace#931
fix: use i18n candidate patterns for entity extraction in miner and palace#931mvalentsev wants to merge 3 commits intoMemPalace:developfrom
Conversation
…alace entity_detector.py was refactored in MemPalace#911 to load candidate patterns from i18n locale JSON files, supporting non-Latin scripts (Cyrillic, accented Latin, etc.). But three other code paths still hardcoded the ASCII-only regex [A-Z][a-z]{2,}, silently missing non-Latin entity names in metadata tagging, closet indexing, and registry lookups. Replace the hardcoded regex with a shared _candidate_entity_words() helper that reuses the same i18n candidate_patterns as entity_detector.
58db004 to
973bd62
Compare
|
Rebased on develop after #932 landed. candidate_patterns from get_entity_patterns() are now pre-wrapped with boundary + capture group, so _candidate_entity_words() compiles them directly without re-wrapping. Tests pass on all platforms. |
|
Hi, Severity: action required | Category: reliability How to fix: Log and avoid caching failures Agent prompt to fix - you can give this to your LLM of choice:
We noticed a couple of other issues in this PR as well — happy to share if helpful. Found by Qodo code review |
|
Fair point on the silent re.error. The try/except is intentionally defensive -- skip a broken pattern rather than crash the whole extraction pipeline. In practice the patterns are simple character classes from static JSON files ([A-Z][a-z]{1,19} and similar), so re.error is not really reachable here. A warning log would be a reasonable addition but out of scope for this PR, which just swaps ASCII-only regex for the i18n-aware version. |
Summary
#911 refactored entity_detector.py to load candidate patterns from i18n
locale JSON, supporting non-Latin scripts. But three other code paths
still hardcode ASCII-only
[A-Z][a-z]{2,}for entity name extraction,silently missing Cyrillic, accented Latin, and other non-Latin names:
miner.py_extract_entities_for_metadata()-- drawer metadata tagspalace.pybuild_closet_lines()-- closet index entity tagsentity_registry.pyextract_unknown_candidates()-- Wikipedia lookupFor example, mining a file with "Михаил написал код" produces zero entity
metadata because
[A-Z]never matches Cyrillic uppercase.Changes
_candidate_entity_words()helper inpalace.pythat loadscandidate patterns from
get_entity_patterns()(same i18n source asentity_detector), with lazy-cached compiled regexes
diary_ingest.pyimports_extract_entities_for_metadatafrom miner,so it gets the fix automatically
entity_languagesincludes "ru"Test plan
test_entity_metadata_finds_cyrillic_namesruff check+ruff format --check: clean