feat(i18n): add Traditional + Simplified Chinese entity detection#945
Open
lmanchu wants to merge 1 commit intoMemPalace:developfrom
Open
feat(i18n): add Traditional + Simplified Chinese entity detection#945lmanchu wants to merge 1 commit intoMemPalace:developfrom
lmanchu wants to merge 1 commit intoMemPalace:developfrom
Conversation
zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.
This adds entity sections for both locales:
- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
surnames covering >95% of Taiwanese / PRC names), length capped
at {1,2} trailing chars so greedy matches don't swallow the
trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
transitions. This is the same mechanism used for Devanagari,
applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
name with no whitespace, so patterns are written as `{name}說`,
`{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
question words, conjunctions, UI nouns, and politeness forms.
**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.
**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.
Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`
Full suite: 957 passed, 0 failed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
zh-TWandzh-CNare shipped inmempalace/i18n/but have noentitysection. When a Chinese user runs:get_entity_patterns()silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern[A-Z][a-z]{1,19}is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document.jaandkoshare the same bug (follow-up PRs).Reproduction (before this PR)
Approach
Add
entitysections tozh-TW.jsonandzh-CN.jsonthat work within the current framework's constraints:candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at{1,2}trailing chars so greedy matching doesn't swallow the trailing verb (e.g.朱宜振說→ entity朱宜振說is wrong).boundary_chars: \u4E00-\u9FFF: reuses the script-aware\binfrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK,\bfires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as{name}說,{name}問,{name}決定— no\bor\s+between them.dialogue_patterns: full-width colon:, Chinese quotes 「」『』, plus the standard Latin forms.pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.What you get
Known Limitation (documented in tests)
CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's
\b(...)\bwrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note:
朱宜振appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.Testing
tests/test_entity_detector.py:test_zh_tw_candidate_extraction_at_boundariestest_zh_tw_person_classificationtest_zh_tw_stopwords_filter_common_particlestest_zh_tw_falls_back_to_english_for_non_cjk_namestest_zh_cn_candidate_extractiontest_zh_cn_and_zh_tw_union_covers_both_variantstest_zh_tw_known_limitation_inline_name_no_boundarypytest tests/ -q).ruff check mempalace/i18n/ tests/test_entity_detector.py).Follow-ups (separate PRs)
ja.json: same treatment (currently falls back to English).ko.json: same treatment.Checklist
pytest tests/ -v)ruff check)developperCONTRIBUTING.md