feat(i18n): add Traditional + Simplified Chinese entity detection by lmanchu · Pull Request #945 · MemPalace/mempalace

lmanchu · 2026-04-16T09:43:42Z

Problem

zh-TW and zh-CN are shipped in mempalace/i18n/ but have no entity section. When a Chinese user runs:

detect_entities(paths, languages=("zh-TW",))

get_entity_patterns() silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern [A-Z][a-z]{1,19} is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document. ja and ko share the same bug (follow-up PRs).

Reproduction (before this PR)

from mempalace.entity_detector import extract_candidates

zh_text = "朱宜振 主持會議。朱宜振 同意 Jeffrey 的方案。朱宜振: 決定 ship。"
extract_candidates(zh_text, languages=("zh-TW",))
# → {}                    ← no Chinese names
extract_candidates(zh_text, languages=("zh-TW", "en"))
# → {"Jeffrey": 1}        ← only English name, misses 朱宜振 entirely

Approach

Add entity sections to zh-TW.json and zh-CN.json that work within the current framework's constraints:

candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at {1,2} trailing chars so greedy matching doesn't swallow the trailing verb (e.g. 朱宜振說 → entity 朱宜振說 is wrong).
boundary_chars: \u4E00-\u9FFF: reuses the script-aware \b infrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK, \b fires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.
person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as {name}說, {name}問, {name}決定 — no \b or \s+ between them.
dialogue_patterns: full-width colon ：, Chinese quotes 「」『』, plus the standard Latin forms.
pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.

What you get

# After this PR
zh_text = (
    "# 會議紀錄\n"
    "- 朱宜振 主持\n"
    "- Jeffrey Lai 報告融資\n"
    "朱宜振 跟 Jeffrey 討論 pitch。\n"
    "朱宜振: 「我們要 6 月 launch。」\n"
    "朱宜振 同意 Arnold 的方案。\n"
    "朱宜振 決定 ship pitch。\n"
    # ...8 more mentions...
)
detect_entities(..., languages=("zh-TW", "en"))
# people:    [('朱宜振', 0.99)]       ← correctly classified as person
# uncertain: [('Jeffrey Lai', 0.06), ...]

Known Limitation (documented in tests)

CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's \b(...)\b wrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).

In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note: 朱宜振 appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.

Testing

7 new tests in tests/test_entity_detector.py:
- test_zh_tw_candidate_extraction_at_boundaries
- test_zh_tw_person_classification
- test_zh_tw_stopwords_filter_common_particles
- test_zh_tw_falls_back_to_english_for_non_cjk_names
- test_zh_cn_candidate_extraction
- test_zh_cn_and_zh_tw_union_covers_both_variants
- test_zh_tw_known_limitation_inline_name_no_boundary
Full suite: 957 passed, 0 failed (pytest tests/ -q).
Ruff clean (ruff check mempalace/i18n/ tests/test_entity_detector.py).

Follow-ups (separate PRs)

ja.json: same treatment (currently falls back to English).
ko.json: same treatment.

Checklist

zh-TW and zh-CN previously had no `entity` section. Calling `detect_entities(..., languages=("zh-TW",))` silently fell back to English patterns (i18n/__init__.py:231-233), so no Chinese names were ever extracted — Chinese-speaking users got zero people or projects detected from their own notes. This adds entity sections for both locales: - `candidate_pattern`: common-surname-prefixed CJK n-grams (~100 surnames covering >95% of Taiwanese / PRC names), length capped at {1,2} trailing chars so greedy matches don't swallow the trailing verb character (e.g. 朱宜振說). - `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK transitions. This is the same mechanism used for Devanagari, applied to the CJK range. - `person_verb_patterns`: Chinese verbs attach directly to the name with no whitespace, so patterns are written as `{name}說`, `{name}問`, `{name}決定` — no `\b` or `\s+` separators. - `dialogue_patterns`: full-width colon `：`, Chinese quotes 「」『』, plus the standard Latin forms. - `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱. - `stopwords`: ~140 common particles, pronouns, time expressions, question words, conjunctions, UI nouns, and politeness forms. **Known limitation** (explicitly covered by a test): CJK scripts have no word delimiters, so a name flanked by CJK on both sides with no punctuation or whitespace break is not extracted. This is a fundamental limit of regex-based CJK entity detection — resolving it would require a dictionary tokeniser. Realistic Chinese technical writing contains enough non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines) that 3+ occurrences normally produce matches. Verified against a realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences with 0.99 person-classification confidence. **Follow-ups** (separate PRs): same pattern for `ja` and `ko`, both of which currently share the silent fallback-to-English bug. Tests: 7 new tests in `tests/test_entity_detector.py`: - `test_zh_tw_candidate_extraction_at_boundaries` - `test_zh_tw_person_classification` - `test_zh_tw_stopwords_filter_common_particles` - `test_zh_tw_falls_back_to_english_for_non_cjk_names` - `test_zh_cn_candidate_extraction` - `test_zh_cn_and_zh_tw_union_covers_both_variants` - `test_zh_tw_known_limitation_inline_name_no_boundary` Full suite: 957 passed, 0 failed.

lmanchu requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(i18n): add Traditional + Simplified Chinese entity detection#945

feat(i18n): add Traditional + Simplified Chinese entity detection#945
lmanchu wants to merge 1 commit intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection

lmanchu commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmanchu commented Apr 16, 2026

Problem

Reproduction (before this PR)

Approach

What you get

Known Limitation (documented in tests)

Testing

Follow-ups (separate PRs)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant