Skip to content

feat(i18n): add Traditional + Simplified Chinese entity detection#945

Open
lmanchu wants to merge 1 commit intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection
Open

feat(i18n): add Traditional + Simplified Chinese entity detection#945
lmanchu wants to merge 1 commit intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection

Conversation

@lmanchu
Copy link
Copy Markdown

@lmanchu lmanchu commented Apr 16, 2026

Problem

zh-TW and zh-CN are shipped in mempalace/i18n/ but have no entity section. When a Chinese user runs:

detect_entities(paths, languages=("zh-TW",))

get_entity_patterns() silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern [A-Z][a-z]{1,19} is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document. ja and ko share the same bug (follow-up PRs).

Reproduction (before this PR)

from mempalace.entity_detector import extract_candidates

zh_text = "朱宜振 主持會議。朱宜振 同意 Jeffrey 的方案。朱宜振: 決定 ship。"
extract_candidates(zh_text, languages=("zh-TW",))
# → {}                    ← no Chinese names
extract_candidates(zh_text, languages=("zh-TW", "en"))
# → {"Jeffrey": 1}        ← only English name, misses 朱宜振 entirely

Approach

Add entity sections to zh-TW.json and zh-CN.json that work within the current framework's constraints:

  • candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at {1,2} trailing chars so greedy matching doesn't swallow the trailing verb (e.g. 朱宜振說 → entity 朱宜振說 is wrong).
  • boundary_chars: \u4E00-\u9FFF: reuses the script-aware \b infrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK, \b fires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.
  • person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as {name}說, {name}問, {name}決定 — no \b or \s+ between them.
  • dialogue_patterns: full-width colon , Chinese quotes 「」『』, plus the standard Latin forms.
  • pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
  • stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.

What you get

# After this PR
zh_text = (
    "# 會議紀錄\n"
    "- 朱宜振 主持\n"
    "- Jeffrey Lai 報告融資\n"
    "朱宜振 跟 Jeffrey 討論 pitch。\n"
    "朱宜振: 「我們要 6 月 launch。」\n"
    "朱宜振 同意 Arnold 的方案。\n"
    "朱宜振 決定 ship pitch。\n"
    # ...8 more mentions...
)
detect_entities(..., languages=("zh-TW", "en"))
# people:    [('朱宜振', 0.99)]       ← correctly classified as person
# uncertain: [('Jeffrey Lai', 0.06), ...]

Known Limitation (documented in tests)

CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's \b(...)\b wrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).

In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note: 朱宜振 appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.

Testing

  • 7 new tests in tests/test_entity_detector.py:
    • test_zh_tw_candidate_extraction_at_boundaries
    • test_zh_tw_person_classification
    • test_zh_tw_stopwords_filter_common_particles
    • test_zh_tw_falls_back_to_english_for_non_cjk_names
    • test_zh_cn_candidate_extraction
    • test_zh_cn_and_zh_tw_union_covers_both_variants
    • test_zh_tw_known_limitation_inline_name_no_boundary
  • Full suite: 957 passed, 0 failed (pytest tests/ -q).
  • Ruff clean (ruff check mempalace/i18n/ tests/test_entity_detector.py).

Follow-ups (separate PRs)

  • ja.json: same treatment (currently falls back to English).
  • ko.json: same treatment.

Checklist

  • Tests pass (pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check)
  • No new dependencies
  • Targets develop per CONTRIBUTING.md

zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant