feat: add Indonesian language support#778
feat: add Indonesian language support#778dominosaurs wants to merge 2 commits intoMemPalace:developfrom
Conversation
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
|
Thanks @dominosaurs! The Two things before merge: 1. CI is failing on a stale branch, not on your changes. The failure is git fetch origin
git rebase origin/develop
git push --force-with-lease2. Please drop the edit to Minor, non-blocking: the hyphenated verbs in Optionally — #911 just landed infra for Indonesian-aware entity detection (so names like |
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology. Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
81664ca to
88f5b5f
Compare
Refine AAAK instruction and expand entity detection patterns.
|
Thanks for the review and the guidance. I was honestly a bit clueless at first and approached I’ve updated the JSON to be more contextual to how Indonesian is actually written today, with the hope that it improves mining quality for Indonesian text in practice. I’m very open to further suggestions from native speakers or anyone with better intuition here. |
What does this PR do?
Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.
Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
How to test
Checklist
python -m pytest tests/ -v)ruff check .)