feat: add Indonesian language support by dominosaurs · Pull Request #778 · MemPalace/mempalace

dominosaurs · 2026-04-13T09:42:10Z

What does this PR do?

Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.

How to test

pytest mempalace/i18n/test_i18n.py -v

Checklist

Tests pass (python -m pytest tests/ -v)
No hardcoded paths
Linter passes (ruff check .)

Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.

igorls · 2026-04-15T16:42:04Z

Thanks @dominosaurs! The id.json content is clean — schema matches en.json, interpolation variables all correct, regex patterns valid.

Two things before merge:

1. CI is failing on a stale branch, not on your changes. The failure is test_version_consistency asserting '3.1.0' == '3.2.0' — this branch predates the v3.2.0 release (#762) so your mempalace/version.py still says 3.1.0. A rebase onto develop will fix it:

git fetch origin
git rebase origin/develop
git push --force-with-lease

2. Please drop the edit to mempalace/i18n/test_i18n.py. That file moves to tests/test_i18n.py via #758 (which will merge ahead of this). The test suite there auto-discovers every *.json in mempalace/i18n/, so the "id" sample is redundant — id.json will be exercised automatically by test_all_languages_load and test_interpolation.

Minor, non-blocking: the hyphenated verbs in action_pattern (ter-konfigurasi|ter-deploy|ter-migrasi) are valid regex but an unusual Indonesian form — if a native speaker happens to review, worth a second look. Not holding up the merge for it.

Optionally — #911 just landed infra for Indonesian-aware entity detection (so names like Budi, Wayan, Siti get extracted from prose). You can add an entity section to id.json with Indonesian person verbs, pronouns, dialogue markers, and stopwords. Totally optional — the CLI/AAAK work stands on its own.

Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology. Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.

Refine AAAK instruction and expand entity detection patterns.

dominosaurs · 2026-04-16T10:29:08Z

Thanks for the review and the guidance.

I was honestly a bit clueless at first and approached id.json too narrowly as strict Indonesian-only translation. After looking at it more carefully, I realized that real Indonesian usage today is much more mixed, especially in technical writing. People naturally blend Indonesian with English terms, so a locale that is too rigid ends up less natural and can weaken mining on real text.

I’ve updated the JSON to be more contextual to how Indonesian is actually written today, with the hope that it improves mining quality for Indonesian text in practice. I’m very open to further suggestions from native speakers or anyone with better intuition here.

dominosaurs requested review from bensig and milla-jovovich as code owners April 13, 2026 09:42

dominosaurs changed the title ~~Add Indonesian language support~~ feat: add Indonesian language support Apr 13, 2026

igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026

bensig approved these changes Apr 15, 2026

View reviewed changes

igorls mentioned this pull request Apr 15, 2026

refactor(entity_detector): make multi-language extensible via i18n JSON #911

Merged

6 tasks

dominosaurs force-pushed the feat/id-lang branch from 81664ca to 88f5b5f Compare April 16, 2026 08:16

dominosaurs requested a review from igorls as a code owner April 16, 2026 08:16

dominosaurs marked this pull request as draft April 16, 2026 08:24

feat: Update Indonesian translations

939d4c1

Refine AAAK instruction and expand entity detection patterns.

dominosaurs marked this pull request as ready for review April 16, 2026 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Indonesian language support#778

feat: add Indonesian language support#778
dominosaurs wants to merge 2 commits intoMemPalace:developfrom
dominosaurs:feat/id-lang

dominosaurs commented Apr 13, 2026 •

edited

Loading

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

dominosaurs commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dominosaurs commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to test

Checklist

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

dominosaurs commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dominosaurs commented Apr 13, 2026 •

edited

Loading