Skip to content

feat: add Indonesian language support#778

Open
dominosaurs wants to merge 2 commits intoMemPalace:developfrom
dominosaurs:feat/id-lang
Open

feat: add Indonesian language support#778
dominosaurs wants to merge 2 commits intoMemPalace:developfrom
dominosaurs:feat/id-lang

Conversation

@dominosaurs
Copy link
Copy Markdown

@dominosaurs dominosaurs commented Apr 13, 2026

What does this PR do?

Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.

How to test

pytest mempalace/i18n/test_i18n.py -v

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

@dominosaurs dominosaurs changed the title Add Indonesian language support feat: add Indonesian language support Apr 13, 2026
@igorls igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026
igorls added a commit that referenced this pull request Apr 15, 2026
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
@igorls
Copy link
Copy Markdown
Collaborator

igorls commented Apr 15, 2026

Thanks @dominosaurs! The id.json content is clean — schema matches en.json, interpolation variables all correct, regex patterns valid.

Two things before merge:

1. CI is failing on a stale branch, not on your changes. The failure is test_version_consistency asserting '3.1.0' == '3.2.0' — this branch predates the v3.2.0 release (#762) so your mempalace/version.py still says 3.1.0. A rebase onto develop will fix it:

git fetch origin
git rebase origin/develop
git push --force-with-lease

2. Please drop the edit to mempalace/i18n/test_i18n.py. That file moves to tests/test_i18n.py via #758 (which will merge ahead of this). The test suite there auto-discovers every *.json in mempalace/i18n/, so the "id" sample is redundant — id.json will be exercised automatically by test_all_languages_load and test_interpolation.

Minor, non-blocking: the hyphenated verbs in action_pattern (ter-konfigurasi|ter-deploy|ter-migrasi) are valid regex but an unusual Indonesian form — if a native speaker happens to review, worth a second look. Not holding up the merge for it.

Optionally — #911 just landed infra for Indonesian-aware entity detection (so names like Budi, Wayan, Siti get extracted from prose). You can add an entity section to id.json with Indonesian person verbs, pronouns, dialogue markers, and stopwords. Totally optional — the CLI/AAAK work stands on its own.

Introduces the Indonesian (id) locale, providing translations for CLI commands, status messages, and core terminology.

Includes language-specific regex patterns for stop words and action detection to support text processing and indexing in Indonesian. The test suite is updated with a sample case to verify correct dialect handling and compression.
@dominosaurs dominosaurs requested a review from igorls as a code owner April 16, 2026 08:16
@dominosaurs dominosaurs marked this pull request as draft April 16, 2026 08:24
Refine AAAK instruction and expand entity detection patterns.
@dominosaurs dominosaurs marked this pull request as ready for review April 16, 2026 10:28
@dominosaurs
Copy link
Copy Markdown
Author

Thanks for the review and the guidance.

I was honestly a bit clueless at first and approached id.json too narrowly as strict Indonesian-only translation. After looking at it more carefully, I realized that real Indonesian usage today is much more mixed, especially in technical writing. People naturally blend Indonesian with English terms, so a locale that is too rigid ends up less natural and can weaken mining on real text.

I’ve updated the JSON to be more contextual to how Indonesian is actually written today, with the hope that it improves mining quality for Indonesian text in practice. I’m very open to further suggestions from native speakers or anyone with better intuition here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants