Skip to content

fix(i18n): resolve language codes case-insensitively (#927)#928

Merged
igorls merged 2 commits intoMemPalace:developfrom
arnoldwender:fix/i18n-lang-case-insensitive
Apr 16, 2026
Merged

fix(i18n): resolve language codes case-insensitively (#927)#928
igorls merged 2 commits intoMemPalace:developfrom
arnoldwender:fix/i18n-lang-case-insensitive

Conversation

@arnoldwender
Copy link
Copy Markdown
Contributor

What and Why

BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1), but the locale files in mempalace/i18n/ mix conventions:

pt-br.json   ← lowercase region
zh-CN.json   ← uppercase region
zh-TW.json   ← uppercase region

On case-sensitive filesystems (Linux, default APFS-CS macOS, strict Windows), --lang PT-BR, --lang zh-cn, or --lang ZH-TW silently missed the file. _load_entity_section() returned {}, the merge loop's found_any stayed False, and entity detection ran in English with no warning.

The cache key in get_entity_patterns() was the raw input tuple, so ("PT-BR",) and ("pt-br",) produced two distinct entries — both wrong.

Reproduction

mempalace mine ~/docs --lang PT-BR
# Expected: Brazilian Portuguese entity patterns merged
# Actual:   silently runs English-only entity detection

Change Summary

  • mempalace/i18n/__init__.py — add _canonical_lang(lang) that resolves any casing to the on-disk filename stem via lowercase comparison. Route load_lang, _load_entity_section, and the get_entity_patterns cache key through it. Behaviour for known locales is unchanged; only the previously-broken case-mismatched paths now succeed.
  • tests/test_i18n_lang_case.py — 8 regression tests: canonical resolution (lower/upper/unknown/empty), load_lang case insensitivity, entity-section parity across cases, cache deduplication across cases, English fallback for genuinely unknown codes still works.

Test Plan

  • ruff check . and ruff format --check . — clean
  • Full suite: pytest tests/ --ignore=tests/benchmarks953 passed
  • New regression suite proves the bug existed: _load_entity_section('PT-BR') returned {} on develop and now returns the same dict as 'pt-br'
  • Confirmed cache deduplication: 3 calls with different casings of zh-CN produce 1 cache entry, not 3

Closes #927

BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.

The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.

Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.

Closes MemPalace#927
PEP 604 union syntax (str | None) requires Python 3.10+. The project
supports 3.9 per CI matrix, so use typing.Optional instead.
@arnoldwender
Copy link
Copy Markdown
Contributor Author

CI failure was my mistake — used PEP 604 str | None which requires Python 3.10+. Fixed in 6caac50 by switching to typing.Optional[str]. Local 3.9 + 3.13 verified clean.

@igorls igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings labels Apr 15, 2026
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the BCP 47 case sensitivity. The bug is real and your fix handles it correctly.

One alternative worth considering: the root cause is that locale filenames mix conventions (pt-br.json lowercase vs zh-CN.json mixed-case). If we rename the files to all-lowercase (zh-cn.json, zh-tw.json), the resolution becomes trivial:

lang_file = _LANG_DIR / f"{lang.strip().lower()}.json"

No glob scan, no lookup table, no new function. pt-br.json already follows this convention, so the inconsistency is only in the Chinese locale filenames.

The tradeoff is that renaming existing files is a (small) breaking change for anyone referencing them by path, while your approach works with any naming convention. Both are valid, curious what maintainers prefer.

@arnoldwender
Copy link
Copy Markdown
Contributor Author

Good point on the naming inconsistency. I'd lean toward my current approach for two reasons:

  1. Robustness to user input, not just shipped files. Even with locale files normalized to lowercase, callers still pass codes from browser Accept-Language headers, config files, CLI args — all of which can arrive in any case (zh-CN, ZH-cn, zh_cn, etc.). The resolver handles that regardless of how we name the files.

  2. Rename is a breaking change for downstream consumers. Anyone importing locale files by path (themes, plugins, third-party tooling) would need coordinated updates.

That said, the two aren't mutually exclusive — renaming files to lowercase and keeping the case-insensitive resolver gives us both the simpler internal code path and the external robustness. Happy to add the rename to this PR if maintainers prefer, or split it into a follow-up. Your call.

@mvalentsev
Copy link
Copy Markdown
Contributor

Fair point on needing normalization regardless of file naming. But .lower() at the API boundary is the resolver -- one line instead of a function that globs the filesystem on every call:

lang = lang.strip().lower()
lang_file = _LANG_DIR / f"{lang}.json"

The locale JSON files are an internal implementation detail, not a public API. Downstream consumers interact through load_lang() / get_entity_patterns(), not by opening files by path. Renaming two files (zh-CN / zh-TW to lowercase) is a safe internal change.

Both approaches work. Leaving it to maintainers to pick whichever they prefer.

@arnoldwender
Copy link
Copy Markdown
Contributor Author

Fair — you're right that load_lang() is the only supported entry point, so the "public API" concern is weaker than I framed it. I'm happy either way. If maintainers prefer the rename-plus-.lower() approach, say the word and I'll fold it into this PR.

@igorls igorls merged commit 65f99ad into MemPalace:develop Apr 16, 2026
6 checks passed
igorls added a commit that referenced this pull request Apr 16, 2026
Bumps version across pyproject.toml, mempalace/version.py, README badge,
and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled
'Unreleased') and adds a 3.3.1 section covering the multi-language
entity-detection infra and the five new locales landed since 2026-04-13.

Highlights:
- Multi-language entity detection infra (#911) + script-aware word
  boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive
  locale resolution (#928) + i18n patterns wired into miner/palace/
  entity_registry (#931)
- Five new fully-supported locales: pt-br (#156), ru (#760), it (#907),
  hi (#773), id (#778)
- UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales
  (#946)
- KnowledgeGraph lock correctness (#884, #887)
- Various smaller fixes and improvements
@igorls igorls mentioned this pull request Apr 16, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(i18n): language code lookup is case-sensitive — '--lang PT-BR' silently falls back to English

3 participants