fix(i18n): resolve language codes case-insensitively (#927) by arnoldwender · Pull Request #928 · MemPalace/mempalace

arnoldwender · 2026-04-15T21:33:58Z

What and Why

BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1), but the locale files in mempalace/i18n/ mix conventions:

pt-br.json   ← lowercase region
zh-CN.json   ← uppercase region
zh-TW.json   ← uppercase region

On case-sensitive filesystems (Linux, default APFS-CS macOS, strict Windows), --lang PT-BR, --lang zh-cn, or --lang ZH-TW silently missed the file. _load_entity_section() returned {}, the merge loop's found_any stayed False, and entity detection ran in English with no warning.

The cache key in get_entity_patterns() was the raw input tuple, so ("PT-BR",) and ("pt-br",) produced two distinct entries — both wrong.

Reproduction

mempalace mine ~/docs --lang PT-BR
# Expected: Brazilian Portuguese entity patterns merged
# Actual:   silently runs English-only entity detection

Change Summary

mempalace/i18n/__init__.py — add _canonical_lang(lang) that resolves any casing to the on-disk filename stem via lowercase comparison. Route load_lang, _load_entity_section, and the get_entity_patterns cache key through it. Behaviour for known locales is unchanged; only the previously-broken case-mismatched paths now succeed.
tests/test_i18n_lang_case.py — 8 regression tests: canonical resolution (lower/upper/unknown/empty), load_lang case insensitivity, entity-section parity across cases, cache deduplication across cases, English fallback for genuinely unknown codes still works.

Test Plan

ruff check . and ruff format --check . — clean
Full suite: pytest tests/ --ignore=tests/benchmarks — 953 passed
New regression suite proves the bug existed: _load_entity_section('PT-BR') returned {} on develop and now returns the same dict as 'pt-br'
Confirmed cache deduplication: 3 calls with different casings of zh-CN produce 1 cache entry, not 3

Closes #927

BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the locale files mix conventions (pt-br.json vs zh-CN.json). On case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently missed the file, _load_entity_section returned {}, and entity detection ran in English with no warning. The cache key in get_entity_patterns was built from raw input, so ('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong. Add _canonical_lang(lang) that resolves any casing to the on-disk filename stem via lowercase comparison, and route load_lang, _load_entity_section, and the cache key through it. Closes MemPalace#927

PEP 604 union syntax (str | None) requires Python 3.10+. The project supports 3.9 per CI matrix, so use typing.Optional instead.

arnoldwender · 2026-04-15T21:37:18Z

CI failure was my mistake — used PEP 604 str | None which requires Python 3.10+. Fixed in 6caac50 by switching to typing.Optional[str]. Local 3.9 + 3.13 verified clean.

mvalentsev

Nice catch on the BCP 47 case sensitivity. The bug is real and your fix handles it correctly.

One alternative worth considering: the root cause is that locale filenames mix conventions (pt-br.json lowercase vs zh-CN.json mixed-case). If we rename the files to all-lowercase (zh-cn.json, zh-tw.json), the resolution becomes trivial:

lang_file = _LANG_DIR / f"{lang.strip().lower()}.json"

No glob scan, no lookup table, no new function. pt-br.json already follows this convention, so the inconsistency is only in the Chinese locale filenames.

The tradeoff is that renaming existing files is a (small) breaking change for anyone referencing them by path, while your approach works with any naming convention. Both are valid, curious what maintainers prefer.

arnoldwender · 2026-04-15T23:58:51Z

Good point on the naming inconsistency. I'd lean toward my current approach for two reasons:

Robustness to user input, not just shipped files. Even with locale files normalized to lowercase, callers still pass codes from browser Accept-Language headers, config files, CLI args — all of which can arrive in any case (zh-CN, ZH-cn, zh_cn, etc.). The resolver handles that regardless of how we name the files.
Rename is a breaking change for downstream consumers. Anyone importing locale files by path (themes, plugins, third-party tooling) would need coordinated updates.

That said, the two aren't mutually exclusive — renaming files to lowercase and keeping the case-insensitive resolver gives us both the simpler internal code path and the external robustness. Happy to add the rename to this PR if maintainers prefer, or split it into a follow-up. Your call.

mvalentsev · 2026-04-16T00:08:52Z

Fair point on needing normalization regardless of file naming. But .lower() at the API boundary is the resolver -- one line instead of a function that globs the filesystem on every call:

lang = lang.strip().lower()
lang_file = _LANG_DIR / f"{lang}.json"

The locale JSON files are an internal implementation detail, not a public API. Downstream consumers interact through load_lang() / get_entity_patterns(), not by opening files by path. Renaming two files (zh-CN / zh-TW to lowercase) is a safe internal change.

Both approaches work. Leaving it to maintainers to pick whichever they prefer.

arnoldwender · 2026-04-16T17:51:54Z

Fair — you're right that load_lang() is the only supported entry point, so the "public API" concern is weaker than I framed it. I'm happy either way. If maintainers prefer the rename-plus-.lower() approach, say the word and I'll fold it into this PR.

Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements

arnoldwender requested review from bensig, igorls and milla-jovovich as code owners April 15, 2026 21:33

fix(i18n): use Optional[str] for Python 3.9 compatibility

6caac50

PEP 604 union syntax (str | None) requires Python 3.10+. The project supports 3.9 per CI matrix, so use typing.Optional instead.

igorls added bug Something isn't working area/i18n Multilingual, Unicode, non-English embeddings labels Apr 15, 2026

mvalentsev reviewed Apr 15, 2026

View reviewed changes

igorls merged commit 65f99ad into MemPalace:develop Apr 16, 2026
6 checks passed

igorls mentioned this pull request Apr 16, 2026

release: v3.3.1 #957

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(i18n): resolve language codes case-insensitively (#927)#928

fix(i18n): resolve language codes case-insensitively (#927)#928
igorls merged 2 commits intoMemPalace:developfrom
arnoldwender:fix/i18n-lang-case-insensitive

arnoldwender commented Apr 15, 2026

Uh oh!

arnoldwender commented Apr 15, 2026

Uh oh!

mvalentsev left a comment

Uh oh!

arnoldwender commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

arnoldwender commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arnoldwender commented Apr 15, 2026

What and Why

Reproduction

Change Summary

Test Plan

Uh oh!

arnoldwender commented Apr 15, 2026

Uh oh!

mvalentsev left a comment

Choose a reason for hiding this comment

Uh oh!

arnoldwender commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

arnoldwender commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants