fix(i18n): resolve language codes case-insensitively (#927)#928
Conversation
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1) but the
locale files mix conventions (pt-br.json vs zh-CN.json). On
case-sensitive filesystems, '--lang PT-BR' or '--lang zh-cn' silently
missed the file, _load_entity_section returned {}, and entity
detection ran in English with no warning.
The cache key in get_entity_patterns was built from raw input, so
('PT-BR',) and ('pt-br',) produced two distinct entries, both wrong.
Add _canonical_lang(lang) that resolves any casing to the on-disk
filename stem via lowercase comparison, and route load_lang,
_load_entity_section, and the cache key through it.
Closes MemPalace#927
PEP 604 union syntax (str | None) requires Python 3.10+. The project supports 3.9 per CI matrix, so use typing.Optional instead.
|
CI failure was my mistake — used PEP 604 |
mvalentsev
left a comment
There was a problem hiding this comment.
Nice catch on the BCP 47 case sensitivity. The bug is real and your fix handles it correctly.
One alternative worth considering: the root cause is that locale filenames mix conventions (pt-br.json lowercase vs zh-CN.json mixed-case). If we rename the files to all-lowercase (zh-cn.json, zh-tw.json), the resolution becomes trivial:
lang_file = _LANG_DIR / f"{lang.strip().lower()}.json"No glob scan, no lookup table, no new function. pt-br.json already follows this convention, so the inconsistency is only in the Chinese locale filenames.
The tradeoff is that renaming existing files is a (small) breaking change for anyone referencing them by path, while your approach works with any naming convention. Both are valid, curious what maintainers prefer.
|
Good point on the naming inconsistency. I'd lean toward my current approach for two reasons:
That said, the two aren't mutually exclusive — renaming files to lowercase and keeping the case-insensitive resolver gives us both the simpler internal code path and the external robustness. Happy to add the rename to this PR if maintainers prefer, or split it into a follow-up. Your call. |
|
Fair point on needing normalization regardless of file naming. But lang = lang.strip().lower()
lang_file = _LANG_DIR / f"{lang}.json"The locale JSON files are an internal implementation detail, not a public API. Downstream consumers interact through Both approaches work. Leaving it to maintainers to pick whichever they prefer. |
|
Fair — you're right that |
Bumps version across pyproject.toml, mempalace/version.py, README badge, and uv.lock. Finalizes the 3.3.0 CHANGELOG section (was still labeled 'Unreleased') and adds a 3.3.1 section covering the multi-language entity-detection infra and the five new locales landed since 2026-04-13. Highlights: - Multi-language entity detection infra (#911) + script-aware word boundaries for combining-mark scripts (#932) + BCP 47 case-insensitive locale resolution (#928) + i18n patterns wired into miner/palace/ entity_registry (#931) - Five new fully-supported locales: pt-br (#156), ru (#760), it (#907), hi (#773), id (#778) - UTF-8 encoding fix on read_text() calls for non-UTF-8 Windows locales (#946) - KnowledgeGraph lock correctness (#884, #887) - Various smaller fixes and improvements
What and Why
BCP 47 language tags are case-insensitive (RFC 5646 §2.1.1), but the locale files in
mempalace/i18n/mix conventions:On case-sensitive filesystems (Linux, default APFS-CS macOS, strict Windows),
--lang PT-BR,--lang zh-cn, or--lang ZH-TWsilently missed the file._load_entity_section()returned{}, the merge loop'sfound_anystayed False, and entity detection ran in English with no warning.The cache key in
get_entity_patterns()was the raw input tuple, so("PT-BR",)and("pt-br",)produced two distinct entries — both wrong.Reproduction
Change Summary
mempalace/i18n/__init__.py— add_canonical_lang(lang)that resolves any casing to the on-disk filename stem via lowercase comparison. Routeload_lang,_load_entity_section, and theget_entity_patternscache key through it. Behaviour for known locales is unchanged; only the previously-broken case-mismatched paths now succeed.tests/test_i18n_lang_case.py— 8 regression tests: canonical resolution (lower/upper/unknown/empty),load_langcase insensitivity, entity-section parity across cases, cache deduplication across cases, English fallback for genuinely unknown codes still works.Test Plan
ruff check .andruff format --check .— cleanpytest tests/ --ignore=tests/benchmarks— 953 passed_load_entity_section('PT-BR')returned{}ondevelopand now returns the same dict as'pt-br'zh-CNproduce 1 cache entry, not 3Closes #927