A structure-aware document localization project for the GitHub Copilot CLI Challenge, built only with free libraries.
.txt.docx- text-based
.pdf - screenshots/images:
.png,.jpg,.jpeg(OCR pipeline)
de_de,es_es,fr_fr,it_it,ja_jp,ko_kr,pt_br,zh_cn,zh_tw
- Language/term localization (rule-based)
- Currency + locale default USD FX (editable)
- Date/time + timezone conversion
- Measurement/unit conversion (
mi -> km,lb -> kg,F -> C) - Address/phone/postal adaptation by locale
- Tax/VAT/GST label adaptation + compliance labels
- Legal clause lock/protect zones (
[[LOCK]]...[[/LOCK]]) - Terminology memory (
term_memory.json) - Style/tone presets (
formal,legal,technical,marketing) - Table overflow risk hints (DOCX)
- Cross-reference/TOC/page-reference QA warnings
- Font/script fallback QA checks for CJK
- Approval workflow states (
Draft,Legal Review,Final)
- Attractive dashboard UI (hero header, styled cards)
- Animated Before/After scorecards
- Visual side-by-side diff
- Layout risk heatmap
streamlitpython-docxpypdfreportlabpymupdf(layout-preserving PDF localization)pillowpytesseract
cd "/Users/swatigoyal/Documents/New project/document_localizer_challenge"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtstreamlit run app.pypython -m localizer.cli input.docx output.docx --locale de_de
python -m localizer.cli input.pdf output.pdf --locale ja_jp --source-timezone America/Los_Angeles --tone legal
python -m localizer.cli screenshot.png localized.txt --locale fr_fr --workflow "Legal Review"Screenshot OCR requires a local Tesseract binary in addition to pytesseract.
- macOS (example):
brew install tesseract
- Upload DOCX/PDF/screenshot.
- Change locale and watch default FX auto-update.
- Adjust tone + workflow state.
- Run localization and show scorecards, diff, heatmap, QA.
- Download localized output + QA report.