Skip to content

Conversation

johnzhou721
Copy link
Contributor

@johnzhou721 johnzhou721 commented Jun 28, 2025

Fixes #683, fixes #692, fixes #689 (3rd through plugin).
 

Motivation

 
The existing model for maintaining translations for recurring phrases (databags) causes some issues;  translators often do not get the nessacary context, the strings are not full sentences or the entire label, and that the strings for other languages aren't nessacarily updated when a new English key is added; in addition, this is an ad-hoc format and there is no way to keep track of whether a translation is a genuine human one or a machine translation, or how much work we've done in translating databags.
 
gettext is the standard solution for most translation issues; lektor-i18n-plugin provides functionality to translate templates properly; this PR contains the nessacary setup, with a PR on lektor-i18n-plugin (beeware/lektor-i18n-plugin#6) containing some flags ensuring the sorting of strings for safety.
 
While we're updating the lektor-i18n-plugin, since a new release will take some time, some unrelated i18n issues have also been addressed.
 

Summary

 
In this PR:

  • Needed setup for string extraction to work.
  • Converts the templates to use the gettext format;  some unlocalized and/or unpluralized have been fixed along the way.  In addition some more l10n fixes are included, like when generating a superpower list we don’t just use ", " but we allow use of the PO file to customize the item separator.
  • Manually inputs all strings in the existing databags into the PO files.
    • Since the lack of context means that some translations won’t be 100% correct even if done by native people, precautions have been taken to ensure that these likely-not-the-best translations and/or translations not produced by native speakers of that language (from commit history) are marked as FUZZY.
  • Updates the plugin version, and some miscellaneous manual modifications to the top of the PO files, that are easier to explain along with the plugin updates below.

Since I figured that updating the plugin isn’t really done often, I decided to fix all bugs I can find in the plugin to avoid the need to update lektor-i18n-plugin repo again.  The changes for the i18n plugin includes the following list; I was trying to fix all the bugs since I figured making a release would be complicated, some of those changes aren’t really relevant to this PR.

  • Ensuring that all POT generation steps sorts the keys (related to Website translation).
  • Using xgettext to combine POT files, as it can properly combine the headers and context.
  • Filling out the Project-Id-Version header entry when generating PO files from the POT for the first time, since we’re in a position to know what the project name is in the plugin.
    • This was needed for local work as the PO editor I used to fill in the strings would refuse to save any file without Project-Id-Version;  however, new msgmerges simply uses the original header, so I have updated the headers of all the languages manually to match what it’d look like if the PO files are reinitialized (except for a comment block on top saying it’s distributed as the same license as the main project).
  • Template translation now supports pgettext and npgettext methods.
  • There used to be a bug due to a limitation of msginit in lektor-i18n-plugin upstream as well where if your content language is not in English there will be extra automatically populated strings in the English version of the PO file.  This bug is now fixed.  (not really relevant to our own Website, just trying to fix more bugs here)
  • Conversely, when the PO file is updated each time for the English version, the new msgids aren’t automatically filled into the msgstr fields (which is different from when the PO files are initially initialized).  This is now resolved.
    • There’s a functional impact of this for our website:  if a person has a string but later adds a plural form for it, msgmerge will actually fill the new plural field with the previous singular msgstr, creating the illusion of plurals not working, and thus requires manual updating of the English PO file to resolve this.
  • Fixed a bug in lr file translation – see Two Seemingly Untranslated Strings #689.
     
     

Answers about Huge Diffs on Files

  • The huge diff in the POT file is caused by using xgettext to combine POTs; the POT file is now consistent with how our actual PO files are formatted, which is a good thing.  The diff there will be one-off.
  • There’s a big diff in Chinese PO file.  When attempting to fix a bug in the plugin for filling in msgstrs since I used polib I didn’t trust its formatting so I reformatted all the PO files at the end of end of each build, producing unnessacery diffs (since Weblate wraps differently).
    • This no longer happens and I have tested the behavior.
    • Nevertheless it’d be good to have strings wrapped the right way in gettext conventions so I rewrapped all the PO files just for consistency.
       

Things I marked as Fuzzy or Need Editing

  • GitHub blame shows that the phrase has been changed in a PR that is not purely a translation PR. This brings up the possibility of machine translation (although not always, I guess there might be polyglots but let's be on the safe side)
  • When plural forms have been either machine-translated or I’ve just copied the singular form over.
  • When the previous databag storage wasn’t an entire recurring label or sentence.  Translators probably didn't have much context back then
  • When older extracted strings didn’t extract the punctuation out and I manually added them everywhere (since different languages used different conventions).  Sometimes I try to add a semicolon or whatever for that language instead of the English semicolon.

Editing Done

  • All introduced strings have been re-audited an unfuzzied in Simplified Chinese
  • Some strings in Traditional Chinese Google-Translated.

Questions

  • Should I credit the translators properly by putting the databag-translator's name and email address into the po file, if no one so far had translated it on Weblate? Is it a privacy concern to commit... email addresses and (potentially full) names??
  • In the lektorproject, the locale for pt_BR is br and not pt_BR. What is the purpose of that?
    https://github.com/beeware/beeware.github.io/blob/lektor/BeeWare.lektorproject#L17-L20  (EDIT -- normalized by me already)

Copy link
Contributor Author

@johnzhou721 johnzhou721 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes. I'm really sweating right now after slaving away a solid 3 hours of my life that I'll never get back but am happy to give away... so need to step away from this for now, updating all the po files by hand is going to be hard, so @freakboy3742 or any adults w/ a credit card who signed up for a deepL key, maybe we should just deepL all the strings? Or I'll copy them in tomorrow or after dinner today.

@@ -83,7 +83,7 @@ locale = fa_IR
lektor-github-repos = 0.1.1
lektor-gravatar = 0.1.3
lektor-markdown-admonition = 0.3.1
git+https://github.com/beeware/lektor-i18n-plugin@v0.5.4 =
../lektor-i18n-plugin =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to replace with new version after beeware/lektor-i18n-plugin#6 gets merged. That is just for extra safety, seem to work fine without it but you never know when the internal impls of any of those programs change...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

../ is my local checkout I used to update the pos

@@ -1,2 +1,3 @@
[jinja2: **/templates/**.html]
encoding = utf-8
trimmed = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -9,6 +9,9 @@

@pass_context
def translate(context, string, bag_name="translate"):
if bag_name == 'translate':
raise RuntimeError("Use the new gettext system instead")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing this PR. eventually I will replace all other bags as well.

@@ -8,16 +8,16 @@
<div class="container">
<p>{{ breadcrumbs(this) }}</p>
<h1>{{ this.title }}</h1>
<p>{{ "posted_by"|trans }}
{% if this.mastodon_handle %}
{% set author_link %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[sweating intensifies]

{% elif this.event_type == "keynote" %}
<p>{% trans title=this.title, url=this.url, talk_title=this.talk_title %}{{ speakers_list }} will be keynoting at {{ title }}, giving a presentation entitled "<a href="{{ url }}">{{ talk_title }}</a>".{% endtrans %}</p>
{% elif this.event_type == "tutorial" %}
<p>{% trans title=this.title, url=this.url, talk_title=this.talk_title %}{{ speakers_list }} will be presenting a tutorial at {{ title }} entitled "<a href="{{ url }}">{{ talk_title }}</a>".{% endtrans %}</p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the duplication here... but I realize I can't do away with it.

@johnzhou721
Copy link
Contributor Author

@HalfWhitt I apologize that some code you wrote for translation bags might get deleted. The bird has flown away... (continuing the extended metaphor in the other thread)

@johnzhou721
Copy link
Contributor Author

Yikes... looks like if you're doing _(name) when name is a variable, it does not get extracted... this happens with the badge macro. Which means this PR must be reviewed extremely carefully for such things.

@johnzhou721
Copy link
Contributor Author

FYI -- marked some arabic strings as fuzzy b/c cannot figure out where to put periods correctly... i'm using textedit to edit the po files directly, spent some time yak shaving to use emacs po-mode but realized I don't know how to use emacs...

@johnzhou721
Copy link
Contributor Author

Yikes. More strings not extracted out of event.html...

@johnzhou721
Copy link
Contributor Author

Isolating event.html into a seperate directory and deleting random components shows that deleting

{% for slug in this.speaker %}
    {%
        do speaker_names.update(
            {slug: members.filter(F._slug == slug).first().name}
        )
    %}
{% endfor %}

will cause babel to extract properly.

@johnzhou721
Copy link
Contributor Author

Progress: python-babel/babel#1216 reports a simple MWE for this situation.

@johnzhou721
Copy link
Contributor Author

Given that this seems to be a bug, I'm going to work around it by separating the logic into a seperate macro. To pass Python dicts around, I will put filters that converts to and from JSON in our plugin.

[sweats]

@johnzhou721
Copy link
Contributor Author

And... now the messages are extracted! Lektorbuilding to update the translation files and then committing.

@johnzhou721
Copy link
Contributor Author

FYI beeware/lektor-i18n-plugin#5 caused a huge diff in 42a8f3c because of the way xgetttext seems to output stuff differently, but a cursory diff will get you the conclusion that it's actually good because now the difference between the format of the pot and the po file is 0, sans the translated strings.

@johnzhou721
Copy link
Contributor Author

@freakboy3742 @HalfWhitt (latter -- since you started the freeform) I'd like some preliminary comments on this before I go copy all the strings into the po files and suggestions on how to work around the issue linked #684 (comment) ? 'cause this is going to produce huge, multithousand-line diffs.

Read the above comments though, if you have time, especially the last one. Thanks

@johnzhou721
Copy link
Contributor Author

Maybe CI is using Ubuntu 24.04 with gettext 0.21 while I'm using gettext 0.25 on macOS and we're hitting the test case over here at https://github.com/translate/translate/pull/5439/files which differentiates b/w <=0.23 and >0.23...

See: https://launchpad.net/ubuntu/noble/+package/gettext

OK... time to go yak shaving tomorrow to install gettext 0.21...

@johnzhou721
Copy link
Contributor Author

Hmm... maybe let me try my ubuntu 22.04 vm tomorrow. See how that works.


#: https://beeware.org/ (content/contents+en.lr:button-block.label)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI beeware/lektor-i18n-plugin#5 caused a huge diff in 42a8f3c because of the way xgetttext seems to output stuff differently, but a cursory diff will get you the conclusion that it's actually good because now the difference between the format of the pot and the po file is 0, sans the translated strings.

Not in the specific hash but looking at these changes

@johnzhou721
Copy link
Contributor Author

Maybe CI is using Ubuntu 24.04 with gettext 0.21 while I'm using gettext 0.25 on macOS and we're hitting the test case over here at https://github.com/translate/translate/pull/5439/files which differentiates b/w <=0.23 and >0.23...

See: https://launchpad.net/ubuntu/noble/+package/gettext

OK... time to go yak shaving tomorrow to install gettext 0.21...

This is resolved now.

@johnzhou721 johnzhou721 marked this pull request as ready for review June 30, 2025 15:23
@johnzhou721
Copy link
Contributor Author

FWIW: I can't figure out why the zh_CN has this huge wrapping diff. It's formatted using Ubuntu 24.04 on GitHub codespaces; it matches the version used in CI. Maybe some of my plugin changes? The CI isn't re-wrapping it back, either.

@johnzhou721
Copy link
Contributor Author

Strangely, using the new po file with the state of beeware.github.io before this PR causes an update translation, but does not re-wrap the strings back.

What the heck is even going on here? I am totally confused to why on Ubuntu 22.04 and 24.04 both, they both would rewrap the strings from how Weblate wraps them... but this is not happening on the main branch! And copying the new PO file onto the main branch and updating translations in the CI in my fork does not wrap them back!

@johnzhou721 johnzhou721 changed the title Use gettext for recurring phrases!!!!! Use gettext for recurring phrases, fix a bunch of i18n issues!!!!! Jul 22, 2025
@johnzhou721
Copy link
Contributor Author

@freakboy3742 I'm sorry if I'm coming across as pushy, but now that (unlike the description at the top says) the dependency for lektor-i18n is pointing at my fork, and also since this seem to require 2 builds when I'm lektor-servering sometimes for some reason, I'd like preview applied to see if this works with the CI pipeline.

Also -- I still can't figure out why the huge wrapping diff. Seems like the diff is on some newer strings I've (and the other contributers have) translated -- the old ones aren't getting modified. I can't pinpoint the exact issue, though... and now I'm sorta stuck in this aspect. If you get these translation files from Weblate into CI it'd probably rewrap all the strings every time there is a new transltion... so this is an issue we need to solve here... I have completely no idea of what's going on.

Another issue is how to test lektor-i18n-plugin... some of the logic is easy to get wrong in there.

[note I ended up carrying my macbook for some other reason so I have access to a machine to work on this (and togax_qt) in some extremely limited time.]

Thanks for taking a look @ this!

@freakboy3742
Copy link
Member

@freakboy3742 I'm sorry if I'm coming across as pushy, but now that (unlike the description at the top says) the dependency for lektor-i18n is pointing at my fork, and also since this seem to require 2 builds when I'm lektor-servering sometimes for some reason, I'd like preview applied to see if this works with the CI pipeline.

Also -- I still can't figure out why the huge wrapping diff. Seems like the diff is on some newer strings I've (and the other contributers have) translated -- the old ones aren't getting modified. I can't pinpoint the exact issue, though... and now I'm sorta stuck in this aspect. If you get these translation files from Weblate into CI it'd probably rewrap all the strings every time there is a new transltion... so this is an issue we need to solve here... I have completely no idea of what's going on.

Another issue is how to test lektor-i18n-plugin... some of the logic is easy to get wrong in there.

[note I ended up carrying my macbook for some other reason so I have access to a machine to work on this (and togax_qt) in some extremely limited time.]

Thanks for taking a look @ this!

So - I think what you're asking here is "can you please apply the preview tag so I can see if this works in CI".

However, because it's 5 paragraphs long, including some digressions about diffs and the computers you have available while travelling, it's difficult to tell. The diff thing may be a request for assistance/explanation on one aspect of the patch; if that's the case, I honestly couldn't tell you what aspect of the patch you're asking for feedback on.

You have been asked, on many occasions, to be more intentional when you're communicating. Github comments aren't a social forum, and they're not made better by having an unfiltered stream of consciousness rendered in text.

If you're asking someone for something specific - ask for that specific thing. Additional detail only serves to confuse the issue, obscuring what you actually want behind a veil of irrelevant details.

@freakboy3742 freakboy3742 added the preview Approved for an automated preview label Aug 4, 2025
Copy link

github-actions bot commented Aug 4, 2025

Visit the preview URL for this PR (updated for commit 2ebe489):

https://beeware-org--pr684-jinjai18n-04pcm3hg.web.app

(expires Mon, 11 Aug 2025 08:51:55 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: b0da44bc067e7d9a4255c77cb2c5fce572218cec

@johnzhou721
Copy link
Contributor Author

Oh… talking about how Ubuntu 24.04 keeps wanting to rewrap the Weblate output when building this PR but not on main.

Might be related to the POT formatting changes… I’d need to get a minimal example working to see what is going on.

@johnzhou721
Copy link
Contributor Author

Okay... I think it's sort of clear that the rewrapping bug may be an error in Translate Toolkit. I have made an minimum working example and posted at translate/translate#5661

@freakboy3742 So what I'm actually asking for next besides potential ways the bug is happening (which I already filed over there) is for you to take a look at the plugin changes and see if you're okay with them. Doing this one step at a time, let's get that reviewed and merged first before we discuss more issues about this PR here. Thanks!

@johnzhou721
Copy link
Contributor Author

Fwiw before I get on an airplane: the line wrapping thing is a translate toolkit / weblate bug…not sure why it’s not happening to main yet.

(probably because there hasn’t been a bunch of translation update commits in awhile… seems that wrap only happens when the pot gets altered.)

@johnzhou721
Copy link
Contributor Author

FWIW: I had no idea why this sort of thing happened with the line wrapping... msgmerge seem to rewrap less but msgcat rewraps every time -- I was using msgcat in one of the old commits in lektor-i18n-plugin before (to reformat everything once they're done) on the final po file and that reformats. I've pasted in old-formatted strings into the new zh_CN po and clean and built and now there's no changes from the old-formatted strings -- so all this rewrap is one-off (hopefully) due to probably mismatched lektor plugin reinstalls, but I'm not reverting the rewrap b/c it seems to be consistent with what gettext would usually produce anyways.

@freakboy3742 -- this is ready for you to take a look at. sorry for the noise

@johnzhou721
Copy link
Contributor Author

@freakboy3742 I'm sort of sorry for the messy summmary... but I have no idea how to explain the changes I made in a more concise way... could a look be taken at this? Thank you very much.

@freakboy3742
Copy link
Member

I'm under a crunch preparing for a conference trip, so I'm going to be short on time for the next couple of weeks. I'll take a look if I get a chance, but I wouldn't bet on that happening soon.

However - as you've already identified - the PR is its own worst enemy. How can you explain the changes more concisely? Start by summarizing what you have done. What have you done? Why is it required? Why have you taken the approach you have? I read the PR summary here, and my eyes glaze over. I still only have the vaguest idea what you're trying to accomplish, and the mechanism by which it has been achieved. The first block of substantive content in the summary should not be a multi-paragraph ramble about the nature of fuzziness. If fuzziness is an issue - then start with "I've done X; this has some issues with fuzziness, details to follow" (or similar).

@johnzhou721
Copy link
Contributor Author

@freakboy3742 Thanks for the tips for preparing a good PR summary. I will do some extensive refactoring.

@johnzhou721
Copy link
Contributor Author

@freakboy3742 I have updated the PR summary to be more concise and a bit more logically organized in terms of what this does. Feel free to take a look at this and the plugin PR if you have a chance.

Note -- sorry for weird formatting in the description I used Microsoft Word to write the Markdown code but it pasted as an image to I pasted into textedit and then pasted the code into this PR.

Thank ya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Approved for an automated preview
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Titles on Members Page Not Localized Two Seemingly Untranslated Strings Use Jinja i18n to gettext instead of databags
3 participants