Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Components.yml contains duplicate language in Keywords - invalid YAML #92

Closed
robert-ancell opened this issue May 30, 2021 · 5 comments

Comments

@robert-ancell
Copy link

The components file in Ubuntu (/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_hirsute_main_dep11_Components-amd64.yml.gz) fails YAML parsing due to containing duplicate languages in the Keywords mapping.

Specifically the first entry fails:

---
File: DEP-11
Version: '0.12'
Origin: ubuntu-hirsute-main
MediaBaseUrl: https://appstream.ubuntu.com/media/hirsute
Time: 20210422T135021
---
Type: desktop-application
ID: software-properties-gtk.desktop
Package: software-properties-gtk
...
Keywords:
  de_AT:
  - Treiber
  - Paketquellen
  - Paketquelle
  - PPA
  ca_ES:
  - controladors
  - dipòsits
  - dipòsit
  - ppa
  - orígens
  - programari
...
  ca_AD:
  - controladors
  - dipòsits
  - dipòsit
  - ppa
  - orígens
  - programari
  ca_ES:
  - controladors
  - dipòsits
  - dipòsit
  - ppa
  - fonts
  - programari

The YAML specification states:

The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique

@robert-ancell
Copy link
Author

I'm having trouble setting up appstream-generator to reproduce this case, but I think the issue may be in src/asgen/backends/ubuntu/ubupkg.d:extractLangpacks() where /var/lib/locales/supported.d/ca is parsed and the @valencia modifier may be confusing things.

ca_AD.UTF-8 UTF-8
ca_ES.UTF-8 UTF-8
ca_ES.UTF-8@valencia UTF-8
ca_IT.UTF-8 UTF-8
ca_FR.UTF-8 UTF-8

@robert-ancell
Copy link
Author

@iainlane seems to be the primary author of this file?

@ximion
Copy link
Owner

ximion commented May 30, 2021

Very odd... The @valencia modifier should be stripped together with the .UTF-8 and end up in an associative array, which can also have a key only once. But it looks like that doesn't happen, which means that it gets stripped at YAML serialization, see https://github.com/ximion/appstream/blob/master/src/as-yaml.c#L652
In hindsight, the value sanity check should likely have been done somewhere earlier in the chain to ensure that we end up with just one unique key in a GHashTable before that one is serialized as-is, but I guess this has been done for efficiency in the past.
I do wonder though whether the result would be less surprising if libappstream also changed its behavior here, and it may not actually loose noticeable performance...

The easiest fix though is to not feed "bad" values in in the first place, which should be fixed in asgen (which may not necessarily be an issue limited to the Ubuntu backend).

@iainlane
Copy link
Collaborator

iainlane commented Jun 4, 2021

@ximion can you transfer this issue to appstream please?

This is fixed by something like iainlane/appstream@a0246f1

Not proposed yet, because

  1. I want to write a testcase, and I don't know how to do that because the header for this as-utils-private.h has hidden visibility and the testsuite links the shared library, so the function can't be linked in. Would like some advice on how to do this best.
  2. The function is used in a dodgy way. Note that it modifies the string in place, and is called as the callback from g_hash_table_foreach. That is, we are directly modifying the keys in a hash table, breaking it. We can either change the API and return a copy (then probably use g_string_replace instead of this stuff), or consume the hash table, making as_component_emit_yaml invalidate the AsComponent which is passed to it. That would be a bit weird I think - the first option is probably the way to go. (Or we could modify that line I just linked to to g_strdup, but if all callers are going to strdup then we might as well make the function do it itself IMO).

iainlane pushed a commit to iainlane/appstream that referenced this issue Jun 7, 2021
We have some locales which are dupes if we do this, for example
ca_ES.UTF-8 and ca_ES.UTF-8@valencia. The latter of these should become
ca_ES@valencia in the output. That is what `locale -a` shows.

Previously `as_locale_strip_encoding ()` modified the passed-in string
in place. However, in the one place where don't `g_strdup ()` the string
before passing to this function, it is the key in a `GHashTable`. We
can't do this, and only get away with it because the hash table isn't
touched after this call. Fix the function to instead return a newly
allocated string, and drop the `g_strdup` calls from the other call
sites.

Add a small test for this function too.

ximion/appstream-generator#92
iainlane pushed a commit to iainlane/appstream that referenced this issue Jun 7, 2021
We have some locales which are dupes if we do this, for example
ca_ES.UTF-8 and ca_ES.UTF-8@valencia. The latter of these should become
ca_ES@valencia in the output. That is what `locale -a` shows.

Previously `as_locale_strip_encoding ()` modified the passed-in string
in place. However, in the one place where don't `g_strdup ()` the string
before passing to this function, it is the key in a `GHashTable`. We
can't do this, and only get away with it because the hash table isn't
touched after this call. Fix the function to instead return a newly
allocated string, and drop the `g_strdup` calls from the other call
sites.

Add a small test for this function too.

ximion/appstream-generator#92
iainlane pushed a commit to iainlane/appstream that referenced this issue Jun 21, 2021
We have some locales which are dupes if we do this, for example
ca_ES.UTF-8 and ca_ES.UTF-8@valencia. The latter of these should become
ca_ES@valencia in the output. That is what `locale -a` shows.

Previously `as_locale_strip_encoding ()` modified the passed-in string
in place. However, in the one place where don't `g_strdup ()` the string
before passing to this function, it is the key in a `GHashTable`. We
can't do this, and only get away with it because the hash table isn't
touched after this call. Fix the function to instead return a newly
allocated string, and drop the `g_strdup` calls from the other call
sites.

Add a small test for this function too.

ximion/appstream-generator#92
ximion pushed a commit to ximion/appstream that referenced this issue Jun 21, 2021
We have some locales which are dupes if we do this, for example
ca_ES.UTF-8 and ca_ES.UTF-8@valencia. The latter of these should become
ca_ES@valencia in the output. That is what `locale -a` shows.

Previously `as_locale_strip_encoding ()` modified the passed-in string
in place. However, in the one place where don't `g_strdup ()` the string
before passing to this function, it is the key in a `GHashTable`. We
can't do this, and only get away with it because the hash table isn't
touched after this call. Fix the function to instead return a newly
allocated string, and drop the `g_strdup` calls from the other call
sites.

Add a small test for this function too.

ximion/appstream-generator#92
@iainlane
Copy link
Collaborator

This should be fixed with snap revision 153 (ish) onwards, which is built using the new version of appstream, 0.14.4. Unfortunately, as release pockets are immutable, we have no way of fixing the broken YAML which is out there for hirsute. Fixing that issue would be #29, but it's not fully there yet for this purpose.

Thanks for reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants