Skip to content

Conversation

@HenkKalkwater
Copy link
Collaborator

@HenkKalkwater HenkKalkwater commented Dec 30, 2024

Parse the lines of a changelog entry using the somewhat standardised
method of writing changelog messages. The parsed messages will be
displayed in a nicely formatted way.

If the parser cannot make any sense out of it, it will fall back to
displaying the changelog in a good, old <pre>-block.

When loading repository data from local files using the --repo-data-dir
option, use the same file names that are used when the files are
downloaded. These became outdated by 0da67c5.
Parse the lines of a changelog entry using the somewhat standardised
method of writing changelog messages. The parsed messages will be
displayed in a nicely formatted way.

If the parser cannot make any sense out of it, it will fall back to
displaying the changelog in a good, old <pre>-block.
@Olf0
Copy link
Contributor

Olf0 commented Jan 3, 2025

@HenkKalkwater, I briefly tried to determine where the changelog is sourced from … and failed.

Hence I just denote the usual sources of changelogs I can think of:

These three sources are consolidated by the tar_git service of the SailfishOS-OBS into an "OBS-changelog", see e.g. for the SailfishOS:Chum GUI app.
Unfortunately this creates two main entries for each release when a rpm/<name>.changes file or a %changelog section in the rpm/<name>.spec file exists (because an additional changelog section for the same release with the git commit messages is created by the tar_git service), as you can see for the SailfishOS:Chum GUI app. Which of the two entries comes first depends on the release date mentioned in the changelog relative to the build date of that release at the SailfishOS-OBS!
Note that this "OBS-changelog" still adheres to the described changelog format. IMO for SailfishOS:Chum it is the most convenient way to utilise the "OBS-changelog" for a package.

Luckily the tar_git service of the SailfishOS-OBS ignores the dummy %changelog section I often use to silence rpmlint at the SailfishOS-OBS.

P.S.: Another source for a changelog which some developers use elaborately is the releases page on GitHub / Gitlab / Forgejo (e.g. used by Codeberg), see e.g. for the SailfishOS:Chum GUI Installer. But that is a free text field everybody structures differently.

@HenkKalkwater
Copy link
Collaborator Author

So far, I'm parsing the changelogs from https://repo.sailfishos.org/obs/sailfishos:/chum/<repo>/repodata/<hash>-other.xml.gz, for Saiflish OS 5.0 that would be https://repo.sailfishos.org/obs/sailfishos:/chum/5.0_aarch64/repodata/229ca2c838411817040f5547a5ad70a478c463e52e62030df744803ff26fd377-other.xml.gz.

<?xml version="1.0" encoding="UTF-8"?>
<otherdata xmlns="http://linux.duke.edu/metadata/other" packages="1869">
  <package pkgid="47853ce7329ddae40cd5bb61b2bd1e8a2c5350ff6e43bc13d623260bc02dfc15" name="AtomicParsley" arch="src">
    <version epoch="0" ver="0.9.6.20221229+obs1" rel="1.4.1.bso"/>
    <changelog author="Nephros &lt;[email protected]&gt; 0.9.0" date="1594728000">- Package for SailfishOS</changelog>
    <changelog author="Nephros &lt;[email protected]&gt; 0.9.0-2" date="1594728001">- rename binary so youtube-dl can find it</changelog>
  <!-- More entries follow … -->
  
  </package>
  <!-- More packages follow… -->
</otherdata>

I thought this was generated by the changelog from the <name>.changes in RPMs, after reading your comment, I can confirm it gets generated from the OBS-changelog, as it seems to combine the sources you mentioned above. See for example, Sailfish OS Chum GUI.

The state right now is that I parse out the version number from the "author" tag (who decided that it should be put in there?) with the following regex that will work 99% of the time

(?P<author>.*) *<(?P<email>.*)>[ -]*(?P<version>.*)

and copy inner text of the <changelog> tag and paste it verbatim into a <pre> tag. This, however, is not really readable on mobile browsers. This PR tries to parse the "any other lines" according to the spec you linked, so it display them as list items in HTML. Additionally, it tries to parse out the OBS convention of formatting the messages like - [affected part] changelog item and format them as below to make them more readable.


  • Changelog item 1
  • [affected part 1] Changelog item 2
  • [affected part 1] Changelog item 3
  • [affected part 2] Changelog item 4

When the parsing of this fails due to the changelog not adhering to the formatting guidelines, it simply drops them in a preformatted text block, like the current situation.

I considered formatting them as below, but that may rearrange items in the changelog, which can make it harder to follow, which is why I did not implement it.


  • Changelog item 1 without affected part

Affected part 1

  • Changelog item 2
  • Changelog item 3

Affected part 2

  • Changelog item 4

@Olf0
Copy link
Contributor

Olf0 commented May 6, 2025

A quick reply, because I do not want to keep this unanswered for even longer.

I'm parsing the changelogs from https://repo.sailfishos.org/obs/sailfishos:/chum/<repo>/repodata/<hash>-other.xml.gz

Yes, IMO this is the easiest way to obtain that data nicely pre-parsed into specific fields / variables.

The state right now is that I parse out the version number from the "author" tag (who decided that it should be put in there?) with the following regex that will work 99% of the time

IMO any further "parsing" is futile: The old thread on the [email protected] mailaing-list you called "specification" leaves way too many details undefined. For any modern format specification (i.e. in this millennium) I would expect a concise grammar, e.g. in Backus-Naur form (BNF) or at least expressed in RegExes or something equivalent.

Basis for this consideration:

  • OBS focusses a single version of a package (precisely: a single <epoch>:<version>-<release> triplet): the current one
    I.e. while some history is retained in log files and OBS allows to e.g. display the diff of the RPM spec file to its previous version, conceptually OBS only handles the current version of a package.
    This package info is nicely provided by
    <package name="FooBar" […]>
    <version epoch="0" ver="1.2.3+a4" rel="5.6beta7"/>
  • For this single, current version of a package the aforementioned XML file provides <changelog> entries.
    • Do not try to dissect (i.e. to parse / interpret) the author field. E.g. the "version" part is underspecified, people use <version> and <version>-<release>, and possibly may use <epoch>:<version>-<release> (as per "specification").
      If you really want to try, IMO you should do that in reverse ("from behind"): The last whitespace separated part of the author field may end in an <release> string, which does (per RPM specification: MUST) not contain hyphens (aka dashes: "-"), and is separated by a hyphen from the <version> string; mind that the <version> string is allowed to contain hyphens, hence only the last hyphen of the last whitespace separated part of an author field separates <release> from <version> string; if there is no hyphen in the last whitespace separated part of an author field, then the <release> string is empty. After cutting away the <release> string and its prepended hyphen, the remainder of the last whitespace separated part of an author field should be parsed in the regular direction (left to right): If the RegEx ^([[:digit:]]+):(.+)$ matches, the first substring is the <epoch> string, the second one is the <version> string (which should only comprise a subset of [[:graph:]]+, but IMO a parser should be lax, contrary to a generator); if not, then the whole remainder is the <version> string and no epoch number is provided (I do not remember if this is equivalent to epoch 0 or 1).
      Somewhat complicated (because historically the <epoch> and <release> strings were pre- rsp. ap-pended to the <version> string, which allows for using almost all characters; even the <release> string does, except for the hyphen / dash) and a bit error-prone to implement. Far worse, many coders spontaneously implement parsers to dissect <epoch>:<version>-<release> strings with wild assumptions how the character-sets of the three sub-strings may be restricted: Such a parser might work well most of the time, e.g. when a <version> string comprises solely "[[:digit:]]+" separated by "\.", but fails when it is e.g. "1A2+.:~*bcd34-56-7.e" (AFAIR that is a "legal" <version> string: It must start with a decimal digit and must not contain whitespaces or control characters).
    • But actually, AFAIU you want to extract the <epoch>:<version>-<release> strings (with the <epoch> and <release> substrings with their separators being optional) to evaluate them for comparisons, to order changelog entries temporally. Please don't: The slightly complicated algorithm (as above: due to historic reasons) was once "specified" as prose in an email on another mailing-list server run by RedHat (RPM-devel, IIRC), but RedHat switched that one off a few years ago; since then, one must read the RPM source code to get that right, likely a second implementation exists in the zypper package (specifically libzypp), maybe that provides better documentation.
      For exactly that purpose the XML file already provides you with a UNIX timestamp in <changelog date="1234567890">, which can be directly used to mathematically compare them to determine the order of changelog entries.

Relevant part

But I might have misunderstood the intention to further analyse the "[<epoch>:]<version>[-<release>]" string, but only to dissect the author field. Then my reply is "Yes, I think your RegEx points in the right direction" (slightly enhanced to also accept horizontal tabs as whitespace characters, for better anchoring and to avoid duplicate namings):

^(?P<contributor>[[:print:]]*)[ /t]+<(?P<email>[[:graph:]]+)>[ /t]+[- /t]*(?P<versionstring>[[:graph:]]+)$

In cases this RegEx does not match (which focuses on anchoring of the email address within [ /t]+< and >[ /t]+), I would next try to just separate "contributor" (or alternatively read that part completely into "email", depending on its use later) and "versionstring":

^(?P<contributor>[[:print:]]+)[- /t]+(?P<versionstring>[[:graph:]]+)$

Notes:

  • One may drop each concluding $ and so allow for arbitrary [^[:graph:]]s to follow the "versionstring" (i.e. whitespaces and control characters). Or ("be as specific as possible"), define [ \t] as the only probable ones by substituting each concluding $ by [ \t]*$.
  • When designing RegExes, please mind that each sub-expression is "greedy" (ordered from the left to the right, with the further left ones having precedence), i.e. it consumes as many characters as possible. Hence a ^(?P<contributor>[[:print:]]*)[ /t]*<(?P<email>[[:graph:]]+)>[ /t]*-?[ /t]*(?P<versionstring>[[:graph:]]*)$ will not put the first occurrence of <([[:graph:]]+)> into the variable email, but the last, because everything else in that RegEx is arbitrary: either "any character: ." (I usually try to avoid being so unspecific, so I did in my example here) or "any number of characters: *" (? is almost equally bad, as both can be resolved to zero repetitions), or both (some in your original RegEx)
    Consequently "You <[email protected]> - 1.2.3-4+<deadbee>" would result in contributor="You <[email protected]> - 1.2.3-4+", email="deadbee" and versionstring="". Takeaway: Try to utilise as much information as possible when designing a RegEx ("be as specific as possible"), this helps the RegEx machine to anchor sub-expressions more easily (which often also speeds up the interpretation of a RegEx). If that is not possible, do research ways to suppress the "greediness" of the RegEx machine rsp. a sub-expression: IIRC this is dreaded, partially RegEx machine dependent, hence rather a last resort measure.
  • I hope that JavaScript RegExes support the character classes [:xyz:], because emulating them with ASCII ranges is tricky in times of widespread use of UTF and i18n / l10n (e.g. committer="Ömer Üçtepe"); i.e. then I would really use .* for the contributor field instead of the [[:print:]]* I suggested above.

@HenkKalkwater
Copy link
Collaborator Author

I do not care about parsing the semantics of the versioning, all I want to split the author from the version number :).

I order the changelog by the order of appearance (which happens to be in order of date ascending) and straight up display the version number to the end user.

Do not restrict it to word characters (\w)
@Olf0
Copy link
Contributor

Olf0 commented May 12, 2025

I do not care about parsing the semantics of the versioning, all I want to split the author from the version number :).

For that I marked the start of the part of my prior message which is relevant for you with "Relevant part", now.


P.S.: > all I want to split the author from the version number

… and keep the email address as part of the author field, or to extract it separately?


HTH & Cheers

@HenkKalkwater
Copy link
Collaborator Author

Aha, I tried your regexes, but since I´m using Python it does not understand things like [:graph:] or [:print:]. Perhaps this regex would be equivalent enough?

^(?P<contributor>.+?)[-\s]+(?P<versionstring>\S+)$

@Olf0
Copy link
Contributor

Olf0 commented May 12, 2025

I researched a basis to answer your question:

I will try to look at it tomorrow during a long train ride.

P.S.: I hate these trivial "shorthand character classes", because then more modern ones ([:<id>:]) are much more convenient and supported for decades by POSIX (hence practically by all UNIX tools since the early 1990s: grep, sed, tr etc.), still UNIX tools also support the old ones. But "ain't no use complaining".

@Olf0
Copy link
Contributor

Olf0 commented May 27, 2025

Aha, I tried your regexes, but since I´m using Python it does not understand things like [:graph:] or [:print:]. Perhaps this regex would be equivalent enough?

^(?P<contributor>.+?)[-\s]+(?P<versionstring>\S+)$

Close, and I refrained from re-evaluating my suggestions, because they likely are "close but not really doing it", too.
Imagine
author=Henk Kalkwater <[email protected]> 1.2.3-4
which would result with your RegEx machine trying
contributor=Henk; versionstring=Kalkwater
because the +? iterator is explicitly non-greedy,
but then likely the condition \S+$ cannot be fulfilled, IIRC because the regular iterators are not greedy when backtracking.

Constructing a RegEx, again:

  • It must end in a version-string, which starts with a digit (either %epoch or %version proper from the .spec file) and then comprises an arbitrary number of non-space characters. To be nice, we may allow for an arbitrary number of space characters to follow before the line ends (trailing space(s) is a common error):
    \d\S*)\s*$
    Tested by using egrep -o 'author="[^"]+"' bd1acf6…0d5b889-other.xml | sed -e 's/^author="//g' -e 's/"$//g' -e 's/&lt;/</g' -e 's/&gt;/>/g' | egrep -o '[[:digit:]][^[:space:]]*[[:space:]]*$'; this is absolutely fine for extracting the version-string, even if some output looks strange: e.g. 50+git1 (correctly extracted) and 5432.3.4-4 (correctly extracted from malformed entry!)

    Thus the RegEx (?P<versionstring>\d\S*)\s*$ should provide you with the correct version-strings.

  • I would like to provide some leeway for authors who refrain from providing an clear name or stick to the simpler format rules for e-mail addresses. For the latter I consulted Wikipedia, expecting to read the IETF RFCs it references, but it was sufficient. Ultimately I settled on either anchoring on the @<domain-part> of the email address or the - as a separator to the version-string.

    • Anchoring on email address
      ^(?P<contributor>).*@[\[\(\w][-\(\w\.:]*[\w\)\]]>?)\s*[-\s]?\s*(?P<versionstring>\d\S*)\s*$
    • Anchoring on - separator (well, ultimately not really after some fiddling)
      ^(?P<contributor>).*)\s*[-\s]?\s*(?P<versionstring>\d\S*)\s*$
      <Will test, this variant may be fully sufficient>

Yes, it works like that (pseudocode - last edited 2025-05-28):

if RegEx.match( "$author", '^(?P<contributor>).*)(([-\s]*\s+\S*-)|[-\s]+)(?P<versionstring>\d\S*)\s*$' )
then true
else
  contributor := "$author"
  versionstring := ""
fi

I you also want the email address separated, I think this is best done in a staged fashion, from the extracted contributor.
If you want me to (try to) design a robust RegEx for separating contributor into contrib-name and contrib-email, please let me know. Done, see next message, below.

@Olf0
Copy link
Contributor

Olf0 commented May 27, 2025

if RegEx.match( "$contributor", '^(?P<contrib-name>).*)<(?P<contrib-email>[^<>]+)>\s*$' )
elif RegEx.match( "$contributor", '^(?P<contrib-name>).*)\s+<?(?P<contrib-email>.*@[\[\(\w][-\(\w\.:]*[\w\)\]])>?\s*$' )
elif RegEx.match( "$contributor", '^\s*<?(?P<contrib-email>.*@[\[\(\w][-\(\w\.:]*[\w\)\]])>?\s*$' )
then contrib-name := ""
else
  contrib-name := "$contributor"
  contrib-email := ""
fi
# Strip $contrib-name of leading and trailing '\s*':
RegEx.match( "$contrib-name", '^\s*(?P<contrib-name>).*\S)\s*$' )

Edit: IMO it is better to first strictly dissect $contributor into contrib-name and contrib-email (if you want to dissect $contributor at all) and later decide what to display (and how) if one of the contrib-* variables is empty.

HTH!

P.S.: There is a single big offender in the format Nephros <[email protected]> - youtube-dl-open-helper-5432.\d.\d[-\d]*. I will try to pose a PR fixing this at its source.
Edit: I addressed that in the first pseudocode block.
Edit 2: Additionally I posed a PR to resolve this specific issue at its source.

@Olf0
Copy link
Contributor

Olf0 commented Aug 10, 2025

@HenkKalkwater, does the pseudo-code below (copied from my two preceding messages ([1] & [2]), transformed to JavaScript fulfil your needs and work well, or is there anything left I can contribute to this PR?

This pseudocode requires the variable author to be set (as discussed above), analyses it, and extracts from it:

  • The variable versionstring, which may be empty, when no proper version field is been provided within the author string.
  • contrib-name and / or contrib-email: At least one of them will be non-empty, if author was non-empty.
  • As an intermediate step it extracts the variable contributor, which comprises what will be later dissected into contrib-name and / or contrib-email (logically also non-empty, if author was non-empty).
if RegEx.match( "$author", '^(?P<contributor>).*)(([-\s]*\s+\S*-)|[-\s]+)(?P<versionstring>\d\S*)\s*$' )
then true
else
  contributor := "$author"
  versionstring := ""
fi

if RegEx.match( "$contributor", '^(?P<contrib-name>).*)<(?P<contrib-email>[^<>]+)>\s*$' )
elif RegEx.match( "$contributor", '^(?P<contrib-name>).*)\s+<?(?P<contrib-email>.*@[\[\(\w][-\(\w\.:]*[\w\)\]])>?\s*$' )
elif RegEx.match( "$contributor", '^\s*<?(?P<contrib-email>.*@[\[\(\w][-\(\w\.:]*[\w\)\]])>?\s*$' )
then contrib-name := ""
else
  contrib-name := "$contributor"
  contrib-email := ""
fi
# Strip $contrib-name of leading and trailing '\s*':
RegEx.match( "$contrib-name", '^\s*(?P<contrib-name>).*\S)\s*$' )

@HenkKalkwater
Copy link
Collaborator Author

Hi @Olf0,

Currently I'm a little busy, that's why I haven't had the time to look at your suggestions, implement and test them. I'll try to look at it in one of the upcoming weekends

@Olf0
Copy link
Contributor

Olf0 commented Aug 13, 2025

Currently I'm a little busy

That is absolutely fine, please take your time!

It was just a kind reminder, in case it slipped off your radar; which it did not, as I understand now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants