Skip to content

RFC: Binary Distribution Format #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

theory
Copy link
Member

@theory theory commented Jul 11, 2024

Add a new RFC describing the proposed trunk binary distribution format for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to support binaries for every OS and architecture supported by PostgreSQL itself, as well as many versions of PostgreSQL.

Previous discussion.

@theory theory added the documentation Improvements or additions to documentation label Jul 11, 2024
@theory theory self-assigned this Jul 11, 2024
@theory theory force-pushed the binary-distribution-format branch 5 times, most recently from 39613c4 to f4f4d0d Compare July 11, 2024 20:00
@theory theory mentioned this pull request Jul 11, 2024
theory added a commit that referenced this pull request Jul 11, 2024
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and
additional research. Notable differences from v1:

*   Introduce the term "package" separate from "distribution". The
    "package" is the bundle of objects being distributed as a…package.
*   Use only one version for the entire distribution, rather than
    separate versions for each extension included in the distribution.
    This allows the deletion of the notes on comparing versions.
*   Make "Source Distribution" the formal term, but mostly refer to it
    as "Distribution". Will be distinguished from binary distributions
    that ship with their own metadata (see #2).
*   Use JSON data types as the base types instead of generic "list" and
    "map" types.
*   Add "Path", "purl", and "Platform" types
*   Use SPDX License Expressions instead of Perl Software::License-based
    structures.
*   Use the term "property" to describe object key/value pairs, to align
    with JSON Schema.
*   Use an array of objects to describe maintainers.
*   Replace the `provides` property with `contents`, with support for
    multiple kinds of PostgreSQL extensions, including TLEs, loadable
    modules, and background workers.
*   Move the `tags` property to the new `classifications` object, and
    add support for curated categories borrowed from [Trunk].
*   Replace `no_index` with `ignore` and use the gitignore format
    instead of separate lists of files and directories.
*   Rename `prereqs` to `packages` and move it into the new
    `dependencies` property, which also has `postgres`, `pipeline`,
    `platforms`, and `variations` properties. Use [purls]. to specify
    dependencies, so that any supported packaging dependency can be
    specified, as well as PGXN packages and Postgres core extensions and
    tools.
*   Remove `release_status`; we'll instead depend on [SemVer] to
    indicate pre-releases.
*   Simplify the `resources` object and add `badges` to it.
*   Add the `artifacts` property, so the extension author can include
    links to other packages or sources for a release.

Also configure `#` to hide a line in `json` code blocks and use it to
encode proper JSON objects without showing the surrounding braces.
Readers can hit the eye button that appears on hover to make the hidden
lines appear.

[PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/
[Trunk]: https://pgt.dev
[purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs
[SemVer] https://semver.org
theory added a commit that referenced this pull request Jul 11, 2024
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and
additional research. Notable differences from v1:

*   Introduce the term "package" separate from "distribution". The
    "package" is the bundle of objects being distributed as a…package.
*   Use only one version for the entire distribution, rather than
    separate versions for each extension included in the distribution.
    This allows the deletion of the notes on comparing versions.
*   Make "Source Distribution" the formal term, but mostly refer to it
    as "Distribution". Will be distinguished from binary distributions
    that ship with their own metadata (see #2).
*   Use JSON data types as the base types instead of generic "list" and
    "map" types.
*   Add "Path", "purl", and "Platform" types
*   Use SPDX License Expressions instead of Perl Software::License-based
    structures.
*   Use the term "property" to describe object key/value pairs, to align
    with JSON Schema.
*   Use an array of objects to describe maintainers.
*   Replace the `provides` property with `contents`, with support for
    multiple kinds of PostgreSQL extensions, including TLEs, loadable
    modules, and background workers.
*   Move the `tags` property to the new `classifications` object, and
    add support for curated categories borrowed from [Trunk].
*   Replace `no_index` with `ignore` and use the gitignore format
    instead of separate lists of files and directories.
*   Rename `prereqs` to `packages` and move it into the new
    `dependencies` property, which also has `postgres`, `pipeline`,
    `platforms`, and `variations` properties. Use [purls]. to specify
    dependencies, so that any supported packaging dependency can be
    specified, as well as PGXN packages and Postgres core extensions and
    tools.
*   Remove `release_status`; we'll instead depend on [SemVer] to
    indicate pre-releases.
*   Simplify the `resources` object and add `badges` to it.
*   Add the `artifacts` property, so the extension author can include
    links to other packages or sources for a release.

Also configure `#` to hide a line in `json` code blocks and use it to
encode proper JSON objects without showing the surrounding braces.
Readers can hit the eye button that appears on hover to make the hidden
lines appear.

[PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/
[Trunk]: https://pgt.dev
[purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs
[SemVer] https://semver.org
@theory theory added rfc New RFC and removed documentation Improvements or additions to documentation labels Jul 11, 2024
Comment on lines 123 to 128
3. For the left part, split on the right-most dash. If the string to the
right of the dash is a valid [SemVer], then the left part is the package
name. If the right string is not a valid [SemVer], try again at the second
right-most dash and check again. Continue until a valid SemVer is produced
or else fail.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semver are required to have 3 parts, with any suffix after a dash (presumably not with another semver-like system, so 3.1.2-0 would be valid, but 3.1.2-1.2.3? This part feels like it has the potential for some ambiguity to me to be honest, but I trust you're more familiar with semver than I am at this point. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, no, of course, 3.1.2-1.2.3 is a valid Semver.

So perhaps it instead needs to be that distribution names cannot start with a digit or use a digit after a dash. That would simplify things: the version starts at the first digit after a dash. Of existing distributions, it looks like only DBT-2 would be a problem:

distinct name from distributions where name LIKE '%-%';
           name           
──────────────────────────
 DBT-2
 E-Maj
 foreign-keycloak-wrapper
 hashtypes-aleksabl
 neo4j-fdw
 passwd-fdw
 pg-json
 Slony-I
 soundex-function
 soundex-operator
 soundex-type
 trunklet-format
(12 rows)

Note that this does not apply to the name of the extensions, apps, or anything else inside the distribution, just the name of the thing being distributed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added the Package Name type to #3. I think that will solve the problem. I'll rewrite this bit.

theory added a commit that referenced this pull request Jul 12, 2024
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and
additional research. Notable differences from v1:

*   Introduce the term "package" separate from "distribution". The
    "package" is the bundle of objects being distributed as a…package.
*   Use only one version for the entire distribution, rather than
    separate versions for each extension included in the distribution.
    This allows the deletion of the notes on comparing versions.
*   Make "Source Distribution" the formal term, but mostly refer to it
    as "Distribution". Will be distinguished from binary distributions
    that ship with their own metadata (see #2).
*   Use JSON data types as the base types instead of generic "list" and
    "map" types.
*   Add "Path", "purl", and "Platform" types
*   Use SPDX License Expressions instead of Perl Software::License-based
    structures.
*   Use the term "property" to describe object key/value pairs, to align
    with JSON Schema.
*   Use an array of objects to describe maintainers.
*   Replace the `provides` property with `contents`, with support for
    multiple kinds of PostgreSQL extensions, including TLEs, loadable
    modules, and background workers.
*   Move the `tags` property to the new `classifications` object, and
    add support for curated categories borrowed from [Trunk].
*   Replace `no_index` with `ignore` and use the gitignore format
    instead of separate lists of files and directories.
*   Rename `prereqs` to `packages` and move it into the new
    `dependencies` property, which also has `postgres`, `pipeline`,
    `platforms`, and `variations` properties. Use [purls]. to specify
    dependencies, so that any supported packaging dependency can be
    specified, as well as PGXN packages and Postgres core extensions and
    tools.
*   Remove `release_status`; we'll instead depend on [SemVer] to
    indicate pre-releases.
*   Simplify the `resources` object and add `badges` to it.
*   Add the `artifacts` property, so the extension author can include
    links to other packages or sources for a release.

Also configure `#` to hide a line in `json` code blocks and use it to
encode proper JSON objects without showing the surrounding braces.
Readers can hit the eye button that appears on hover to make the hidden
lines appear.

[PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/
[Trunk]: https://pgt.dev
[purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs
[SemVer] https://semver.org
@theory theory mentioned this pull request Jul 12, 2024
2 tasks
theory added a commit that referenced this pull request Jul 12, 2024
*   Add the "Package Name" type that disallows leading digits.
    [Discussion](#2 (comment)).
*   Document the Number type and allow `0` as a valid value for a
    Version Range.
*   Add the "Path Pattern" type and use it for the "ignore" property.
*   Merge "License" into "License Expression"
*   Rename "Version" to "SemVer", to distinguish it from any other
    versions used for dependencies.
*   Make "Version Range" not-specific to SemVers, since it's used for
    all sorts of dependency version requirements. Also, disallow version
    truncation in Version Ranges, since different version formats will
    have different rules.
*   Document that purl version expressions are valid but ignored.
*   Fix various spelling, grammatical, syntax, and narrative errors and
    clumsiness.
*   Update spec URLs to point to rfcs.pgxn.org.
*   Rename `generated_by` to `producer`.
*   Add question about preloading.
*   Note quality binary distribution as sign of success.
Add a new RFC describing the proposed trunk binary distribution format
for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to
support binaries for every OS and architecture supported by PostgreSQL
itself, as well as many versions of PostgreSQL.
@theory theory force-pushed the binary-distribution-format branch from f4f4d0d to 1a778ac Compare July 12, 2024 21:47
theory added a commit that referenced this pull request Jul 12, 2024
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and
additional research. Notable differences from v1:

*   Introduce the term "package" separate from "distribution". The
    "package" is the bundle of objects being distributed as a…package.
*   Use only one version for the entire distribution, rather than
    separate versions for each extension included in the distribution.
    This allows the deletion of the notes on comparing versions.
*   Make "Source Distribution" the formal term, but mostly refer to it
    as "Distribution". Will be distinguished from binary distributions
    that ship with their own metadata (see #2).
*   Use JSON data types as the base types instead of generic "list" and
    "map" types.
*   Add "Path", "purl", and "Platform" types
*   Use SPDX License Expressions instead of Perl Software::License-based
    structures.
*   Use the term "property" to describe object key/value pairs, to align
    with JSON Schema.
*   Use an array of objects to describe maintainers.
*   Replace the `provides` property with `contents`, with support for
    multiple kinds of PostgreSQL extensions, including TLEs, loadable
    modules, and background workers.
*   Move the `tags` property to the new `classifications` object, and
    add support for curated categories borrowed from [Trunk].
*   Replace `no_index` with `ignore` and use the gitignore format
    instead of separate lists of files and directories.
*   Rename `prereqs` to `packages` and move it into the new
    `dependencies` property, which also has `postgres`, `pipeline`,
    `platforms`, and `variations` properties. Use [purls]. to specify
    dependencies, so that any supported packaging dependency can be
    specified, as well as PGXN packages and Postgres core extensions and
    tools.
*   Remove `release_status`; we'll instead depend on [SemVer] to
    indicate pre-releases.
*   Simplify the `resources` object and add `badges` to it.
*   Add the `artifacts` property, so the extension author can include
    links to other packages or sources for a release.

Also configure `#` to hide a line in `json` code blocks and use it to
encode proper JSON objects without showing the surrounding braces.
Readers can hit the eye button that appears on hover to make the hidden
lines appear.

[PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/
[Trunk]: https://pgt.dev
[purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs
[SemVer] https://semver.org
theory added a commit that referenced this pull request Jul 12, 2024
*   Add the "Package Name" type that disallows leading digits.
    [Discussion](#2 (comment)).
*   Document the Number type and allow `0` as a valid value for a
    Version Range.
*   Add the "Path Pattern" type and use it for the "ignore" property.
*   Merge "License" into "License Expression"
*   Rename "Version" to "SemVer", to distinguish it from any other
    versions used for dependencies.
*   Make "Version Range" not-specific to SemVers, since it's used for
    all sorts of dependency version requirements. Also, disallow version
    truncation in Version Ranges, since different version formats will
    have different rules.
*   Document that purl version expressions are valid but ignored.
*   Fix various spelling, grammatical, syntax, and narrative errors and
    clumsiness.
*   Update spec URLs to point to rfcs.pgxn.org.
*   Rename `generated_by` to `producer`.
*   Add question about preloading.
*   Note quality binary distribution as sign of success.
theory added 2 commits July 12, 2024 17:54
Made possible by RFC-3 disallowing digits after dashes in package names.
Made possible by forbidding dots (.) in Terms in the metatdata spec
(80702c3), making it impossible for a package name to include a semver.
Without the trunk binary distribution format, it will be difficult to build
and deliver cross-platform binary distribution of all the packages on PGXN.

## Prior art
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have more substantive comments on things in a bit. But as long as there's a prior-art section ...

As near as I can tell, the Tembo trunk format demonstrated the "expand directory prefixes to pg_config directories" pattern since March 2023. Also, PL/Java demonstrated the same pattern since January 2016. ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, that's part of why it's called the "Trunk" format. But you're right it should be mentioned here. I wasn't aware of PL/Java using, it, though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I assumed the reason you hadn't mentioned the earlier PL/Java prior art was that you hadn't known about it, not that you deliberately snubbed it or anything. :)


## Unresolved questions

* Should the archive format be Zip or tarball? PGXN had traditionally used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't simplify this choice, but maybe I can complicate it in a useful way.

This seems to be a recurring question for projects of all kinds, with very few actual technical arguments to be made either way. Both are widely supported. Either one can deliver the files. Maybe the most noticeable difference is that zip is more closely associated in people's minds with Windows and tar is more associated in people's minds with *n*x.

That association can sometimes be so strong it leads to problems: a quarter century ago, I was contributing to a project (I think it was the ANTLR parser generator) that actually offered downloads in both formats. The choice of format had been overloaded with extra meaning: it was assumed that people using the tar download would want the text artifacts (sources and docs and such) to come out with *n*x-style line endings, and people using the zip download would want those same artifacts to come out with Windows-style line endings.

Tar, of course, doesn't have any notion of line-ending conversions. What you put in is what comes out. So the project built the tarballs from files with *n*xy line endings.

Zip does have a notion of line-ending conversions, so you can build a zipfile from *n*xy text files and have them become Windowsy when extracted on a Windows box.

But the normal way zip does that is irredeemably broken.

For each entry in a zip, the extractor looks at a bit that flags that entry as a binary or text file. Line ending conversions are applied to text files only, and binary ones are of course left alone. Everything about that is ok; the problem is how that flag bit gets set in the first place. The zip file creator tool sets it by guessing. It samples some bytes of each file, and if it decides they look more ASCII/textish than binaryish, it sets the entry's "apparent ASCII/text file" bit to formally record the stupid guess it made. The extractor will then, deterministically, doing exactly as it's been told, corrupt during extraction every binary file that got mistakenly flagged as text. And ANTLR users were being hit by that.

If someone were to build a low-level zip-file creation tool that actually set the "apparent ASCII/text" flag explicitly according to the known text/binaryness of each file instead of guessing, then a zip created that way could be safely extracted anywhere. It's only the familiar ubiquitous creation tool that sets the flag by guessing. I wasn't able to build a better creation tool using the Java library API because that API's not low-level enough to let me control that flag.

It's tempting to say "well, tar then! it won't ever mess with any of the bytes!". But the line-ending issue hasn't gone away; even in 2025 it is still likely that whatever files in a distribution are source/doc/config/text/etc. will be friendlier to view and edit on the user's system if they get the right line endings. Which isn't really a technically difficult problem, as long as you don't try to do it the way zip did. (Thankfully, Mac OS fell in line with the *n*xy line endings, after Macs originally had yet a third different way for lines to end.)

For a format like trunk that wants to cryptographically hash entries, line-ending conversion makes it necessary to specify which form gets hashed. It should probably be the single known form actually stored in the archive. If extracted on a system with different line endings, it should be understood that extracted text files on the filesystem may hash to unexpected values even though the archive they came from verifies ok.

When I was later working on PL/Java, years after the ANTLR experience, the choice of format was made easy by noting that absolutely anyone looking at PL/Java is going to have Java installed, and Java's standard library has APIs for zip (and for jar, which is nothing but zip with a manifest file in a specified format that adds information on the content—so, a lot like trunk). If an extractor Java class is added to a jar and named as Main-Class in the manifest, the jar can be run with java -jar foo.jar and so be self-extracting in case a user has a stripped-down Java runtime without a CLI jar tool.

Because the Java API didn't let me get low-level enough to write zip files with the "apparent text" bits correctly non-guessingly set, I just used entries in the manifest to explicitly indicate which members should or shouldn't get text treatment. The regular CLI zip or jar tools would attach no meaning to those when extracting, but I could write the self-extractor class to do the right thing, and then just recommend java -jar foo.jar as always the right way to extract it. And from there it was a short step to also using pg_config to find the directories to extract into.

For trunk the choice is harder because you can't count on any single thing like Java being around. (Well, PostgreSQL! Maybe if there were a tarfile-fdw or zipfile-fdw and it was in core...)

Maybe there is a Rust (or Python, or some other) library with a zip API that would give explicit control over the "apparent text" flag when creating. You could create zip files using that and then extraction could safely rely on the flag. Of course you'd still have a custom extractor to do the pg_config directory expansion anyway, so it could also follow the PL/Java example and indicate per-entry text/binaryness in the manifest.

I hope that background's at least interesting, if the matter of zip's handling of text hadn't been explicitly raised before. There's documentation on the extractor used in PL/Java (which is actually a few years older than PL/Java) going into more depth than I have gone into here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I was not aware of this whole line-ending thing. Interestingly, it looks like gitattributes supports text file annotation similar to what you describe for jar files. As PGXN v1 uses zip files, I can see where this would be super useful. Sadly, it doesn't look like get-archive uses this annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc New RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants