RFC: Binary Distribution Format #2

theory · 2024-07-11T19:21:12Z

Add a new RFC describing the proposed trunk binary distribution format for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to support binaries for every OS and architecture supported by PostgreSQL itself, as well as many versions of PostgreSQL.

Previous discussion.

Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org

text/0002-binary-distribution-format.md

pgguru · 2024-07-12T01:48:07Z

text/0002-binary-distribution-format.md

+3.  For the left part, split on the right-most dash. If the string to the
+    right of the dash is a valid [SemVer], then the left part is the package
+    name. If the right string is not a valid [SemVer], try again at the second
+    right-most dash and check again. Continue until a valid SemVer is produced
+    or else fail.


Semver are required to have 3 parts, with any suffix after a dash (presumably not with another semver-like system, so 3.1.2-0 would be valid, but 3.1.2-1.2.3? This part feels like it has the potential for some ambiguity to me to be honest, but I trust you're more familiar with semver than I am at this point. :-)

Bah, no, of course, 3.1.2-1.2.3 is a valid Semver.

So perhaps it instead needs to be that distribution names cannot start with a digit or use a digit after a dash. That would simplify things: the version starts at the first digit after a dash. Of existing distributions, it looks like only DBT-2 would be a problem:

distinct name from distributions where name LIKE '%-%'; name ────────────────────────── DBT-2 E-Maj foreign-keycloak-wrapper hashtypes-aleksabl neo4j-fdw passwd-fdw pg-json Slony-I soundex-function soundex-operator soundex-type trunklet-format (12 rows)

Note that this does not apply to the name of the extensions, apps, or anything else inside the distribution, just the name of the thing being distributed.

Okay, added the Package Name type to #3. I think that will solve the problem. I'll rewrite this bit.

text/0002-binary-distribution-format.md

Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org

* Add the "Package Name" type that disallows leading digits. [Discussion](#2 (comment)). * Document the Number type and allow `0` as a valid value for a Version Range. * Add the "Path Pattern" type and use it for the "ignore" property. * Merge "License" into "License Expression" * Rename "Version" to "SemVer", to distinguish it from any other versions used for dependencies. * Make "Version Range" not-specific to SemVers, since it's used for all sorts of dependency version requirements. Also, disallow version truncation in Version Ranges, since different version formats will have different rules. * Document that purl version expressions are valid but ignored. * Fix various spelling, grammatical, syntax, and narrative errors and clumsiness. * Update spec URLs to point to rfcs.pgxn.org. * Rename `generated_by` to `producer`. * Add question about preloading. * Note quality binary distribution as sign of success.

Add a new RFC describing the proposed trunk binary distribution format for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to support binaries for every OS and architecture supported by PostgreSQL itself, as well as many versions of PostgreSQL.

Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org

* Add the "Package Name" type that disallows leading digits. [Discussion](#2 (comment)). * Document the Number type and allow `0` as a valid value for a Version Range. * Add the "Path Pattern" type and use it for the "ignore" property. * Merge "License" into "License Expression" * Rename "Version" to "SemVer", to distinguish it from any other versions used for dependencies. * Make "Version Range" not-specific to SemVers, since it's used for all sorts of dependency version requirements. Also, disallow version truncation in Version Ranges, since different version formats will have different rules. * Document that purl version expressions are valid but ignored. * Fix various spelling, grammatical, syntax, and narrative errors and clumsiness. * Update spec URLs to point to rfcs.pgxn.org. * Rename `generated_by` to `producer`. * Add question about preloading. * Note quality binary distribution as sign of success.

Made possible by RFC-3 disallowing digits after dashes in package names.

Made possible by forbidding dots (.) in Terms in the metatdata spec (80702c3), making it impossible for a package name to include a semver.

jcflack · 2025-05-19T22:15:54Z

text/0002-binary-distribution-format.md

+Without the trunk binary distribution format, it will be difficult to build
+and deliver cross-platform binary distribution of all the packages on PGXN.
+
+## Prior art


I'll have more substantive comments on things in a bit. But as long as there's a prior-art section ...

As near as I can tell, the Tembo trunk format demonstrated the "expand directory prefixes to pg_config directories" pattern since March 2023. Also, PL/Java demonstrated the same pattern since January 2016. ;)

Well yes, that's part of why it's called the "Trunk" format. But you're right it should be mentioned here. I wasn't aware of PL/Java using, it, though.

Sure, I assumed the reason you hadn't mentioned the earlier PL/Java prior art was that you hadn't known about it, not that you deliberately snubbed it or anything. :)

jcflack · 2025-05-20T01:25:32Z

text/0002-binary-distribution-format.md

+
+## Unresolved questions
+
+*    Should the archive format be Zip or tarball? PGXN had traditionally used


I can't simplify this choice, but maybe I can complicate it in a useful way.

This seems to be a recurring question for projects of all kinds, with very few actual technical arguments to be made either way. Both are widely supported. Either one can deliver the files. Maybe the most noticeable difference is that zip is more closely associated in people's minds with Windows and tar is more associated in people's minds with *n*x.

That association can sometimes be so strong it leads to problems: a quarter century ago, I was contributing to a project (I think it was the ANTLR parser generator) that actually offered downloads in both formats. The choice of format had been overloaded with extra meaning: it was assumed that people using the tar download would want the text artifacts (sources and docs and such) to come out with *n*x-style line endings, and people using the zip download would want those same artifacts to come out with Windows-style line endings.

Tar, of course, doesn't have any notion of line-ending conversions. What you put in is what comes out. So the project built the tarballs from files with *n*xy line endings.

Zip does have a notion of line-ending conversions, so you can build a zipfile from *n*xy text files and have them become Windowsy when extracted on a Windows box.

But the normal way zip does that is irredeemably broken.

For each entry in a zip, the extractor looks at a bit that flags that entry as a binary or text file. Line ending conversions are applied to text files only, and binary ones are of course left alone. Everything about that is ok; the problem is how that flag bit gets set in the first place. The zip file creator tool sets it by guessing. It samples some bytes of each file, and if it decides they look more ASCII/textish than binaryish, it sets the entry's "apparent ASCII/text file" bit to formally record the stupid guess it made. The extractor will then, deterministically, doing exactly as it's been told, corrupt during extraction every binary file that got mistakenly flagged as text. And ANTLR users were being hit by that.

If someone were to build a low-level zip-file creation tool that actually set the "apparent ASCII/text" flag explicitly according to the known text/binaryness of each file instead of guessing, then a zip created that way could be safely extracted anywhere. It's only the familiar ubiquitous creation tool that sets the flag by guessing. I wasn't able to build a better creation tool using the Java library API because that API's not low-level enough to let me control that flag.

It's tempting to say "well, tar then! it won't ever mess with any of the bytes!". But the line-ending issue hasn't gone away; even in 2025 it is still likely that whatever files in a distribution are source/doc/config/text/etc. will be friendlier to view and edit on the user's system if they get the right line endings. Which isn't really a technically difficult problem, as long as you don't try to do it the way zip did. (Thankfully, Mac OS fell in line with the *n*xy line endings, after Macs originally had yet a third different way for lines to end.)

For a format like trunk that wants to cryptographically hash entries, line-ending conversion makes it necessary to specify which form gets hashed. It should probably be the single known form actually stored in the archive. If extracted on a system with different line endings, it should be understood that extracted text files on the filesystem may hash to unexpected values even though the archive they came from verifies ok.

When I was later working on PL/Java, years after the ANTLR experience, the choice of format was made easy by noting that absolutely anyone looking at PL/Java is going to have Java installed, and Java's standard library has APIs for zip (and for jar, which is nothing but zip with a manifest file in a specified format that adds information on the content—so, a lot like trunk). If an extractor Java class is added to a jar and named as Main-Class in the manifest, the jar can be run with java -jar foo.jar and so be self-extracting in case a user has a stripped-down Java runtime without a CLI jar tool.

Because the Java API didn't let me get low-level enough to write zip files with the "apparent text" bits correctly non-guessingly set, I just used entries in the manifest to explicitly indicate which members should or shouldn't get text treatment. The regular CLI zip or jar tools would attach no meaning to those when extracting, but I could write the self-extractor class to do the right thing, and then just recommend java -jar foo.jar as always the right way to extract it. And from there it was a short step to also using pg_config to find the directories to extract into.

For trunk the choice is harder because you can't count on any single thing like Java being around. (Well, PostgreSQL! Maybe if there were a tarfile-fdw or zipfile-fdw and it was in core...)

Maybe there is a Rust (or Python, or some other) library with a zip API that would give explicit control over the "apparent text" flag when creating. You could create zip files using that and then extraction could safely rely on the flag. Of course you'd still have a custom extractor to do the pg_config directory expansion anyway, so it could also follow the PL/Java example and indicate per-entry text/binaryness in the manifest.

I hope that background's at least interesting, if the matter of zip's handling of text hadn't been explicitly raised before. There's documentation on the extractor used in PL/Java (which is actually a few years older than PL/Java) going into more depth than I have gone into here.

Wow, I was not aware of this whole line-ending thing. Interestingly, it looks like gitattributes supports text file annotation similar to what you describe for jar files. As PGXN v1 uses zip files, I can see where this would be super useful. Sadly, it doesn't look like get-archive uses this annotation.

theory added the documentation Improvements or additions to documentation label Jul 11, 2024

theory self-assigned this Jul 11, 2024

theory had a problem deploying to github-pages July 11, 2024 19:21 — with GitHub Actions Failure

theory force-pushed the main branch from 5f065a3 to 665f4c2 Compare July 11, 2024 19:23

theory force-pushed the binary-distribution-format branch 5 times, most recently from 39613c4 to f4f4d0d Compare July 11, 2024 20:00

theory mentioned this pull request Jul 11, 2024

RFC: Meta Spec v2 #3

Open

theory added rfc New RFC and removed documentation Improvements or additions to documentation labels Jul 11, 2024

pgguru reviewed Jul 12, 2024

View reviewed changes

theory mentioned this pull request Jul 12, 2024

Anticipated RFCs #4

Open

2 tasks

theory force-pushed the binary-distribution-format branch from f4f4d0d to 1a778ac Compare July 12, 2024 21:47

theory added 2 commits July 12, 2024 17:54

Simplify package name and semver parsing

93e353a

Made possible by RFC-3 disallowing digits after dashes in package names.

Simplify package name/version parsing

e7cfa52

Made possible by forbidding dots (.) in Terms in the metatdata spec (80702c3), making it impossible for a package name to include a semver.

theory mentioned this pull request Sep 3, 2024

Design Binary Packaging Architecture pgxn/planning#4

Open

8 tasks

jcflack reviewed May 19, 2025

View reviewed changes

jcflack reviewed May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Binary Distribution Format #2

RFC: Binary Distribution Format #2

Uh oh!

theory commented Jul 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

pgguru Jul 12, 2024

Uh oh!

theory Jul 12, 2024

Uh oh!

theory Jul 12, 2024

Uh oh!

Uh oh!

jcflack May 19, 2025

Uh oh!

theory May 19, 2025

Uh oh!

jcflack May 19, 2025

Uh oh!

jcflack May 20, 2025

Uh oh!

theory May 20, 2025

Uh oh!

Uh oh!


		## Unresolved questions

		* Should the archive format be Zip or tarball? PGXN had traditionally used

RFC: Binary Distribution Format #2

Are you sure you want to change the base?

RFC: Binary Distribution Format #2

Uh oh!

Conversation

theory commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

theory commented Jul 11, 2024 •

edited

Loading