-
Notifications
You must be signed in to change notification settings - Fork 0
RFC: Binary Distribution Format #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
39613c4
to
f4f4d0d
Compare
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org
3. For the left part, split on the right-most dash. If the string to the | ||
right of the dash is a valid [SemVer], then the left part is the package | ||
name. If the right string is not a valid [SemVer], try again at the second | ||
right-most dash and check again. Continue until a valid SemVer is produced | ||
or else fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semver are required to have 3 parts, with any suffix after a dash (presumably not with another semver-like system, so 3.1.2-0
would be valid, but 3.1.2-1.2.3
? This part feels like it has the potential for some ambiguity to me to be honest, but I trust you're more familiar with semver than I am at this point. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, no, of course, 3.1.2-1.2.3
is a valid Semver.
So perhaps it instead needs to be that distribution names cannot start with a digit or use a digit after a dash. That would simplify things: the version starts at the first digit after a dash. Of existing distributions, it looks like only DBT-2
would be a problem:
distinct name from distributions where name LIKE '%-%';
name
──────────────────────────
DBT-2
E-Maj
foreign-keycloak-wrapper
hashtypes-aleksabl
neo4j-fdw
passwd-fdw
pg-json
Slony-I
soundex-function
soundex-operator
soundex-type
trunklet-format
(12 rows)
Note that this does not apply to the name of the extensions, apps, or anything else inside the distribution, just the name of the thing being distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, added the Package Name type to #3. I think that will solve the problem. I'll rewrite this bit.
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org
* Add the "Package Name" type that disallows leading digits. [Discussion](#2 (comment)). * Document the Number type and allow `0` as a valid value for a Version Range. * Add the "Path Pattern" type and use it for the "ignore" property. * Merge "License" into "License Expression" * Rename "Version" to "SemVer", to distinguish it from any other versions used for dependencies. * Make "Version Range" not-specific to SemVers, since it's used for all sorts of dependency version requirements. Also, disallow version truncation in Version Ranges, since different version formats will have different rules. * Document that purl version expressions are valid but ignored. * Fix various spelling, grammatical, syntax, and narrative errors and clumsiness. * Update spec URLs to point to rfcs.pgxn.org. * Rename `generated_by` to `producer`. * Add question about preloading. * Note quality binary distribution as sign of success.
Add a new RFC describing the proposed trunk binary distribution format for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to support binaries for every OS and architecture supported by PostgreSQL itself, as well as many versions of PostgreSQL.
f4f4d0d
to
1a778ac
Compare
Add a new RFC for Meta Spec v2. Based on the [PGXN Meta Sketch] and additional research. Notable differences from v1: * Introduce the term "package" separate from "distribution". The "package" is the bundle of objects being distributed as a…package. * Use only one version for the entire distribution, rather than separate versions for each extension included in the distribution. This allows the deletion of the notes on comparing versions. * Make "Source Distribution" the formal term, but mostly refer to it as "Distribution". Will be distinguished from binary distributions that ship with their own metadata (see #2). * Use JSON data types as the base types instead of generic "list" and "map" types. * Add "Path", "purl", and "Platform" types * Use SPDX License Expressions instead of Perl Software::License-based structures. * Use the term "property" to describe object key/value pairs, to align with JSON Schema. * Use an array of objects to describe maintainers. * Replace the `provides` property with `contents`, with support for multiple kinds of PostgreSQL extensions, including TLEs, loadable modules, and background workers. * Move the `tags` property to the new `classifications` object, and add support for curated categories borrowed from [Trunk]. * Replace `no_index` with `ignore` and use the gitignore format instead of separate lists of files and directories. * Rename `prereqs` to `packages` and move it into the new `dependencies` property, which also has `postgres`, `pipeline`, `platforms`, and `variations` properties. Use [purls]. to specify dependencies, so that any supported packaging dependency can be specified, as well as PGXN packages and Postgres core extensions and tools. * Remove `release_status`; we'll instead depend on [SemVer] to indicate pre-releases. * Simplify the `resources` object and add `badges` to it. * Add the `artifacts` property, so the extension author can include links to other packages or sources for a release. Also configure `#` to hide a line in `json` code blocks and use it to encode proper JSON objects without showing the surrounding braces. Readers can hit the eye button that appears on hover to make the hidden lines appear. [PGXN Meta Sketch]: https://justatheory.com/2024/03/rfc-pgxn-metadata-sketch/ [Trunk]: https://pgt.dev [purls]: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rs [SemVer] https://semver.org
* Add the "Package Name" type that disallows leading digits. [Discussion](#2 (comment)). * Document the Number type and allow `0` as a valid value for a Version Range. * Add the "Path Pattern" type and use it for the "ignore" property. * Merge "License" into "License Expression" * Rename "Version" to "SemVer", to distinguish it from any other versions used for dependencies. * Make "Version Range" not-specific to SemVers, since it's used for all sorts of dependency version requirements. Also, disallow version truncation in Version Ranges, since different version formats will have different rules. * Document that purl version expressions are valid but ignored. * Fix various spelling, grammatical, syntax, and narrative errors and clumsiness. * Update spec URLs to point to rfcs.pgxn.org. * Rename `generated_by` to `producer`. * Add question about preloading. * Note quality binary distribution as sign of success.
Made possible by RFC-3 disallowing digits after dashes in package names.
Made possible by forbidding dots (.) in Terms in the metatdata spec (80702c3), making it impossible for a package name to include a semver.
Without the trunk binary distribution format, it will be difficult to build | ||
and deliver cross-platform binary distribution of all the packages on PGXN. | ||
|
||
## Prior art |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have more substantive comments on things in a bit. But as long as there's a prior-art section ...
As near as I can tell, the Tembo trunk
format demonstrated the "expand directory prefixes to pg_config
directories" pattern since March 2023. Also, PL/Java demonstrated the same pattern since January 2016. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, that's part of why it's called the "Trunk" format. But you're right it should be mentioned here. I wasn't aware of PL/Java using, it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I assumed the reason you hadn't mentioned the earlier PL/Java prior art was that you hadn't known about it, not that you deliberately snubbed it or anything. :)
|
||
## Unresolved questions | ||
|
||
* Should the archive format be Zip or tarball? PGXN had traditionally used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't simplify this choice, but maybe I can complicate it in a useful way.
This seems to be a recurring question for projects of all kinds, with very few actual technical arguments to be made either way. Both are widely supported. Either one can deliver the files. Maybe the most noticeable difference is that zip is more closely associated in people's minds with Windows and tar is more associated in people's minds with *n*x.
That association can sometimes be so strong it leads to problems: a quarter century ago, I was contributing to a project (I think it was the ANTLR parser generator) that actually offered downloads in both formats. The choice of format had been overloaded with extra meaning: it was assumed that people using the tar download would want the text artifacts (sources and docs and such) to come out with *n*x-style line endings, and people using the zip download would want those same artifacts to come out with Windows-style line endings.
Tar, of course, doesn't have any notion of line-ending conversions. What you put in is what comes out. So the project built the tarballs from files with *n*xy line endings.
Zip does have a notion of line-ending conversions, so you can build a zipfile from *n*xy text files and have them become Windowsy when extracted on a Windows box.
But the normal way zip does that is irredeemably broken.
For each entry in a zip, the extractor looks at a bit that flags that entry as a binary or text file. Line ending conversions are applied to text files only, and binary ones are of course left alone. Everything about that is ok; the problem is how that flag bit gets set in the first place. The zip file creator tool sets it by guessing. It samples some bytes of each file, and if it decides they look more ASCII/textish than binaryish, it sets the entry's "apparent ASCII/text file" bit to formally record the stupid guess it made. The extractor will then, deterministically, doing exactly as it's been told, corrupt during extraction every binary file that got mistakenly flagged as text. And ANTLR users were being hit by that.
If someone were to build a low-level zip-file creation tool that actually set the "apparent ASCII/text" flag explicitly according to the known text/binaryness of each file instead of guessing, then a zip created that way could be safely extracted anywhere. It's only the familiar ubiquitous creation tool that sets the flag by guessing. I wasn't able to build a better creation tool using the Java library API because that API's not low-level enough to let me control that flag.
It's tempting to say "well, tar then! it won't ever mess with any of the bytes!". But the line-ending issue hasn't gone away; even in 2025 it is still likely that whatever files in a distribution are source/doc/config/text/etc. will be friendlier to view and edit on the user's system if they get the right line endings. Which isn't really a technically difficult problem, as long as you don't try to do it the way zip did. (Thankfully, Mac OS fell in line with the *n*xy line endings, after Macs originally had yet a third different way for lines to end.)
For a format like trunk that wants to cryptographically hash entries, line-ending conversion makes it necessary to specify which form gets hashed. It should probably be the single known form actually stored in the archive. If extracted on a system with different line endings, it should be understood that extracted text files on the filesystem may hash to unexpected values even though the archive they came from verifies ok.
When I was later working on PL/Java, years after the ANTLR experience, the choice of format was made easy by noting that absolutely anyone looking at PL/Java is going to have Java installed, and Java's standard library has APIs for zip (and for jar, which is nothing but zip with a manifest file in a specified format that adds information on the content—so, a lot like trunk). If an extractor Java class is added to a jar and named as Main-Class
in the manifest, the jar can be run with java -jar foo.jar
and so be self-extracting in case a user has a stripped-down Java runtime without a CLI jar tool.
Because the Java API didn't let me get low-level enough to write zip files with the "apparent text" bits correctly non-guessingly set, I just used entries in the manifest to explicitly indicate which members should or shouldn't get text treatment. The regular CLI zip or jar tools would attach no meaning to those when extracting, but I could write the self-extractor class to do the right thing, and then just recommend java -jar foo.jar
as always the right way to extract it. And from there it was a short step to also using pg_config
to find the directories to extract into.
For trunk the choice is harder because you can't count on any single thing like Java being around. (Well, PostgreSQL! Maybe if there were a tarfile-fdw or zipfile-fdw and it was in core...)
Maybe there is a Rust (or Python, or some other) library with a zip API that would give explicit control over the "apparent text" flag when creating. You could create zip files using that and then extraction could safely rely on the flag. Of course you'd still have a custom extractor to do the pg_config
directory expansion anyway, so it could also follow the PL/Java example and indicate per-entry text/binaryness in the manifest.
I hope that background's at least interesting, if the matter of zip's handling of text hadn't been explicitly raised before. There's documentation on the extractor used in PL/Java (which is actually a few years older than PL/Java) going into more depth than I have gone into here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I was not aware of this whole line-ending thing. Interestingly, it looks like gitattributes supports text file annotation similar to what you describe for jar files. As PGXN v1 uses zip files, I can see where this would be super useful. Sadly, it doesn't look like get-archive uses this annotation.
Add a new RFC describing the proposed trunk binary distribution format for PGXN packages. Inspired by Python wheel and pgt.dev, aiming to support binaries for every OS and architecture supported by PostgreSQL itself, as well as many versions of PostgreSQL.
Previous discussion.