Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't encode ':' or '/' as part of the canonical representation #161

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dwalluck
Copy link
Contributor

This makes the Java canonical representation match the majority of other implementations.

Fixes #122
Fixes #92

@dwalluck
Copy link
Contributor Author

dwalluck commented Feb 27, 2025

@jeremylong
Copy link
Collaborator

The line 112-113 where you say the : is encode - I don't think it is. The URL just doesn't have a schema; just a host and path: repository_url=repo.spring.io/release

@dwalluck
Copy link
Contributor Author

Let me run the full test suite again without the ':' and see if it fails.

@dwalluck
Copy link
Contributor Author

OK, the first line has is for ':', the second line is for '/' ("%2F") (not ':').

@dwalluck
Copy link
Contributor Author

Checking the latest test suite file, if you don't encode the colon, if fails. The file seems to be inconsistent. It's encoded in one spot and not in the other.

org.junit.ComparisonFailure: 
Expected :pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https://hub-ci.huggingface.co
Actual   :pkg:huggingface/microsoft/deberta-v3-base@559062ad13d311b87b2c455e67dcd5f1c8f65111?repository_url=https%3A//hub-ci.huggingface.co

@jeremylong
Copy link
Collaborator

I was talking about the one you mentioned above - there doesn't seem to be any encoding.

https://github.com/package-url/purl-spec/blob/8040ff0be50f0c5b1986b1a0947bd539f5405fc4/test-suite-data.json#L112-L113

    "purl": "pkg:Maven/org.apache.xmlgraphics/[email protected]?classifier=sources&repositorY_url=repo.spring.io/release",
    "canonical_purl": "pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources&repository_url=repo.spring.io/release",

@dwalluck
Copy link
Contributor Author

There is no encoding in the input. However, the output without the patch is "%2F", not '/', I think.

@dwalluck
Copy link
Contributor Author

dwalluck commented Feb 27, 2025

And I found the inconsistency in the test file: the colon is encoded when it's in the version (sha256%3A). It's not encoded when it's in a key value (repository_url=https:).

This is not right, is it? I am trying to verify if the problem is the code or the test file.

@dwalluck
Copy link
Contributor Author

dwalluck commented Feb 27, 2025

The line 112-113 where you say the : is encode - I don't think it is. The URL just doesn't have a schema; just a host and path: repository_url=repo.spring.io/release

Sorry, I flipped the examples! L112-113 is for '/', and L88-89 is for ':'. I hope it makes sense now. I updated the original comment.

@@ -460,7 +460,7 @@ private static String uriEncode(String source, Charset charset) {
}

private static boolean isUnreserved(int c) {
return (isAlpha(c) || isDigit(c) || '-' == c || '.' == c || '_' == c || '~' == c);
return (isAlpha(c) || isDigit(c) || '-' == c || '.' == c || '_' == c || '~' == c || ':' == c || '/' == c);
Copy link
Contributor

@ppkarwasz ppkarwasz Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be rather a different set for each PURL component:

  • namespace does not need to encode neither : nor /.
  • name does not need to encode :, but it needs to encode /, which is used to separate it from namespace.
  • qualifiers, probably does not need to encode :, / or ?.
  • subpath can probably add & and = to the mix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a param to the percentEncode to take a String of characters to allow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, there are a lot more characters potentially allowed by the RFC than we list here.

I am of the opinion right now to add just enough in order to pass the test suite.

The '?' is not allowed unencoded per the purl spec itself.

@dwalluck
Copy link
Contributor Author

So, with 8ab171f, this now fixes #122, but not #92. The test suite has "%3A", but all examples given in https://github.com/package-url/purl-spec/blob/8040ff0be50f0c5b1986b1a0947bd539f5405fc4/README.rst?plain=1#L116-L117 say otherwise.

Should we just change the test suite back to how to was and assume that there is a mistake in the tests and that the current README.rst is the correct one?

@ppkarwasz
Copy link
Contributor

The exact definition of namespace, name, version and subpath are still WIP: https://github.com/orgs/package-url/projects/1

Maybe this PR is precocious and we should wait for the spec to be finalized.

@dwalluck
Copy link
Contributor Author

dwalluck commented Mar 3, 2025

The author of the Rust version has proposed the set of characters to use https://github.com/phylum-dev/purl/blob/151168733f75a9802556e4b07eb577b9d99f7cea/purl/src/format.rs#L9-L27

I could take them from here.

@ppkarwasz
Copy link
Contributor

The author of the Rust version has proposed the set of characters to use https://github.com/phylum-dev/purl/blob/151168733f75a9802556e4b07eb577b9d99f7cea/purl/src/format.rs#L9-L27

I could take them from here.

That code is based on the WHATWG URL standard, while, unless I am mistaken, PURL will be based on RFC 3986. The living standard seems more liberal, so I would base the code on the more conservative RFC 3986. In particular:

  • The RFC 3986 says that the space character, ", %, <, >, \, the backtick character, {, | and } can never occur in an URI and must be percent-encoded. Obviously percent % also needs to be encoded.
  • WHATWG URL allows some of these characters to be unencoded, but it is probably safer to encode them.

@dwalluck
Copy link
Contributor Author

I will revisit this later. It may need to be combined with #174.

@dwalluck dwalluck marked this pull request as draft March 11, 2025 16:19
@dwalluck dwalluck changed the title Don't encode ':' or '/' as part of the canonical representation fix: don't encode ':' or '/' as part of the canonical representation Mar 18, 2025
@dwalluck dwalluck force-pushed the GH-122 branch 3 times, most recently from e42515b to 447ba2a Compare March 18, 2025 21:48
@ppkarwasz
Copy link
Contributor

Now that Spotless is there, there will be some conflicts. This should help:

git checkout origin/master -- pom.xml
mvn spotless:apply

@dwalluck
Copy link
Contributor Author

git checkout origin/master -- pom.xml
mvn spotless:apply

OK, it does apply on build, but if you have to fix the conflicts first, then it doesn't really help.

@dwalluck
Copy link
Contributor Author

git checkout origin/master -- pom.xml

Now you just need to run sortpom:sort on that 😉

@dwalluck dwalluck force-pushed the GH-122 branch 2 times, most recently from 7803217 to b6482d2 Compare March 19, 2025 13:33
@ppkarwasz
Copy link
Contributor

git checkout origin/master -- pom.xml
mvn spotless:apply

OK, it does apply on build, but if you have to fix the conflicts first, then it doesn't really help.

You can:

  1. Checkout pom.xml from master to have the Spotless configuration.
  2. Format your code.
  3. Merge the master branch into your PR branch. 99% of the code should be identical and identically formatted, so few conflicts should occur.

This makes the Java canonical representation match the majority of
other implementations.

Fixes package-url#122
Fixes package-url#92
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slash character is not expected to be escaped by the specification Inconsistent colon encoding
3 participants