-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes quoting #358
Fixes quoting #358
Conversation
💖 Thank you for opening your first pull request in this repository! 💖 A few things to keep in mind:
⭐ No matter what, we are really grateful that you put in the effort to do this! ⭐ |
@bmcfee this is a tricky one indeed.
It seems that the 1.7 behaviour is a bug since it changed the default behaviour of 1.6, which was not intended in #313. I would be very keen to retain the behaviour of 1.6 but still allow spaces in file names somehow. 1.7 broke backwards compatibility when it shouldn't have so we need to put out a patch. So I think the test
100% agree with this. The registry file format was supposed to be extremely simple but we ended up adding a bunch of stuff on top of it and now it's breaking. In reality, we should have just gone with a json file that maps directly to the in memory As a first step, could we get the parser handling the following registry? It would be backwards compatible and allow quotes and spaces.
This assumes that the quotes should be removed if they are the first and last character only. Otherwise, they're part of the string and should remain. Is that a valid assumption? If shlex doesn't handle that, how hard would it be to enforce this through our previous simple parser? |
I would definitely advocate for a json-backed format here, as the parsing is (relatively) unambiguous. (And I know that nobody mentioned this, but yaml would be a terrible idea. :))
I don't think this is a valid assumption, and it opens up some weird behaviors: what do we do if a filename begins with a quote? Two quotes? Getting it to behave consistently here could be a bit of a mess, and I think it's much cleaner to have a consistent handling of quote characters throughout the string. It's also a pretty subtle parsing behavior for an end-user to keep in mind, as handling of quotes will differ depending on where they land in the string. I can't think of any other quoting or sanitization schemes that work this way, so there's no precedent that we can direct users to for why it would work this way, and users will have to learn this syntax specifically for pooch. My 2¢ on all of this: shlex parsing is a fine enough choice, even if it's overkill. The whole problem I had was it breaking backward compatibility. The solution proposed in this PR is meant to be a minimally viable parser for the format specification as it currently stands. However, I also agree that it's not ultimately the most sustainable approach, and would advocate for a JSON-backed implementation going forward. |
This PR restores the ability of filenames to contain quotes.
Implementing this required some modifications to the registry parser, as discussed in the issue thread for #357.
The dust isn't quite settled on this though; two of the existing tests are now failing:
The first test fails as follows:
This because the line
Would be technically valid if
forth_element
is considered as the hash (it fails URL validation), and the remaining prefix of the line (some-file.txt second_element third_element
) is taken as the filename.This could be easily fixed by tightening the check on what constitutes a valid hash (eg hex strings with optional algorithm prefix), but one could easily devise a new test case that would pass there. I think the correct behavior is actually to let this one parse.
The second test fails as follows:
This is because the surrounding quotes on the filename are no longer suppressed by the registry parser. (This was the initial cause of my problem with #357.)
I'm actually not sure what the correct behavior here should be. The old (<=1.6) behavior would be to leave the quotes in, in which case this test as written should fail. The test could be revised so that the quotes are included in the search key.
OTOH, I see the logic in allowing quotes to encapsulate filenames like this. Doing this correctly does require some sophisticated implementation (ie the guts of the shlex package) to get the escaping and matching logic right. IMO it's overkill for what's needed here since the file format can only include one or two additional tokens (with no spaces in either) afterward.
Keeping filename parsing as literal as possible has a lot of advantages for simplicity, and seems less likely to surface unexpected behavior (eg suppressing quotes).
Unfortunately, I don't think it's possible to give a solution that's backward compatible with both 1.7.0 and 1.6 behavior, so I understand if the maintainers don't want to adopt the solution I've implemented here.
Relevant issues/PRs:
Fixes #357
Also relevant to #313