Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes quoting #358

Closed
wants to merge 2 commits into from
Closed

Fixes quoting #358

wants to merge 2 commits into from

Conversation

bmcfee
Copy link

@bmcfee bmcfee commented Mar 17, 2023

This PR restores the ability of filenames to contain quotes.

Implementing this required some modifications to the registry parser, as discussed in the issue thread for #357.

The dust isn't quite settled on this though; two of the existing tests are now failing:

       2 failed
         - pooch/tests/test_core.py:471 test_pooch_load_registry_invalid_line
         - pooch/tests/test_core.py:478 test_pooch_load_registry_with_spaces

The first test fails as follows:

――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― test_pooch_load_registry_invalid_line ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

    def test_pooch_load_registry_invalid_line():
        "Should raise an exception when a line doesn't have two elements"
        pup = Pooch(path="", base_url="", registry={})
        with pytest.raises(IOError):
>           pup.load_registry(os.path.join(DATA_DIR, "registry-invalid.txt"))
E           Failed: DID NOT RAISE <class 'OSError'>

pooch/tests/test_core.py:475: Failed

This because the line

some-file.txt second_element third_element forth_element

Would be technically valid if forth_element is considered as the hash (it fails URL validation), and the remaining prefix of the line (some-file.txt second_element third_element) is taken as the filename.

This could be easily fixed by tightening the check on what constitutes a valid hash (eg hex strings with optional algorithm prefix), but one could easily devise a new test case that would pass there. I think the correct behavior is actually to let this one parse.


The second test fails as follows:

――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― test_pooch_load_registry_with_spaces ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

    def test_pooch_load_registry_with_spaces():
        "Should check that spaces in filenames are allowed in registry files"
        pup = Pooch(path="", base_url="")
        pup.load_registry(os.path.join(DATA_DIR, "registry-spaces.txt"))
>       assert "file with spaces.txt" in pup.registry
E       assert 'file with spaces.txt' in {'"file with spaces.txt"': 'baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d', 'other\\ with\\ spaces.txt': 'baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d'}
E        +  where {'"file with spaces.txt"': 'baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d', 'other\\ with\\ spaces.txt': 'baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d'} = <pooch.core.Pooch object at 0x7f93733cbb20>.registry

pooch/tests/test_core.py:482: AssertionError

This is because the surrounding quotes on the filename are no longer suppressed by the registry parser. (This was the initial cause of my problem with #357.)

I'm actually not sure what the correct behavior here should be. The old (<=1.6) behavior would be to leave the quotes in, in which case this test as written should fail. The test could be revised so that the quotes are included in the search key.

OTOH, I see the logic in allowing quotes to encapsulate filenames like this. Doing this correctly does require some sophisticated implementation (ie the guts of the shlex package) to get the escaping and matching logic right. IMO it's overkill for what's needed here since the file format can only include one or two additional tokens (with no spaces in either) afterward.

Keeping filename parsing as literal as possible has a lot of advantages for simplicity, and seems less likely to surface unexpected behavior (eg suppressing quotes).

Unfortunately, I don't think it's possible to give a solution that's backward compatible with both 1.7.0 and 1.6 behavior, so I understand if the maintainers don't want to adopt the solution I've implemented here.

Relevant issues/PRs:
Fixes #357
Also relevant to #313

@welcome
Copy link

welcome bot commented Mar 17, 2023

💖 Thank you for opening your first pull request in this repository! 💖

A few things to keep in mind:

No matter what, we are really grateful that you put in the effort to do this!

@leouieda
Copy link
Member

@bmcfee this is a tricky one indeed.

I don't think it's possible to give a solution that's backward compatible with both 1.7.0 and 1.6 behavior.

It seems that the 1.7 behaviour is a bug since it changed the default behaviour of 1.6, which was not intended in #313. I would be very keen to retain the behaviour of 1.6 but still allow spaces in file names somehow. 1.7 broke backwards compatibility when it shouldn't have so we need to put out a patch.

So I think the test test_pooch_load_registry_invalid_line with the multiple entries should still fail.

Keeping filename parsing as literal as possible has a lot of advantages for simplicity, and seems less likely to surface unexpected behavior (eg suppressing quotes).

100% agree with this. The registry file format was supposed to be extremely simple but we ended up adding a bunch of stuff on top of it and now it's breaking. In reality, we should have just gone with a json file that maps directly to the in memory registry and another one of the urls. The way we handle the registry and custom urls needs to be reworked since it's clearly showing its limitations. But that's something for Pooch 2.0 maybe.


As a first step, could we get the parser handling the following registry? It would be backwards compatible and allow quotes and spaces.

# Regular
foo.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
foo.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
# Quotes should remain as characters
foo'.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
foo'.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
foo".txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
foo".txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
"foo".txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
"foo".txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
'foo'.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
'foo'.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
# Escaped spaces
foo\ bar.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
foo\ bar.txt baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
# Quotes should be removed after parsing
"foo.txt" baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
"foo.txt" baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
"foo bar.txt" baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
"foo bar.txt" baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com
'foo bar.txt' baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
'foo bar.txt' baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d https://url.com

This assumes that the quotes should be removed if they are the first and last character only. Otherwise, they're part of the string and should remain. Is that a valid assumption?

If shlex doesn't handle that, how hard would it be to enforce this through our previous simple parser?

@bmcfee
Copy link
Author

bmcfee commented Mar 29, 2023

In reality, we should have just gone with a json file that maps directly to the in memory registry and another one of the urls. The way we handle the registry and custom urls needs to be reworked since it's clearly showing its limitations. But that's something for Pooch 2.0 maybe.

I would definitely advocate for a json-backed format here, as the parsing is (relatively) unambiguous. (And I know that nobody mentioned this, but yaml would be a terrible idea. :))

This assumes that the quotes should be removed if they are the first and last character only. Otherwise, they're part of the string and should remain. Is that a valid assumption?

I don't think this is a valid assumption, and it opens up some weird behaviors: what do we do if a filename begins with a quote? Two quotes? Getting it to behave consistently here could be a bit of a mess, and I think it's much cleaner to have a consistent handling of quote characters throughout the string.

It's also a pretty subtle parsing behavior for an end-user to keep in mind, as handling of quotes will differ depending on where they land in the string. I can't think of any other quoting or sanitization schemes that work this way, so there's no precedent that we can direct users to for why it would work this way, and users will have to learn this syntax specifically for pooch.

My 2¢ on all of this: shlex parsing is a fine enough choice, even if it's overkill. The whole problem I had was it breaking backward compatibility.

The solution proposed in this PR is meant to be a minimally viable parser for the format specification as it currently stands. However, I also agree that it's not ultimately the most sustainable approach, and would advocate for a JSON-backed implementation going forward.

@bmcfee bmcfee closed this Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pooch 1.7.0 fails when filenames have apostrophes
2 participants