-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pooch 1.7.0 fails when filenames have apostrophes #357
Comments
Confirmed that this is due to this line: Line 660 in a965902
For example, the 1.6.0 behavior would have been: In [2]: line = "Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"
In [3]: line.split()
Out[3]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"] In 1.7, it becomes: In [11]: shlex.split(line)
Out[11]: ['Karissa_Hobbs_-_Lets_Go_Fishin.ogg'] (note the missing single-quotes) A quick fix here is to use In [12]: shlex.split(shlex.quote(line))
Out[12]: ["Karissa_Hobbs_-_Let's_Go_Fishin'.ogg"] I don't know if this will be safe across platforms, but it would at least fix the bug I'm seeing. |
For the reference: The original bug that let to using |
I think the use of In [2]: shlex.split(shlex.quote("foo' bar baz'.txt"))
Out[2]: ["foo' bar baz'.txt"] |
Ok, I started prepping a PR for this and realized that it's a bit more subtle than just sticking shlex.quote in there, since it will apply to the entire line. If a filename ends with a quote, it will muck up the space parsing, and the whole parser falls apart. The problem as I see it boils down to some ambiguity in the format of the registry file, in particular due to the option of custom urls as a third token. We don't know in advance how many spaces there will be in a line, and that makes splitting tricky. If we didn't have custom urls, it would be enough to do: filename, hash = line.rsplit(' ', maxsplit=2) where rsplit will group all but the last space separator into If we always have a custom URL, it would be equally simple: filename, hash, url = line.rsplit(' ', maxsplit=3) but this will be incorrect if the url is absent and the filename contains a space. What we really need is a way to first check if the final token is a hash or a url, and then act accordingly: # Assuming URLs in registry files are properly URL-encoded, i.e. ' ' → '%20'
prefix, last = line.rsplit(' ', maxsplit=2)
if is_url(last):
url = last
filename, hash = prefix.split(' ', maxsplit=2)
else:
filename, hash, url = prefix, last, None Now, there's a problem that URLs (per RFC 3986) if not fully formed can look an awful lot like hashes. We can't rely on detecting just a We also can't get away with detecting I suspect the simplest solution would be to explicitly match URLs against a subset of supported schemes (http:, https:, ftp:, sftp:, doi:) and treat anything else as if it's a hash. If it's not actually a hash, this will eventually fail later on. Does this sound reasonable? If so, I'd be happy to prep a PR implementing the above strategy. |
Hi @bmcfee, thanks for opening this issue. I'm sorry that the last release broke backward compatibility. This is something we always try to avoid through exhaustive testing, but as you could tell we weren't considering the case where the filenames contain apostrophes. I think the issue you are facing is part of a larger discussion: what type of standard do we want to use for filenames within Pooch. After we merged #315 I think we inadvertently established that we would follow the POSIX standard (see Parsing rules in shlex when POSIX is True). Therefore, if a filename contains a special character, like a space or an apostrophe, we can always use the backslash to escape it. In your particular case, you can modify your registry file to:
and it should work. We use the same strategy for spaces, as was introduced in #315. If we try to modify the So, in my opinion, special characters like spaces, double quotes or apostrophes in filenames should be handled by inserting a backslash before them. What do you think? Does it sounds like a good compromise? |
Thanks @santisoler - I agree with that assessment. I'm totally happy if pooch wants to explicitly adopt POSIX-style syntax for filenames, and I think the confusion only arose because the syntax was implicitly defined (1.7) or under-specified (<1.6). I opened the PR because it wasn't clear if this was intended behavior or a bug to be fixed, but if it's intended behavior we can close it out. I think what this means for my project in particular is that we need to put out a post-release to upper-bound the supported pooch versions, and then revise our registry file with an updated lower bound. One question though: how does this formatting work when creating registry files from a pooch object? I don't see any logic for escaping filenames on write ( Lines 218 to 222 in a965902
|
@bmcfee thanks for being very understanding with this. I think it's best to assume the POSIX syntax is intended (though it was originally an oversight). |
No worries - sounds like a good plan to me! |
Description of the problem:
This just popped up in the librosa test suite after the environments upgraded from pooch 1.6 to 1.7.
We have a registry of data files that are fetched by pooch, as listed here:
https://github.com/librosa/librosa/blob/c800e74f6a6ec5c27e0fa978d7355943cce04359/librosa/util/example_data/registry.txt
Note that three of the files (lines 22-24) contain an apostrophe:
In pooch 1.6, this works as expected. However, after upgrading to 1.7, reading these files fails (see below). I believe this is a regression caused by #315 .
Full code that generated the error
Minimal example via librosa:
This will call out to our example loader:
https://github.com/librosa/librosa/blob/c800e74f6a6ec5c27e0fa978d7355943cce04359/librosa/util/files.py#L43-L98
Full error message
System information
conda list
below:Output of conda list (it's bloated, but you asked for it :))
The text was updated successfully, but these errors were encountered: