-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mojo::File handling of utf-8 strings as path names #2043
Comments
A simpler example:
|
Unicode filename handling in Perl is generally a mess; you can only work with whatever bytes the filesystem gives you and that's what Mojo::File does. The bytes you received are the correct UTF-8 bytes in the filename, but you rendered text output and so that encoded the UTF-8 to UTF-8 again. So that output isn't actually indicative of any problem. So it seems the issue is that something is upgrading the name before the call to lstat which unfortunately changes the value in Perl's broken filename API. |
@Grinnz: no. I have received objects from $dir->list, and those objects were invalid. And the reason was not some utf-8 data in $dir, but the fact that $dir was created from ASCII data, which were nevertheless utf8::is_utf8. And when user feeds something to Mojo::File->new(), he usually does not have under control whether this is bytes or characters. In my opinion $dir->list should not return invalid objects (i.e. objects on which $obj->lstat method fails). Note that the problem is not with data from ->lstat call being upgraded, but the object returned from ->list->each being invalid. |
I understand, I am saying that the output is not incorrect because rendering text in Mojolicious not depend on is_utf8 (as most sane code does not). Unfortunately perl's filename handling does depend on is_utf8 and that's why some upgrade is breaking your lstat call. |
Additionally, is_utf8 does not represent whether it is bytes or characters, it represents how the value is stored in Perl. The filename you received is a set of bytes whether upgraded or downgraded. |
To be clear, this is not a bug in your code, but it may not be easy to fix in Mojolicious correctly. |
The bug in Perl where lstat depends on the utf8 flag is the same as this one: Perl/perl5#10550 |
Steps to reproduce the behavior
When I have a Mojo::File object pointing to a directory, constructed from a path name which is utf8::is_utf8() sting (e.g. taken from NotYAMLConfig file), it works as expected, I can list the directory using
for ($mojo_file->list->each) { ... }
, but the returned objects are broken when the directory contains non-ascii file names. For example, $entry->lstat on them returns undef. According to strace(), Perl tries to donewfstatat(AT_FDCWD, $filename)
where$filename
is contains twice utf-8 encoded data.For example:
Expected behavior
Mojo::File->new() should either utf8::downgrade() its argument itself, or it should warn() or reject it at all. In any case $dir->list should not return Mojo::File objects which do not correspond to the actual file in that directory.
Actual behavior
Incorrect Mojo::File objects are constructed by the ->list method, resulting in twice encoded utf-8 data passed to the
newfstatat()
syscall.The text was updated successfully, but these errors were encountered: