-
-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.listdir and UnicodeError #120
Comments
Sounds like a bug. What filesystem? A traceback would be great. |
This is likely to be a error that may happen in general on Linux and Unix: the path cannot be guaranteed to be Unicode-decodable, as this is an unspecified byte string, with some (possibly unknown) encoding. |
To get some feel for the problem of FS encoding (at least on Python 2) see aboutcode-org/scancode-toolkit#688 |
So as reported in scancode by @dengste
That's a problem with Python 2 only. |
Python3 uses surrogate pair encoding and one way to emulate this on Python2 is this https://github.com/pjdelport/backports.os by @pjdelport |
FWIW, you are not alone there, @jaraco 's path.py has the same issue: jaraco/path#130 |
Here is a more complete snippet:
|
@willmcgugan one issue is that your fsencode/decode https://github.com/PyFilesystem/pyfilesystem2/blob/master/fs/_fscompat.py#L5 may not be as involved as @pjdelport 's https://github.com/pjdelport/backports.os/blob/master/src/backports/os.py The backport of @pjdelport one works on Python2 flawlessly for me and we have tested it on 100+ million files so far. |
Now the key is that
... and only there. So IMHO the fix could be to dabble around there. But then the the fsencode/fsdecode dance to ensure correctness on Linux/Unix may require to be done in many other places: not sure. |
FWIW, @benhoyt scandir package does not fare better on Python2 see benhoyt/scandir#86
|
@willmcgugan I could take a crack at this as this is a blocker for me. Now, how to best deal with this? |
@pombredanne Would be happy to accept a PR. This would be something I intend to look at, but couldn't say when I'll have the time. Paths have to be unicode in the Pyfilesystem api. So the fix would have to be at the boundaries. I'd be interested to know if the scandir code is similarly affected. Feel free to email me if you have any questions. |
@willmcgugan scandir code is affected the same way on Python2 Now the fix is rather engaged, as essentially |
shrikes: the problem is the "boundaries" are large. For instance, should |
So this eventually means touching most everything is osfs and fixing a large number of tests |
Maybe a |
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
The approach is that unicode is used everywhere unless when on *nix and that real access to files is needed. In this case the patch is encoded to bytes using the filesystem encoding. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@ReimarBauer do you mind testing if the code in #121 from this branch works for you? |
re
The reality is that even on Python3, you cannot use anything realiably that comes as unicode from the os/os.path modules on *nix: you need to fsencode these otherwise this will fail on the cases highlighted here, so shielding users from path semantics with Unicode cannot work as general rule. |
Signed-off-by: Philippe Ombredanne <[email protected]>
Instead I added doc to explain that fsencode how can be used if needed. Signed-off-by: Philippe Ombredanne <[email protected]>
* Avoid code duplication with a new _get_validated_syspath() method * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <[email protected]>
* This was mistakenly left over * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Following @willmcgugan in PyFilesystem#121 this is: - removing and/or shortcuts and - does not override path arg variables Signed-off-by: Philippe Ombredanne <[email protected]>
* I had somehow introduced a regression with the previous commit Signed-off-by: Philippe Ombredanne <[email protected]>
just got back to this and try to test this :) |
The above and I think also the 2.0.18 have the same behaviour for my problem currently. I guess there are different / more issues related too.
this makes a list with the content of e.g. The further processing of this list makes then problems. e.g.
My current workaround for this is. (this fork and also 2.0.18)
This means listdir returns something and isdir cannot handle it. |
I got a hint by @appleonkel on the PythonCamp for using from backports.os import fsdecode name = fsdecode(name) |
@ReimarBauer this is already something that I integrated in my WIP branch 810ee9b#diff-97766fdc3eaf0f62e76fe6d51fff1be2R8 FWIW, there is a bit more to it than just handling this in scandir (or os.listdir) |
@pombredanne Great! Looking forward :) |
I'm seeing this too. I"m considering adding pyfilesystem to http://stromberg.dnsalias.org/~strombrg/backshift/ (a filesystem backup tool), but this bug blocks that. The error I'm getting in a rudimentary REPL test:
...many files listed correctly, but then: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte |
Work in progress to fix this #120 |
@ReimarBauer @dstromberg @pombredanne There is a work in progress effort to address this issue. Please give |
thx @willmcgugan |
Hi folks.
I just tried walking with osfs:// using 2.0.20, and got no errors. ssh://
gives an error with both 2.0.20 and 2.0.20a0.
BTW, I'm also getting an error on a symlink that causes itself to be
retraversed. IOW: ./c/d/2 -> .. It seems to be trying to traverse
forever. This happens with osfs - I haven't tried it with ssh yet.
I'm trying to use pyfilesystem2 to walk a directory hierarchy I created for
testing backshift: http://stromberg.dnsalias.org/~strombrg/backshift/
The code I'm testing pyfilesystem2 with looks like:
#!./bin/python3
"""List a couple of test directories, to see if pyfilesystem2 can deal with
non-unicode filenames and self-referential symlinks."""
import fs
import fs.sshfs
def list_files(filesys):
"""List files in filesys."""
for path in filesys.walk.files():
print(type(path), path)
def main():
"""List a test directory."""
filesys =
fs.open_fs('ssh://localhost/home/dstromberg/src/home-svn/backshift/trunk/tests/50-encoding-2.6-3.1')
list_files(filesys)
print()
filesys =
fs.open_fs('ssh://localhost/home/dstromberg/src/home-svn/backshift/trunk/tests/57-symlinks')
list_files(filesys)
main()
Thanks!
…On Fri, May 4, 2018 at 7:41 AM, Will McGugan ***@***.***> wrote:
@ReimarBauer <https://github.com/ReimarBauer> @dstromberg
<https://github.com/dstromberg> @pombredanne
<https://github.com/pombredanne> There is a work in progress effort to
address this issue. Please give 2.0.22a0 a try, and let me know if that
fixes it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA0yGqXZlZEbLHTg8SGb4nvjJKHEzERPks5tvGiEgaJpZM4RGpVx>
.
--
Dan Stromberg
|
@dstromberg : |
On my system for whatever reason I have a file whith wrong encoding in the / dir.
Always if I want
for item in sorted(self.fs.listdir(_sel_dir)):
I have to encapsulate this by an exception for UnicodeDecodeError. I would prefer to not crash but just ignore this file.
(I am still looking on why that file anyway is there)
The text was updated successfully, but these errors were encountered: