Python not able to open file with non-english characters in path
Provide the filename as a unicode
string to the open
call.
How do you produce the filename?
if provided as a constant by you
Add a line near the beginning of your script:
# -*- coding: utf8 -*-
Then, in a UTF-8 capable editor, set path
to the unicode
filename:
path = u"D:/bar/クレイジー・ヒッツ!/foo.abc"
read from a list of directory contents
Retrieve the contents of the directory using a unicode
dirspec:
dir_files= os.listdir(u'.')
read from a text file
Open the filename-containing-file using codecs.open
to read unicode
data from it. You need to specify the encoding of the file (because you know what is the “default windows charset” for non-Unicode applications on your computer).
in any case
Do a:
path= path.decode("utf8")
before opening the file; substitute the correct encoding if not "utf8".
The path in your error is:
'\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
I think this is the UTF8 encoded version of your filename.
I've created a folder of the same name on Windows7 and placed a file called 'abc.txt' in it:
>>> a = '\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
>>> os.listdir('.')
['?????\xb7???!']
>>> os.listdir(u'.') # Pass unicode to have unicode returned to you
[u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01']
>>>
>>> a.decode('utf8') # UTF8 decoding your string matches the listdir output
u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01'
>>> os.listdir(a.decode('utf8'))
[u'abc.txt']
So it seems that Duncan's suggestion of path.decode('utf8')
does the trick.
Update
I can't test this for you, but I suggest that you try checking whether the path contains non-ascii before doing the .decode('utf8')
. This is a bit hacky...
ASCII_TRANS = '_'*32 + ''.join([chr(x) for x in range(32,126)]) + '_'*130
path=path.strip()
path=path[17:] #to remove the file://localhost/ part
path=urllib.unquote(path)
if path.translate(ASCII_TRANS) != path: # Contains non-ascii
path = path.decode('utf8')
path=urllib.url2pathname(path)