How to remove bad path characters in Python?
I think the safest approach here is to just replace any suspicious characters. So, I think you can just replace (or get rid of) anything that isn't alphanumeric, -, _, a space, or a period. And here's how you do that:
import re
re.sub(r'[^\w\-_\. ]', '_', filename)
The above escapes every character that's not a letter, '_'
, '-'
, '.'
or space with an '_'
. So, if you're looking at an entire path, you'll want to throw os.sep in the list of approved characters as well.
Here's some sample output:
In [27]: re.sub(r'[^\w\-_\. ]', '_', r'some\*-file._n\\ame')
Out[27]: 'some__-file._n__ame'
Unfortunately, the set of acceptable characters varies by OS and by filesystem.
Windows:
- Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
- The following reserved characters are not allowed:
< > : " / \ | ? * - Characters whose integer representations are in the range from zero through 31 are not allowed.
- Any other character that the target file system does not allow.
- The following reserved characters are not allowed:
The list of accepted characters can vary depending on the OS and locale of the machine that first formatted the filesystem.
.NET has GetInvalidFileNameChars and GetInvalidPathChars, but I don't know how to call those from Python.
- Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
- Mac OS: NUL is always excluded, "/" is excluded from POSIX layer, ":" excluded from Apple APIs
- HFS+: any sequence of non-excluded characters that is representable by UTF-16 in the Unicode 2.0 spec
- HFS: any sequence of non-excluded characters representable in MacRoman (default) or other encodings, depending on the machine that created the filesystem
- UFS: same as HFS+
- Linux:
- native (UNIX-like) filesystems: any byte sequence excluding NUL and "/"
- FAT, NTFS, other non-native filesystems: varies
Your best bet is probably to either be overly-conservative on all platforms, or to just try creating the file name and handle errors.