Removing non-printable "gremlin" chars from text files

Replace anything that isn't a desirable character with a blank (delete it):

clean = re.sub('[^\s!-~]', '', dirty)

This allows all whitespace (spaces, newlines, tabs etc), and all "normal" characters (! is the first ascii printable and ~ is the last ascii printable under decimal 128).


An alternative you might be interested in would be:

import string
clean = lambda dirty: ''.join(filter(string.printable.__contains__, dirty))

It simply filters out all non-printable characters from the dirty string it receives.

>>> len(clean(map(chr, range(0x110000))))
100

Try this:

clean = re.sub('[\0\200-\377]', '', dirty)

The idea is to match each NUL or "high ASCII" character (i.e. \0 and those that do not fit in 7 bits) and remove them. You could add more characters as you find them, such as ASCII ESC or BEL.

Or this:

clean = re.sub('[^\040-\176]', '', dirty)

The idea being to only permit the limited range of "printable ASCII," but note that this also removes newlines. If you want to keep newlines or tabs or the like, just add them into the brackets.