How to remove non-printable/invisible characters in ruby?

Codepoint 65279 is a zero-width no-break space. It is commonly used as a byte-order mark (BOM).

You can remove it from a string with:

my_new_string = my_old_string.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

A fast way to check if you have any invisible characters is to check the length of the string, if it's higher than what you can see in IRB, you do.


try this:

>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"

Ruby can help you convert from one multi-byte character set to another. Check into the search results, plus read up on Ruby String's encode method.

Also, Ruby's Iconv is your friend.

Finally, James Grey wrote a series of articles which cover this in good detail.

One of the things you can do using those tools is to tell them to transcode to a visually similar character, or ignore them completely.

Dealing with alternate character sets is one of the most... irritating things I've ever had to do, because files can contain anything, but be marked as text. You might not expect it and then your code dies or starts throwing errors, because people are so ingenious when coming up with ways to insert alternate characters into content.