How to create a string with a "bad encoding" in ruby?
Lots of one-byte strings will make an invalid UTF-8 string, starting with 0x80. So 128.chr
should work.
Your safe_str
method will (currently) never actually do anything to the string, it is a no-op. The docs for String#encode
on Ruby 1.9.3 say:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
This is true for the current release of 2.0.0 (patch level 247), however a recent commit to Ruby trunk changes this, and also introduces a scrub
method that pretty much does what you want.
Until a new version of Ruby is released you will need to round trip your text string to another encoding and back to clean it, as in the second example in this answer to the question you linked to, something like:
def safe_str str
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
Note that your first example of an attempt to create an invalid string won’t work:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true
From the <<
docs:
If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation.
So you’ll always get a valid string.
Your second method, using pack
will create a string with the encoding ASCII-8BIT
. If you then change this using force_encoding
you can create a UTF-8 string with an invalid encoding:
bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false