when we import csv data, how eliminate "invalid byte sequence in UTF-8"

CSV.parse(File.read('/path/to/csv').scrub)

Ruby 1.9 can change string encoding with invalid detection and replacement:

str = str.encode('UTF-8', :invalid => :replace)

For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.

If you get a message like this:

error ** from ASCII-8BIT to UTF-8

Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:

str.force_encoding('UTF-8')

If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.

I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2

Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.

Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:

'string'.encode('UTF-8')

However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

Tags:

Ruby

Utf 8

Related

Recent Posts