How to search and replace dual characters by Unicode single characters in a garbled file?
It looks like you had the text encoded in utf-8 (that is good, as it is the standard for Unix), but then something read it as ISO 8859-1 / Microsoft's windows Latin-1 and then output its interpretation. You need to reverse this.
e.g.
echo "passer de très bonnes fêtes de fin d'année" | iconv --to-code="ISO 8859-1"
This will take the broken encoding, and convert it to valid utf-8. If your system is configured to utf-8, then it will read correctly.
Explication:
If we do echo è | od -t x1
and echo ê | od -t x1
, then we see that the hex codes are c3 a8 0a
and c3 aa 0a
, we then look here http://www.ascii-code.com/ ( these are iso 8859-1 codes, not ascii ) we see that these codes give è
and ê
both followed by an invisible character. So now we know what went wrong: something read utf-8, but interpreted it as iso 8859-1. So we now need to reverse it: We read in what ever format it is that we are reading in, and convert to iso 8859-1 (the reverse of what got us here). The result is valid utf-8.