How to recode to UTF-8 conditionally?
This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :
#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")
if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi
You can use it this way :
recodeifneeded utf-8 file.txt
So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :
find . -name "*.txt" -exec recodeifneeded utf-8 {} \;
I hope this helps.
This script, adapted from harrymc's idea, which recodes one file conditionally (based on existence of certain UTF-8 encoded Scandinavian characters), seems to work for me tolerably well.
$ cat recode-to-utf8.sh
#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already
result=`grep -c [åäöÅÄÖ] $1`
if [ "$result" -eq "0" ]
then
echo "Recoding $1 from ISO-8859-1 to UTF-8"
recode ISO-8859-1..UTF-8 $1 # overwrites file
else
echo "$1 was already UTF-8 (probably); skipping it"
fi
(Batch processing files is of course a simple matter of e.g. for f in *txt; do recode-to-utf8.sh $f; done
.)
NB: this totally depends on the script file itself being UTF-8. And as this is obviously a very limited solution suited to what kind of files I happen to have, feel free to add better answers which solve the problem in a more generic way.
UTF-8 has strict rules about which byte sequences are valid. This means that if data could be UTF-8, you'll rarely get false positives if you assume that it is.
So you can do something like this (in Python):
def convert_to_utf8(data):
try:
data.decode('UTF-8')
return data # was already UTF-8
except UnicodeError:
return data.decode('ISO-8859-1').encode('UTF-8')
In a shell script, you can use iconv
to perform the converstion, but you'll need a means of detecting UTF-8. One way is to use iconv
with UTF-8 as both the source and destination encodings. If the file was valid UTF-8, the output will be the same as the input.