How to recode to UTF-8 conditionally?

This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

You can use it this way :

recodeifneeded utf-8 file.txt

So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

I hope this helps.

This script, adapted from harrymc's idea, which recodes one file conditionally (based on existence of certain UTF-8 encoded Scandinavian characters), seems to work for me tolerably well.

$ cat recode-to-utf8.sh 

#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already

result=`grep -c [åäöÅÄÖ] $1` 
if [ "$result" -eq "0" ]
then
    echo "Recoding $1 from ISO-8859-1 to UTF-8"
    recode ISO-8859-1..UTF-8 $1 # overwrites file
else
    echo "$1 was already UTF-8 (probably); skipping it"
fi

(Batch processing files is of course a simple matter of e.g. for f in *txt; do recode-to-utf8.sh $f; done.)

NB: this totally depends on the script file itself being UTF-8. And as this is obviously a very limited solution suited to what kind of files I happen to have, feel free to add better answers which solve the problem in a more generic way.

UTF-8 has strict rules about which byte sequences are valid. This means that if data could be UTF-8, you'll rarely get false positives if you assume that it is.

So you can do something like this (in Python):

def convert_to_utf8(data):
    try:
        data.decode('UTF-8')
        return data  # was already UTF-8
    except UnicodeError:
        return data.decode('ISO-8859-1').encode('UTF-8')

In a shell script, you can use iconv to perform the converstion, but you'll need a means of detecting UTF-8. One way is to use iconv with UTF-8 as both the source and destination encodings. If the file was valid UTF-8, the output will be the same as the input.

How to recode to UTF-8 conditionally?

Tags:

Linux

Unix

Character Encoding

Utf 8

Conversion

Related

Recent Posts