Why does sed fail with International characters and how to fix?
sed
is not very well setup for non-ASCII text. However you can use (almost) the same code in perl
and get the result you want:
perl -pe 's/.*\| //' x
I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.
Example: in
is UTF-8
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.
Example: in
is ISO-8859-1
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.
The answer is based on Debian Lenny/Sid and sed 4.1.5.