Does . really match any character?

@julio-guerra: I ran into a similar situation, trying to delete lines like the folowing (note the Ã character):

--MP_/yZa.b._zhqt9OhfqzaÃC

in a file, using

sed 's/^--MP_.*$//g' my_file

The file encoding indicated by the Linux file command was

    file my_file: ISO-8859 text, with very long lines
 file -b my_file: ISO-8859 text, with very long lines
file -bi my_file: text/plain; charset=iso-8859-1

I tried your solution (clever!), with various permutations; e.g.,

LANG=ISO-8859 sed 's/^--MP_.*$//g' my_file

but none of those worked. I found two workarounds:

The following Perl expression worked, i.e. deleted that line:

perl -pe 's/^--MP_.*$//g' my_file

[For an explanation of the -pe command-line switches, refer to this StackOverflow answer:

Perl flags -pe, -pi, -p, -w, -d, -i, -t? ]

Alternatively, after converting the file encoding to UTF-8, the sed expression worked (the Ã character remained, but was now UTF8-encoded):

iconv -f iso-8859-1 -t utf-8 my_file > my_file.utf8

As I am working with lots (1000's) of emails with various encodings, that undergo intermediate processing (bash-scripted conversions to UTF-8 do not always work), for my purposes "solution 1" above will probably be the most robust solution.

Notes:

sed (GNU sed) 4.4
perl v5.26.1 built for x86_64-linux-thread-multi
Arch Linux x86_64 system

It works for me. It's probably a character encoding problem.

This might help:

Why does sed fail with International characters and how to fix?
http://www.barregren.se/blog/how-use-sed-together-utf8

Does . really match any character?

Tags:

Ascii

Sed

Non Ascii Characters

Related

Recent Posts