Does . really match any character?
@julio-guerra: I ran into a similar situation, trying to delete lines like the folowing (note the Ã
character):
--MP_/yZa.b._zhqt9OhfqzaÃC
in a file, using
sed 's/^--MP_.*$//g' my_file
The file encoding indicated by the Linux file
command was
file my_file: ISO-8859 text, with very long lines
file -b my_file: ISO-8859 text, with very long lines
file -bi my_file: text/plain; charset=iso-8859-1
I tried your solution (clever!), with various permutations; e.g.,
LANG=ISO-8859 sed 's/^--MP_.*$//g' my_file
but none of those worked. I found two workarounds:
- The following
Perl
expression worked, i.e. deleted that line:
perl -pe 's/^--MP_.*$//g' my_file
[For an explanation of the -pe
command-line switches, refer to this StackOverflow answer:
Perl flags -pe, -pi, -p, -w, -d, -i, -t? ]
- Alternatively, after converting the file encoding to UTF-8, the sed expression worked (the
Ã
character remained, but was now UTF8-encoded):
iconv -f iso-8859-1 -t utf-8 my_file > my_file.utf8
As I am working with lots (1000's) of emails with various encodings, that undergo intermediate processing (bash-scripted conversions to UTF-8 do not always work), for my purposes "solution 1" above will probably be the most robust solution.
Notes:
- sed (GNU sed) 4.4
- perl v5.26.1 built for x86_64-linux-thread-multi
- Arch Linux x86_64 system
It works for me. It's probably a character encoding problem.
This might help:
- Why does sed fail with International characters and how to fix?
- http://www.barregren.se/blog/how-use-sed-together-utf8