Can not use `cut -c` (`--characters`) with UTF-8?
You haven't said which cut
you're using, but since you've mentioned the GNU long option --characters
I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation'
:
‘-c character-list’ ‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as
-b
for now, but internationalization will change that.
(emphasis added)
For the moment, GNU cut
always works in terms of single-byte "characters", so the behaviour you see is expected.
Supporting both the -b
and -c
options is required by POSIX — they weren't added to GNU cut
because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c
has been done in some other cut
implementations, although not FreeBSD's and OS X's at least.
This is the historic behaviour of -c
. -b
was newly added to take over the byte role so that -c
can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut
doesn't even implement the -n
option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.
colrm
(part of util-linux
, should be already installed on most distributions) seems to handle internationalization much better :
$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α
Beware of the numbering : colrm N
will remove columns from N
, printing characters up to N-1
.
(credits)
Since many grep
implementations are multibyte-aware, you can also use grep -o
to simulate some uses of cut -c
.
First two characters:
$ echo Τηεοδ29 | grep -o '^..'
Τη
Last two characters:
$ echo Τηεοδ29 | grep -o '...$'
δ29
Second character:
$ echo Τηεοδ29 | grep -o '^..' | grep -o '.$'
η
Adjust the number of periods, or use {x,y}
syntax, to simulate cut
ranges.