Find any lines exceeding a certain length
In order of decreasing speed (on a GNU system in a UTF-8 locale and on ASCII input) according to my tests:
grep '.\{80\}' file
perl -nle 'print if length$_>79' file
awk 'length>79' file
sed -n '/.\{80\}/p' file
Except for the perl
¹ one (or for awk
/grep
/sed
implementations (like mawk
or busybox) that don't support multi-byte characters), that counts the length in terms of number of characters (according to the LC_CTYPE
setting of the locale) instead of bytes.
If there are bytes in the input that don't form part of valid characters (which happens sometimes when the locale's character set is UTF-8 and the input is in a different encoding), then depending on the solution and tool implementation, those bytes will either count as 1 character, or 0 or not match .
.
For instance, a line that consists of 30 a
s a 0x80 byte, 30 b
s, a 0x81 byte and 30 UTF-8 é
s (encoded as 0xc3 0xa9), in a UTF-8 locale would not match .\{80\}
with GNU grep
/sed
(as that standalone 0x80 byte doesn't match .
), would have a length of 30+1+30+1+2*30=122 with perl
or mawk
, 3*30=90 with gawk
.
If you want to count in terms of bytes, fix the locale to C
with LC_ALL=C grep/awk/sed...
.
That would have all 4 solutions consider that line above contains 122 characters. Except in perl
and GNU tools, you'd still have potential issues for lines that contain NUL characters (0x0 byte).
¹ the perl
behaviour can be affected by the PERL_UNICODE
environment variable though