Is there an easy way to count characters in words in file, from terminal?

$ awk '{ print length }' file | sort -n | uniq -c | awk '{ printf("%d character words: %d\n", $2, $1) }'
2 character words: 3
5 character words: 1
7 character words: 1

The first awk filter will just print the length of each line in the file called file. I'm assuming that this file contains one word per line.

The sort -n (sort the lines from the output of awk numerically in ascending order) and uniq -c (count the number of times each line occurs consecutively) will then create the following output from that for the given data:

   3 2
   1 5
   1 7

This is then parsed by the second awk script which interprets each line as "X number of lines having Y characters" and produces the wanted output.

The alternative solution is to do it all in awk and keeping counts of lengths in an array. It's a tradeoff between efficiency, readability/ease of understanding (and therefore maintainability) which solution is the "best".

Alternative solution:

$ awk '{ len[length]++ } END { for (i in len) printf("%d character words: %d\n", i, len[i]) }' file
2 character words: 3
5 character words: 1
7 character words: 1

Another way to do it all with awk alone

$ awk '{words[length()]++} END{for(k in words)print k " character words - " words[k]}' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1

words[length()]++ use length of input line as key to save count
END{for(k in words)print k " character words - " words[k]} after all lines are processed, print contents of array in desired format

Performance comparison, numbers selected are best of two runs

$ wc words.txt
 71813  71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M    long_file.txt

$ time awk '{words[length()]++} END{for(k in words)print k " character words - " words[k]}' long_file.txt > t1

real    0m20.632s
user    0m20.464s
sys     0m0.108s

$ time perl -lne '$h{length($_)}++ }{ for $n (sort keys %h) {print "$n character words - $h{$n}"}' long_file.txt > t2

real    0m19.749s
user    0m19.640s
sys     0m0.108s

$ time awk '{ print length }' long_file.txt | sort -n | uniq -c | awk '{ printf("%d character words - %d\n", $2, $1) }' > t3

real    1m23.294s
user    1m24.952s
sys     0m1.980s

$ diff -s <(sort t1) <(sort t2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -s <(sort t1) <(sort t3)
Files /dev/fd/63 and /dev/fd/62 are identical

If file has only ASCII characters,

$ time LC_ALL=C awk '{words[length()]++} END{for(k in words)print k " character words - " words[k]}' long_file.txt > t1

real    0m15.651s
user    0m15.496s
sys     0m0.120s

Not sure why time for perl didn't change much, probably encoding has to be set some other way

Here's a perl equivalent (with - optional - sort):

$ perl -lne '
    $h{length($_)}++ }{ for $n (sort keys %h) {print "$n character words - $h{$n}"}
' file
2 character words - 3
5 character words - 1
7 character words - 1

Is there an easy way to count characters in words in file, from terminal?

Tags:

Text Processing

Related

Recent Posts