How can I find the most frequent word in a .csv file, ignoring duplicates on each line?
With GNU grep
or compatible:
$ grep -nEo '\w+' file.csv|sort -u|cut -d: -f2-|sort|uniq -c|sort -k1rn|head
2 blue
2 green
2 red
1 brown
1 yellow
I would probably reach for perl
- Use
uniq
from theList::Util
module to de-duplicate each row. - Use a hash to count the resulting occurrences.
For example
perl -MList::Util=uniq -F, -lnE '
map { $h{$_}++ } uniq @F
}{
foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown
If you have no option except the sort
and uniq
coreutils, then you can implement a similar algorithm with the addition of a shell loop
while IFS=, read -a words; do
printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
2 red
2 green
2 blue
1 yellow
1 brown
however please refer to Why is using a shell loop to process text considered bad practice?
You can use awk
with an associative array and a simple logic check.
awk -F, '
{split("", c); for (i=1; i<=NF; i++)
if (!c[$i]){c[$i]++;wds[$i]++}}
END{for (wd in wds) print wds[wd], wd}' file
Output
1 brown
2 red
1 yellow
2 blue
2 green
Walkthrough
Set the field separator to ,
awk -F, '
You are going to count c
to see if more than one occurrence of a word is on a line so make sure the word count is null at the beginning of each line with delete c;
or split("", c)
and then iterate over the fields
{split("", c); for (i=1; i<=NF; i++)
or
{delete c; for (i=1; i<=NF; i++)
if you have not seen the word $i
yet on this line !c[$i]
then increment the counter for that word c[$i]++
(to 1 so the condition test fails if it occurs again on the same line) and then increment the overall counting for that word wds[$i]++
when the test does not fail
if (!c[$i]){c[$i]++;wds[$i]++}}
When the file is done then just iterate over the wds
array and print the count wds[wd]
and the word wd
END{for (wd in wds) print wds[wd], wd}' file
just for fun
A hacky one with no awk
associative array bits
awk -F, '{for (i=1; i<=NF; i++) print NR, $i}' file |
sort | uniq | awk '{print $2}'| sort | uniq -c | sort -nr
awk
out the fields so they are preceded by their line numbers then sort | uniq
to lose the line dupes, awk
again to lose the numbering and then revert to your original code.