Remove duplicate entries from a CSV file

The reason the myfile.csv is not changing is because the -u option for uniq will only print unique lines. In this file, all lines are duplicates so they will not be printed out.

However, more importantly, the output will not be saved in myfile.csv because uniq will just print it out to stdout (by default, your console).

You would need to do something like this:

$ sort -u myfile.csv -o myfile.csv

The options mean:

-u - keep only unique lines
-o - output to this file instead of stdout

You should view man sort for more information.

As Belmin showed, sort is great. His answer is best for unsorted data, and it's easy to remember and use.

However, it is also volatile, as it changes the order of the input. If you absolutely need to have the data go through in the same order but removing later duplicates, awk may be better.

$ cat myfile.csv
c
a
c
b
b
a
c


$ awk '{if (!($0 in x)) {print $0; x[$0]=1} }' myfile.csv
c
a
b

Weird edge case, but it does come up from time to time.

Also, if your data is already sorted when you are poking at it, you can just run uniq.

$ cat myfile.csv 
a
a
a
b
b
c
c
c
c
c


$ uniq myfile.csv 
a
b
c

Drawback to both of my suggestions is that you need to use a temporary file and copy that back in.

If you want to maintain order of the file (not sorted) but still remove duplicates you can also do this

awk '!v[$1]++' /tmp/file

For example

d
d
a
a
b
b
c
c
c
c
c

It will output

d
a
b
c

Remove duplicate entries from a CSV file

Tags:

Text Processing

Files

Related

Recent Posts