Remove duplicate entries from a CSV file
The reason the myfile.csv
is not changing is because the -u
option for uniq
will only print unique lines. In this file, all lines are duplicates so they will not be printed out.
However, more importantly, the output will not be saved in myfile.csv
because uniq
will just print it out to stdout
(by default, your console).
You would need to do something like this:
$ sort -u myfile.csv -o myfile.csv
The options mean:
-u
- keep only unique lines-o
- output to this file instead ofstdout
You should view man sort
for more information.
As Belmin showed, sort is great. His answer is best for unsorted data, and it's easy to remember and use.
However, it is also volatile, as it changes the order of the input. If you absolutely need to have the data go through in the same order but removing later duplicates, awk may be better.
$ cat myfile.csv
c
a
c
b
b
a
c
$ awk '{if (!($0 in x)) {print $0; x[$0]=1} }' myfile.csv
c
a
b
Weird edge case, but it does come up from time to time.
Also, if your data is already sorted when you are poking at it, you can just run uniq.
$ cat myfile.csv
a
a
a
b
b
c
c
c
c
c
$ uniq myfile.csv
a
b
c
Drawback to both of my suggestions is that you need to use a temporary file and copy that back in.
If you want to maintain order of the file (not sorted) but still remove duplicates you can also do this
awk '!v[$1]++' /tmp/file
For example
d
d
a
a
b
b
c
c
c
c
c
It will output
d
a
b
c