Identify duplicate lines in a file without deleting them?
This is a classical problem that can be solved with the uniq
command. uniq
can detect duplicate consecutive lines and remove duplicates (-u
, --unique
) or keep duplicates only (-d
, --repeated
).
Since ordering of duplicate lines is not important for you, you should sort it first. Then use uniq
to print unique lines only:
sort yourfile.txt | uniq -u
There is also a -c
(--count
) option that prints the number of duplicates for the -d
option. See the manual page of uniq
for details.
If you really do not care about the parts after the first field, you can use the following command to find duplicate keys and print each line number for it (append another | sort -n
to have the output sorted by line):
cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D
Since you want to see duplicate lines (using the first field as key), you cannot directly use uniq
. The issue that make automation difficult is that the title parts vary, but a program cannot automatically determine which title should be considered the final one.
Here is an AWK script (save it to script.awk
) that takes your text file as input and prints all duplicate lines so you can decide which to delete. (awk -f script.awk yourfile.txt
)
#!/usr/bin/awk -f
{
# Store the line ($0) grouped per URL ($1) with line number (NR) as key
lines[$1][NR] = $0;
}
END {
for (url in lines) {
# find lines that have the URL occur multiple times
if (length(lines[url]) > 1) {
for (lineno in lines[url]) {
# Print duplicate line for decision purposes
print lines[url][lineno];
# Alternative: print line number and line
#print lineno, lines[url][lineno];
}
}
}
}
If I understand your question, I think that you need something like:
for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done
or:
for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done
where file.txt
is your file containing data about you are interested.
In the output you will see the number of the lines and lines where first field is found two or more times.
If I read this correctly, all you need is something like
awk '{print $1}' file | sort | uniq -c |
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done
That will print out the number of the line that contains the dupe and the line itself. For example, using this file:
foo bar baz
http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field
bar foo baz
http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI
baz foo bar
http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field
It will produce this output:
2:http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field
4:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI
6:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field
To print only the number of the line, you could do
awk '{print $1}' file | sort | uniq -c |
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1
And to print only the line:
awk '{print $1}' file | sort | uniq -c |
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-
Explanation:
The awk
script just prints the 1st space separated field of the file. Use $N
to print the Nth field. sort
sorts it and uniq -c
counts the occurrences of each line.
This is then passed to the while
loop which saves the number of occurrences as $num
and the line as $dupe
and if $num
is greater than one (so it's duplicated at least once) it will search the file for that line, using -n
to print the line number. The --
tells grep
that what follows is not a command line option, useful for when $dupe
can start with -
.