Identify duplicate lines in a file without deleting them?

This is a classical problem that can be solved with the uniq command. uniq can detect duplicate consecutive lines and remove duplicates (-u, --unique) or keep duplicates only (-d, --repeated).

Since ordering of duplicate lines is not important for you, you should sort it first. Then use uniq to print unique lines only:

sort yourfile.txt | uniq -u

There is also a -c (--count) option that prints the number of duplicates for the -d option. See the manual page of uniq for details.

If you really do not care about the parts after the first field, you can use the following command to find duplicate keys and print each line number for it (append another | sort -n to have the output sorted by line):

 cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D

Since you want to see duplicate lines (using the first field as key), you cannot directly use uniq. The issue that make automation difficult is that the title parts vary, but a program cannot automatically determine which title should be considered the final one.

Here is an AWK script (save it to script.awk) that takes your text file as input and prints all duplicate lines so you can decide which to delete. (awk -f script.awk yourfile.txt)

#!/usr/bin/awk -f
{
    # Store the line ($0) grouped per URL ($1) with line number (NR) as key
    lines[$1][NR] = $0;
}
END {
    for (url in lines) {
        # find lines that have the URL occur multiple times
        if (length(lines[url]) > 1) {
            for (lineno in lines[url]) {
                # Print duplicate line for decision purposes
                print lines[url][lineno];
                # Alternative: print line number and line
                #print lineno, lines[url][lineno];
            }
        }
    }
}

If I understand your question, I think that you need something like:

for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

or:

for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

where file.txt is your file containing data about you are interested.

In the output you will see the number of the lines and lines where first field is found two or more times.

If I read this correctly, all you need is something like

awk '{print $1}' file | sort | uniq -c | 
    while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done

That will print out the number of the line that contains the dupe and the line itself. For example, using this file:

foo bar baz
http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
bar foo baz
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
baz foo bar
http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

It will produce this output:

2:http://unix.stackexchange.com/questions/49569/  unique-lines-based-on-the-first-field
4:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field   sort, CLI
6:http://unix.stackexchange.com/questions/49569/  Unique lines based on the first field

To print only the number of the line, you could do

awk '{print $1}' file | sort | uniq -c | 
 while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1

And to print only the line:

awk '{print $1}' file | sort | uniq -c | 
while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-

Explanation:

The awk script just prints the 1st space separated field of the file. Use $N to print the Nth field. sort sorts it and uniq -c counts the occurrences of each line.

This is then passed to the while loop which saves the number of occurrences as $num and the line as $dupe and if $num is greater than one (so it's duplicated at least once) it will search the file for that line, using -n to print the line number. The -- tells grep that what follows is not a command line option, useful for when $dupe can start with -.

Identify duplicate lines in a file without deleting them?

Tags:

Command Line

Sort

Related

Recent Posts