pruning subdomains of other domains in a file using script (bash, awk or similar)

Try this,

rev file \
| sort -u \
| tr '.' ',' \
| awk '$0!~dom_regex{print;dom_regex="^"$0"[.]";};NR==1{dom_regex="^"$0"[.]";print};' \
| tr ',' '.' \
| rev

Output:

4.3.2.1.domain.org
domain.com
anotherdomain.com
domain.net

Explanation:

  1. sort the reversed file and eliminate duplicate lines. This step will group the domains/subdomains of "one kind" together with the shortest one in front.
  2. the awk part will look if the next one is of the same kind (saved as regex in variable dom_regex). If not, it will print the line and set new dom_regex. Otherwise, the line is skipped.
  3. reverse the file again.

Here is another version

sed 's/^/\./' file |
    rev |
    LC_ALL=C sort -u |
    awk 'p == "" || substr($0,1,length(p)) != p { print $0; p = $0 }' |
    rev |
    sed 's/^\.//'

Input

domain.com
domain.net
sub.domain.com
anotherdomain.com
a.b.c.d.e.domain.net
5.4.3.2.1.domain.org
4.3.2.1.domain.org
b.c
a-b.c
b.b.c
btcapp.api.btc.com
btc.com

Output

a-b.c
b.c
4.3.2.1.domain.org
btc.com
domain.com
anotherdomain.com
domain.net

Trying with your recommended data set at http://p.ip.fi/WRD-, the source file I've collected contains 59683 lines and the filtered list has 34824. I see 36 lines with grep btc.com | wc -l applied to the filtered list.


Try this if you have only one domain extension.

awk -F '.' '!seen[$(NF-1)"."$NF]++' file

domain.com
domain.net
anotherdomain.com