pruning subdomains of other domains in a file using script (bash, awk or similar)
Try this,
rev file \
| sort -u \
| tr '.' ',' \
| awk '$0!~dom_regex{print;dom_regex="^"$0"[.]";};NR==1{dom_regex="^"$0"[.]";print};' \
| tr ',' '.' \
| rev
Output:
4.3.2.1.domain.org
domain.com
anotherdomain.com
domain.net
Explanation:
sort
the reversed file and eliminate duplicate lines. This step will group the domains/subdomains of "one kind" together with the shortest one in front.- the
awk
part will look if the next one is of the same kind (saved as regex in variabledom_regex
). If not, it will print the line and set newdom_regex
. Otherwise, the line is skipped. - reverse the file again.
Here is another version
sed 's/^/\./' file |
rev |
LC_ALL=C sort -u |
awk 'p == "" || substr($0,1,length(p)) != p { print $0; p = $0 }' |
rev |
sed 's/^\.//'
Input
domain.com
domain.net
sub.domain.com
anotherdomain.com
a.b.c.d.e.domain.net
5.4.3.2.1.domain.org
4.3.2.1.domain.org
b.c
a-b.c
b.b.c
btcapp.api.btc.com
btc.com
Output
a-b.c
b.c
4.3.2.1.domain.org
btc.com
domain.com
anotherdomain.com
domain.net
Trying with your recommended data set at http://p.ip.fi/WRD-, the source file I've collected contains 59683 lines and the filtered list has 34824. I see 36 lines with grep btc.com | wc -l
applied to the filtered list.
Try this if you have only one domain extension.
awk -F '.' '!seen[$(NF-1)"."$NF]++' file
domain.com
domain.net
anotherdomain.com