How to subtract two list (fast)?
You can use comm
to remove anything that’s common to both lists:
listr=($(comm -3 <(printf "%s\n" "${list1[@]}" | sort) <(printf "%s\n" "${list2[@]}" | sort) | sort -n))
This sorts both lists in the order comm
expects, compares them, outputs only items which are unique to either list, and sorts them again in numeric order.
If both lists are sorted lexicographically (as per LC_COLLATE
), you can avoid sorting again:
listr=($(comm --nocheck-order -3 <(printf "%s\n" "${list1[@]}") <(printf "%s\n" "${list2[@]}")))
This also works great if the values you need to compare are stored in files.
#!/bin/zsh
list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
list2=( 1 2 3 5 7 8 9 11 12 )
listr=("${(@)list1:|list2}")
typeset -p listr
Abstract:
For long lists, if lists are already sorted, comm (alg7) is the fastest.
The zsh solution is (by far) the fastest if no file is read, that is, the lists are given "in memory". However, that is not a fair comparison with all the other solutions that have to read values from files. I changed the original code (which avoided the time to read files from testing) to a code that also include the time to read files.
This is a Community Answer, it is meant to only report the time of each answer.
Please DO edit and add your solution/option to compare all.
List of algorithms
- alg1: naive looped solution.
- alg2: using external
sort
anduniq -u
- alg3: processing an string in bash.
- alg4: join -v on sorted lists (Thanks @Kusalananda)
- alg5: comm (Thanks @Stephen Kitt)
- alg6: zsh (Thanks @Llua)
- alg7: comm but on already sorted files (thanks @Stephen Kitt)
Note from zsh manual:
${name:|arrayname}
If arrayname is the name (N.B., not contents) of an array variable, then any elements contained in arrayname are removed from the substitution of name.
Testing
As there are several ways to solve this, we need a general framework to test (in fairness) the answers.
Some guidelines (change them if you find them unfair):
- Measure enough repetitions to have a reasonably precision.
- Measure inside the shell used (avoid loading/unloading of the shell).
- Avoid output overhead by either not printing or redirecting to /dev/null.
Testing code:
#!/bin/bash
alg1(){
arr=( "${list1[@]}" )
for i in "${list2[@]}"; do
for j in "${!arr[@]}"; do
if [[ "$i" == "${arr[j]}" ]]; then
unset arr["$j"]
break
fi
done
done
printf '%s ' "${arr[@]}"; echo
}
alg2(){
arr=($({ printf '%s\n' "${list1[@]}" "${list2[@]}"; } | sort | uniq -u))
printf '%s ' "${arr[@]}"; echo
}
alg3(){
a=" $(printf '%s ' ${list1[@]})" # Watch the space at the start!!.
for i in "${list2[@]}"; do
a=${a/ "$i" / };
done
printf '%s\n' "$a"
}
alg4(){ join -v 1 list1.txt list2.txt ; }
alg5(){ #listr=$(
comm -3 <(printf "%s\n" "${list1[@]}" | sort) \
<(printf "%s\n" "${list2[@]}" | sort) |
sort -n
#)
}
alg6(){ zsh -c ' alg6(){
list1=( $(cat list1.txt) )
list2=( $(cat list2.txt) )
listr=("${(@)list1:|list2}")
typeset -p listr
}
TIMEFMT="%E %U %S"
time ( for ((k=0;k<'"$1"';k++)); do alg6; done; )
'
}
#: <<-\_comment_
alg7(){ comm -23 list1.txt list2.txt; }
try(){ for ((k=0;k<$1;k++)); do "$2"; done; }
#list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
#list2=( 1 2 3 5 7 8 9 11 12 )
#list1=( a a b b b c d d )
#list2=( b b c c c d d e )
size=1000000
list1=( "0" $(seq 1 "$size") )
list2=( "${list1[@]}" ); unset "list2[123]" "list2[234]" "list2[345]"
printf '%s\n' "${list1[@]}" | sort >list1.txt
printf '%s\n' "${list2[@]}" | sort >list2.txt
repeats=${1:-10}; shift
out=${1:-no} ; shift
(($#==0)) && set -- alg{1..7}
TIMEFORMAT='%R %U %S'
for i
do printf '%s ' "$i"
if [[ $out == no ]]; then
[[ $i != alg6 ]] &&
time try "$repeats" "$i" >/dev/null ||
time alg6 "$repeats" >/dev/null
else
[[ $i != alg6 ]] &&
time try "$repeats" "$i" ||
time alg6 "$repeats"
fi
done
Results:
Short list (as presented in the code):
$ ./script
alg1 2.056 0.806 1.237
alg2 3.478 3.126 1.756
alg3 0.999 0.728 0.304
alg4 1.186 0.780 0.434
alg5 5.234 1.926 1.722
alg6 2.71s 1.64s 1.26s
2.758 1.637 1.265
alg7 1.156 0.799 0.422
The times for alg6 are as reported by zsh and after as reported by bash.
Also, the execution time of zsh is really smaller (0.050) if the reading of files is removed from the function to outside.
Longer List
Testing a list of only 500 values (10 repeats) reveals that alg1 is very inefficient. Removing it from further testing:
alg1 4.149 3.471 0.657
alg2 0.145 0.055 0.063
alg3 0.219 0.180 0.009
alg4 0.039 0.015 0.003
alg5 0.149 0.018 0.027
alg6 0.06s 0.02s 0.02s
0.087 0.030 0.018
alg7 0.033 0.008 0.008
Testing 5k values (10 repeats) reveals that alg3 is also inefficient:
alg2 0.590 0.526 0.187
alg3 12.957 12.888 0.044
alg4 0.098 0.047 0.008
alg5 0.654 0.028 0.036
alg6 0.16s 0.12s 0.04s
0.211 0.118 0.044
alg7 0.038 0.022 0.014
Testing 50k values (10 repeats):
alg2 6.487 5.838 1.611
alg4 0.488 0.469 0.019
alg5 5.073 0.250 0.056
alg6 1.42s 1.20s 0.21s
1.467 1.206 0.216
alg7 0.271 0.247 0.014
For 500k (10 repeats)
alg4 5.471 5.269 0.156
alg6 15.14s 13.33s 1.91s
15.215 13.335 1.926
alg7 2.833 2.655 0.138
For 1M (10 repeats)
alg4 11.127 10.804 0.251
alg7 5.772 5.525 0.230