Command line tool to "cat" pairwise expansion of all rows in a file
Here's how to do it in awk so that it doesn't have to store the whole file in an array. This is basically the same algorithm as terdon's.
If you like, you can even give it multiple filenames on the command line and it will process each file independently, concatenating the results together.
#!/usr/bin/awk -f
#Cartesian product of records
{
file = FILENAME
while ((getline line <file) > 0)
print $0, line
close(file)
}
On my system, this runs in about 2/3 the time of terdon's perl solution.
I'm not sure this is better than doing it in memory, but with a sed
that r
eads out its infile for every line in its infile and another on the other side of a pipe alternating H
old space with input lines...
cat <<\IN >/tmp/tmp
Row1,10
Row2,20
Row3,30
Row4,40
IN
</tmp/tmp sed -e 'i\
' -e 'r /tmp/tmp' |
sed -n '/./!n;h;N;/\n$/D;G;s/\n/ /;P;D'
OUTPUT
Row1,10 Row1,10
Row1,10 Row2,20
Row1,10 Row3,30
Row1,10 Row4,40
Row2,20 Row1,10
Row2,20 Row2,20
Row2,20 Row3,30
Row2,20 Row4,40
Row3,30 Row1,10
Row3,30 Row2,20
Row3,30 Row3,30
Row3,30 Row4,40
Row4,40 Row1,10
Row4,40 Row2,20
Row4,40 Row3,30
Row4,40 Row4,40
I did this another way. It does store some in memory - it stores a string like:
"$1" -
... for each line in the file.
pairs(){ [ -e "$1" ] || return
set -- "$1" "$(IFS=0 n=
case "${0%sh*}" in (ya|*s) n=-1;; (mk|po) n=+1;;esac
printf '"$1" - %s' $(printf "%.$(($(wc -l <"$1")$n))d" 0))"
eval "cat -- $2 </dev/null | paste -d ' \n' -- $2"
}
It is very fast. It cat
's the file as many times as there are lines in the file to a |pipe
. On the other side of the pipe that input is merged with the file itself as many times as there are lines in the file.
The case
stuff is just for portability - yash
and zsh
both add one element to the split, while mksh
and posh
both lose one. ksh
, dash
, busybox
, and bash
all split out to exactly as many fields as there are zeroes as printed by printf
. As written the above renders the same results for every one of the above mentioned shells on my machine.
If the file is very long, there may be $ARGMAX
issues with too many arguments in which case you would need to introduce xargs
or similar as well.
Given the same input I used before the output is identical. But, if I were to go bigger...
seq 10 10 10000 | nl -s, >/tmp/tmp
That generates a file almost identical to what I used before (sans 'Row') - but at 1000 lines. You can see for yourself how fast it is:
time pairs /tmp/tmp |wc -l
1000000
pairs /tmp/tmp 0.20s user 0.07s system 110% cpu 0.239 total
wc -l 0.05s user 0.03s system 32% cpu 0.238 total
At 1000 lines there is some slight variation in performance between shells - bash
is invariably the slowest - but because the only work they do anyway is generate the arg string (1000 copies of filename -
) the effect is minimal. The difference in performance between zsh
- as above - and bash
is 100th of a second here.
Here's another version that should work for a file of any length:
pairs2()( [ -e "$1" ] || exit
rpt() until [ "$((n+=1))" -gt "$1" ]
do printf %s\\n "$2"
done
[ -n "${1##*/*}" ] || cd -P -- "${1%/*}" || exit
: & set -- "$1" "/tmp/pairs$!.ln" "$(wc -l <"$1")"
ln -s "$PWD/${1##*/}" "$2" || exit
n=0 rpt "$3" "$2" | xargs cat | { exec 3<&0
n=0 rpt "$3" p | sed -nf - "$2" | paste - /dev/fd/3
}; rm "$2"
)
It creates a soft-link to its first arg in /tmp
with a semi-random name so that it won't get hung-up on weird filenames. That's important because cat
's args are fed to it over a pipe via xargs
. cat
's output is saved to <&3
while sed
p
rints every line in the first arg as many times as there are lines in that file - and its script is also fed to it via a pipe. Again paste
merges its input, but this time it takes only two arguments -
again for its standard input and the link name /dev/fd/3
.
That last - the /dev/fd/[num]
link - should work on any linux system and many more besides, but if it doesn't creating a named pipe with mkfifo
and using that instead should work as well.
The last thing it does is rm
the soft-link it creates before exiting.
This version is actually faster still on my system. I guess it is because that though it execs more applications, it starts handing them their arguments immediately - whereas before it stacked them all first.
time pairs2 /tmp/tmp | wc -l
1000000
pairs2 /tmp/tmp 0.30s user 0.09s system 178% cpu 0.218 total
wc -l 0.03s user 0.02s system 26% cpu 0.218 total
Well, you could always do it in your shell:
while read i; do
while read k; do echo "$i $k"; done < sample.txt
done < sample.txt
It is a good deal slower than your awk
solution (on my machine, it took ~11 seconds for 1000 lines, versus ~0.3 seconds in awk
) but at least it never holds more than a couple of lines in memory.
The loop above works for the very simple data you have in your example. It will choke on backslashes and it will eat trailing and leading spaces. A more robust version of the same thing is:
while IFS= read -r i; do
while IFS= read -r k; do printf "%s %s\n" "$i" "$k"; done < sample.txt
done < sample.txt
Another choice is to use perl
instead:
perl -lne '$line1=$_; open(A,"sample.txt");
while($line2=<A>){printf "$line1 $line2"} close(A)' sample.txt
The script above will read each line of the input file (-ln
), save it as $l
, open sample.txt
again, and print each line along with $l
. The result is all pairwise combinations while only 2 lines are ever stored in memory. On my system, that took only about 0.6
seconds on 1000 lines.