Faster grep function for big (27GB) files

A few things you can try:

1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.

2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.

3) Use fgrep because you're searching for a fixed string, not a regular expression.

4) Use -f to make grep read patterns from a file, rather than using a loop.

5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.

After making those changes, this is what your script would become:

awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
 for x in {a..z}
 do
  for y in {a..z}
  do
    LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
  done >> output.txt

Also, check out GNU Parallel which is designed to help you run jobs in parallel.


My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation

e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.

If you have to use grep, you can use

grep -f filename

to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.

Tags:

File

Bash

Grep

Awk