grep - how to output progress bar or status
This wouldn't necessarily require modifying grep
, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r
option to recursively a directory structure. In that case, it is not even clear that grep
knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep
in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat
to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done
For simplicity, I use a globstar (**
) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find
, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/
only matches directories.) Fortunately, GNU grep provides the -d skip
option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100
specifies that up to 100 files should be given to each grep instance, and -j 4
specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)
This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.
dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"
Try the parallel program
find * -name \*.[ch] | parallel -j5 --bar '(grep grep-string {})' > output-file
Though I found this to be slower than a simple
find * -name \*.[ch] | xargs grep grep-string > output-file