Count lines in large files
On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.
find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc
To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.
find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc
As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines
time grep -c $ my_file.txt;
real 0m44.96s user 0m41.59s sys 0m3.09s
time wc -l my_file.txt;
real 0m37.57s user 0m33.48s sys 0m3.97s
time sed -n '$=' my_file.txt;
real 0m38.22s user 0m28.05s sys 0m10.14s
time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt
;
real 0m23.38s user 0m20.19s sys 0m3.11s
time awk 'END { print NR }' my_file.txt;
real 0m19.90s user 0m16.76s sys 0m3.12s
spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()
res1: org.joda.time.Seconds = PT15S
Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.
But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:
$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &
Notice the &
at each command line, so all will run in parallel; dd
works like cat
here, but allow us to specify how many bytes to read (count * bs
bytes) and how many to skip at the beginning of the input (skip * bs
bytes). It works in blocks, hence, the need to specify bs
as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.
If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.
EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.
Try: sed -n '$=' filename
Also cat is unnecessary: wc -l filename
is enough in your present way.