(Python) Counting lines in a huge (>10GB) file as fast as possible
I know its a bit unfair but you could do this
int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])
If you're on Windows, check out Coreutils.
A fast, 1-line solution is:
sum(1 for i in open(file_path, 'rb'))
It should work on files of arbitrary size.
Ignacio's answer is correct, but might fail if you have a 32 bit process.
But maybe it could be useful to read the file block-wise and then count the \n
characters in each block.
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r") as f:
print sum(bl.count("\n") for bl in blocks(f))
will do your job.
Note that I don't open the file as binary, so the \r\n
will be converted to \n
, making the counting more reliable.
For Python 3, and to make it more robust, for reading files with all kinds of characters:
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b: break
yield b
with open("file", "r",encoding="utf-8",errors='ignore') as f:
print (sum(bl.count("\n") for bl in blocks(f)))
mmap the file, and count up the newlines.
import mmap
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines