how to count the total number of lines in a text file using python
This link (How to get line count cheaply in Python?) has lots of potential solutions, but they all ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering.
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawpycount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
Here are my timings:
rawpycount 0.0048 0.0046 1.00
bufcount 0.0074 0.0066 1.43
wccount 0.01 0.01 2.17
itercount 0.014 0.014 3.04
opcount 0.021 0.02 4.43
kylecount 0.023 0.021 4.58
simplecount 0.022 0.022 4.81
mapcount 0.038 0.032 6.82
I would post it there, but I'm a relatively new user to stack exchange and don't have the requisite manna.
EDIT:
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat)
def rawbigcount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen if buf )
count=0
with open ('filename.txt','rb') as f:
for line in f:
count+=1
print count
You can use sum()
with a generator expression here. The generator expression will be [1, 1, ...]
up to the length of the file. Then we call sum()
to add them all together, to get the total count.
with open('text.txt') as myfile:
count = sum(1 for line in myfile)
It seems by what you have tried that you don't want to include empty lines. You can then do:
with open('text.txt') as myfile:
count = sum(1 for line in myfile if line.rstrip('\n'))
You can use sum()
with a generator expression:
with open('data.txt') as f:
print sum(1 for _ in f)
Note that you cannot use len(f)
, since f
is an iterator. _
is a special variable name for throwaway variables, see What is the purpose of the single underscore "_" variable in Python?.
You can use len(f.readlines())
, but this will create an additional list in memory, which won't even work on huge files that don't fit in memory.