Merging and sorting log files in Python
As for the critical sorting function:
def sort_key(line):
return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')
This should be used as the key
argument to sort
or sorted
, not as cmp
. It is faster this way.
Oh, and you should have
from datetime import datetime
in your code to make this work.
First off, you will want to use the fileinput
module for getting data from multiple files, like:
data = fileinput.FileInput()
for line in data.readlines():
print line
Which will then print all of the lines together. You also want to sort, which you can do with the sorted keyword.
Assuming your lines had started with [2011-07-20 19:20:12]
, you're golden, as that format doesn't need any sorting above and beyond alphanum, so do:
data = fileinput.FileInput()
for line in sorted(data.readlines()):
print line
As, however, you have something more complex you need to do:
def compareDates(line1, line2):
# parse the date here into datetime objects
NotImplemented
# Then use those for the sorting
return cmp(parseddate1, parseddate2)
data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
print line
For bonus points, you can even do
data = fileinput.FileInput(openhook=fileinput.hook_compressed)
which will enable you to read in gzipped log files.
The usage would then be:
$ python yourscript.py access.log.1 access.log.*.gz
or similar.
You can do this
import fileinput
import re
from time import strptime
f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
print l,