How to parse this custom log file in Python
Using @Joran Beasley's answer I came up with the following solution and it seems to work:
Main Points:
- My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
- Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
- The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
- A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.
Function to split up the log files.
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith(matchDate(line)):
if currentDict:
yield currentDict
currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
listNew= list(generateDicts(f))
Function to see if the line being processed starts with a {Date} that matches the format I am looking for
def matchDate(line):
matchThis = ""
matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
if matched:
#matches a date and adds it to matchThis
matchThis = matched.group()
else:
matchThis = "NONE"
return matchThis
create a generator (Im on a generator bend today)
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith("2015"): #you might want a better check here
if currentDict:
yield currentDict
currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("logfile.txt") as f:
print list(generateDicts(f))
there may be a few minor typos... I didnt actually run this
You can get the fields you are looking for directly from the regex using groups. You can even name them:
>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
... print found.groupdict()
...
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'
Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.
-lrm