How to tag corrupted data in dataframe after an error has been raised
This happens because of the way you're populating the dataframe.
sample_data['error_msg'] = str(e)
Will actually overwrite the entire column with str(e)
.
This is probably the most efficient way to do it:
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError as e:
pass # you could write the row and the error to your logs here
df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()
However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
try:
date_of_birth = int2date(row['sec_num'])
df.set_value(i, 'date_of_birth', date_of_birth)
except ValueError as e:
df.set_value(i, 'is_in_error', True)
df.set_value(i, 'error_msg', str(e))
This handles each row separately and will only write the error to the correct index instead of updating the entire column.
You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.
To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
return ValueError("Value:{0} not a legal date.".format(argdate))
With this you simply can map a list of values to the function and will receive the exceptions (which lack a trace of course, but in such a case this should not be a problem) as values in the result list:
You then can walk over the list, replace the found exceptions by None
values and fill other columns instead with the message contained in the exception.