Infer which columns are datetime
The main problem I see is when parsing numerical values.
I'd propose converting them to strings first
Setup
dat = {
'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
'columns': ['Mixed', 'Numeric Values', 'Strings'],
'data': [
['2017-07-06 00:00:00', 1, 'HI'],
['2018-02-27 21:30:05', 1, 'HI'],
['2017-04-12 00:00:00', 1, 'HI'],
['2017-05-21 22:05:00', 1, 'HI'],
['2018-01-22 00:00:00', 1, 'HI'],
['2019-10-04 00:00:00', 1, 'HI'],
['None', 1, 'HI'],
['some_string', 1, 'HI']
]
}
df = pd.DataFrame(**dat)
df
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 1 HI
1 2018-02-27 21:30:05 1 HI
2 2017-04-12 00:00:00 1 HI
3 2017-05-21 22:05:00 1 HI
4 2018-01-22 00:00:00 1 HI
352867 2019-10-04 00:00:00 1 HI
352868 None 1 HI
352869 some_string 1 HI
Solution
df.astype(str).apply(pd.to_datetime, errors='coerce')
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 NaT NaT
1 2018-02-27 21:30:05 NaT NaT
2 2017-04-12 00:00:00 NaT NaT
3 2017-05-21 22:05:00 NaT NaT
4 2018-01-22 00:00:00 NaT NaT
352867 2019-10-04 00:00:00 NaT NaT
352868 NaT NaT NaT
352869 NaT NaT NaT
This function will set the data type of a column to datetime, if any value in the column matches the regex pattern(\d{4}-\d{2}-\d{2})+ (e.g. 2019-01-01). Credit to this answer on how to Search for String in all Pandas DataFrame columns and filter that helped with setting and applying the mask.
def presume_date(dataframe):
""" Set datetime by presuming any date values in the column
indicates that the column data type should be datetime.
Args:
dataframe: Pandas dataframe.
Returns:
Pandas dataframe.
Raises:
None
"""
df = dataframe.copy()
mask = dataframe.astype(str).apply(lambda x: x.str.match(
r'(\d{4}-\d{2}-\d{2})+').any())
df_dates = df.loc[:, mask].apply(pd.to_datetime, errors='coerce')
for col in df_dates.columns:
df[col] = df_dates[col]
return df
Working from the suggestion to use dateutil
, this may help. It is still working on the presumption that if there are any date-like values in a column, that the column should be a datetime. I tried to consider different dataframe iterations methods that are faster. I think this answer on How to iterate over rows in a DataFrame in Pandas did a good job describing them.
Note that dateutil.parser
will use the current day or year for any strings like 'December' or 'November 2019' with no year or day values.
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame(columns=['are_you_a_date','no_dates_here'])
df = df.append(pd.Series({'are_you_a_date':'December 2015','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'February 27 2018','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'May 2017 12','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'2017-05-21','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':None,'no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'some_string','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'Processed: 2019/01/25','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'December','no_dates_here':'just a string'}), ignore_index=True)
def parse_dates(x):
try:
return parse(x,fuzzy=True)
except ValueError:
return ''
except TypeError:
return ''
list_of_datetime_columns = []
for row in df:
if any([isinstance(parse_dates(row[0]),
datetime.datetime) for row in df[[row]].values]):
list_of_datetime_columns.append(row)
df_dates = df.loc[:, list_of_datetime_columns].apply(pd.to_datetime, errors='coerce')
for col in list_of_datetime_columns:
df[col] = df_dates[col]
In case you would also like to use the datatime values from dateutil.parser
, you can add this:
for col in list_of_datetime_columns:
df[col] = df[col].apply(lambda x: parse_dates(x))