How to automatically detect columns that contain datetime in a pandas dataframe
There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.
If you have a df (based on @Alexander's df)
df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00']})
data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]
df = pd.concat(data, axis=1, keys=[s.name for s in data])
or with the help of a mask i.e
mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)
df.types
Output:
col1 object col2 datetime64[ns] col3 int64 col4 datetime64[ns] dtype: object
If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+'
Eg:
ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
'col4': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col5': ['2017-02-04',
'2017-02-04',
'17-02-2004 14:13:00',
'17-02-2014',
'2017-02-03']})
mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)
Output :
col3 col4 col5 0 0 2017-02-04 18:41:00 2017-02-04 00:00:00 1 1 2017-02-04 18:41:00 2017-02-04 00:00:00 2 2 2017-02-04 18:41:00 2004-02-17 14:13:00 3 3 2017-02-04 18:41:00 2014-02-17 00:00:00 4 4 2017-02-03 14:13:00 2017-02-03 00:00:00
Hope it helps
You can identify which columns in your dataframe are of type object
and then only convert those columns to datetime using coerce=True
so that errors are generated for columns which cannot be converted. Use combine_first
to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.
df = pd.DataFrame(
{'col1': ['A', 'B', 'C', 'D', 'E'],
'col2': ['2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-04 18:41:00',
'2017-02-03 14:13:00'],
'col3': [0, 1, 2, 3, 4]})
object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']
df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
pd.to_datetime, coerce=True))
>>> df
col1 col2 col3
0 A 2017-02-04 18:41:00 0
1 B 2017-02-04 18:41:00 1
2 C 2017-02-04 18:41:00 2
3 D 2017-02-04 18:41:00 3
4 E 2017-02-03 14:13:00 4
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1 5 non-null object
col2 5 non-null datetime64[ns]
col3 5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes