Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'
This is because pandas falls back to dateutil.parser.parse
for parsing the strings when it has a non-default format or when no format
string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format
string to to_datetime
. Or another option is to use infer_datetime_format=True
Apparently, the infer_datetime_format
cannot infer when there are microseconds. With an example without those, you can see a large speed-up:
In [28]: d = '2014-12-24 01:02:03'
In [29]: c = re.sub('-', '/', d)
In [30]: s_c = pd.Series([c]*10000)
In [31]: %timeit pd.to_datetime(s_c)
1 loops, best of 3: 1.14 s per loop
In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True)
10 loops, best of 3: 105 ms per loop
In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S")
10 loops, best of 3: 99.5 ms per loop
This question has already been sufficiently answered, but I wanted to add in the results of some tests I was running to optimize my own code.
I was getting this format from an API: "Wed Feb 08 17:58:56 +0000 2017".
Using the default pd.to_datetime(SERIES)
with an implicit conversion, it was taking over an hour to process roughly 20 million rows (depending on how much free memory I had).
That said, I tested three different conversions:
# explicit conversion of essential information only -- parse dt str: concat
def format_datetime_1(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')
return dt_series
# explicit conversion of what datetime considers "essential date representation" -- parse dt str: del then join
def format_datetime_2(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
del split_date[4]
str_date = ' '.join(str(s) for s in split_date)
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
return dt_series
# explicit conversion of what datetime considers "essential date representation" -- parse dt str: concat
def format_datetime_3(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
str_date = split_date[0] + ' ' + split_date[1] + ' ' + split_date[2] + ' ' + split_date[3] + ' ' + split_date[5]
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
return dt_series
# implicit conversion
def format_datetime_baseline(dt_series):
return pd.to_datetime(dt_series)
This was the results:
# sample of 250k rows
dt_series_sample = df['created_at'][:250000]
%timeit format_datetime_1(dt_series_sample) # best of 3: 1.56 s per loop
%timeit format_datetime_2(dt_series_sample) # best of 3: 2.09 s per loop
%timeit format_datetime_3(dt_series_sample) # best of 3: 1.72 s per loop
%timeit format_datetime_baseline(dt_series_sample) # best of 3: 1min 9s per loop
The first test results in an impressive 97.7% runtime reduction!
Somewhat surprisingly, it looks like even the "appropriate representation" takes longer, probably because it is semi-implicit.
Conclusion: the more explicit you are, the faster it will run.