Pandas: slow date conversion
Streamlined date parsing with caching
Reading all data and then converting it will always be slower than converting while reading the CSV. Since you won't need to iterate over all the data twice if you do it right away. You also don't have to store it as strings in memory.
We can define our own date parser that utilizes a cache for the dates it has already seen.
import pandas as pd
cache = {}
def cached_date_parser(s):
if s in cache:
return cache[s]
dt = pd.to_datetime(s, format='%Y%m%d', coerce=True)
cache[s] = dt
return dt
df = pd.read_csv(filen,
index_col=None,
header=None,
parse_dates=[0],
date_parser=cached_date_parser)
Has the same advantages as @fixxxer s answer with only parsing each string once, with the extra added bonus of not having to read all the data and THEN parse it. Saving you memory and processing time.
Great suggestion @EdChum! As @EdChum suggests, using infer_datetime_format=True
can be significantly faster. Below is my example.
I have a file of temperature data from a sensor log, which looks like this:
RecNum,Date,LocationID,Unused 1,11/7/2013 20:53:01,13.60,"117","1", 2,11/7/2013 21:08:01,13.60,"117","1", 3,11/7/2013 21:23:01,13.60,"117","1", 4,11/7/2013 21:38:01,13.60,"117","1", ...
My code reads the csv and parses the date (parse_dates=['Date']
).
With infer_datetime_format=False
, it takes 8min 8sec:
Tue Jan 24 12:18:27 2017 - Loading the Temperature data file. Tue Jan 24 12:18:27 2017 - Temperature file is 88.172 MB. Tue Jan 24 12:18:27 2017 - Loading into memory. Please be patient. Tue Jan 24 12:26:35 2017 - Success: loaded 2,169,903 records.
With infer_datetime_format=True
, it takes 13sec:
Tue Jan 24 13:19:58 2017 - Loading the Temperature data file. Tue Jan 24 13:19:58 2017 - Temperature file is 88.172 MB. Tue Jan 24 13:19:58 2017 - Loading into memory. Please be patient. Tue Jan 24 13:20:11 2017 - Success: loaded 2,169,903 records.
Note: As @ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates
that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}
return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse