Get the format in dateutil.parse
I don't know of a way that you can return the parsed format from dateutil
(or any other python timestamp parser that I know of).
Implementing your own timestamp parsing function that returns a list of possible formats and related datetime objects is fairly trivial using datetime.strptime()
but doing it efficiently against a broadly useful list of possible timestamp formats is not.
The following example utilizes a list of just over 100 formats. It does not even scratch the surface of the wide variety of formats parsed by dateutil
. It tests each format in sequence until it exhausts all formats in the list (likely much less efficient than the dateutil
approach of locating the various datetime parts independently as noted in the answer from @alecxe).
In addition, I have included some example timestamp formats that include time zone names (instead of offsets). If you run the example function below against those particular datetime strings, you may find that it does not return the expected matches even though I have included matching formats using the %Z
directive. Some explanation for the challenges with using %Z
to handle time zone names can be found in issue 22377 at bugs.python.org (just to highlight another non-trivial aspect of implementing your own datetime parsing function).
With all of those caveats, if you are dealing with a manageable set of potential formats, implementing something simple like the below may get you what you need.
Example function that attempts to match a datetime string against a list of formats and return a dict that includes the original datestring and a list of matches, each a dict that includes a datetime object along with the matched format:
from datetime import datetime
def parse_timestamp(datestring, formats):
results = {'datestring': datestring, 'matches': []}
for f in formats:
try:
d = datetime.strptime(datestring, f)
except:
continue
results['matches'].append({'datetime': d, 'format': f})
return results
Example formats and datetime strings:
formats = ['%A, %B %d, %Y', '%A, %B %d, %Y %I:%M:%S %p %Z', '%A, %d %B %Y', '%B %d %Y', '%B %d, %Y', '%H:%M:%S', '%H:%M:%S,%f', '%H:%M:%S.%f', '%Y %b %d %H:%M:%S.%f', '%Y %b %d %H:%M:%S.%f %Z', '%Y %b %d %H:%M:%S.%f*%Z', '%Y%m%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S %z', '%Y-%m-%d %H:%M:%S%z', '%Y-%m-%d %H:%M:%S,%f', '%Y-%m-%d %H:%M:%S,%f%z', '%Y-%m-%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S.%f%z', '%Y-%m-%d %I:%M %p', '%Y-%m-%d %I:%M:%S %p', '%Y-%m-%d*%H:%M:%S', '%Y-%m-%d*%H:%M:%S:%f', '%Y-%m-%dT%H:%M:%S', '%Y-%m-%dT%H:%M:%S%Z', '%Y-%m-%dT%H:%M:%S%z', '%Y-%m-%dT%H:%M:%S*%f%z', '%Y-%m-%dT%H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S.%f%z', '%Y/%m/%d', '%Y/%m/%d*%H:%M:%S', '%a %b %d %H:%M:%S %Z %Y', '%a, %d %b %Y %H:%M:%S %z', '%b %d %H:%M:%S', '%b %d %H:%M:%S %Y', '%b %d %H:%M:%S %z', '%b %d %H:%M:%S %z %Y', '%b %d %Y', '%b %d %Y %H:%M:%S', '%b %d, %Y', '%b %d, %Y %I:%M:%S %p', '%b.%d.%Y', '%d %B %Y', '%d %B %Y %H:%M:%S %Z', '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S %z', '%d %b %Y %H:%M:%S*%f', '%d%m_%H:%M:%S', '%d%m_%H:%M:%S.%f', '%d-%b-%Y', '%d-%b-%Y %H:%M:%S', '%d-%b-%Y %H:%M:%S.%f', '%d-%b-%Y %I:%M:%S %p', '%d-%m-%Y', '%d-%m-%Y %I:%M %p', '%d-%m-%Y %I:%M:%S %p', '%d-%m-%y', '%d-%m-%y %I:%M %p', '%d-%m-%y %I:%M:%S %p', '%d/%b %H:%M:%S,%f', '%d/%b/%Y %H:%M:%S', '%d/%b/%Y %I:%M %p', '%d/%b/%Y:%H:%M:%S', '%d/%b/%Y:%H:%M:%S %z', '%d/%m/%Y', '%d/%m/%Y %H:%M:%S %z', '%d/%m/%Y %I:%M %p', '%d/%m/%Y %I:%M:%S %p', '%d/%m/%Y %I:%M:%S %p:%f', '%d/%m/%Y*%H:%M:%S', '%d/%m/%Y*%H:%M:%S*%f', '%d/%m/%y', '%d/%m/%y %H:%M:%S', '%d/%m/%y %H:%M:%S %z', '%d/%m/%y %I:%M %p', '%d/%m/%y %I:%M:%S %p', '%d/%m/%y*%H:%M:%S', '%m%d_%H:%M:%S', '%m%d_%H:%M:%S.%f', '%m-%d-%Y', '%m-%d-%Y %I:%M %p', '%m-%d-%Y %I:%M:%S %p', '%m-%d-%y', '%m-%d-%y %I:%M %p', '%m-%d-%y %I:%M:%S %p', '%m/%d/%Y', '%m/%d/%Y %H:%M:%S %z', '%m/%d/%Y %I:%M %p', '%m/%d/%Y %I:%M:%S %p', '%m/%d/%Y %I:%M:%S %p:%f', '%m/%d/%Y*%H:%M:%S', '%m/%d/%Y*%H:%M:%S*%f', '%m/%d/%y', '%m/%d/%y %H:%M:%S', '%m/%d/%y %H:%M:%S %z', '%m/%d/%y %I:%M %p', '%m/%d/%y %I:%M:%S %p', '%m/%d/%y*%H:%M:%S', '%y%m%d %H:%M:%S', '%y-%m-%d %H:%M:%S', '%y-%m-%d %H:%M:%S,%f', '%y-%m-%d %H:%M:%S,%f %z', '%y/%m/%d %H:%M:%S']
datestrings = ['03-11-1999', '03-12-1999 5:06 AM', '03-12-1999 5:06:07 AM', '03-12-99 5:06 AM', '03-12-99 5:06:07 AM', '03/12/1999', '03/12/1999 5:06 AM', '03/12/1999 5:06:07 AM', '03/12/99 5:06 AM', '03/12/99 5:06:07', '03/12/99 5:06:07 AM', '04/23/17 04:34:22 +0000', '0423_11:42:35', '0423_11:42:35.883', '05/09/2017*08:22:14*612', '06/01/22 04:11:05', '08/10/11*13:33:56', '10-04-19 12:00:17', '10-06-26 02:31:29,573', '10/03/2017 07:29:46 -0700', '11-02-11 16:47:35,985 +0000', '11/22/2017*05:13:11', '11:42:35', '11:42:35,173', '11:42:35.173', '12/03/1999', '12/03/1999 5:06 AM', '12/03/99 5:06 AM', '12/3/1999', '12/3/1999 5:06 AM', '12/3/1999 5:06:07 AM', '150423 11:42:35', '19/Apr/2017:06:36:15 -0700', '1999-03-12 05:06:07.0', '1999-03-12 5:06 AM', '1999-03-12 5:06:07 AM', '1999-03-12+01:00', '1999-3-12 5:06 AM', '1999-3-12 5:06:07 AM', '1999/3/12', '20150423 11:42:35.173', '2017 Mar 03 05:12:41.211 PDT', '2017 Mar 10 01:44:20.392', '2017-02-11T18:31:44', '2017-03-10 14:30:12,655+0000', '2017-03-12 13:11:34.222-0700', '2017-03-12T17:56:22-0700', '2017-06-26 02:31:29,573', '2017-07-01T14:59:55.711+0000', '2017-07-04*13:23:55', '2017-07-22T16:28:55.444', '2017-08-19 12:17:55 -0400', '2017-08-19 12:17:55-0400', '2017-09-08T03:13:10', '2017-10-14T22:11:20+0000', '2017-10-30*02:47:33:899', '2017-11-22T10:10:15.455', '2017/04/12*19:37:50', '2018 Apr 13 22:08:13.211*PDT', '2018-02-27 15:35:20.311', '2018-08-20T13:20:10*633+0000', '22 Mar 1999 05:06:07 +0100', '22 March 1999', '22 March 1999 05:06:07 CET', '22-Mar-1999', '22-Mar-1999 05:06:07', '22-Mar-1999 5:06:07 AM', '22/03/1999 5:06:07 AM', '22/Mar/1999 5:06:07 +0100', '22/Mar/99 5:06 AM', '23 Apr 2017 10:32:35*311', '23 Apr 2017 11:42:35', '23-Apr-2017 11:42:35', '23-Apr-2017 11:42:35.883', '23/Apr 11:42:35,173', '23/Apr/2017 11:42:35', '23/Apr/2017:11:42:35', '3-11-1999', '3-12-1999 5:06 AM', '3-12-99 5:06 AM', '3-12-99 5:06:07 AM', '3-22-1999 5:06:07 AM', '3/12/1999', '3/12/1999 5:06 AM', '3/12/1999 5:06:07 AM', '3/12/99 5:06 AM', '3/12/99 5:06:07', '8/5/2011 3:31:18 AM:234', '9/28/2011 2:23:15 PM', 'Apr 20 00:00:35 2010', 'Dec 2, 2017 2:39:58 AM', 'Jan 21 18:20:11 +0000 2017', 'Jun 09 2018 15:28:14', 'Mar 16 08:12:04', 'Mar 22 1999', 'Mar 22, 1999', 'Mar 22, 1999 5:06:07 AM', 'Mar.22.1999', 'March 22 1999', 'March 22, 1999', 'Mon Mar 22 05:06:07 CET 1999', 'Mon, 22 Mar 1999 05:06:07 +0100', 'Monday, 22 March 1999', 'Monday, March 22, 1999', 'Monday, March 22, 1999 5:06:07 AM CET', 'Sep 28 19:00:00 +0000']
Example usage:
print(parse_timestamp('2018-08-20T13:20:10*633+0000', formats))
# OUTPUT
# {'datestring': '2018-08-20T13:20:10*633+0000', 'matches': [{'datetime': datetime.datetime(2018, 8, 20, 13, 20, 10, 633000, tzinfo=datetime.timezone.utc), 'format': '%Y-%m-%dT%H:%M:%S*%f%z'}]}
Is there a way to get the "format" after parsing a date in dateutil?
Not possible with dateutil
. The problem is that dateutil
never has the format as an intermediate result any time during the parsing as it detects separate components of the datetime separately - take a look at this not quite easy to read source code.
My idea was to:
- Create an object that has a list of candidate specifiers you think might be in the date pattern (the more you add, the more possibilities you will get out the other end)
- Parse the date string
- Create a list of possible specifiers for each element in the string, based on the date and the list of candidates you supplied.
- Recombine them to produce a list of 'possibles'.
If you get only a single candidate, you can be pretty sure is it the right format. But you will often get many possibilities (especially with dates, months, minutes and hours all in the 0-10 range).
Example class:
import re
from itertools import product
from dateutil.parser import parse
from collections import defaultdict, Counter
COMMON_SPECIFIERS = [
'%a', '%A', '%d', '%b', '%B', '%m',
'%Y', '%H', '%p', '%M', '%S', '%Z',
]
class FormatFinder:
def __init__(self,
valid_specifiers=COMMON_SPECIFIERS,
date_element=r'([\w]+)',
delimiter_element=r'([\W]+)',
ignore_case=False):
self.specifiers = valid_specifiers
joined = (r'' + date_element + r"|" + delimiter_element)
self.pattern = re.compile(joined)
self.ignore_case = ignore_case
def find_candidate_patterns(self, date_string):
date = parse(date_string)
tokens = self.pattern.findall(date_string)
candidate_specifiers = defaultdict(list)
for specifier in self.specifiers:
token = date.strftime(specifier)
candidate_specifiers[token].append(specifier)
if self.ignore_case:
candidate_specifiers[token.
upper()] = candidate_specifiers[token]
candidate_specifiers[token.
lower()] = candidate_specifiers[token]
options_for_each_element = []
for (token, delimiter) in tokens:
if token:
if token not in candidate_specifiers:
options_for_each_element.append(
[token]) # just use this verbatim?
else:
options_for_each_element.append(
candidate_specifiers[token])
else:
options_for_each_element.append([delimiter])
for parts in product(*options_for_each_element):
counts = Counter(parts)
max_count = max(counts[specifier] for specifier in self.specifiers)
if max_count > 1:
# this is a candidate with the same item used more than once
continue
yield "".join(parts)
And some sample tests:
def test_it_returns_value_from_question_1():
s = "2014-01-01 00:12:12"
candidates = FormatFinder().find_candidate_patterns(s)
sut = FormatFinder()
candidates = sut.find_candidate_patterns(s)
assert "%Y-%m-%d %H:%M:%S" in candidates
def test_it_returns_value_from_question_2():
s = 'Jan. 04, 2017'
sut = FormatFinder()
candidates = sut.find_candidate_patterns(s)
candidates = list(candidates)
assert "%b. %d, %Y" in candidates
assert len(candidates) == 1
def test_it_can_ignore_case():
# NB: apparently the 'AM/PM' is meant to be capitalised in my locale!
# News to me!
s = "JANUARY 12, 2018 02:12 am"
sut = FormatFinder(ignore_case=True)
candidates = sut.find_candidate_patterns(s)
assert "%B %d, %Y %H:%M %p" in candidates
def test_it_returns_parts_that_have_no_date_component_verbatim():
# In this string, the 'at' is considered as a 'date' element,
# but there is no specifier that produces a candidate for it
s = "January 12, 2018 at 02:12 AM"
sut = FormatFinder()
candidates = sut.find_candidate_patterns(s)
assert "%B %d, %Y at %H:%M %p" in candidates
To make it a bit clearer, here's some example of using this code in an iPython shell:
In [2]: ff = FormatFinder()
In [3]: list(ff.find_candidate_patterns("2014-01-01 00:12:12"))
Out[3]:
['%Y-%d-%m %H:%M:%S',
'%Y-%d-%m %H:%S:%M',
'%Y-%m-%d %H:%M:%S',
'%Y-%m-%d %H:%S:%M']
In [4]: list(ff.find_candidate_patterns("Jan. 04, 2017"))
Out[4]: ['%b. %d, %Y']
In [5]: list(ff.find_candidate_patterns("January 12, 2018 at 02:12 AM"))
Out[5]: ['%B %d, %Y at %H:%M %p', '%B %M, %Y at %H:%d %p']
In [6]: ff_without_case = FormatFinder(ignore_case=True)
In [7]: list(ff_without_case.find_candidate_patterns("JANUARY 12, 2018 02:12 am"))
Out[7]: ['%B %d, %Y %H:%M %p', '%B %M, %Y %H:%d %p']