Simply using parsec in python
I encourage you to define your own parser using those combinators, rather than construct the Parser
directly.
If you want to construct a Parser
by wrapping a function, as the documentation states, the fn
should accept two arguments, the first is the text and the second is the current position. And fn
should return a Value
by Value.success
or Value.failure
, rather than a boolean. You can grep @Parser
in the parsec/__init__.py
in this package to find more examples of how it works.
For your case in the description, you could define the parser as follows:
from parsec import *
spaces = regex(r'\s*', re.MULTILINE)
name = regex(r'[_a-zA-Z][_a-zA-Z0-9]*')
tag_start = spaces >> string('<') >> name << string('>') << spaces
tag_stop = spaces >> string('</') >> name << string('>') << spaces
@generate
def header_kv():
key = yield spaces >> name << spaces
yield string(':')
value = yield spaces >> regex('[^\n]+')
return {key: value}
@generate
def header():
tag_name = yield tag_start
values = yield sepBy(header_kv, string('\n'))
tag_name_end = yield tag_stop
assert tag_name == tag_name_end
return {
'type': 'tag',
'name': tag_name,
'values': values
}
@generate
def body():
tag_name = yield tag_start
values = yield sepBy(sepBy1(regex(r'[^\n<,]+'), string(',')), string('\n'))
tag_name_end = yield tag_stop
assert tag_name == tag_name_end
return {
'type': 'tag',
'name': tag_name,
'values': values
}
parser = header + body
If you run parser.parse(mystr)
, it yields
({'type': 'tag',
'name': 'kv',
'values': [{'key1': '"string"'},
{'key2': '1.00005'},
{'key3': '[1,2,3]'}]},
{'type': 'tag',
'name': 'csv',
'values': [['date', 'windspeed', 'direction'],
['20190805', '22', 'NNW'],
['20190805', '23', 'NW'],
['20190805', '20', 'NE']]}
)
You can refine the definition of values
in the above code to get the result in the exact form you want.
According to the tests, the proper way to parse your string would be the following:
from parsec import *
possible_chars = letter() | space() | one_of('/.,:"[]') | digit()
parser = many(many(possible_chars) + string("<") >> mark(many(possible_chars)) << string(">"))
parser.parse(mystr)
# [((1, 1), ['k', 'v'], (1, 3)), ((5, 1), ['/', 'k', 'v'], (5, 4)), ((6, 1), ['c', 's', 'v'], (6, 4)), ((11, 1), ['/', 'c', 's', 'v'], (11, 5))]
The construction of the parser
:
For the sake of convenience, we first define the characters we wish to match. parsec
provides many types:
letter()
: matches any alphabetic character,string(str)
: matches any specified stringstr
,space()
: matches any whitespace character,spaces()
: matches multiple whitespace characters,digit()
: matches any digit,eof()
: matches EOF flag of a string,regex(pattern)
: matches a provided regex pattern,one_of(str)
: matches any character from the provided string,none_of(str)
: match characters which are not in the provided string.
We can separate them with operators, according to the docs:
|
: This combinator implements choice. The parser p | q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. NOTICE: without backtrack,+
: Joint two or more parsers into one. Return the aggregate of two results from this two parser.^
: Choice with backtrack. This combinator is used whenever arbitrary look ahead is needed. The parser p || q first applies p, if it success, the value of p is returned. If p fails, it pretends that it hasn't consumed any input, and then parser q is tried.<<
: Ends with a specified parser, and at the end parser consumed the end flag,<
: Ends with a specified parser, and at the end parser hasn't consumed any input,>>
: Sequentially compose two actions, discarding any value produced by the first,mark(p)
: Marks the line and column information of the result of the parserp
.
Then there are multiple "combinators":
times(p, mint, maxt=None)
: Repeats parserp
frommint
tomaxt
times,count(p,n)
: Repeats parserp
n
-times. Ifn
is smaller or equal to zero, the parser equals to return empty list,(p, default_value=None)
: Make a parser optional. If success, return the result, otherwise returndefault_value
silently, without raising any exception. Ifdefault_value
is not providedNone
is returned instead,many(p)
: Repeat parserp
from never to infinitely many times,many1(p)
: Repeat parserp
at least once,separated(p, sep, mint, maxt=None, end=None)
: ,sepBy(p, sep)
: parses zero or more occurrences of parserp
, separated by delimitersep
,sepBy1(p, sep)
: parses at least one occurrence of parserp
, separated by delimitersep
,endBy(p, sep)
: parses zero or more occurrences ofp
, separated and ended bysep
,endBy1(p, sep)
: parses at least one occurrence ofp
, separated and ended bysep
,sepEndBy(p, sep)
: parses zero or more occurrences ofp
, separated and optionally ended bysep
,sepEndBy1(p, sep)
: parses at least one occurrence ofp
, separated and optionally ended bysep
.
Using all of that, we have a parser which matches many occurrences of many possible_chars
, followed by a <
, then we mark the many occurrences of possible_chars
up until >
.