convert substrings to dict
If the keys don't have spaces or colons in it, you could:
- split according to alpha followed by colon to get the tokens
- zip half shifted slices in a dict comprehension to rebuild the dict
like this:
import re,itertools
s = 'k1:some text k2:more text k3:and still more'
toks = [x for x in re.split("(\w+):",s) if x] # we need to filter off empty tokens
# toks => ['k1', 'some text ', 'k2', 'more text ', 'k3', 'and still more']
d = {k:v for k,v in zip(itertools.islice(toks,None,None,2),itertools.islice(toks,1,None,2))}
print(d)
result:
{'k2': 'more text ', 'k1': 'some text ', 'k3': 'and still more'}
using itertools.islice
avoids to create sub-lists like toks[::2]
would do
Another regex magic with splitting the input string on key/value pairs:
import re
s = 'k1:some text k2:more text k3:and still more'
pat = re.compile(r'\s+(?=\w+:)')
result = dict(i.split(':') for i in pat.split(s))
print(result)
The output:
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
- using
re.compile()
and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program \s+(?=\w+:)
- the crucial pattern to split the input string by whitespace character(s)\s+
if it's followed by a "key"(a word\w+
with colon:
).(?=...)
- stands for lookahead positive assertion
If you have a list of your known keys (and maybe also values, but I don't address that in this answer), you can do it with a regex. There might be a shortcut if, say, you can simply assert that the last whitespace before a colon definitely signals the beginning of the key, but this should work as well:
import re
s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1', 'k2', 'k3']
dict_splitter = re.compile(r'(?P<key>({keys})):(?P<val>.*?)(?=({keys})|$)'.format(keys=')|('.join(key_list)))
result = {match.group('key'): match.group('val') for match in dict_splitter.finditer(s)}
print(result)
>> {'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}
Explanantion:
(?P<key>({keys})) # match all the defined keys, call that group 'key'
: # match a colon
(?P<val>.*?) # match anything that follows and call it 'val', but
# only as much as necessary..
(?=({keys})|$) # .. as long as whatever follows is either a new key or
# the end of the string
.format(keys=')|('.join(key_list))
# build a string out of the keys where all the keys are
# 'or-chained' after one another, format it into the
# regex wherever {keys} appears.
Caveat 1: If your keys can contain each other order is important, and you might want to go from long keys to shorter ones in order to force longest matches first: key_list.sort(key=len, reverse=True)
Caveat 2: If your key list contains regex metacharacters, it will break the expression, so they might need to be escaped first: key_list = [re.escape(key) for key in key_list]
Option 1
If the keys don't have spaces or colons, you can simplify your solution with dict
+ re.findall
(import re
, first):
>>> dict(re.findall('(\S+):(.*?)(?=\s\S+:|$)', s))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
Only the placement of the colon (:
) determines how keys/values are matched.
Details
(\S+) # match the key (anything that is not a space)
: # colon (not matched)
(.*?) # non-greedy match - one or more characters - this matches the value
(?= # use lookahead to determine when to stop matching the value
\s # space
\S+: # anything that is not a space followed by a colon
| # regex OR
$) # EOL
Note that this code assumes the structure as presented in the question. It will fail on strings with invalid structures.
Option 2
Look ma, no regex...
This operates on the same assumption as the one above.
- Split on colon (
:
) - All elements but the first and last will need to be split again, on space (to separate keys and values)
- zip adjacent elements, and convert to dictionary
v = s.split(':')
v[1:-1] = [j for i in v[1:-1] for j in i.rsplit(None, 1)]
dict(zip(v[::2], v[1::2]))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}