convert substrings to dict

If the keys don't have spaces or colons in it, you could:

split according to alpha followed by colon to get the tokens
zip half shifted slices in a dict comprehension to rebuild the dict

like this:

import re,itertools
s = 'k1:some text k2:more text k3:and still more'
toks = [x for x in re.split("(\w+):",s) if x]  # we need to filter off empty tokens
# toks => ['k1', 'some text ', 'k2', 'more text ', 'k3', 'and still more']
d = {k:v for k,v in zip(itertools.islice(toks,None,None,2),itertools.islice(toks,1,None,2))}
print(d)

result:

{'k2': 'more text ', 'k1': 'some text ', 'k3': 'and still more'}

using itertools.islice avoids to create sub-lists like toks[::2] would do

Another regex magic with splitting the input string on key/value pairs:

import re

s = 'k1:some text k2:more text k3:and still more'
pat = re.compile(r'\s+(?=\w+:)')
result = dict(i.split(':') for i in pat.split(s))

print(result)

The output:

{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program
\s+(?=\w+:) - the crucial pattern to split the input string by whitespace character(s) \s+ if it's followed by a "key"(a word \w+ with colon :).
(?=...) - stands for lookahead positive assertion

If you have a list of your known keys (and maybe also values, but I don't address that in this answer), you can do it with a regex. There might be a shortcut if, say, you can simply assert that the last whitespace before a colon definitely signals the beginning of the key, but this should work as well:

import re

s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1', 'k2', 'k3']
dict_splitter = re.compile(r'(?P<key>({keys})):(?P<val>.*?)(?=({keys})|$)'.format(keys=')|('.join(key_list)))
result = {match.group('key'): match.group('val') for match in dict_splitter.finditer(s)}
print(result)
>> {'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}

Explanantion:

(?P<key>({keys}))  # match all the defined keys, call that group 'key'
:                  # match a colon
(?P<val>.*?)       # match anything that follows and call it 'val', but
                   # only as much as necessary..
(?=({keys})|$)     # .. as long as whatever follows is either a new key or 
                   # the end of the string
.format(keys=')|('.join(key_list))
                   # build a string out of the keys where all the keys are
                   # 'or-chained' after one another, format it into the
                   # regex wherever {keys} appears.

Caveat 1: If your keys can contain each other order is important, and you might want to go from long keys to shorter ones in order to force longest matches first: key_list.sort(key=len, reverse=True)

Caveat 2: If your key list contains regex metacharacters, it will break the expression, so they might need to be escaped first: key_list = [re.escape(key) for key in key_list]

Option 1
If the keys don't have spaces or colons, you can simplify your solution with dict + re.findall (import re, first):

>>> dict(re.findall('(\S+):(.*?)(?=\s\S+:|$)', s))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

Only the placement of the colon (:) determines how keys/values are matched.

Details

(\S+)   # match the key (anything that is not a space)
:       # colon (not matched)
(.*?)   # non-greedy match - one or more characters - this matches the value 
(?=     # use lookahead to determine when to stop matching the value
\s      # space
\S+:    # anything that is not a space followed by a colon 
|       # regex OR
$)      # EOL

Note that this code assumes the structure as presented in the question. It will fail on strings with invalid structures.

Option 2
Look ma, no regex...
This operates on the same assumption as the one above.

Split on colon (:)
All elements but the first and last will need to be split again, on space (to separate keys and values)
zip adjacent elements, and convert to dictionary

v = s.split(':')
v[1:-1] = [j for i in v[1:-1] for j in i.rsplit(None, 1)]

dict(zip(v[::2], v[1::2]))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

convert substrings to dict

Tags:

Python

Dictionary

String

Related

Recent Posts