Converting a String to a List of Words?

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

The most simple way:

Click to copy

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

Click to copy

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

Try this:

Click to copy

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

Click to copy

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

Converting a String to a List of Words?

Tags:

Python

String

List

Text Segmentation

Words

Related

Recent Posts