How to split text into words?

Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";

var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");

foreach (Match match in matches) {
    var word = match.Value;
}

I believe this is the shortest RegEx that will get all the words

\w+[^\s]*\w+|\w

Split text on whitespace, then trim punctuation.

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));

Agrees exactly with example.


First, Remove all special characeters:

var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better

Then split it:

var split = fixedInput.Split(' ');

For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):

public static string RemoveSpecialCharacters(this string str) {
   var sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

Then use it like so:

var words = input.RemoveSpecialCharacters().Split(' ');

You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)

Update

I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:

(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')

With:

char.IsLetter(c)

Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases


If you don't want to use a Regex object, you could do something like...

string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();

You'll still have to handle the trailing apostrophe at the end of "that,'"

Tags:

C#

.Net