Split string into sentences in javascript

str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")

The RegExp (see on Debuggex):

  • (.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
  • (\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
  • (\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
  • g = global
  • m = multiline

Remarks:

  • If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.

Use lookahead to avoid replacing dot if not followed by space + word char:

sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

OUTPUT:

["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]

You could exploit that the next sentence begins with an uppercase letter or a number.

.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)

Regular expression visualization

Debuggex Demo

It splits this text

This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.

into the sentences:

This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.

jsfiddle


str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

Breakdown:

([.?!]) = Capture either . or ? or !

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


The replace operation uses:

"$1|"

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.


So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.