whitespace in regular expression

\t is not equivalent to \s+, but \s+ should match a tab (\t).

The problem in your example is that the second pattern \s\s+ is looking for two or more whitespace characters, and \t is only one whitespace character.

Here are some examples that should help you understand:

>>> result = re.match(r'\s\s+', '\t')
>>> print result
None
>>> result = re.match(r'\s\s+', '\t\t')
>>> print result
<_sre.SRE_Match object at 0x10ff228b8>

\s\s+ would also match ' \t', '\n\t', ' \n \t \t\n'.

Also, \s\s* is equivalent to \s+. Both will match one or more whitespace characters.


\s+ is not equivalent to \t because \s does not mean <space>, but instead means <whitespace>. A literal space (sometimes four of which are used for tabs, depending on the application used to display them) is simply . That is, hitting the spacebar creates a literal space. That's hardly surprising.

\s\s will never match a \t because since \t IS whitespace, \s matches it. It will match \t\t, but that's because there's two characters of whitespace (both tab characters). When your regex runs \s\s+, it's looking for one character of whitespace followed by one, two, three, or really ANY number more. When it reads your regex it does this:

\s\s+

Regular expression visualization

Debuggex Demo

The \t matches the first \s, but when it hits the second one your regex spits it back out saying "Oh, nope nevermind."

Your first regex does this:

\s\s*

Regular expression visualization

Debuggex Demo

Again, the \t matches your first \s, and when the regex continues it sees that it doesn't match the second \s so it takes the "high road" instead and jumps over it. That's why \s\s* matches, because the * quantifier includes "or zero." while the + quantifier does not.

Tags:

Python

Regex