whitespace in regular expression
\t
is not equivalent to \s+
, but \s+
should match a tab (\t
).
The problem in your example is that the second pattern \s\s+
is looking for two or more whitespace characters, and \t
is only one whitespace character.
Here are some examples that should help you understand:
>>> result = re.match(r'\s\s+', '\t')
>>> print result
None
>>> result = re.match(r'\s\s+', '\t\t')
>>> print result
<_sre.SRE_Match object at 0x10ff228b8>
\s\s+
would also match ' \t'
, '\n\t'
, ' \n \t \t\n'
.
Also, \s\s*
is equivalent to \s+
. Both will match one or more whitespace characters.
\s+
is not equivalent to \t
because \s
does not mean <space>
, but instead means <whitespace>
. A literal space (sometimes four of which are used for tabs, depending on the application used to display them) is simply . That is, hitting the spacebar creates a literal space. That's hardly surprising.
\s\s
will never match a \t
because since \t
IS whitespace, \s
matches it. It will match \t\t
, but that's because there's two characters of whitespace (both tab characters). When your regex runs \s\s+
, it's looking for one character of whitespace followed by one, two, three, or really ANY number more. When it reads your regex it does this:
\s\s+
Debuggex Demo
The \t
matches the first \s
, but when it hits the second one your regex spits it back out saying "Oh, nope nevermind."
Your first regex does this:
\s\s*
Debuggex Demo
Again, the \t
matches your first \s
, and when the regex continues it sees that it doesn't match the second \s so it takes the "high road" instead and jumps over it. That's why \s\s*
matches, because the *
quantifier includes "or zero." while the +
quantifier does not.