Why doesn't [01-12] range work as expected?
You seem to have misunderstood how character classes definition works in regex.
To match any of the strings 01
, 02
, 03
, 04
, 05
, 06
, 07
, 08
, 09
, 10
, 11
, or 12
, something like this works:
0[1-9]|1[0-2]
References
- regular-expressions.info/Character Classes
- Numeric Ranges (have many examples on matching strings interpreted as numeric ranges)
Explanation
A character class, by itself, attempts to match one and exactly one character from the input string. [01-12]
actually defines [012]
, a character class that matches one character from the input against any of the 3 characters 0
, 1
, or 2
.
The -
range definition goes from 1
to 1
, which includes just 1
. On the other hand, something like [1-9]
includes 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
.
Beginners often make the mistakes of defining things like [this|that]
. This doesn't "work". This character definition defines [this|a]
, i.e. it matches one character from the input against any of 6 characters in t
, h
, i
, s
, |
or a
. More than likely (this|that)
is what is intended.
References
- regular-expressions.info/Brackets for Grouping and Alternation with the vertical bar
How ranges are defined
So it's obvious now that a pattern like between [24-48] hours
doesn't "work". The character class in this case is equivalent to [248]
.
That is, -
in a character class definition doesn't define numeric range in the pattern. Regex engines doesn't really "understand" numbers in the pattern, with the exception of finite repetition syntax (e.g. a{3,5}
matches between 3 and 5 a
).
Range definition instead uses ASCII/Unicode encoding of the characters to define ranges. The character 0
is encoded in ASCII as decimal 48; 9
is 57. Thus, the character definition [0-9]
includes all character whose values are between decimal 48 and 57 in the encoding. Rather sensibly, by design these are the characters 0
, 1
, ..., 9
.
See also
- Wikipedia/ASCII
Another example: A to Z
Let's take a look at another common character class definition [a-zA-Z]
In ASCII:
A
= 65,Z
= 90a
= 97,z
= 122
This means that:
[a-zA-Z]
and[A-Za-z]
are equivalent- In most flavors,
[a-Z]
is likely to be an illegal character range- because
a
(97) is "greater than" thanZ
(90)
- because
[A-z]
is legal, but also includes these six characters:[
(91),\
(92),]
(93),^
(94),_
(95),`
(96)
Related questions
- is the regex [a-Z] valid and if yes then is it the same as [a-zA-Z]
A character class in regular expressions, denoted by the [...]
syntax, specifies the rules to match a single character in the input. As such, everything you write between the brackets specify how to match a single character.
Your pattern, [01-12]
is thus broken down as follows:
- 0 - match the single digit 0
- or, 1-1, match a single digit in the range of 1 through 1
- or, 2, match a single digit 2
So basically all you're matching is 0, 1 or 2.
In order to do the matching you want, matching two digits, ranging from 01-12 as numbers, you need to think about how they will look as text.
You have:
- 01-09 (ie. first digit is 0, second digit is 1-9)
- 10-12 (ie. first digit is 1, second digit is 0-2)
You will then have to write a regular expression for that, which can look like this:
+-- a 0 followed by 1-9
|
| +-- a 1 followed by 0-2
| |
<-+--> <-+-->
0[1-9]|1[0-2]
^
|
+-- vertical bar, this roughly means "OR" in this context
Note that trying to combine them in order to get a shorter expression will fail, by giving false positive matches for invalid input.
For instance, the pattern [0-1][0-9]
would basically match the numbers 00-19, which is a bit more than what you want.
I tried finding a definite source for more information about character classes, but for now all I can give you is this Google Query for Regex Character Classes. Hopefully you'll be able to find some more information there to help you.
This also works:
^([1-9]|[0-1][0-2])$
[1-9]
matches single digits between 1 and 9
[0-1][0-2]
matches double digits between 10 and 12
There are some good examples here