How do you unit test regular expressions?
You should always test your regexen, much like any other chunk of code. They're at the most simple a function that takes a string and returns a bool, or returns an array of values.
Here are some suggestions on what to think about when it comes to designing unit tests for regexen. These are not not hard and fast prescriptions for unit test design, but some guidelines to shape your thinking. As always, weigh the needs of your testing versus cost of failure balanced with the time required to implement them all. (I find that 'implementing' the test is the easy part! :-] )
Points to consider:
- Think of every group (the parentheses) as a curly brace.
- Think of every | as a condition. Make sure to test for each branch.
- Think of every modifier (*, +, ? ) as a different path.
- (side note to the above: remember the difference between *, +, ? and *?, +?, and ??.)
- for \d, \s, \w, and their negations, give several in each range a try.
- For * and +, you need to test for the 'no value', 'one of', and 'one or more' for each.
- For important 'control' characters (eg, strings in the regex you look for) test to see what happens if they show up in the wrong places. This may surprise you.
- If you have real world data, use as much of it as you can.
- If you don't, make sure to test both the simple and complex forms that should be valid.
- Make sure to test what regex control characters do when inserted.
- Make sure to verify that the empty string is properly accepted/rejected.
- Make sure to verify that a string of each of the different kind of space characters are properly accepted or rejected.
- Make sure that proper handling of case insensitivity is done (the i flag). This has bit me more times than almost anything else in text parsing (other than spaces).
- If you have the x, m or s options, make sure you understand what they do and test for it (the behavior here can be different)
For a regex that returns lists, also remember:
- Verify that the data you expect is returned, in the right order, in the right fields.
- Verify that slight modifications do not return good data.
- Verify that mixed anonymous groups and named groups parse correctly (eg,
(?<name> thing1 ( thing2) )
) - this behavior can be different based on the regex engine you're using. - Once again, give lots of real world trials.
If you use any advanced features, such as non-backtracking groups, make sure you understand completely how the feature works, and using the guidelines above, build example strings that should work for and against each of them.
Depending on your regex library implementation, the way groups are captured may be different as well. Perl 5 has a 'open paren order' ordering, C# has that partially except for named groups and so on. Make sure to experiment with your flavor to know exactly what it does.
Then, integrate them right in with your other unit tests, either in their own module or alongside the module that contains the regex. For particularly nasty regexen, you may find you need lots and lots of tests to verify that the pattern and all the features you use are correct. If the regex makes up a large (or nearly all) of the work that the method is doing, I will use the advice above to fashion inputs to test that function and not the regex directly. That way, if later you decide that the regex is not the way to go, or you want to break it up, you can capture the behavior the regex provided without changing the interface - ie, the method that invokes the regex.
As long as you really know how a regex feature is supposed to work in your flavor of regex, you should be able to develop decent test cases for it. Just make sure you really, really, really do understand how the feature works!
Just throw a bunch of values at it, checking that you get the right result (whether that's match/no-match or a particular replacement value etc).
Importantly, if there are any corner cases which you wonder whether they'll work or not, capture them in a unit test and explain in a comment why they work. That way someone else who wants to change the regex will be able to check that the corner case still works, and it'll give a hint to them as to how to fix it if it breaks.
I would create a set of input values with expected output values, much like every other test case.
Also, I can thoroughly recommmend the free Regex Tool Expresso. It's a fantastic regex editor/debugger that has saved me days of pain in the past.
Presumably your regular expressions are contained within a method of a class. For example:
public bool ValidateEmailAddress( string emailAddr )
{
// Validate the email address using regular expression.
return RegExProvider.Match( this.ValidEmailRegEx, emailAddr );
}
You can now write tests for this method. I guess the point is is that the regex is an implementation detail - your test needs to test the interface, which in this case is just the validate email method.