a regular expression generator for number ranges
You cannot cover your requirement with Character Groups only. Imagine the Range 129-131
. The Pattern 1[2-3][1-9]
would also match 139
which is out of range.
So in this example you need to change the last group to something else: 1[2-3](1|9)
. You can now find this effect as well for the tens and hundrets, leading to the problem that aapattern that basically represents each valid number as a fixed sequence of numbers is the only working solution. (if you don't want an algorithm that needs to track overflows in order to decide whether it should use [2-8]
or (8,9,0,1,2)
)
if you anyway autogenerate the pattern - keep it simple:
128-132
can be written as (I left out the non-matching group addition ?:
for better readability)
(128|129|130|131|132)
algorithm should be ovious, a for, an array, string concatenation and join.
That would already work as expected, but you can also perform some "optimization" on this if you like it more compact:
(128|129|130|131|132) <=>
1(28|29|30|31|32) <=>
1(2(8|9)|3(0|1|2))
more optimization
1(2([8-9])|3([0-2]))
Algorithms for the last steps are out there, look for factorization. An easy way would be to push all the numbers to a tree, depending on the character position:
1
2
8
9
3
0
1
2
and finally iterate over the three and form the pattern 1(2(8|9)|3(0|1|2))
. As a last step, replace anything of the pattern (a|(b|)*?c)
with [a-c]
Same goes for 11-29
:
11-29 <=>
(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29) <=>
(1(1|2|3|4|5|7|8|9)|2(1|2|3|4|5|7|8|9)) <=>
(1([1-9])|2([1-9])
as an addition you now can proceed with the factorization:
(1([1-9])|2([1-9]) <=>
(1|2)[1-9] <=>
[1-2][1-9]
Here's my solution and an algorithm with complexity O(log n)
(n is the end of the range). I believe it is the simplest one here:
Basically, split your task into these steps:
- Gradually "weaken" the
start
of the range. - Gradually "weaken" the
end
of the range. - Merge those two.
By "weaken", I mean finding the end of range that can be represented by simple regex for this specific number, for example:
145 -> 149,150 -> 199,200 -> 999,1000 -> etc.
Here's a backward one, for the end
of the range:
387 -> 380,379 -> 300,299 -> 0
Merging would be the process of noticing the overlap of 299->0 and 200->999 and combining those into 200->299.
In result, you would get this set of numbers (first list intact, second one inverted:
145, 149, 150, 199, 200, 299, 300, 379, 380, 387
Now, here is the funny part. Take the numbers in pairs, and convert them to ranges:
145-149, 150-199, 200-299, 300-379, 380-387
Or in regex:
14[5-9], 1[5-9][0-9], 2[0-9][0-9], 3[0-7][0-9], 38[0-7]
Here's how the code for the weakening
would look like:
public static int next(int num) {
//Convert to String for easier operations
final char[] chars = String.valueOf(num).toCharArray();
//Go through all digits backwards
for (int i=chars.length-1; i>=0;i--) {
//Skip the 0 changing it to 9. For example, for 190->199
if (chars[i]=='0') {
chars[i] = '9';
} else { //If any other digit is encountered, change that to 9, for example, 195->199, or with both rules: 150->199
chars[i] = '9';
break;
}
}
return Integer.parseInt(String.valueOf(chars));
}
//Same thing, but reversed. 387 -> 380, 379 -> 300, etc
public static int prev(int num) {
final char[] chars = String.valueOf(num).toCharArray();
for (int i=chars.length-1; i>=0;i--) {
if (chars[i] == '9') {
chars[i] = '0';
} else {
chars[i] = '0';
break;
}
}
return Integer.parseInt(String.valueOf(chars));
}
The rest is technical details and is easy to implement. Here's an implementation of this O(log n)
algorithm: https://ideone.com/3SCvZf
Oh, and by the way, it works with other ranges too, for example for range 1-321654
the result is:
[1-9]
[1-9][0-9]
[1-9][0-9][0-9]
[1-9][0-9][0-9][0-9]
[1-9][0-9][0-9][0-9][0-9]
[1-2][0-9][0-9][0-9][0-9][0-9]
3[0-1][0-9][0-9][0-9][0-9]
320[0-9][0-9][0-9]
321[0-5][0-9][0-9]
3216[0-4][0-9]
32165[0-4]
And for 129-131
it's:
129
13[0-1]
I've finally arrived at the following. The overall idea is to start with the beginning of the range, produce a regular expression that will match from that up to but not including the next multiple of 10, then for hundreds, etc. until you have matched things up to the end of the range; then start with the end of the range and work downwards, replacing increasing numbers of digits with 0s to match against similar numbers of 9s, to match the specific end-of-range. Then generate one regular expression for the part of the range if they don't already cover it all.
Special note should be taken of bezmax's routine to convert two numbers to the regular expression that will match them - MUCH easier than dealing with strings or character arrays directly, I think.
Anyway, here it is:
package numbers;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
/**
* Has methods for generating regular expressions to match ranges of numbers.
*/
public class RangeRegexGenerator
{
public static void main(String[] args)
{
RangeRegexGenerator rrg = new RangeRegexGenerator();
// do
// {
// Scanner scanner = new Scanner(System.in);
// System.out.println("enter start, <return>, then end and <return>");
// int start = scanner.nextInt();
// int end = scanner.nextInt();
// System.out.println(String.format("for %d-%d", start, end));
List<String> regexes = rrg.getRegex("0015", "0213");
for (String s: regexes) { System.out.println(s); }
// }
// while(true);
}
/**
* Return a list of regular expressions that match the numbers
* that fall within the range of the given numbers, inclusive.
* Assumes the given strings are numbers of the the same length,
* and 0-left-pads the resulting expressions, if necessary, to the
* same length.
* @param begStr
* @param endStr
* @return
*/
public List<String> getRegex(String begStr, String endStr)
{
int start = Integer.parseInt(begStr);
int end = Integer.parseInt(endStr);
int stringLength = begStr.length();
List<Integer> pairs = getRegexPairs(start, end);
List<String> regexes = toRegex(pairs, stringLength);
return regexes;
}
/**
* Return a list of regular expressions that match the numbers
* that fall within the range of the given numbers, inclusive.
* @param beg
* @param end
* @return
*/
public List<String> getRegex(int beg, int end)
{
List<Integer> pairs = getRegexPairs(beg, end);
List<String> regexes = toRegex(pairs);
return regexes;
}
/**
* return the list of integers that are the paired integers
* used to generate the regular expressions for the given
* range. Each pair of integers in the list -- 0,1, then 2,3,
* etc., represents a range for which a single regular expression
* is generated.
* @param start
* @param end
* @return
*/
private List<Integer> getRegexPairs(int start, int end)
{
List<Integer> pairs = new ArrayList<>();
ArrayList<Integer> leftPairs = new ArrayList<>();
int middleStartPoint = fillLeftPairs(leftPairs, start, end);
ArrayList<Integer> rightPairs = new ArrayList<>();
int middleEndPoint = fillRightPairs(rightPairs, middleStartPoint, end);
pairs.addAll(leftPairs);
if (middleEndPoint > middleStartPoint)
{
pairs.add(middleStartPoint);
pairs.add(middleEndPoint);
}
pairs.addAll(rightPairs);
return pairs;
}
/**
* print the given list of integer pairs - used for debugging.
* @param list
*/
@SuppressWarnings("unused")
private void printPairList(List<Integer> list)
{
if (list.size() > 0)
{
System.out.print(String.format("%d-%d", list.get(0), list.get(1)));
int i = 2;
while (i < list.size())
{
System.out.print(String.format(", %d-%d", list.get(i), list.get(i + 1)));
i = i + 2;
}
System.out.println();
}
}
/**
* return the regular expressions that match the ranges in the given
* list of integers. The list is in the form firstRangeStart, firstRangeEnd,
* secondRangeStart, secondRangeEnd, etc.
* @param pairs
* @return
*/
private List<String> toRegex(List<Integer> pairs)
{
return toRegex(pairs, 0);
}
/**
* return the regular expressions that match the ranges in the given
* list of integers. The list is in the form firstRangeStart, firstRangeEnd,
* secondRangeStart, secondRangeEnd, etc. Each regular expression is 0-left-padded,
* if necessary, to match strings of the given width.
* @param pairs
* @param minWidth
* @return
*/
private List<String> toRegex(List<Integer> pairs, int minWidth)
{
List<String> list = new ArrayList<>();
String numberWithWidth = String.format("%%0%dd", minWidth);
for (Iterator<Integer> iterator = pairs.iterator(); iterator.hasNext();)
{
String start = String.format(numberWithWidth, iterator.next()); // String.valueOf(iterator.next());
String end = String.format(numberWithWidth, iterator.next());
list.add(toRegex(start, end));
}
return list;
}
/**
* return a regular expression string that matches the range
* with the given start and end strings.
* @param start
* @param end
* @return
*/
private String toRegex(String start, String end)
{
assert start.length() == end.length();
StringBuilder result = new StringBuilder();
for (int pos = 0; pos < start.length(); pos++)
{
if (start.charAt(pos) == end.charAt(pos))
{
result.append(start.charAt(pos));
} else
{
result.append('[').append(start.charAt(pos)).append('-')
.append(end.charAt(pos)).append(']');
}
}
return result.toString();
}
/**
* Return the integer at the end of the range that is not covered
* by any pairs added to the list.
* @param rightPairs
* @param start
* @param end
* @return
*/
private int fillRightPairs(List<Integer> rightPairs, int start, int end)
{
int firstBeginRange = end; // the end of the range not covered by pairs
// from this routine.
int y = end;
int x = getPreviousBeginRange(y);
while (x >= start)
{
rightPairs.add(y);
rightPairs.add(x);
y = x - 1;
firstBeginRange = y;
x = getPreviousBeginRange(y);
}
Collections.reverse(rightPairs);
return firstBeginRange;
}
/**
* Return the integer at the start of the range that is not covered
* by any pairs added to its list.
* @param leftInts
* @param start
* @param end
* @return
*/
private int fillLeftPairs(ArrayList<Integer> leftInts, int start, int end)
{
int x = start;
int y = getNextLeftEndRange(x);
while (y < end)
{
leftInts.add(x);
leftInts.add(y);
x = y + 1;
y = getNextLeftEndRange(x);
}
return x;
}
/**
* given a number, return the number altered such
* that any 9s at the end of the number remain, and
* one more 9 replaces the number before the other
* 9s.
* @param num
* @return
*/
private int getNextLeftEndRange(int num)
{
char[] chars = String.valueOf(num).toCharArray();
for (int i = chars.length - 1; i >= 0; i--)
{
if (chars[i] == '0')
{
chars[i] = '9';
} else
{
chars[i] = '9';
break;
}
}
return Integer.parseInt(String.valueOf(chars));
}
/**
* given a number, return the number altered such that
* any 9 at the end of the number is replaced by a 0,
* and the number preceding any 9s is also replaced by
* a 0.
* @param num
* @return
*/
private int getPreviousBeginRange(int num)
{
char[] chars = String.valueOf(num).toCharArray();
for (int i = chars.length - 1; i >= 0; i--)
{
if (chars[i] == '9')
{
chars[i] = '0';
} else
{
chars[i] = '0';
break;
}
}
return Integer.parseInt(String.valueOf(chars));
}
}
This one is correct as far as I've been able to test it; the one posted by bezmax did not quite work, though he had the right idea (that I also came up with) for an overall algorithm, and a major implementation detail or two that were helpful, so I'm leaving the 'answer' checkmark on his response.
I was a little surprised at the amount of interest this generated, though not as much as by just how complex the problem turned out to be.