Is it possible for a computer to "learn" a regular expression by user-provided examples?
The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.
Yes, it is possible, we can generate regexes from examples (text -> desired extractions). This is a working online tool which does the job: http://regex.inginf.units.it/
Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm. The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor). This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste). Please look at the video tutorial here.
This is a research project so you can read about used algorithms here.
Behold! :-)
Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well. Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:
"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken" -> "789-345B"
An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"
When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well. When applying the human generated solution \d++-345[AB] to the following text, it fails:
"On the back of the item there is a code: 966-347Z"
You have to provide other examples, in order to better describe what is a match and what is not a desired match: --i.e:
"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"
The phone number is not a product id, this may be an important proof.
No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.
Suppose you provide the examples 111111 and 999999, should the computer generate:
- A regex matching exactly those two examples:
(111111|999999)
- A regex matching 6 identical digits
(\d)\1{5}
- A regex matching 6 ones and nines
[19]{6}
- A regex matching any 6 digits
\d{6}
- Any of the above three, with word boundaries, e.g.
\b\d{6}\b
- Any of the first three, not preceded or followed by a digit, e.g.
(?<!\d)\d{6}(?!\d)
As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.
If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.
Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.
It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.