Performing regex on a stream
Streamflyer is able to apply regular expressions on character streams.
Note that I'm the author of it.
You could use a Scanner
and the findWithinHorizon
method:
Scanner s = new Scanner(new File("thefile"));
String nextMatch = s.findWithinHorizon(yourPattern, 0);
From the api on findWithinHorizon
:
If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound. In this case it may buffer all of the input searching for the pattern.
A side note: When matching on multiple lines, you might want to look at the constants Pattern.MULTILINE
and Pattern.DOTALL
.
The java implementation of regular expression engine looks unsuitable for streaming processing.
I would rather advocate another approach rooted on "derivative combinators".
The researcher Matt Might has published relevant posts about "derivative combinators" on his blog and suggests a Scala implementation here:
- http://matt.might.net/articles/parsing-with-derivatives/
- http://matt.might.net/articles/nonblocking-lexing-toolkit-based-on-regex-derivatives/
On my side, I succeed to improve this implementation by adding some "capture" ability, but I feel it could have a significant impact on memory consumption.