Scanner vs. StringTokenizer vs. String.Split
Let's start by eliminating StringTokenizer
. It is getting old and doesn't even support regular expressions. Its documentation states:
StringTokenizer
is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use thesplit
method ofString
or thejava.util.regex
package instead.
So let's throw it out right away. That leaves split()
and Scanner
. What's the difference between them?
For one thing, split()
simply returns an array, which makes it easy to use a foreach loop:
for (String token : input.split("\\s+") { ... }
Scanner
is built more like a stream:
while (myScanner.hasNext()) {
String token = myScanner.next();
...
}
or
while (myScanner.hasNextDouble()) {
double token = myScanner.nextDouble();
...
}
(It has a rather large API, so don't think that it's always restricted to such simple things.)
This stream-style interface can be useful for parsing simple text files or console input, when you don't have (or can't get) all the input before starting to parse.
Personally, the only time I can remember using Scanner
is for school projects, when I had to get user input from the command line. It makes that sort of operation easy. But if I have a String
that I want to split up, it's almost a no-brainer to go with split()
.
They're essentially horses for courses.
Scanner
is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.String.split()
andPattern.split()
give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.StringTokenizer
is even more restrictive thanString.split()
, and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast asString.split()
. (See my comparison ofString.split()
andStringTokenizer
.) It also predates the regular expressions API, of whichString.split()
is a part.
You'll note from my timings that String.split()
can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer
that it gives you the output as a string array, which is usually what you want. Using an Enumeration
, as provided by StringTokenizer
, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer
is a bit of a waste of space nowadays, and you may as well just use String.split()
.
Split is slow, but not as slow as Scanner. StringTokenizer is faster than split. However, I found that I could obtain double the speed, by trading some flexibility, to get a speed-boost, which I did at JFastParser https://github.com/hughperkins/jfastparser
Testing on a string containing one million doubles:
Scanner: 10642 ms
Split: 715 ms
StringTokenizer: 544ms
JFastParser: 290ms
StringTokenizer was always there. It is the fastest of all, but the enumeration-like idiom might not look as elegant as the others.
split came to existence on JDK 1.4. Slower than tokenizer but easier to use, since it is callable from the String class.
Scanner came to be on JDK 1.5. It is the most flexible and fills a long standing gap on the Java API to support an equivalent of the famous Cs scanf function family.