boost::tokenizer vs boost::split

The best choice depends on a few factors. If you're only needing to scan the tokens once, then the boost::tokenizer is a good choice in both runtime and space performance (those vectors of tokens can take up a lot of space, depending on input data.)

If you're going to be scanning the tokens often, or need a vector with efficient random access, then the boost::split into a vector may be the better option.

For example, in your "A^B^C^...^Z" input string where the tokens are 1-byte in length, the boost::split/vector<string> method will consume at least 2*N-1 bytes. With the way strings are stored in most STL implementations you can figure it taking more than 8x that count. Storing these strings in a vector is costly in terms of memory and time.

I ran a quick test on my machine and a similar pattern with 10 million tokens looked like this:

boost::split = 2.5s and ~620MB
boost::tokenizer = 0.9s and 0MB

If you're just doing a one-time scan of the tokens, then clearly the tokenizer is better. But, if you're shredding into a structure that you want to reuse during the lifetime of your application, then having a vector of tokens may be preferred.

If you want to go the vector route, then I'd recommend not using a vector<string>, but a vector of string::iterators instead. Just shred into a pair of iterators and keep around your big string of tokens for reference. For example:

using namespace std;
vector<pair<string::const_iterator,string::const_iterator> > tokens;
boost::split(tokens, s, boost::is_any_of("^"));
for(auto beg=tokens.begin(); beg!=tokens.end();++beg){
   cout << string(beg->first,beg->second) << endl;
}

This improved version takes 1.6s and 390MB on the same server and test. And, best of all the memory overhead of this vector is linear with the number of tokens -- not dependent in any way on the length of tokens, whereas a std::vector<string> stores each token.

I find rather different results using clang++ -O3 -std=c++11 -stdlib=libc++.

First I extracted a text file with ~470k words separated by commas with no newlines into a giant string, like so:

path const inputPath("input.txt");

filebuf buf;
buf.open(inputPath.string(),ios::in);
if (!buf.is_open())
    return cerr << "can't open" << endl, 1;

string str(filesystem::file_size(inputPath),'\0');
buf.sgetn(&str[0], str.size());
buf.close();

Then I ran various timed tests storing results into a pre-sized vector cleared between runs, for example,

void vectorStorage(string const& str)
{
    static size_t const expectedSize = 471785;

    vector<string> contents;
    contents.reserve(expectedSize+1);

    ...

    {
        timed _("split is_any_of");
        split(contents, str, is_any_of(","));
    }
    if (expectedSize != contents.size()) throw runtime_error("bad size");
    contents.clear();

    ...
}

For reference, the timer is just this:

struct timed
{
    ~timed()
    {
        auto duration = chrono::duration_cast<chrono::duration<double, ratio<1,1000>>>(chrono::high_resolution_clock::now() - start_);

        cout << setw(40) << right << name_ << ": " << duration.count() << " ms" << endl;
    }

    timed(std::string name="") :
        name_(name)
    {}


    chrono::high_resolution_clock::time_point const start_ = chrono::high_resolution_clock::now();
    string const name_;
};

I also clocked a single iteration (no vector). Here are the results:

Vector: 
                              hand-coded: 54.8777 ms
                         split is_any_of: 67.7232 ms
                     split is_from_range: 49.0215 ms
                               tokenizer: 119.37 ms
One iteration:
                               tokenizer: 97.2867 ms
                          split iterator: 26.5444 ms
            split iterator back_inserter: 57.7194 ms
                split iterator char copy: 34.8381 ms

The tokenizer is so much slower than split, the one-iteration figure doesn't even include the string copy:

{
    string word;
    word.reserve(128);

    timed _("tokenizer");
    boost::char_separator<char> sep(",");
    boost::tokenizer<boost::char_separator<char> > tokens(str, sep);

    for (auto range : tokens)
    {}
}

{
    string word;

    timed _("split iterator");
    for (auto it = make_split_iterator(str, token_finder(is_from_range(',', ',')));
         it != decltype(it)(); ++it)
    {
        word = move(copy_range<string>(*it));
    }
}

Unambiguous conclusion: use split.

It might depend on your version of boost and how you're the functionality.

We had a performance issue in some logic that was using boost::split 1.41.0 to handle thousands or hundreds of thousands of smaller strings (expected less than 10 tokens). When I ran the code through a performance analyzer we found that a surprising 39% amount of time was spent in boost::split.

We tried some simple "fixes" that didn't affect performance materially like "we know we wont have more than 10 items on each pass, so preset the vector to 10 items".

Since we didn't actually need the vector and could just iterate the tokens and accomplish the same job, we changed the code to boost::tokenize and the same section of code dropped to <1% of runtime.

boost::tokenizer vs boost::split

Tags:

C++

Boost

Related

Recent Posts