Sometimes you get things wrong

A few years ago, Sean Parent challenged me to provide an implementation of Boyer-Moore searching in C++. I did that, first in boost and then, later as part of the proposed Library Fundamentals Technical Specification.

The idea here is that you have a searcher object, which encapsulates the actual searching. You construct it with the pattern that you want to search for, and then you call the searchers operator() with the corpus that you want to search, and it will return to you the start of the pattern in the corpus, if it exists, and the end of the corpus, if it does not (this is the same set of rules that std::search follows).

But this weekend I realized that this is not the right thing to return. The searcher (and std::search for that matter) should return a “range” (ok, a pair of iterators) denoting the beginning and end of the pattern in the corpus. (Yes, you can get the end of the pattern by incrementing the returned iterator by the length of the pattern, but that’s an O(N) operation if you only have forward iterators.

I came to this realization when I was trying to write a split algorithm. I needed to find a pattern in a sequence (hello, searchers) repeatedly. But I quickly ran aground, because I needed to advance my internal iterators by the size of the pattern. I looked at the searchers I had written for libc++, and sure enough, all that information was there; it just had to be returned. So, as an experiment, I changed the searcher implementation to return a pair<Iterator, Iterator> instead of just an Iterator.

Now my split algorithm is very simple.

It assumes that the searcher returns pair(last, last) when the pattern is not found.

template <typename InIterator, typename Searcher, typename OutIter>
OutIter split(InIterator first, InIterator last, const Searcher &s, OutIter out)
    while (first != last)
        std::pair<InIterator, InIterator> found = s(first, last);
        // We've got three cases here:
        //  The pattern is found in the middle of the input; output a chunk, and go around again
        //  The pattern doesn't exist: output the rest of the input and terminate.
        //  The pattern is found at the end of the input, output an empty chunk and terminate.
        *out++ = std::make_pair(first, found.first);
        if (found.second == last && found.first != found.second) // we found the pattern at the end; output an empty match
            *out++ = std::make_pair(last, last);
        first = found.second;
    return out;

It takes a pair of iterators, which are the corpus to be searched, a searcher (which defines the pattern to be searched for), and an iterator to write the results to.

The moral of this story (again) is that your algorithms should not throw away useful information. This is the second time this has bit me (see my post on copy_while from a couple of years ago); I wonder how many more times I will make this mistake before I finally learn.

[Later] I realized that I wasn’t being clear above. The iterator calculation that you want to avoid is determining the distance between the patterns, not the length of the pattern. So, if you have a pattern of '*', and a source of "ABCD*EFG*HIJKL*MNOP", you can avoid calculating the distance between successive asterisks.



7 thoughts on “Sometimes you get things wrong

  1. Pingback: Raw loops vs. STL algorithms-IT大道

  2. Dave Thomas

    Should ‘out’ be a container of results? What constrains/bounds writes to ‘out’? (Valid answer: Left as an exercise for the reader.)

  3. Marshall Clow Post author

    Actually, it’s a good question. The two approaches that I think are best are either (a) an output iterator that the algorithm writes to, or (b) a “functor” that gets called for each iterator pair generated.

    Before we had lambdas, I would have said that (a) was the best choice. Now I’m less sure.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s