Marshall's C++ Musings

Sometimes you get things wrong

Advertisements

A few years ago, Sean Parent challenged me to provide an implementation of Boyer-Moore searching in C++. I did that, first in boost and then, later as part of the proposed Library Fundamentals Technical Specification.

The idea here is that you have a searcher object, which encapsulates the actual searching. You construct it with the pattern that you want to search for, and then you call the searchers operator() with the corpus that you want to search, and it will return to you the start of the pattern in the corpus, if it exists, and the end of the corpus, if it does not (this is the same set of rules that std::search follows).

But this weekend I realized that this is not the right thing to return. The searcher (and std::search for that matter) should return a “range” (ok, a pair of iterators) denoting the beginning and end of the pattern in the corpus. (Yes, you can get the end of the pattern by incrementing the returned iterator by the length of the pattern, but that’s an O(N) operation if you only have forward iterators.

I came to this realization when I was trying to write a split algorithm. I needed to find a pattern in a sequence (hello, searchers) repeatedly. But I quickly ran aground, because I needed to advance my internal iterators by the size of the pattern. I looked at the searchers I had written for libc++, and sure enough, all that information was there; it just had to be returned. So, as an experiment, I changed the searcher implementation to return a pair<Iterator, Iterator> instead of just an Iterator.

Now my split algorithm is very simple.

It assumes that the searcher returns pair(last, last) when the pattern is not found.

template <typename InIterator, typename Searcher, typename OutIter>
OutIter split(InIterator first, InIterator last, const Searcher &s, OutIter out)
{
    while (first != last)
    {
        std::pair<InIterator, InIterator> found = s(first, last);
        // We've got three cases here:
        //  The pattern is found in the middle of the input; output a chunk, and go around again
        //  The pattern doesn't exist: output the rest of the input and terminate.
        //  The pattern is found at the end of the input, output an empty chunk and terminate.
        *out++ = std::make_pair(first, found.first);
        if (found.second == last && found.first != found.second) // we found the pattern at the end; output an empty match
            *out++ = std::make_pair(last, last);
        first = found.second;
    }
    return out;
}

It takes a pair of iterators, which are the corpus to be searched, a searcher (which defines the pattern to be searched for), and an iterator to write the results to.

The moral of this story (again) is that your algorithms should not throw away useful information. This is the second time this has bit me (see my post on copy_while from a couple of years ago); I wonder how many more times I will make this mistake before I finally learn.

[Later] I realized that I wasn’t being clear above. The iterator calculation that you want to avoid is determining the distance between the patterns, not the length of the pattern. So, if you have a pattern of '*', and a source of "ABCD*EFG*HIJKL*MNOP", you can avoid calculating the distance between successive asterisks.

 

Advertisements