Given a word and a text, we need to return the occurrences of anagrams

Essentially you can slide a window of the length of your word over your input and keep a count of how many of each letter are in the window. When the letter counts in your sliding window match the letter counts of your word, you have a match.

Let your word length be n, and your current position be curr. Create an array, or vector, windCounts of length 26. The entry windCounts[i] stores the number of occurrences of the ith letter of the alphabet seen from position curr - n - 1 to curr.

What you do is you advance curr, and keep your array windCounts up to date, by decrementing the letter that has dropped out of the back of the sliding window, and incrementing the letter count that has appeared in the front of the sliding window. (Obviously until curr > n, you only increment, you just build up your sliding window to the length of your word.)

In C++, you can use a vector for the counts of letters in your word, and for the counts of letters in your sliding window and simply use vector::operator== to do the equality.

Edit: the algorithm is O(N), where N is the length of the text to search. This can be seen from the code below where the loop body is executed for each letter that you slide the window.

#include <string>
#include <vector>
#include <algorithm> // for_each 

using std::string;
using std::vector;

#include <iostream>

int main(int argc, char* argv[])
{
    const string text = "forxxorfxdofr";
    const string word = "for"; 

    // Counts of letters in word
    vector<int> wordCounts(256); // optimization: cut down from 256 to 26 
    std::for_each(word.begin(), word.end(), 
        [&] (char c) { wordCounts[c]++; } );

    // Current position of end of sliding window
    string::const_iterator curr = text.begin() + word.size();
    // Initial sliding window counts
    vector<int> windCounts(256);
    std::for_each(text.begin(), curr,
        [&] (char c) { windCounts[c]++; } );

    // Run sliding window over text
    int numMatches = 0;
    while (1) {
        numMatches += wordCounts == windCounts;
        if (curr == text.end()) {
            break;
        }
        windCounts[*(curr - word.size())]--;
        windCounts[*curr]++;
        ++curr;
    }

    std::cout << numMatches << "\n";

    return 0;
}

You can simply look for the character count.

Say for example that you're looking for anagramms of look. So, you're looking for:

  • a 4 charachter length word,
  • with 1 l, 2 o and 1 k.

Simply process the first 4 letters, store the counts. Check whether you have a match. Add the next character (increment), remove the old character (decrement). Check again. And so on...


TooTone's O(n) solution suffers from having to compare two 256-element vectors for each character of the input text. This can be avoided by tracking the number of positions at which the two vectors differ, and registering a match when this number goes to zero. In fact, we don't even need to store two different vectors at all, since we can just store one vector containing their difference.

Here's my version implementing these optimizations. It's written in plain old C, but should work under C++ with appropriate adjustments:

#include <stdio.h>
#include <limits.h> /* for UCHAR_MAX (usually 255) */

int find_anagrams (char *word, char *text) {
    int len = 0;           /* length of search word */
    int bin[UCHAR_MAX+1];  /* excess count of each char in last len chars of text */
    int mismatch = 0;      /* count of nonzero values in bins[] */
    int found = 0;         /* number of anagrams found */
    int i;                 /* generic loop counter */

    /* initialize bins */
    for (i = 0; i <= UCHAR_MAX; i++) bin[i] = 0;
    for (i = 0; word[i] != '\0'; i++) {
        unsigned char c = (unsigned char) word[i];
        if (bin[c] == 0) mismatch++;
        bin[c]--;
        len++;  /* who needs strlen()? */
    }

    /* iterate through text */
    for (i = 0; text[i] != '\0'; i++) {
        /* add next char in text to bins, keep track of mismatch count */
        unsigned char c = (unsigned char) text[i];
        if (bin[c] == 0) mismatch++;
        if (bin[c] == -1) mismatch--;
        bin[c]++;

        /* remove len-th previous char from bins, keep track of mismatch count */
        if (i >= len) {
            unsigned char d = (unsigned char) text[i - len];
            if (bin[d] == 0) mismatch++;
            if (bin[d] == 1) mismatch--;
            bin[d]--;
        }

        /* if mismatch count is zero, we've found an anagram */
        if (mismatch == 0) {
            found++;
#ifdef DEBUG
            /* optional: print each anagram found */
            printf("Anagram found at position %d: \"", i-len+1);
            fwrite(text+i-len+1, 1, len, stdout);
            printf("\"\n");
#endif
        }
    }
    return found;
}


int main (int argc, char *argv[]) {
    if (argc == 3) {
        int n = find_anagrams(argv[1], argv[2]);
        printf("Found %d anagrams of \"%s\" in \"%s\".\n", n, argv[1], argv[2]);
        return 0;
    } else {
        fprintf(stderr, "Usage: %s <word> <text>\n", (argc ? argv[0] : "countanagrams"));
        return 1;
    }
}

Tags:

Algorithm

C++