Decomposing a discrete signal into a sum of rectangle functions

As discussed in the comments, here is an idea for a greedy algorithm. Basically, the goal is to iteratively find the rectangle that will minimize $|w-w'|$ in that particular situation. A combination of a maximum subarray and minimum subarray algorithm finds a rectangle that comes pretty close, if you use the average over the resulting subarray as $a_i$. The only problem is that values below that average (or above, for negative rectangles) should already count as negative (positive); I don't know if there is a simple way to solve this (and it's a bit difficult to explain, too). One way to reduce this problem is to trim the rectangle if that improves the result. Another is to limit the rectangle length, as longer rectangles tend to have a higher percentage of values below the average. There may be better options, possibly depending on the shape of the signal.

Anyway, I needed to experiment with this to figure out whether it actually works, so here is some code (in C, but should be easy to port):

/* Type of index into signal. */
typedef unsigned int Index;

/* This should be a floating-point type. If you decide to use integers, change
   the division to round away from zero instead of towards zero. */
typedef double Value;

/* Represents a rectangle at [start,end). */
typedef struct {
    Index start, end;
    Value height;
} Rectangle;

/* Trims the rectangle by removing samples whose absolute value will increase
   when preliminaryHeight is subtracted. */
static Index optimizeRectangle(const Value signal[], Rectangle *rectangle, Value preliminaryHeight)
{
    Index result = 0;

    /* Divided into two cases for better efficiency. */
    if (preliminaryHeight >= 0) {
        while (rectangle->start + 1 < rectangle->end) {
            Value value = signal[rectangle->start];
            if (preliminaryHeight - value < value) {
                break;
            }
            rectangle->start++;
            rectangle->height -= value;
            result++;
        }
        while (rectangle->start + 1 < rectangle->end) {
            Value value = signal[rectangle->end - 1];
            if (preliminaryHeight - value < value) {
                break;
            }
            rectangle->end--;
            rectangle->height -= value;
            result++;
        }
    } else {
        while (rectangle->start + 1 < rectangle->end) {
            Value value = signal[rectangle->start];
            if (preliminaryHeight - value > value) {
                break;
            }
            rectangle->start++;
            rectangle->height -= value;
            result++;
        }
        while (rectangle->start + 1 < rectangle->end) {
            Value value = signal[rectangle->end - 1];
            if (preliminaryHeight - value > value) {
                break;
            }
            rectangle->end--;
            rectangle->height -= value;
            result++;
        }
    }

    return result;
}

/* Finds the rectangle that seems most appropriate at the moment, and whose
   length is at most maxResultLength.
   Returns the original length of the rectangle, before optimization. This
   value should be used for maxResultLength in the next iteration. If it is
   zero, the entire signal was zero. */
static Index findRectangle(const Value signal[], Index signalLength, Rectangle *result, Index maxResultLength)
{
    result->start = result->end = 0;
    result->height = 0;
    Value resultHeightAbs = 0;

    /* Start index and height of current maximum and minimum subarrays. */
    Index posStart = 0, negStart = 0;
    Value posHeight = 0, negHeight = 0;

    for (Index index = 0; index < signalLength; index++) {
        /* If the length of a subarray exceeds maxResultLength, backtrack
           to find the next best start index. */
        Index nextIndex = index + 1;
        if (nextIndex - posStart > maxResultLength && index > 0) {
            Index oldStart = posStart;
            posStart = index;
            posHeight = 0;
            Value newHeight = 0;
            for (Index newStart = index - 1; newStart > oldStart; newStart--) {
                newHeight += signal[newStart];
                if (newHeight > posHeight) {
                    posStart = newStart;
                    posHeight = newHeight;
                }
            }
        }
        if (nextIndex - negStart > maxResultLength && index > 0) {
            Index oldStart = negStart;
            negStart = index;
            negHeight = 0;
            Value newHeight = 0;
            for (Index newStart = index - 1; newStart > oldStart; newStart--) {
                newHeight += signal[newStart];
                if (newHeight < negHeight) {
                    negStart = newStart;
                    negHeight = newHeight;
                }
            }
        }

        /* Extend the subarrays. */
        Value value = signal[index];
        posHeight += value;
        negHeight += value;

        /* Throw away the entire maximum subarray if it is negative or zero. */
        if (posHeight <= 0) {
            posStart = nextIndex;
            posHeight = 0;
        } else {
            /* Update result if current maximum subarray is better. */
            if (posHeight > resultHeightAbs) {
                result->start = posStart;
                result->end = nextIndex;
                result->height = posHeight;
                resultHeightAbs = posHeight;
            }
        }
        /* Throw away the entire minimum subarray if it is positive or zero. */
        if (negHeight >= 0) {
            negStart = nextIndex;
            negHeight = 0;
        } else {
            /* Update result if current minimum subarray is better. */
            Value negHeightAbs = -negHeight;
            if (negHeightAbs > resultHeightAbs) {
                result->start = negStart;
                result->end = nextIndex;
                result->height = negHeight;
                resultHeightAbs = negHeightAbs;
            }
        }
    }

    Index resultLength = result->end - result->start;
    if (!resultLength) {
        /* Return now to avoid division by zero. */
        return 0;
    }

    /* Trim rectangle. */
    Value normalizedHeight;
    Index trimLength;
    do {
        normalizedHeight = result->height / (result->end - result->start);
        trimLength = optimizeRectangle(signal, result, normalizedHeight);
    } while (trimLength);

    /* Normalize height. */
    result->height = normalizedHeight;

    return resultLength;
}

/* Subtracts a rectangle from a signal. */
static void subtractRectangle(Value signal[], const Rectangle *rectangle)
{
    for (Index index = rectangle->start; index < rectangle->end; index++) {
        signal[index] -= rectangle->height;
    }
}

/* Decomposes a signal into rectangles. Stores at most resultLength rectangles
   in result, and returns the actual number of rectangles written. All
   rectangles are subtracted from the signal. */
unsigned int extractRectangles(Value signal[], Index signalLength, Rectangle result[], unsigned int resultLength)
{
    Index rectangleLength = signalLength;
    for (unsigned int resultIndex = 0; resultIndex < resultLength; resultIndex++) {
        Rectangle *rectangle = &(result[resultIndex]);
        rectangleLength = findRectangle(signal, signalLength, rectangle, rectangleLength);
        if (!rectangleLength) {
            return resultIndex;
        }
        subtractRectangle(signal, rectangle);
    }
    return resultLength;
}

The main function is extractRectangles. It returns the rectangles as an array, but this can be adapted easily. Have fun.


Dear Raskolnikov, Brian and Sebastian, thanks you very much for sharing your ideas.

I cross posted my question in two forums, so you can check the complete discussion by looking here and here.

From the discussion I got three main elements:

  1. Haar wavelet provide a solution to my problem, although a very suboptimal one
  2. This problem could be solved via an approach similar matching pursuit (but the full enumeration of the dictionary of bases needs to avoided, because it is too large in my problem)
  3. Sebastian Reichelt provided code for an approach based on maximum and minimum subarray search

All and all I ended up with four different solutions to the presented problem:

  1. explore_a indicates my original solution suggested in question (quantizing the $a_i$ values and solving multiple maximum subarray problems)
  2. brute_force indicates the approach based on approximating $\left|w' - w\right|$ by $\left(w' - w\right)^2$. In this approach, it turns out that each rectangle possibility can be evaluated with only two memory reads, one multiplication and one division, making a brute force exploration tractable.
  3. reichelt indicates a port of Sebastian's code to python (which I used to test the ideas)
  4. abs_max indicates an approach where at each iteration a rectangle of a single element, placed at the maximum absolute value of the signal, is selected.

All the code implementing these methods and some example results are available at http://lts.cr/Hzg

At each run the code will generate a new "random" signal (with a noisy step). An example signal (labeled a[0:n]) and the first rectangle selected by each method can be seen below. Example signal http://lts.cr/adL

Then each method is run recursively to approximate the input signal, until the approximation error is very low or the maximum number of iterations have been reached.

At typical result can be seen below (and another here) Example result 1 http://lts.cr/adT

After running the code multiple time (for different random signals), the behavior seems consistent:

  1. Unsurprisingly, abs_max reaches convergence in a number of iterations equal to the signal length. It does so quite fast (exponential decay).
  2. explore_a decrease the energy fast initially and then tends to stagnate.
  3. reichelt is consistently worse than explore_a, getting stuck at a higher energy level. I hope I did not do any dumb mistake in the port from C to Python. By visual inspection the first rectangle selected seems reasonable.
  4. brute_force is consistently the method that decreases the energy the fastest. It is also consistently intersects the abs_max solution, which indicates that a better strategy would be to switch from one method to the other.

Obviously the exact behavior changes from run to run and would certainly change depending on the method used to generate the "random" signal. However I believe these initial results are a good indicator. I feel that it is reasonable now to proceed to generate my real data, and evaluate how well/fast brute_force and explore_a run there.

Feel free to add comments or play around with the code.

Again, thank you very much for your inputs and insights !


I experimented with your code a bit, and found so many things that I figured I should just post another answer:

  • As stated in a comment, you made a mistake when you ported my code. (BTW, I meant "convergence" instead of "conversion," of course).

  • The explore_a function doesn't do what it says on the tin (so you managed to break your own code as well, I guess :-)). Since it directly uses the value of the maximum subarray as the score, only the two instances of a_value closest to zero are ever taken. You need to calculate the effect on $|w-w'|$ instead.

  • It doesn't quite match your description either: The true effect of a_value is that it turns positive samples below a_value into negative samples (and negates the array if it is negative). However, this is actually a nice solution for the problem that I tried to describe in my first answer. So I decided to fix the code; now it seems to work really well.

  • I also think that zero should always be tried as an a_value, both with and without negating the signal (which is equivalent to finding the maximum and minimum subarray, similarly to my first algorithm).

  • This can be combined both with length restriction and trimming, as described in my first answer. Length restriction turns out to produce worse results, and it is also rather expensive. The only positive effect is that it degrades to the "maximum absolute value" algorithm once the maximum length becomes 1, so it reaches convergence more quickly. Trimming tends to help a bit, but (as in every greedy algorithm) a better result in one step can lead to worse results in the following steps.

Here is an updated graph (of a different signal, since unfortunately you did not use fixed random seeds, and with different order and colors because they are chosen automatically): Updated graph, signal length 100 explore_a is hidden behind explore_a_opt (i.e., with trimming), and explore_a_len (with length restriction) is hidden behind explore_a_len_opt.

Looking at this graph, a few more things come to mind:

  • The fixed explore_a seems to produce better results than brute_force (which is actually a misleading name because it is really still a greedy algorithm, just with brute force in each step). I guess this means that brute_force doesn't quite do what it is supposed to either.

  • The test signal isn't particular well-suited for approximation using rectangles. It is essentially white noise with two rectangles added to it. Once the algorithms have "found" these rectangles, they can't do much better than abs_max any more.

  • It probably makes most sense to compare the algorithms at a number of steps that is significantly lower than the signal length. (Otherwise, what's the point?)

So here is a graph belonging to a signal of length 500 with more low-frequency content, showing the first 100 steps: New graph, signal length 500

Finally, since all algorithms are greedy, I think you should compare them to something like simulated annealing as well (as I mentioned in the comments).