Unit Testing Machine Learning Code

You may not appreciate the irony, but basically what you have there is legacy code: a chunk of software without any unit tests. Naturally you don't know where to begin. So you may find it helpful to read up on handling legacy code.

The definitive thought on this is Michael Feather's book, Working Effectively with Legacy Code. There used to be a helpful summary​ of that on the ObjectMentor site, but alas the website has gone the way of the company. However WELC has left a legacy in reviews and other articles. Check them out (or just buy the book), although the key lessons are the ones which S.Lott and tvanfosson cover in their replies.


2019 update: I have fixed the link to the WELC summary with a version from the Wayback Machine web archive (thanks @milia).

Also - and despite knowing that answers which comprise mainly links to other sites are low quality answers :) - here is a link to a new (2019 new) Google tutorial on Testing and Debugging ML code. I hope this will be of illumination to future Seekers who stumble across this answer.


Your unit tests need to employ some kind of fuzz factor, either by accepting approximations, or using some kind of probabilistic checks.

For example, if you have some function that returns a floating point result, it is almost impossible to write a test that works correctly across all platforms. Your checks would need to perform the approximation.

TEST_ALMOST_EQ(result, 4.0);

Above TEST_ALMOST_EQ might verify that result is between 3.9 and 4.1 (for example).

Alternatively, if your machine learning algorithms are probabilistic, your tests will need to accommodate for it by taking the average of multiple runs and expecting it to be within some range.

x = 0;
for (100 times) {
  x += result_probabilistic_test();
}

avg = x/100;
TEST_RANGE(avg, 10.0, 15.0);

Ofcourse, the tests are non-deterministic, so you will need to tune them such that you can get non-flaky tests with a high probability. (E.g., increase the number of trials, or increase the range of error).

You can also use mocks for this (e.g, a mock random number generator for your probabilistic algorithms), and they usually help for deterministically testing specific code paths, but they are a lot of effort to maintain. Ideally, you would use a combination of fuzzy testing and mocks.

HTH.


Without seeing your code, it's hard to tell, but I suspect that you are attempting to write tests at too high a level. You might want to think about breaking your methods down into smaller components that are deterministic and testing these. Then test the methods that use these methods by providing mock implementations that return predictable values from the underlying methods (which are probably located on a different object). Then you can write tests that cover the domain of the various methods, ensuring that you have coverage of the full range of possible outcomes. For the small methods you do so by providing values that represent the domain of inputs. For the methods that depend on these, by providing mock implementations that return the range of outcomes from the dependencies.


"then I'm not really testing out everything that could go wrong."

Correct.

The job of unit tests is not to test everything that could go wrong.

The job of unit tests is to test that what you have does the right thing, given specific inputs and specific expected results. The important part here is the specific visible, external requirements are satisfied by specific test cases. Not that every possible thing that could go wrong is somehow prevented.

Nothing can test everything that could go wrong. You can write a proof, but you'll be hard-pressed to write tests for everything.

Choose your test cases wisely.

Further, the job of unit tests is to test that each small part of the overall application does the right thing -- in isolation.

Your "code that approximated a curve with a lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation. The integrated whole could also be tested to be sure that it works.

Your "computationally intensive work on the lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation.

That point of unit testing is to create small, correct units that are later assembled.