efficient 2d mean filter implementation that minimises redundant memory loads?
What you are doing is called Convolution. You convolve the multidimensional data with a smaller kernel of the same number of dimensions. It is a very common task, and there are plenty of libraries for it.
A fast solution (depending on the kernel size) is to calculate the convolution in the frequency domain. You calculate the (multidimensional) FFT of both data and kernel, multiply them, and calculate the inverse FFT. You will find libraries optimized to do just that, eg. for Python there is scipy.ndimage.filters.convolve and scipy.signal.fftconvolve.
Tiling is a common image processing technique to optimize low-level memory access. You allocate square tiles (or cubes) that fit well into the CPU cache. When you access neighbouring pixels they will be close together in memory most of the time. Looping over the whole array gets a bit tricky, though.
For further reading I reccommend the paper Why Modern CPUs Are Starving and What Can Be Done about It, which mentions this memory blocking technique, and points to numerical libraries that implement it.
And finally there is the Integral Image that allows you to calculate the average of an arbitrary rectangle/cuboid with just a very small number of memory accesses.
For the case of a 2D mean filter I would maintain column totals which could then be reused so that for each iteration you only calculate one new column total and then sum the column totals to get the mean. E.g. for a 3x3 mean:
for (i = 1; i < M - 1; ++i)
{
// init first two column sums
col0 = a[i - 1][0] + a[i][0] + a[i + 1][0];
col1 = a[i - 1][1] + a[i][1] + a[i + 1][1];
for (j = 1; j < N - 1; ++j)
{
// calc new col sum
col2 = a[i - 1][j + 1] + a[i][j + 1] + a[i + 1][j + 1];
// calc new mean
mean[i][j] = (col0 + col1 + col2) / 9;
// shuffle col sums
col0 = col1;
col1 = col2;
}
}
This results in only 3 loads per point, rather than 9 as in the naive case, but is still not quite optimal.
You can optimise this further by processing two rows per iteration and maintaining overlapping column sums for rows i and i + 1.