What is the best way to calculate number of padding bytes

pad = (-size)&3;

This should be the fastest.

size 0: pad 0
size 1: pad 3
size 2: pad 2
size 3: pad 1

As long as the optimizing compiler uses bitmasking for the % 4 instead of division, I think your code is probably pretty good. This might be a slight improvement:

// only the last 2 bits (hence & 3) matter
pad = (4 - (size & 3)) & 3;

But again, the optimizing compiler is probably smart enough to be reducing your code to this anyway. I can't think of anything better.


// align n bytes on size boundary
pad n size = (~n + 1) & (size - 1)

this is similar to TypeIA's solution and only machine language ops are used.

(~n + 1) computes the negative value, that would make up 0 when added to n
& (size - 1) filters only the last relevant bits.

examples

pad 13 8 = 3
pad 11 4 = 1