Does "16bit integer PCM data" mean it's signed or unsigned?
16-bit audio is, by convention, usually signed.
Think about what PCM audio is: each measure is how far along its axis the speaker should physically rest at that moment in time. Therefore perfect silence is absolutely any repeating value — that represents the speaker not moving.
0 is then the centre of the range, and usually where a microphone should be with no input. -32768 is the speaker as close to one end of its axis as it can be, 32767 is it at the other end.
The safest way to detect silence would be to run a spectral analysis over the relevant range and look for periods where there is no activity in any audible frequency range.
If you're looking for pauses between speech then the easiest thing would probably be to go to somewhere like this, plug in an acceptable frequency range for speech (it's considered to be around 300Hz to around 3500Hz in telephony), your sampling rate and however many multiplications you think you can afford. Copy the coefficients supplied. E.g. I assumed you'll do 37 taps across the speech range with a 44100Hz input and, converted to a C array, I got:
double coefficients[] = {
-0.000560, -0.001290, -0.002332, -0.003606, -0.004911, -0.005921, -0.006201,
-0.005256, -0.002610, 0.002106, 0.009059, 0.018139, 0.028924, 0.040691, 0.052479,
0.063203, 0.071794, 0.077351, 0.079274, 0.077351, 0.071794, 0.063203, 0.052479,
0.040691, 0.028924, 0.018139, 0.009059, 0.002106, -0.002610, -0.005256, -0.006201,
-0.005921, -0.004911, -0.003606, -0.002332, -0.001290, -0.000560};
If it were double
input, for each input sample c
I'd then compute a sampled value:
double *inputWave = ... input, an infinite array for the purposes of the example ...
double sampledValue = 0.0;
for(size_t coeff = 0; coeff < numberOfTaps; coeff++) {
sampledValue += coefficients[coeff] * inputWave[c + coeff];
}
// (where numberOfTaps = sizeof(coefficients) / sizeof(coefficients[0]),
// i.e. the number of coefficients: 37 with the array given above)
What I've then got is a bandpass filter. Only that part of the signal representing sound in the frequency range 300–3500Hz should remain in the output values. In real life no such filter is perfect; increase the number of coefficients to increase the quality of your filter.
Having cut irrelevant parts of the signal I could then look for prolonged periods of sampledValue = [close to] 0.0
.