How can I do real-time voice activity detection in Python?

You should try using Python bindings to webRTC VAD from Google. It's lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.

# Run the VAD on 10 ms of silence. The result should be False.
import webrtcvad
vad = webrtcvad.Vad(2)

sample_rate = 16000
frame_duration = 10  # ms
frame = b'\x00\x00' * int(sample_rate * frame_duration / 1000)
print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))

Also, this article might be useful for you.

I found out that LibROSA could be one of the solutions to your problem. There's a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.

Let's use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:

To calculate STFT, Fast Fourier transform window size(n_fft) is used as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency bins(n_stft) are calculated over a window size of 512. The window is moved by a hop length of 256 to have a better overlapping of the windows in calculating the STFT.

stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))

# Plot audio with zoomed in y axis
def plotAudio(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))
    ax.margins(2, -0.1)

# Plot audio
def plotAudio2(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))

def minMaxNormalize(arr):
    mn = np.min(arr)
    mx = np.max(arr)
    return (arr-mn)/(mx-mn)

def predictSound(X):
    clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample
    stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512))
    stfts = np.mean(stfts,axis=1)
    stfts = minMaxNormalize(stfts)
    result = model.predict(np.array([stfts]))
    predictions = [np.argmax(y) for y in result]

CHUNKSIZE = 22050 # fixed chunk size
RATE = 22050

p = pyaudio.PyAudio()
stream =, channels=1, 
rate=RATE, input=True, frames_per_buffer=CHUNKSIZE)

#preprocessing the noise around
#noise window
data =
noise_sample = np.frombuffer(data, dtype=np.float32)
print("Noise Sample")
loud_threshold = np.mean(np.abs(noise_sample)) * 10
print("Loud threshold", loud_threshold)
audio_buffer = []
near = 0

    # Read chunk and load it into numpy array.
    data =
    current_window = np.frombuffer(data, dtype=np.float32)
    #Reduce noise real-time
    current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False)
        audio_buffer = current_window
            print("Inside silence reign")
                audio_buffer = np.concatenate((audio_buffer,current_window))
                near += 1
                audio_buffer = []
            print("Inside loud reign")
            near = 0
            audio_buffer = np.concatenate((audio_buffer,current_window))

# close stream

Code credit to: Chathuranga Siriwardhana

Full code can be found here.

I think there are two approaches here,

  1. Threshold Approach
  2. Small, deployable, Neural net. Approach

The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.

in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like 00000000111111110000000011110000. The neural net should be small and optimized for running on edge devices like mobile.

You can check this out from TensorFlow

This is a voice activity detector. I think it's for your purpose.

Also, check these out.

of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.