What is a good approach for extracting portions of speech from an arbitrary audio file?
EnergyDetector
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output
If you use the configuration file that you find at the end of the answer, you need to put output.prm
in prm/
, and you'll find the segmentation in lbl/
.
As a reference, I attach my EnergyDetector configuration file:
*** EnergyDetector Config File
***
loadFeatureFileExtension .prm
minLLK -200
maxLLK 1000
bigEndian false
loadFeatureFileFormat SPRO4
saveFeatureFileFormat SPRO4
saveFeatureFileSPro3DataKind FBCEPSTRA
featureServerBufferSize ALL_FEATURES
featureServerMemAlloc 50000000
featureFilesPath prm/
mixtureFilesPath gmm/
lstPath lst/
labelOutputFrames speech
labelSelectedFrames all
addDefaultLabel true
defaultLabel all
saveLabelFileExtension .lbl
labelFilesPath lbl/
frameLength 0.01
segmentalMode file
nbTrainIt 8
varianceFlooring 0.0001
varianceCeiling 1.5
alpha 0.25
mixtureDistribCount 3
featureServerMask 19
vectSize 1
baggedFrameProbabilityInit 0.1
thresholdMode weight
CMU Sphinx
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
Other VADs
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.
It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.
The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:
import webrtcvad
vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)
Hi pyAudioAnalysis has a silence removal functionality.
In this library, silence removal can be as simple as that:
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)
Internally silence removal()
follows a semi-supervised approach: first an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with the 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and a dynamic thresholding is used to detect the active segments.