How to combine Neural Network and Audio like Classify?
One thing we have to consider before applying deep neural network is the size of the data. If the data size is very small, which is the case here (only 18 examples), a very deep neural network may not converge well.
There are several ways to deal with small data set. One common way is to use transfer learning (see example here) which leverage on pretrained network and only train small a small portion of the large network. Another way is to apply data augmentation to generate more data. Another way is to to use a shallow neural network and work with dimension reduced data. I will demonstrate the last one, which is also how Classify
did in the first place in your example.
First of all, we convert the audio into spectrograms.
windInstrument = ExampleData[{"Sound", #}] & /@ {"AltoFlute","AltoSaxophone", "BassClarinet", "BassFlute", "Flute","FrenchHorn", "Oboe", "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = ExampleData[{"Sound", #}] & /@ {"Cello","CelloPizzicato", "DoubleBass", "DoubleBassPizzicato", "OrganChord","Viola", "Violin"};
audioToSpectrogram[a_] := Module[{data},
data = SpectrogramArray[a, Automatic, Automatic, HannWindow];
ImageResize[
ImageAdjust@Image[Reverse@
Transpose@Abs[data[[All, 1 ;; Dimensions[data][[2]]/2]]]], {266,
277}]
]
spectrogramData =
audioToSpectrogram /@ Flatten[{windInstrument, nonwindInstrument}];
We then construct a dimension reduction function from our data. The original spectrogram data of 266 by 277 is reduced to a vector of size 17
dr = DimensionReduction[spectrogramData, 17,
Method -> "Linear"]
Now construct our training data from the dimension reduced data
trainingData =
RandomSample@
MapAt[dr[audioToSpectrogram[#]] &,
Flatten[{Thread[windInstrument -> "wind"],
Thread[nonwindInstrument -> "nonwind"]}], {All, 1}];
Construct and train the neural network, we use only two layers
net = NetChain[{10, Tanh, 2, Tanh, SoftmaxLayer[]}, "Input" -> 17,
"Output" -> NetDecoder[{"Class", {"wind", "nonwind"}}]];
trained =
NetTrain[net, trainingData,
Method -> {"SGD", "L2Regularization" -> 0.1},
MaxTrainingRounds -> 500]
evaluate on test data
trained@
dr@
audioToSpectrogram[ExampleData[{"Sound", #}]] & /@ {"Clarinet",
"Piano", "Bassoon"}
(* {"wind", "nonwind", "wind"} *)
In 11.3, NetEncoder
support Audio object.
windInstrument =
ExampleData[{"Sound", #}] & /@ {"AltoFlute", "AltoSaxophone",
"BassClarinet", "BassFlute", "Flute", "FrenchHorn", "Oboe",
"SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument =
ExampleData[{"Sound", #}] & /@ {"Cello", "CelloPizzicato",
"DoubleBass", "DoubleBassPizzicato", "OrganChord", "Viola",
"Violin"};
(*Convert Sound to Audio that fits NetTrain's Input*)
trainingData = Join[Thread[Audio /@ windInstrument -> "windInstrument"],
Thread[Audio /@ nonwindInstrument -> "nonwindInstrument"]];
Let's construct network.
Here I use AudioMelSpectrogram
, this one and AudioMFCC
are common features in speech tasks.
net = NetChain[{LongShortTermMemoryLayer[30],
LongShortTermMemoryLayer[10], SequenceLastLayer[], 2,
SoftmaxLayer[]},
"Input" -> NetEncoder["AudioMelSpectrogram"],
"Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]
net = NetTrain[net, trainingData];
types = {"Clarinet", "Piano", "Bassoon"};
net[Audio@ExampleData[{"Sound", #}]] & /@ types
(*{"windInstrument","nonwindInstrument","windInstrument"}*)
However now, there are more focus on raw Audio itself(NetEncoder["Audio"]
) such as WaveNet and SampleRNN.
net = NetChain[{ConvolutionLayer[32, 80, "Interleaving" -> True],
LongShortTermMemoryLayer[10], SequenceLastLayer[], 2,
SoftmaxLayer[]}, "Input" -> NetEncoder["Audio"],
"Output" ->
NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]
But it use so much memory, so it can't be done in real application.
And the features from xslittlegrass's answer is similar to NetEncoder["AudioSpectrogram"]
PS:
This way support out-of-core training, that means it can be used in real applications:
enc = NetEncoder["AudioMelSpectrogram"];
file = "ExampleData/rule30.wav";
a1 = Import[file];
(*in-core Audio object*)
a2 = Audio[file];
(*out-of-core Audio object*)
Dimensions /@ NetEncoder["AudioMelSpectrogram"][{a1, a2}]
(*{{215,40},{215,40}}*)
ByteCount /@ {a1, a2}
(*{161128,424}*)