Number of feature maps produced after each convolution layer in CNN's
I too had this confusion for a while and only after some digging did the mist clear away.
Difference betweeen convolution with 1 channel and convolution with multiple channels This is where my understanding was going wrong. I will make an attempt to explain this difference. I am no expert so please bear with me
Convolution operation with a single channel
When we think of a simple gray scale 32X32 image and a convolution operation we are applying 1 or more convolution matrixes in the first layer.
As per your example, each of these convolutional matrices of dimension 5X5 produces a 28x28 matrix as an output. Why 28X28? Because, you can slide a window of 5 pixels square in 32-5+1=28 positions assuming stride=1 and padding=0.
In such a scenario, each of your convolution matrix has 5X5=25 trainable weights + 1 trainable bias. You can have as many convolution kernels you like. But, each of the kernels would be 2 dimensional and each of the kernels would produce an output matrix of dimension 28X28 which is then fed to the MAXPOOL layer.
Convolutional operation with multiple channels
What if the image would have been a RGB 32X32 picure? As per popular literature the image should be treated as comprising of 3 channels and a convolution operation should be carried out on each of these channels. I must admit that I hastily drew some misleading conclusions. I was under the impression that we should use three independent 5X5 convolution matrices - 1 for each channel. I was wrong.
When you have 3 channels , each of your convolution matrix should be of dimension 3X5X5 - think of this as a single unit comprising of a 5X5 matrix stacked 3 times. Therefore you have 5x5x3=75 trainable weights + 1 trainable bias.
What happens in the second convolutional layer?
In your example, I found it easier to visualize that the 6 feature maps produced by the first CONV1+MAXPOOL1 layer as 6 channels. Therfore applying the same RGB logic like before, any convolution kernel that we apply in the second CONV2 layer should have a dimension 6X5X5. Why 6? Because we the CONV1+MAXPOOL1 has produced 6 feature maps. Why 5x5? In your example you have chosen a windo dimension of 5x5. Theoretically, I could have chosen 3x3, in which case the kernel dimension would be 6X3X3.
Therefore in the current example, if you have N2 convolutional matrixes in CONV2 layer then each of these N2 kernels will be a matrix of size 6X5X5 . In the current example N2=16 and the convolution operation of a kernel of dimension 6X5X5 on an input image with 6 channels X 14X14 will produce N2 matrices each of dimension 10X10. Why 10? 10=14-5+1 (stride=1,padding=0).
You now have N2=16 matrices lined up for MAXPOOL2 layer.
Reference:LeNet architecture
http://deeplearning.net/tutorial/lenet.html
Notice the encircled region. You can see that in the second convolution layer, the operation is shown to span across each of the 4 feature maps that were produced by the first layer.
Reference:Andrew Ng lectures
https://youtu.be/bXJx7y51cl0
Reference:How does Convolution arithmetic with multiple channels looks like?
I found another SFO question which nicely described this. How a Convolutional Neural Net handles channels
Take note that in the referenced example, the information in the 3 channels in squashed into a 2 dimensional matrix. This is why your 6 feature maps from CONV1+MAXPOOL1 layer no longer seem to appear to contribute to the dimension of the first fully connected layer.
The second convolution works as follows:
- input matrix shape: 6@14x14 (6 channels - result of applying 6 filters in the previous (first) convolution step)
- the 6@14x14 input matrix will be convolved using 16 filters (each of filters should have 6 channels in order to match the number of channels in the input matrix
- this will result in 16@5x5 output matrix
NOTE: Number of channels of the input matrix and number of channels in each filter must match in order to be able to perform element-wise multiplication.
So the main difference between first and second convolutions is that the # of channels in input matrix in first convolution is 1
so we will use 6
filters where each filter has only one channel (depth of matrix).
For the second convolution the input matrix has 6 channels (feature maps
), so each filter for this convolution must have 6 channels as well. For example: each of 16 filters will have the 6@3x3
shape.
The result of a convolution step for a single filter of 6@3x3
shape will be a single channel of WxH
(Width, Heigth) shape. After applying all 16 filters (where each of them has shape: 6@3x3
) we will get 16 channels, where each channel is a result of convolution of a single filter.
Let me give you a basic idea of how convolution works with a simple example:
If we have an input image of size 32x32 with 3 channels, and if we choose to use a 5x5 filter, the dimensions of the filter would be 5x5x3 (implicitly).
So now the convolution will take place between the corresponding channels of input image and filter, producing 3 channels, which would be added to produce a single output for the given 5x5x3 filter.
- Hence whenever we say the depth, we mean the number of filters we wish to add. In case, if we want a depth of 6, it means we are using 6 filters of size 5x5x3 producing only 6 layers of output.
To answer your question, in the given architecture, they are using 6 filters for the input image, then using 16 filters of size (y x y x 6) where y is supposed to be the height and width of the filter chosen.