multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?

how many hidden layers?

a model with zero hidden layers will resolve linearly separable data. So unless you already know your data isn't linearly separable, it doesn't hurt to verify this--why use a more complex model than the task requires? If it is linearly separable then a simpler technique will work, but a Perceptron will do the job as well.

Assuming your data does require separation by a non-linear technique, then always start with one hidden layer. Almost certainly that's all you will need. If your data is separable using a MLP, then that MLP probably only needs a single hidden layer. There is theoretical justification for this, but my reason is purely empirical: Many difficult classification/regression problems are solved using single-hidden-layer MLPs, yet I don't recall encountering any multiple-hidden-layer MLPs used to successfully model data--whether on ML bulletin boards, ML Textbooks, academic papers, etc. They exist, certainly, but the circumstances that justify their use is empirically quite rare.


How many nodes in the hidden layer?

From the MLP academic literature. my own experience, etc., I have gathered and often rely upon several rules of thumb (RoT), and which I have also found to be reliable guides (ie., the guidance was accurate, and even when it wasn't, it was usually clear what to do next):

RoT based on improving convergence:

When you begin the model building, err on the side of more nodes in the hidden layer.

Why? First, a few extra nodes in the hidden layer isn't likely do any any harm--your MLP will still converge. On the other hand, too few nodes in the hidden layer can prevent convergence. Think of it this way, additional nodes provides some excess capacity--additional weights to store/release signal to the network during iteration (training, or model building). Second, if you begin with additional nodes in your hidden layer, then it's easy to prune them later (during iteration progress). This is common and there are diagnostic techniques to assist you (e.g., Hinton Diagram, which is just a visual depiction of the weight matrices, a 'heat map' of the weight values,).

RoTs based on size of input layer and size of output layer:

A rule of thumb is for the size of this [hidden] layer to be somewhere between the input layer size ... and the output layer size....

To calculate the number of hidden nodes we use a general rule of: (Number of inputs + outputs) x 2/3

RoT based on principal components:

Typically, we specify as many hidden nodes as dimensions [principal components] needed to capture 70-90% of the variance of the input data set.

And yet the NN FAQ author calls these Rules "nonsense" (literally) because they: ignore the number of training instances, the noise in the targets (values of the response variables), and the complexity of the feature space.

In his view (and it always seemed to me that he knows what he's talking about), choose the number of neurons in the hidden layer based on whether your MLP includes some form of regularization, or early stopping.

The only valid technique for optimizing the number of neurons in the Hidden Layer:

During your model building, test obsessively; testing will reveal the signatures of "incorrect" network architecture. For instance, if you begin with an MLP having a hidden layer comprised of a small number of nodes (which you will gradually increase as needed, based on test results) your training and generalization error will both be high caused by bias and underfitting.

Then increase the number of nodes in the hidden layer, one at a time, until the generalization error begins to increase, this time due to overfitting and high variance.


In practice, I do it this way:

input layer: the size of my data vactor (the number of features in my model) + 1 for the bias node and not including the response variable, of course

output layer: soley determined by my model: regression (one node) versus classification (number of nodes equivalent to the number of classes, assuming softmax)

hidden layer: to start, one hidden layer with a number of nodes equal to the size of the input layer. The "ideal" size is more likely to be smaller (i.e, some number of nodes between the number in the input layer and the number in the output layer) rather than larger--again, this is just an empirical observation, and the bulk of this observation is my own experience. If the project justified the additional time required, then I start with a single hidden layer comprised of a small number of nodes, then (as i explained just above) I add nodes to the Hidden Layer, one at a time, while calculating the generalization error, training error, bias, and variance. When generalization error has dipped and just before it begins to increase again, the number of nodes at that point is my choice. See figure below.

enter image description here


It is very difficult to choose the number of neurons in a hidden layer, and to choose the number of hidden layers in your neural network.

Usually, for most applications, one hidden layer is enough. Also, the number of neurons in that hidden layer should be between the number of inputs (10 in your example) and the number of outputs (5 in your example).

But the best way to choose the number of neurons and hidden layers is experimentation. Train several neural networks with different numbers of hidden layers and hidden neurons, and measure the performance of those networks using cross-validation. You can stick with the number that yields the best performing network.


To automate the selection of the best number of layers and best number of neurons for each of the layers, you can use genetic optimization.

The key pieces would be:

  1. Chromosome: Vector that defines how many units in each hidden layer (e.g. [20,5,1,0,0] meaning 20 units in first hidden layer, 5 in second, ... , with layers 4 and 5 missing). You can set a limit on the maximum number number of layers to try, and the max number of units in each layer. You should also place restrictions of how the chromosomes are generated. E.g. [10, 0, 3, ... ] should not be generated, because any units after a missing layer (the '3,...') would be irrelevant and would waste evaluation cycles.
  2. Fitness Function: A function that returns the reciprocal of the lowest training error in the cross-validation set of a network defined by a given chromosome. You could also include the number of total units, or computation time if you want to find the "smallest/fastest yet most accurate network".

You can also consider:

  • Pruning: Start with a large network, then reduce the layers and hidden units, while keeping track of cross-validation set performance.
  • Growing: Start with a very small network, then add units and layers, and again keep track of CV set performance.