What is num_units in tensorflow BasicLSTMCell?
From this brilliant article
num_units
can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!
The argument n_hidden
of BasicLSTMCell
is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x
of shape [T, 10]
, you will feed the LSTM with the sequence of values from t=0
to t=T-1
, each of size 10
.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden]
, and get a n_hidden
vector.
Your LSTM gets at each timestep t
:
- the previous hidden state
h_{t-1}
, of sizen_hidden
(att=0
, the previous state is[0., 0., ...]
) - the input, transformed to size
n_hidden
- it will sum these inputs and produce the next hidden state
h_t
of sizen_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128
and you will be fine.
The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128
was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X
is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh
function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for
loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
- A good answer to a similar question here.
- You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for
loops written out explicitly.