calculating the number of parameters of a GRU layer (Keras)
The key is that tensorflow will separate biases for input and recurrent kernels when the parameter reset_after=True
in GRUCell
. You can look at some of the source code in GRUCell
as follow:
if self.use_bias:
if not self.reset_after:
bias_shape = (3 * self.units,)
else:
# separate biases for input and recurrent kernels
# Note: the shape is intentionally different from CuDNNGRU biases
# `(2 * 3 * self.units,)`, so that we can distinguish the classes
# when loading and converting saved weights.
bias_shape = (2, 3 * self.units)
Taking the reset gate as an example, we generally see the following formulas.
But if we set reset_after=True
, the actual formula is as follows:
As you can see, the default parameter of GRU
is reset_after=True
in tensorflow2
. But the default parameter of GRU
is reset_after=False
in tensorflow1.x
.
So the number of parameters of a GRU
layer should be ((16+32)*32 + 32 + 32) * 3 * 2 = 9600
in tensorflow2
.
I figured out a little bit more about this, as an addition to the accepted answer. What Keras does in GRUCell.call()
is:
With reset_after=False
(default in TensorFlow 1):
With reset_after=True
(default in TensorFlow 2):
After training with reset_after=False
, b_xh
equals b_hz
, b_xr
equals b_hr
and b_xh
equals b_hh
, because (I assume) TensorFlow realizes that each of these pairs of vectors can be combined into one single parameter vector - just like the OP pointed out in a comment above. However, with reset_after=True
, that's not the case for b_xh
and b_hh
- they can and will be different, so they can not be combined into one vector, and that's why the total parameter count is higher.