calculating the number of parameters of a GRU layer (Keras)

The key is that tensorflow will separate biases for input and recurrent kernels when the parameter reset_after=True in GRUCell. You can look at some of the source code in GRUCell as follow:

if self.use_bias:
    if not self.reset_after:
        bias_shape = (3 * self.units,)
    else:
        # separate biases for input and recurrent kernels
        # Note: the shape is intentionally different from CuDNNGRU biases
        # `(2 * 3 * self.units,)`, so that we can distinguish the classes
        # when loading and converting saved weights.
        bias_shape = (2, 3 * self.units)

Taking the reset gate as an example, we generally see the following formulas. enter image description here

But if we set reset_after=True, the actual formula is as follows: enter image description here

As you can see, the default parameter of GRU is reset_after=True in tensorflow2. But the default parameter of GRU is reset_after=False in tensorflow1.x.

So the number of parameters of a GRU layer should be ((16+32)*32 + 32 + 32) * 3 * 2 = 9600 in tensorflow2.

I figured out a little bit more about this, as an addition to the accepted answer. What Keras does in GRUCell.call() is:

$z_t=\sigma(x_tW_z+b_{xz}+h_{t-1}U_z+b_{hz})$

$r_t=\sigma(x_tW_r+b_{xr}+h_{t-1}U_r+b_{hr})$

With reset_after=False (default in TensorFlow 1):

$h_t=z_t\odot h_{t-1}+(1-z_t)\odot \tanh(x_tW_h+b_{xh}+(r_t\odot h_{t-1})U_h+b_{hh})$

With reset_after=True (default in TensorFlow 2):

$h_t=z_t\odot h_{t-1}+(1-z_t)\odot \tanh(x_tW_h+b_{xh}+r_t\odot(h_{t-1}U_h+b_{hh}))$

After training with reset_after=False, b_xh equals b_hz, b_xr equals b_hrand b_xh equals b_hh, because (I assume) TensorFlow realizes that each of these pairs of vectors can be combined into one single parameter vector - just like the OP pointed out in a comment above. However, with reset_after=True, that's not the case for b_xh and b_hh - they can and will be different, so they can not be combined into one vector, and that's why the total parameter count is higher.

calculating the number of parameters of a GRU layer (Keras)

Tags:

Tensorflow

Lstm

Gated Recurrent Unit

Related

Recent Posts