Understanding Tensorflow LSTM Input shape

(This answer "addreses" the problem when direct np.reshape() doesn't organize the final array as we want it. If we want to directly reshape into 3D np.reshape will do it, but watch out for the final organization of the input).

In my personal try to finally resolve this problem of feeding input shape for RNN and not confuse anymore, I will give my "personal" explanation for this.

In my case (and I think that many others may have this organization scheme in their feature matrices), most of the blogs outside "don't help". Let's give it a try in how to convert a 2D feature matrix into a 3D shaped one for RNNs.

Let's say we have this organization type in our feature matrix: we have 5 observations (i.e. rows - for convention I think it is the most logical term to use) and in each row, we have 2 features for EACH timestep (and we have 2 timesteps), like this:

(The df is to better understand visually my words)

In [1]: import numpy as np                                                           

In [2]: arr = np.random.randint(0,10,20).reshape((5,4))                              

In [3]: arr                                                                          
Out[3]: 
array([[3, 7, 4, 4],
       [7, 0, 6, 0],
       [2, 0, 2, 4],
       [3, 9, 3, 4],
       [1, 2, 3, 0]])

In [4]: import pandas as pd                                                          

In [5]: df = pd.DataFrame(arr, columns=['f1_t1', 'f2_t1', 'f1_t2', 'f2_t2'])         

In [6]: df                                                                           
Out[6]: 
   f1_t1  f2_t1  f1_t2  f2_t2
0      3      7      4      4
1      7      0      6      0
2      2      0      2      4
3      3      9      3      4
4      1      2      3      0

We will now take the values to work with them. The thing here is that RNNs incorporate the "timestep" dimension to their input, because of their architechtural nature. We can imagine that dimension as stacking 2D arrays one behind the other for the number of timesteps we have. In this case, we have two timesteps; so we will have two 2D arrays stacked: one for timestep1 and behind that, the one for timestep2.

In reality, in that 3D input we need to make, we still have 5 observations. The thing is that we need to arrange them differently: the RNN will take the first row (or specified batch - but we will keep it simple here) of the first array (i.e. timestep1) and the first row of the second stacked array (i.e. timestep2). Then the second row...until the last one (the 5th one in our example). So, in each row of each timestep, we need to have the two features, of course, separated in different arrays each one corresponding to its timestep. Let's see this with the numbers.

I will make two arrays for easier understanding. Remember that, because of our organizational scheme in the df, you might have noticed that we need to take the first two columns (i.e. features 1 and 2 for the timestep1) as our FIRST ARRAY OF THE STACK and the last two columns, that is, the 3rd and the 4th, as our SECOND ARRAY OF THE STACK, so that everything makes sense finally.

In [7]: arrStack1 = arr[:,0:2]                                                       

In [8]: arrStack1                                                                    
Out[8]: 
array([[3, 7],
       [7, 0],
       [2, 0],
       [3, 9],
       [1, 2]])

In [9]: arrStack2 = arr[:,2:4]                                                       

In [10]: arrStack2                                                                   
Out[10]: 
array([[4, 4],
       [6, 0],
       [2, 4],
       [3, 4],
       [3, 0]])

Finally, the only thing we need to do is stack both arrays ("one behind the other") as if they were part of the same final structure:

In [11]: arrfinal3D = np.stack([arrStack1, arrStack2])                               

In [12]: arrfinal3D                                                                  
Out[12]: 
array([[[3, 7],
        [7, 0],
        [2, 0],
        [3, 9],
        [1, 2]],

       [[4, 4],
        [6, 0],
        [2, 4],
        [3, 4],
        [3, 0]]])

In [13]: arrfinal3D.shape                                                            
Out[13]: (2, 5, 2)

That's it: we have our feature matrix ready to be fed into the RNN cell, taking into account our organization of the 2D feature matrix.

(For a one liner regarding all this you could use:

In [14]: arrfinal3D_1 = np.stack([arr[:,0:2], arr[:,2:4]])                           

In [15]: arrfinal3D_1                                                                
Out[15]: 
array([[[3, 7],
        [7, 0],
        [2, 0],
        [3, 9],
        [1, 2]],

       [[4, 4],
        [6, 0],
        [2, 4],
        [3, 4],
        [3, 0]]])

The documentation of tf.nn.dynamic_rnn states:

inputs: The RNN inputs. If time_major == False (default), this must be a Tensor of shape: [batch_size, max_time, ...], or a nested tuple of such elements.

In your case, this means that the input should have a shape of [batch_size, 10, 2]. Instead of training on all 4000 sequences at once, you'd use only batch_size many of them in each training iteration. Something like the following should work (added reshape for clarity):

batch_size = 32
# batch_size sequences of length 10 with 2 values for each timestep
input = get_batch(X, batch_size).reshape([batch_size, 10, 2])
# Create LSTM cell with state size 256. Could also use GRUCell, ...
# Note: state_is_tuple=False is deprecated;
# the option might be completely removed in the future
cell = tf.nn.rnn_cell.LSTMCell(256, state_is_tuple=True)
outputs, state = tf.nn.dynamic_rnn(cell,
                                   input,
                                   sequence_length=[10]*batch_size,
                                   dtype=tf.float32)

From the documentation, outputs will be of shape [batch_size, 10, 256], i.e. one 256-output for each timestep. state will be a tuple of shapes [batch_size, 256]. You could predict your final value, one for each sequence, from that:

predictions = tf.contrib.layers.fully_connected(state.h,
                                                num_outputs=1,
                                                activation_fn=None)
loss = get_loss(get_batch(Y).reshape([batch_size, 1]), predictions)

The number 256 in the shapes of outputs and state is determined by cell.output_size resp. cell.state_size. When creating the LSTMCell like above, these are the same. Also see the LSTMCell documentation.