What "exactly" happens inside embedding layer in pytorch?
That is a really good question! The embedding layer of PyTorch (same goes for Tensorflow) serves as a lookup table just to retrieve the embeddings for each of the inputs, which are indices. Consider the following case, you have a sentence where each word is tokenized. Therefore, each word in your sentence is represented with a unique integer (index). In case the list of indices (words) is [1, 5, 9]
, and you want to encode each of the words with a 50
dimensional vector (embedding), you can do the following:
# The list of tokens
tokens = torch.tensor([0,5,9], dtype=torch.long)
# Define an embedding layer, where you know upfront that in total you
# have 10 distinct words, and you want each word to be encoded with
# a 50 dimensional vector
embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=50)
# Obtain the embeddings for each of the words in the sentence
embedded_words = embedding(tokens)
Now, to answer your questions:
During the forward pass, the values for each of the tokens in your sentence are going to be obtained in a similar way as the Numpy's indexing works. Because in the backend, this is a differentiable operation, during the backward pass (training), Pytorch is going to compute the gradients for each of the embeddings and readjust them accordingly.
The weights are the embeddings themselves. The word embedding matrix is actually a weight matrix that will be learned during training.
There is no actual function per se. As we defined above, the sentence is already tokenized (each word is represented with a unique integer), and we can just obtain the embeddings for each of the tokens in the sentence.
Finally, as I mentioned the example with the indexing many times, let us try it out.
# Let us assume that we have a pre-trained embedding matrix
pretrained_embeddings = torch.rand(10, 50)
# We can initialize our embedding module from the embedding matrix
embedding = torch.nn.Embedding.from_pretrained(pretrained_embeddings)
# Some tokens
tokens = torch.tensor([1,5,9], dtype=torch.long)
# Token embeddings from the lookup table
lookup_embeddings = embedding(tokens)
# Token embeddings obtained with indexing
indexing_embeddings = pretrained_embeddings[tokens]
# Voila! They are the same
np.testing.assert_array_equal(lookup_embeddings.numpy(), indexing_embeddings.numpy())
nn.Embedding
layer can serve as a lookup table. This means if you have a dictionary of n
elements you can call each element by id if you create the embedding.
In this case the size of the dictionary would be num_embeddings
and the embedding_dim
would be 1.
You don't have anything to learn in this scenario. You just indexed elements of a dict, or you encoded them, you may say. So forward pass analysis in this case is not needed.
You may have used this if you used word embeddings like Word2vec.
On the other side you may use embedding layers for categorical variables (features in general case). In there you will set the embedding dimension embedding_dim
to the number of categories you may have.
In that case you start with randomly initialized embedding layer and you learn the categories (features) in forward.