What is a multi-headed model? And what exactly is a 'head' in a model?
Head is the top of a network. For instance, on the bottom (where data comes in) you take convolution layers of some model, say resnet. If you call ConvLearner.pretrained, CovnetBuilder will build a network with appropriate head to your data in Fast.ai (if you are working on a classification problem, it will create a head with a cross entropy loss, if you are working on a regression problem, it will create a head suited to that).
But you could build a model that has multiple heads. The model could take inputs from the base network (resnet conv layers) and feed the activations to some model, say head1 and then same data to head2. Or you could have some number of shared layers built on top of resnet and only those layers feeding to head1 and head2.
You could even have different layers feed to different heads! There are some nuances to this (for instance, with regards to the fastai lib, ConvnetBuilder will add an AdaptivePooling layer on top of the base network if you don’t specify the custom_head argument and if you do it won’t) but this is the general picture.
- https://forums.fast.ai/t/terminology-question-head-of-neural-network/14819/2
- https://youtu.be/h5Tz7gZT9Fo?t=3613 (1:13:00)
The explanation you found is accurate. Depending on what you want to predict on your data you require an adequate backbone network and a certain amount of prediction heads.
For a basic classification network for example you can view ResNet, AlexNet, VGGNet, Inception,... as the backbone and the fully connected layer as the sole prediction head.
A good example for a problem where you need multiple-heads is localization, where you not only want to classify what is in the image but also want to localize the object (find the coordinates of the bounding box around it).
The image below shows the general architecture
The backbone network ("convolution and pooling") is responsible for extracting a feature map from the image that contains higher level summarized information. Each head uses this feature map as input to predict its desired outcome.
The loss that you optimize for during training is usually a weighted sum of the individual losses for each prediction head.