YOLO object detection: how does the algorithm predict bounding boxes larger than a grid cell?

everything outside of the grid cell should be unknown to the neurons predicting the bounding boxes for an object detected in that cell right.

It's not quite right. The cells correspond to a partition of the image where the neuron have learned to respond if the center of an object is located within.

However, the receptive field of those output neurons is much larger than the cell and actually cover the entire image. It is therefore able to recognize and draw a bounding box around an object much larger than its assigned "center cell".

So a cell is centered on the center of the receptive field of the output neuron but is a much smaller part. It is also somewhat arbitrary, and one could image for example to have overlapping cells -- in which case you would expect neighboring neurons to fire simultaneously when an object is centered in the overlapping zone of their cells.

YOLO predicts offsets to anchors. The anchors are initialised such that there are 13x13 sets of anchors. (In Yolov3 each set has k=5 anchors, different yolo versions have different k.) The anchors are spread over the image, to make sure objects in all parts are detected.

The anchors can have an arbitrary size and aspect ratio, unrelated to the grid size. If your dataset has mostly large foreground objects, then you should initialise your anchors to be large. YOLO learns better if it only has to make small adjustments to the anchors.

Each prediction actually uses information from the whole image. Often context from the rest of the image helps the prediction. e.g. black pixels below a vehicle could be either tyres or shadow.

The algorithm doesn't really "know" in which cell the centre of the object is located. But during trainig we have that information from the ground truth, and we can train it to guess. With enough training, it ends up pretty good at guessing. The way that works is that the closest anchor to the ground truth is assigned to the object. Other anchors are assigned to the other objects or to the background. Anchors assigned to the background are supposed to have a low confidence, while anchors assigned to an object are assessed for the IoU of their bounding boxes. So the training reinforces one anchor to give a high confidence and an accurate bounding box, while other anchors give a low confidence. The example in your question doesn't include any predictions with low confidence (probably trying to keep things simple) but actually there will be many many more low confidence predictions than high confidence ones.

YOLO object detection: how does the algorithm predict bounding boxes larger than a grid cell?

Tags:

Computer Vision

Deep Learning

Tensorflow

Yolo

Convolutional Neural Network

Related

Recent Posts