How does shift-and-stitch in a fully convolutional network work?

While this question has been answered, I found this image here that better-explained shift-and-stitch. Just image your FCN is a 2x2 max-pooling layer (Also the numbers represent pixel values not index values btw). So the values are being max-pulled after doing the shifting and then we stitch the results into the original image: Shift and Stich


In FCN, the final output you get (by default without utilizing any tricks for upsampling) is at a lower resolution compared to the input. Assuming you have an input image of shape 100x100 and you get an output (from the network) of shape 10x10. Mapping the output directly to the input resolution will look patchy (even with high order interpolation).

Now, you take the same input and shift it a bit and get the output and repeat this process multiple times. You end up with a set of output images and a vector of shifts corresponding to each output. These output images with the shift vectors can be utilized (stitch) to get better resolution in the final schematic map.

One might think of it as taking multiple (shifted) low-resolution images of an object and combining (stitch) them to get a higher resolution image.