Why detach needs to be called on variable in this example?

The top voted answer is INCORRECT/INCOMPLETE!

Check this: https://github.com/pytorch/examples/issues/116, and have a look at @plopd's answer:

This is not true. Detaching fake from the graph is necessary to avoid forward-passing the noise through G when we actually update the generator. If we do not detach, then, although fake is not needed for gradient update of D, it will still be added to the computational graph and as a consequence of backward pass which clears all the variables in the graph (retain_graph=False by default), fake won't be available when G is updated.

This post also clarifies a lot: https://zhuanlan.zhihu.com/p/43843694 (In Chinese).


ORIGINAL ANSWER (WRONG / INCOMPLETE)

You're right, optimizerD only updates netD and the gradients on netG are not used before netG.zero_grad() is called, so detaching is not necessary, it just saves time, because you're not computing gradients for the generator.

You're basically also answering your other question yourself, you don't detach fake in the second block because you specifically want to compute gradients on netG to be able to update its parameters.

Note how in the second block real_label is used as the corresponding label for fake, so if the discriminator finds the fake input to be real, the final loss is small, and vice versa, which is precisely what you want for the generator. Not sure if that's what confused you, but it's really the only difference compared to training the discriminator on fake inputs.

EDIT

Please see FatPanda's comment! My original answer is in fact incorrect. Pytorch destroys (parts of) the compute graph when .backward() is called. Without detaching before errD_fake.backward() the errG.backward() call later would not be able to backprop into the generator because the required graph is no longer available (unless you specify retain_graph=True). I'm relieved Soumith made the same mistake :D

Tags:

Pytorch