Why detach needs to be called on variable in this example?
The top voted answer is INCORRECT/INCOMPLETE!
Check this: https://github.com/pytorch/examples/issues/116, and have a look at @plopd's answer:
This is not true. Detaching
fake
from the graph is necessary to avoid forward-passing the noise through G when we actually update the generator. If we do not detach, then, althoughfake
is not needed for gradient update of D, it will still be added to the computational graph and as a consequence ofbackward
pass which clears all the variables in the graph (retain_graph=False
by default),fake
won't be available when G is updated.
This post also clarifies a lot: https://zhuanlan.zhihu.com/p/43843694 (In Chinese).
ORIGINAL ANSWER (WRONG / INCOMPLETE)
You're right, optimizerD
only updates netD
and the gradients on netG
are not used before netG.zero_grad()
is called, so detaching is not necessary, it just saves time, because you're not computing gradients for the generator.
You're basically also answering your other question yourself, you don't detach fake
in the second block because you specifically want to compute gradients on netG
to be able to update its parameters.
Note how in the second block real_label
is used as the corresponding label for fake
, so if the discriminator finds the fake input to be real, the final loss is small, and vice versa, which is precisely what you want for the generator. Not sure if that's what confused you, but it's really the only difference compared to training the discriminator on fake inputs.
EDIT
Please see FatPanda's comment! My original answer is in fact incorrect. Pytorch destroys (parts of) the compute graph when .backward()
is called. Without detaching before errD_fake.backward()
the errG.backward()
call later would not be able to backprop into the generator because the required graph is no longer available (unless you specify retain_graph=True
). I'm relieved Soumith made the same mistake :D