Unaggregated gradients / gradients per example in tensorflow
To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
- Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
- Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
- Call your custom tf.gradients function and give your loss as a list of slices:
custagg_gradients(
ys=[cross_entropy[i] for i in xrange(batch_size)],
xs=variables.trainable_variables(),
aggregation_method=CUSTOM,
gradient_factors=gradient_factors
)
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).
tf.gradients
returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y
errors, the gradient with respect to W
is 2(WX-Y)X'
where X
is the batch of observations and Y
is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k
per-example loss gradients is to use batches of size 1 and do k
passes. Ian Goodfellow wrote up how to get all k
gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients
method