Derivation of gradient of SVM loss

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single $i$-th observation. It is

$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$

where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.

In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.

In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.

To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write

$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$

If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.

Derivation of gradient of SVM loss

Tags:

Machine Learning

Related

Recent Posts