Derivative of spectral norm of symmetric matrix

Let $W \in \mathbb{R}^{m\text{x}n}$. $f(W)=||W||_2 = \sigma_1(W)$, where $\sigma_1(W)$ stands for the larger singular value of (W).

The SVD decomposition of $W$ is $W=U\Sigma V^{T}$. So $||W||_2=e_1^{T}U^{T}(U\Sigma V^{T})V e_1$. Hence, $f(W)=u_1^{T}Wv_1$, where $u_1$ and $v_1$ are the first column vectors of $U$ and $V$, respectively.

Hence, the derivative can be computed as:

$Df(W)(H)= u_1^{T}Hv_1 = trace(u_1^{T}Hv_1) = trace(v_1u_1^{T}H)$.

Thus, the gradient is $\nabla f(W) = v_1u_1^{T}$.

Ps.: Notice that a sufficient condition for the existence of the derivative is that $\sigma_1 \neq \sigma_2$, otherwise the maximization problem in the induced 2-norm has more than one argmax.


Here is an approach based on the implicit function theorem, similar to loup blanc's approach, with a connection to my other answer.

Let $W=U \Sigma V^T$ be a SVD of $W$, where $\Sigma = \operatorname{diag}(\sigma_1,...,\sigma_n)$, with $\sigma_k \ge \sigma_{k+1}$. Then $\|W\|_2 = \sigma_1$, where $\sigma_1 = \sqrt{\lambda_\max(W^TW)}$.

Let $\eta(x,A) = \det(x I -A^TA)$, then $\eta(x,A)=0$ iff $\sqrt{x}$ is a singular value of $A$. Note that $\eta$ is a polynomial in $x$ and the entries of $A$, hence so are the various derivatives.

Note that ${\partial \det(A) \over \partial A} (\Delta) = \operatorname{tr} ((\operatorname{adj} A) \Delta)$, and if $A$ is invertible, we have ${\partial \det(A) \over \partial A} (\Delta) = \det A \operatorname{tr} (A^{-1} \Delta ) $. The latter form is more convenient to work with.

If $\sqrt{x}$ is not a singular value of $W$, we have ${ \partial \eta(x, W) \over \partial x} = \det(x I -W^T W) \operatorname{tr} ((x I -W^TW)^{-1} )$. Using the SVD expansion (and continuity), we have ${ \partial \eta(x, W) \over \partial x} = \sum_k \prod_{i\neq k} (x-\sigma_i^2)$ for all $x$.

Hence we see that if $\sigma_1 > \sigma_2$ (which also implies $\sigma_1 >0$), then ${ \partial \eta(\sigma_1^2, W) \over \partial x}\neq 0$, and the implicit function theorem gives the existence of a differentiable function $\xi$ defined in a neighbourhood of $W$ such that $\xi(W) = \sigma_1^2$ and $\eta(\xi(W'),W') = 0$ for $W'$ near $W$, hence $\sqrt{\xi(W')}$ is a singular value of $W'$, and continuity of the roots of $x \mapsto \eta(x,W')$ as a function of $W'$ shows that $\|W'\|_2 = \sqrt{\xi(W')}$.

To compute the derivative of $\xi$, we need ${ \partial \eta(\sigma_1^2, W) \over \partial A}(H)$. For $x \neq \sigma_1^2$ near $\sigma_1^2$ we have \begin{eqnarray} { \partial \eta(x, W) \over \partial A}(H) &=& -\det(x I -W^T W) \operatorname{tr} ((x I -W^T W)^{-1} (W^TH+H^TW) ) \\ &=& -2 \det(x I -W^T W) \operatorname{tr} ((x I -W^T W)^{-1} W^TH ) \end{eqnarray} and so (expanding use the SVD), \begin{eqnarray} { \partial \xi(W) \over \partial W}(H) &=& \lim_{x \to \sigma_1^2} 2 { \operatorname{tr} ((x I -W^T W)^{-1} W^TH ) \over \operatorname{tr} ((x I -W^T W)^{-1} ) } \\ &=& 2 \lim_{x \to \sigma_1^2} { \operatorname{tr} ((x I -\Sigma^2)^{-1} \Sigma U^T H V ) \over \operatorname{tr} ((x I -\Sigma^2)^{-1} ) } \\ &=& 2 \lim_{x \to \sigma_1^2} { \sum_k { \sigma_k \over x-\sigma_k^2 } [U^THV]_{kk} \over \sum_k { 1 \over x-\sigma_k^2 } } \\ &=& 2 \lim_{x \to \sigma_1^2} { \sum_k (x-\sigma_1^2) { \sigma_k \over x-\sigma_k^2 } [U^THV]_{kk} \over \sum_k (x-\sigma_1^2) { 1 \over x-\sigma_k^2 } } \\ &=& 2 \sigma_1 [U^THV]_{11} \\ &=& 2 \sigma_1 \langle u , H v \rangle \end{eqnarray} where $u = U e_1,v=V e_1$ are left and right singular vectors of $W$ corresponding to the singular value $\sigma_1$.

Finally, since $n(W') = \|W'\|_2 = \sqrt{\xi(W')}$, we have ${\partial n(W) \over \partial W} = {1 \over 2 \sqrt{ \xi(W')}} {\partial \xi(W) \over \partial W } = \langle u , H v \rangle$, as above.

The above works for any matrix $W$ as long as $\sigma_1 > \sigma_2$.

If $W$ is symmetric, then we can write the spectral decomposition $W=U \Lambda U^T$ (this also functions as a SVD), and so $W^T W = U \Lambda^2 U^T$. Hence $\|W\|_2$ is the absolute value of the eigenvalue of largest absolute value. Hence the condition $\sigma_1 > \sigma_2$ corresponds to requiring $|\lambda_1| > |\lambda_k|$ for $k >1$, and in this case ${\partial n(W) \over \partial W}(H) = \langle u , H u \rangle$, where $u$ is a unit eigenvector corresponding to $\lambda_1$.

We have, of course, ${\partial n(W) \over \partial W_{ij}} = {\partial n(W) \over \partial W}(E_{ij}) = [u]_i [u]_j$.


Another approach is to use subgradients. This approach has the added advantage of characterizing points at which the norm is differentiable in general, not just for symmetric matrices.

Let $n(W) = \|W\|_2$, and note that $n(W) = \sigma_1(W)$, the maximum singular value of $W$.

The relevant result is:

$n$ is differentiable at $W$ iff both of the subspaces of left and right singular vectors of $W$ corresponding to $\sigma_1(W)$ have dimension one.

A consequence of this assumption is that if $u,v$ are of unit norm, and $\|W\|_2 = \langle u, W v \rangle$, then $u,v$ are left and right singular vectors of $W$ corresponding to the maximum singular value of $W$, and furthermore, $u,v$ are unique modulo sign. This follows from the singular value decomposition.

If $W$ is symmetric, we can state this assumption in terms of eigenvalues: Then $n$ is differentiable at $W$ iff when we order the eigenvalues of $W$ by decreasing absolute value, then $|\lambda_1| > |\lambda_k|$, for all $k>1$ (that is, the eigenvalue of maximum absolute value is unique).

(In particular, it is not sufficient to assume that the maximum eigenvalue is unique.)

A consequence of this condition is that if $u,v$ are of unit norm, and $\|W\|_2 = \langle u, W v \rangle$, then $u,v$ are left and right eigenvectors corresponding to $\lambda_1$, and furthermore, $u=(\operatorname{sgn} \lambda_1) v$ (and $v$ is unique modulo sign).

Returning to the general case:

Note that $n(W) = \max_{u,v \in B} \phi(W, (u,v))$, where $\phi(W,(u,v)) = \langle u, W v \rangle$, and $B$ is the closed unit ball (hence compact). For each $(u,v)$, the function $W \mapsto \phi(W, (u,v))$ is convex (in fact, linear), hence $n$ is convex and since $B \times B$ is compact, it admits a subgradient $\partial n(W)$ at each $W$.

Since $n$ is convex and finite valued, $n$ is Lipschitz and regular (the directional derivatives and generalised directional derivatives coincide).

Note that ${\partial \phi(W,(u,v)) \over \partial W}(H) = \langle u, H v \rangle$.

Let $I(W) = \{(u,v) \in B \times B \, | \, \|W\|_2 = \phi(W, (u,v)) \}$. Note that $(u,v) \in I(W)$ iff $u,v$ are left and right singular vectors of $W$ corresponding to the maximum singular value of $W$ (that is, $\sigma_1(W) = \|W\|_2$).

Then Corollary 2 in Remark 2.8.3 of Clarke's "Optimization and Nonsmooth Analysis" gives $\partial n(W) = \operatorname{co}\{ {\partial \phi(W,(u,v)) \over \partial W}\}_{(u,v) \in I(W)}$. (This is true whether or not $W$ is symmetric.)

Proposition 2.2.4 shows that if $\partial n(W)$ is a singleton, then $n$ is (strictly) differentiable at $W$. Furthermore, Proposition 2.3.6 shows that if $n$ is differentiable at $W$, then $\partial n(W)$ is a singleton.

It should be clear that $\partial n(W)$ is a singleton iff $I(W)$ contains exactly two points (which are symmetric about the origin) iff both of the subspaces of left and right singular vectors of $W$ corresponding to $\sigma_1(W)$ have dimension one. When $n$ is differentiable, we have ${\partial n(W) \over \partial W}(H) = \langle u, H v \rangle$, where $u,v$ are left and right singular vectors corresponding to the maximum singular value of $W$

If $W$ is symmetric, then $\partial n(W)$ is a singleton iff the eigenvalue of maximum absolute value is unique. When $n$ is differentiable, then ${\partial n(W) \over \partial W}(H) = \langle u, H u \rangle$, where $u$ is a unit eigenvector corresponding to the (unique) maximum absolute value eigenvalue. (The corresponding gradient using the Frobenius norm inner-product on the space of matrices is $u u^T$.)

Aside: Since $n$ is locally Lipschitz, the Rademacher theorem shows that $n$ is differentiable almost everywhere (Lebesgue). In practice however, $W$ often has some restricted structure so this result is of less utility that might seem at first glance.

We have, of course, ${\partial n(W) \over \partial W_{ij}} = {\partial n(W) \over \partial W}(E_{ij}) = [u]_i [u]_j$.