The Sub Gradient and the Proximal Operator of the of $ {L}_{2, 1} $ Norm (Mixed Norm)

In the following I will use MATLAB's notation using the : operator for selecting a column of a Matrix.

Sub Gradient of $ {L}_{2, 1} $ Mixed Norm

$$ f \left( X \right) = {\left\| A X \right\|}_{2, 1} = \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} $$

Now, for a vector $ x $ the gradient:

$$ \frac{\mathrm{d} {\left\| A x \right\|}_{2} }{\mathrm{d} x} = \frac{ {A}^{T} A x }{ {\left\| A x \right\|}_{2} } $$

Which implies:

$$\begin{align*} \frac{\mathrm{d} {\left\| A X \right\|}_{2, 1} }{\mathrm{d} X} & = \frac{\mathrm{d} \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} X} && \text{} \\ & = \frac{\mathrm{d} \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} {X}_{:, i}} \boldsymbol{e}_{i}^{T} && \text{Where $ \boldsymbol{e}_{i} $ is the standard $ i $ -th basis vector} \\ & = \sum_{i} \frac{\mathrm{d} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} {X}_{:, i}} \boldsymbol{e}_{i}^{T} && \text{} \\ & = \sum_{i} \frac{ {A}^{T} A {X}_{:, i} }{ {\left\| A {X}_{:, i} \right\|}_{2} } \boldsymbol{e}_{i}^{T} && \text{} \\ & = {A}^{T} A X D \end{align*}$$

Where

$$ D = \operatorname{diag} \left\{ {d}_{1}, {d}_{2}, \ldots, {d}_{n} \right\}, \; {d}_{i} = \begin{cases} 0 & \text{ if } {\left\| A {X}_{:, i} \right\|}_{2} = 0 \\ \frac{1}{{\left\| A {X}_{:, i} \right\|}_{2}} & \text{ if } {\left\| A {X}_{:, i} \right\|}_{2} \neq 0 \end{cases} $$

Remark
For a column of zero sin $ X $ the Sub Gradient of the $ {L}_{2} $ norm of that vector is any vector with $ {L}_{2} $ norm which less or equal to unit. In the case above it was chosen to be the zero vector which indeed has a norm less than 1.

Prox of $ {L}_{2, 1} $ Mixed Norm

The problem is given by:

$$ \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} $$

Where $ X, Y \in \mathbb{R}^{m \times n} $.

Again, this can be decomposed into working on each column of $ X $ separately:

$$\begin{aligned} \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} & = \arg \min_{X} \sum_{i} \frac{1}{2} {\left\| {X}_{:, i} - {Y}_{:, i} \right\|}_{2}^{2} + \sum_{i} \lambda {\left\| {X}_{:, i} \right\|}_{2}^{2} && \text{} \\ & = \arg \min_{X} \left( \frac{1}{2} {\left\| {X}_{:, 1} - {Y}_{:, 1} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, 1} \right\|}_{2}^{2} \right) && \\ & + \left( \frac{1}{2} {\left\| {X}_{:, 2} - {Y}_{:, 2} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, 2} \right\|}_{2}^{2} \right) && \\ & + \cdots && \\ & + \left( \frac{1}{2} {\left\| {X}_{:, n} - {Y}_{:, n} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, n} \right\|}_{2}^{2} \right) \end{aligned}$$

Each term in the brackets is independent Prox Function of $ {L}_{2} $ Norm.
Hence the solution is given by:

$$ \hat{X} = \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} $$

Where $ \hat{X}_{:, i} = {Y}_{:, i} \left( 1 - \frac{\lambda}{\max \left( {\left\| {Y}_{:, i} \right\|}_{2} , \lambda \right)} \right) $

MATLAB Code

I implemented a code to verify the results versus Numerical Derivative (Finite Differences) and CVX (Reference for the Prox).
The full code is in my StackExchange Mathematics Q3307741 GitHub Repository.


Let $x_j$ be the $j$th column of $X$. (So $\|X\|_{2,1} = \sum_{j=1}^n \| x_j \|_2$.) Then the separable sum rule for proximal operators tells us that $\text{prox}_{\lambda\| \cdot \|_{2,1}}(X)$ is the $m \times n$ matrix whose $j$th column is $\text{prox}_{\lambda\| \cdot \|_2}(x_j)$.