Derivative of $\sqrt{AA^T}$ with respect to $A$
Define the matrix $F$ such that $$F = \sqrt{AA^T} \;\implies F^2 = AA^T $$ The vec operation can be used to flatten these matrices into vectors. $$a={\rm vec}(A),\quad f={\rm vec}(F)$$ The requested gradient can be calculated as follows. $$\eqalign{ &F\,F &= AA^T \cr &F\,dF\,(I)+(I)\,dF\,F &= A\,dA^T\,(I)+(I)\,dA\,A^T \cr &(I^T\otimes F+F^T\otimes I)\,{\rm vec}(dF) &= (I^T\otimes A)\,{\rm vec}(dA^T)+(A\otimes I)\,{\rm vec}(dA) \cr &\Big(I\otimes F+F\otimes I\Big)\,df &= \Big((I\otimes A)K+(A\otimes I)\Big)\,da \cr &\frac{\partial f}{\partial a} &= \Big(I\otimes F+F\otimes I\Big)^+ \Big((I\otimes A)K+(A\otimes I)\Big) \cr\cr }$$ where $M^+$ denotes the pseudoinverse of $M$, $I$ is the identity matrix, and $K$ is the commutation matrix associated with the Kronecker product. The solution also takes advantage of the fact that $I$ and $F$ are symmetric.
The true gradient is a fourth-order tensor, while the above result is a flattened version of it. If you want the full tensor result, the components are not too difficult to calculate, since there is a one-to-one mapping between the elements of the tensor and the elements of the flattened matrix.
Upon re-reading the question, it looks like you are interested in the directional derivative, where changes in $A$ are restricted to the matrix direction $V$. $$\eqalign{ v &= {\rm vec}(V) \\ df_v &= \bigg(\frac{\partial f}{\partial a}\bigg)\cdot v \\ }$$ Once again, unflattening this vector into a matrix is just a one-to-one mapping, i.e. $$\eqalign{ F &\in {\mathbb R}^{m\times n} \implies f \in {\mathbb R}^{mn\times 1} \\ F_{ij} &= f_{\alpha} \\ \alpha &= i+(j-1)\,m \\ i &= 1+(\alpha-1)\,{\rm mod}\,m \\ j &= 1+(\alpha-1)\,{\rm div}\,m \\ }$$
The biggest challenge with this derivative is that matrices do not necessarily commute. This post should give you a taste for the difficulties one may encounter.
Nevertheless let's give it a try. But first, generalize the problem a bit by considering the map
$$ f\colon\mathbb S^n_+\longrightarrow\mathbb S^n_+,\, X\longmapsto X^{\alpha} \overset{\text{def}}{=} \exp(\alpha\log(X))$$
which maps a symmetric, positive definite matrix to its matrix power, defined via matrix exponential and matrix logarithm. One can easily check that $\sqrt{X} = X^{1/2}$ by diagonalizing.
1. The commutative case
If $\Delta X$ commutes with $X$, then everything is easy, because then $\log(X\cdot\Delta X)=\log(X) + \log(\Delta X)$ and $\exp(X+\Delta X) = \exp(X)\cdot\exp(\Delta X)$ which does not hold true in the general case
In the commuting case we have
$$\begin{aligned} \log(X+\Delta X)-\log(X) &= \log(X+\Delta X) + \log(X^{-1}) \\&= \log((X+\Delta X)X^{-1}) \\&= \log(I+X^{-1}\Delta X) \\&= \sum_{k=1}^{\infty}(-1)^{k+1} \frac{(X^{-1}\Delta X)^{k}}{k} \sim X^{-1}\Delta X \\ \exp(X+\Delta X) - \exp(X) &= \exp(X)\exp(\Delta X) - \exp(X) \\&= \exp(X)\big(\exp(\Delta X) -I\big)\\ &= \exp(X)\sum_{k=1}^{\infty} \frac{1}{k!}\Delta X^k \sim \exp(X)\Delta X \end{aligned}$$ Hence $\partial_{\Delta X}\log(X) = X^{-1}$ and $\partial_{\Delta X} \exp(X) = \exp(X)$ in the commuting case $[X, \Delta X]=0$. In particular via chain rule it follows that we have $\partial_{\Delta X} X^\alpha = \alpha X^{\alpha-1}$ which satisfies our expectation.
2. The non-commutative case
This one is much harder, one can try to use the Baker-Campbell-Hausdorff formula and/or the Zassenhaus formula but the end result won't be pretty. In general, the nice formula from before does not hold anymore. For example, if $\alpha=2$ then once can easily check that
$$(X+\Delta X)^2-X^2 \sim X\cdot\Delta X + \Delta X \cdot X \neq 2X\cdot\Delta X$$
We can verify this numerically as well, for example with the python (3.7) program below. The residuals are stable when you increase the sample size $N$ in the commutative case, but they will get larger and larger in the general case. (randomly sampling matrices with extremely large commutator is rather rare...)
import numpy as np
from scipy.linalg import norm, fractional_matrix_power as powm
from scipy.stats import ortho_group, wishart # to sample orthogonal/ spd matrices
alphas = np.linspace(0.5, 5, 10)
N = 100 # sample size
eps = 10**-8
n=6 # matrix size
print("Commutative case")
# using simultaneous diagonalizable => commuting
for a in alphas:
r = 0
for _ in range(N):
D = np.diag(np.random.rand(n))
S = np.diag(np.random.rand(n))
U = ortho_group.rvs(n)
X = U.T @ D @ U
DX = eps* U.T@ S @ U
num = powm(X+DX, a) - powm(X, a) # numerical derivative
ana = a*powm(X, a-1) @ DX # formula
r =max(r, norm( num-ana))
print(F"alpha: {a:.2f}, max residual {r}")
# residuals should be numerically close to zero
print("General case")
for a in alphas:
r = 0
for _ in range(N):
X = wishart(scale=np.eye(n)).rvs()
DX= eps*wishart(scale=np.eye(n)).rvs()
num = powm(X+DX, a) - powm(X, a) # numerical derivative
ana = a*powm(X, a-1) @ DX # formula
r =max(r, norm( num-ana))
print(F"alpha: {a:.2f}, max residual {r}")
# residuals should be much larger
For starters,$$\frac{\partial}{\partial A_{ij}}(AA^T)_{kl}=\frac{\partial}{\partial A_{ij}}(A_{km}A_{lm})=\delta_{ik}\delta_{jm}A_{lm}+A_{km}\delta_{il}\delta_{jm}=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$Next, let $X:=\sqrt{AA^T}$ so $(AA^T)_{kl}=X_{kp}X_{pl}$, so$$\frac{\partial X_{kp}}{\partial A_{ij}}X_{pl}+\frac{\partial X_{pl}}{\partial A_{ij}}X_{kp}=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$To evaluate the derivative $D_{ijrs}:=\frac{\partial X_{rs}}{\partial A_{ij}}$, note that$$D_{ijrs}(\delta_{kr}X_{sl}+\delta_{sl}X_{kr})=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$So the result is of the form $B^{-1}C$, but with $B,\,C$ as rank-$4$ tensors rather than matrices, and their multiplication and inversion suitably defined.