Why I should believe that the derivative of the determinant is the trace

The identity $\det\exp X =\exp\text{tr}X$ is obviously valid for diagonal $X$, and this generalises to diagonalisable matrices (since $X\to OXO^T$ with orthogonal $O$ changes neither determinants not traces) and from there to all square matrices (because the diagonalisable matrices are dense). The choice $X=\ln (I-tA)$ for small $t$ gives $$\det (I-tA)=\exp\text{tr}\ln (I-tA)\approx\exp(-t\text{tr}A)\approx 1-t\text{tr}A=\det I-t\text{tr}A.$$


We take $N$ vectors with coordinates $(1, 0, 0...)^T, (0, 1, 0, ...)^T, ... (0, 0,..., 1)^T$.

$N$ vectors determine a parallelepiped, to find a volume of this parallelepiped we compose a matrix of components of these vectors and calculate it's determinant. In our case the matrix is identity matrix, the parallelepiped is a unit cube, it's volume is 1, $\det(I)$ is 1.

We can consider matrix $I+tA$ also as coordinates of $N$ vectors. $\det(I+tA)$ is a volume of a parallelepiped formed by these vectors.

Each of these $N$ vectors is close to the corresponding unit vector, and the whole parallelepiped is just a slightly distorted unit cube.

See what happens. There was a vector $(1, 0, 0....)^T$, now we have a slightly different vector $(1+a_1*t, a_2*t, ....)$. When we changed the first coordinate the volume of parallelepiped increased approximately by $1*a_1*t$: this is the "area of a square side * thickness of the layer". But when we change some other coordinates the affects only the regions along the edges of the cube. The change of parallelepiped's volume would be $O(t^2)$ and can be ignored.

It's easy to visualise this in 3-D case, and not much changes in case of higher dimensions.

So, the total change of volume would be $t*(a_1+a_2+...) + o(t)$.

So: $d(\det(I + tA))/dt = d(V)/dt = Tr(A)$

Update: I guess V.I.Arnold (link suggested in comments) explained the same, but better...