Definition of the total derivative.

This is one of the most fundamental definitions in all of analysis.

It says that the increment $\Delta f:=f(a+h)-f(a)$ of the function value should in first approximation be a linear function of the increment $h$ attached at the point $a$. In other terms: We want $$f(a+h)-f(a)=Lh +r(h)\qquad(|h|\ll1)\ ,\tag{1}$$ whereby the error $r(h)$ should be smaller by magnitudes than the linear term $Lh$ when $h$ is small. Now in general $|Lh|$ will be of order $|h|$ for "most" $h$. This means that we should require that $$\lim_{h\to0}{|r(h)|\over |h|}=0$$ in order to impart any real content to $(1)$. It turns out that this condition determines $L$ uniquely. If it can be satisfied then $f$ is called differentiable at $a$, and one denotes the resulting $L$ by $Df\bigr|_a$, or similar.


I think it's much easier to understand if you write the definition as "$Df_{|a}$ is the unique linear map $L$ (if it exists) satisfying $f(x) = f(a) + L(x-a) + o(||x-a||)$ when $x\rightarrow a$".

Maybe it's even clearer if you write "$Df_{|a}$ is the linear part of the unique affine map $A$ (if it exists) satisfying $f(x) = A(x) + o(||x-a||)$ when $x\rightarrow a$".

So you can see that $A$ is the best possible affine approximation of $f$ near $a$ (because the error you make by replacing $f$ with $A$ is negligible compared to any affine map), and $Df_{|a}$ is the linear part of this affine approximation.


The division by $\|h\|$ here is exactly analogous with the division by $h$ in the definition of the (standard) derivative of a real-valued function of a real variable:

$\dfrac{df}{dx}=\underset{h\rightarrow 0}{\lim}\dfrac{f(x+h)-f(x)}{h}$

To answer the second question, $Df|_a$ is linear because it satisfies the linearity property, that is it commutes with addition and scalar multiplication. This is a consequence of how it is defined and is not actually the hard to prove (try using the definition to show $Df|_a+Dg|_a=D(f+g)|_a$ and $D(\alpha f)|_a=\alpha Df|_a$ directly; hint, use some linear algebra)

To answer the final question, it is the natural analog of the familiar derivative of a function $f:\mathbb{R} \rightarrow \mathbb{R}$ for the case of a function from $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$.