how to prove the chain rule?
Assuming everything behaves nicely ($f$ and $g$ can be differentiated, and $g(x)$ is different from $g(a)$ when $x$ and $a$ are close), the derivative of $f(g(x))$ at the point $x = a$ is given by $$ \lim_{x \to a}\frac{f(g(x)) - f(g(a))}{x-a}\\ = \lim_{x\to a}\frac{f(g(x)) - f(g(a))}{g(x) - g(a)}\cdot \frac{g(x) - g(a)}{x-a} $$ where the second line becomes $f'(g(a))\cdot g'(a)$, by definition of derivative.
One approach is to use the fact the "differentiability" is equivalent to "approximate linearity", in the sense that if $f$ is defined in some neighborhood of $a$, then $$ f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}\quad\text{exists} $$ if and only if $$ f(a + h) = f(a) + f'(a) h + o(h)\quad\text{at $a$ (i.e., "for small $h$").} \tag{1} $$ (As usual, "$o(h)$" denotes a function satisfying $o(h)/h \to 0$ as $h \to 0$.)
If $f$ is differentiable at $a$ and $g$ is differentiable at $b = f(a)$, and if we write $b + k = y = f(x) = f(a + h)$, then $$ k = y - b = f(a + h) - f(a) = f'(a) h + o(h), $$ so $o(k) = o(h)$, i.e., any quantity negligible compared to $k$ is negligible compared to $h$. Now we simply compose the linear approximations of $g$ and $f$: \begin{align*} f(a + h) &= f(a) + f'(a) h + o(h), \\ g(b + k) &= g(b) + g'(b) k + o(k), \\ (g \circ f)(a + h) &= (g \circ f)(a) + g'\bigl(f(a)\bigr)\bigl[f'(a) h + o(h)\bigr] + o(k) \\ &= (g \circ f)(a) + \bigl[g'\bigl(f(a)\bigr) f'(a)\bigr] h + o(h). \end{align*} Since the right-hand side has the form of a linear approximation, (1) implies that $(g \circ f)'(a)$ exists, and is equal to the coefficient of $h$, i.e., $$ (g \circ f)'(a) = g'\bigl(f(a)\bigr) f'(a). $$ One nice feature of this argument is that it generalizes with almost no modifications to vector-valued functions of several variables.