How is the derivative truly, literally the "best linear approximation" near a point?
As some people on this site might be aware I don't always take downvotes well. So here's my attempt to provide more context to my answer for whoever decided to downvote.
Note that I will confine my discussion to functions $f: D\subseteq \Bbb R \to \Bbb R$ and to ideas that should be simple enough for anyone who's taken a course in scalar calculus to understand. Let me know if I haven't succeeded in some way.
First, it'll be convenient for us to define a new notation. It's called "little oh" notation.
Definition: A function $f$ is called little oh of $g$ as $x\to a$, denoted $f\in o(g)$ as $x\to a$, if
$$\lim_{x\to a}\frac {f(x)}{g(x)}=0$$
Intuitively this means that $f(x)\to 0$ as $x\to a$ "faster" than $g$ does.
Here are some examples:
- $x\in o(1)$ as $x\to 0$
- $x^2 \in o(x)$ as $x\to 0$
- $x\in o(x^2)$ as $x\to \infty$
- $x-\sin(x)\in o(x)$ as $x\to 0$
- $x-\sin(x)\in o(x^2)$ as $x\to 0$
- $x-\sin(x)\not\in o(x^3)$ as $x\to 0$
Now what is an affine approximation? (Note: I prefer to call it affine rather than linear -- if you've taken linear algebra then you'll know why.) It is simply a function $T(x) = A + Bx$ that approximates the function in question.
Intuitively it should be clear which affine function should best approximate the function $f$ very near $a$. It should be $$L(x) = f(a) + f'(a)(x-a).$$ Why? Well consider that any affine function really only carries two pieces of information: slope and some point on the line. The function $L$ as I've defined it has the properties $L(a)=f(a)$ and $L'(a)=f'(a)$. Thus $L$ is the unique line which passes through the point $(a,f(a))$ and has the slope $f'(a)$.
But we can be a little more rigorous. Below I give a lemma and a theorem that tell us that $L(x) = f(a) + f'(a)(x-a)$ is the best affine approximation of the function $f$ at $a$.
Lemma: If a differentiable function $f$ can be written, for all $x$ in some neighborhood of $a$, as $$f(x) = A + B\cdot(x-a) + R(x-a)$$ where $A, B$ are constants and $R\in o(x-a)$, then $A=f(a)$ and $B=f'(a)$.
Proof: First notice that because $f$, $A$, and $B\cdot(x-a)$ are continuous at $x=a$, $R$ must be too. Then setting $x=a$ we immediately see that $f(a)=A$.
Then, rearranging the equation we get (for all $x\ne a$)
$$\frac{f(x)-f(a)}{x-a} = \frac{f(x)-A}{x-a} = \frac{B\cdot (x-a)+R(x-a)}{x-a} = B + \frac{R(x-a)}{x-a}$$
Then taking the limit as $x\to a$ we see that $B=f'(a)$. $\ \ \ \square$
Theorem: A function $f$ is differentiable at $a$ iff, for all $x$ in some neighborhood of $a$, $f(x)$ can be written as $$f(x) = f(a) + B\cdot(x-a) + R(x-a)$$ where $B \in \Bbb R$ and $R\in o(x-a)$.
Proof: "$\implies$": If $f$ is differentiable then $f'(a) = \lim_{x\to a} \frac{f(x)-f(a)}{x-a}$ exists. This can alternatively be written $$f'(a) = \frac{f(x)-f(a)}{x-a} + r(x-a)$$ where the "remainder function" $r$ has the property $\lim_{x \to a} r(x-a)=0$. Rearranging this equation we get $$f(x) = f(a) + f'(a)(x-a) -r(x-a)(x-a).$$ Let $R(x-a):= -r(x-a)(x-a)$. Then clearly $R\in o(x-a)$ (confirm this for yourself). So $$f(x) = f(a) + f'(a)(x-a) + R(x-a)$$ as required.
"$\impliedby$": Simple rearrangement of this equation yields
$$B + \frac{R(x-a)}{x-a}= \frac{f(x)-f(a)}{x-a}.$$ The limit as $x\to a$ of the LHS exists and thus the limit also exists for the RHS. This implies $f$ is differentiable by the standard definition of differentiability. $\ \ \ \square$
Taken together the above lemma and theorem tell us that not only is $L(x) = f(a) + f'(a)(x-a)$ the only affine function who's remainder tends to $0$ as $x\to a$ faster than $x-a$ itself (this is the sense in which this approximation is the best), but also that we can even define the concept differentiability by the existence of this best affine approximation.
I'll first give a intuitive answer, then an analytic answer.
Intuitively, the tangent goes in the same direction as the function, following it as closely as possible for a line. Any other line immediately starts to diverge from the function.
Analytically:
Consider the Taylor aproximation at $x$: $f(x+h) =f(x)+hf'(x)+h^2f''(x)/2+... $.
This means that, for small $h$ $f(x+h) \approx f(x)+hf'(x)+h^2f''(x)/2 $ so that the error $E(x, h) =f(x+h)- (f(x)+hf'(x)) $ is about $ h^2f''(x)/2 $.
Now consider any other line through $(x, f(x))$ with slope $s$, with $s \ne f'(x)$. At $x+h$, its value is $f(x)+sh$, so its error, $e(x, h)$ is $e(x, h, s) =f(x+h)-(f(x)+sh) $.
Since $f(x+h)-f(x) \approx hf'(x)+h^2f''(x)/2 $,
$\begin{array}\\ e(x, h, s) &=f(x+h)-(f(x)+sh)\\ &\approx hf'(x)+h^2f''(x)/2-sh\\ &= h(f'(x)-s)+h^2f''(x)/2\\ \end{array} $
so that $\dfrac{E(x, h)}{e(x, h, s)} \approx \dfrac{h^2f''(x)/2}{h(f'(x)-s)+h^2f''(x)/2} = \dfrac{hf''(x)/2}{f'(x)-s+hf''(x)/2} $.
Since $s \ne f'(x)$, as $h \to 0$, the numerator of thie ratio of errors goes to zero, while the denominator stays bounded away from zero.
Therefore the error of the tangent goes to zero faster than the error in any other line through the point.
That is why the tangent is the best linear approximation to the curve.
There is a sense in which the derivative is the best linear approximation. You just have to define "best" approximation in a proper way, taking into account that the derivative is a very local property. In particular, suppose we are trying to approximate $f$ at $x_0$. Then, we make the following definition:
A function $g$ is a at least as good of an approximation as $h$ if there is some $\varepsilon>0$ such that for any $x$ with $|x-x_0|<\varepsilon$ we have that $|g(x)-f(x)|\leq |h(x)-f(x)|$.
This is to say that, when we compare two functions, we only look at arbitrarily small neighborhoods of the point at which we are approximating. This defeats your strategy - if you take the tangent line and compare it to a secant line passing through $(a,f(a))$, this approximation will exclude $a$ from consideration by making $\varepsilon$ small enough. Essentially, the important thing is that you have to fix $\varepsilon$ after you fix the two functions which you want to compare. This is only a partial order (well, and not quite that), so sometimes there is no best approximation.
However, we have two theorems:
$f$ is differentiable at $x_0$ if and only if there is a linear function $g$ which is at least as good of an approximation as any other linear $h$.
If $f$ is differentiable at $x_0$, then $g(x)=f(x)+(x-x_0)f'(x)$ is the best linear approximation of $f$.
meaning this definition is equivalent to the usual one. Interestingly, we get the condition of continuity at $x_0$ if you we ask for the best constant approximation to exist.