What is the intuition behind the definition of the differential of a function?
The most classical answer is probably: $df_p$ is the best linear approximation of $f$ at $p$, in a sense that can be made precise.
Personally I often like to think of it in the following way: while $f$ transforms positions, the differential of $f$ transforms velocities (or: while $f$ transforms points, $df$ transforms tangent vectors.)
More precisely: let $t \mapsto x(t)$ be a curve in $M$ through $p$ at $t=0$. The image of this curve in $N$ is $t \mapsto y(t) = f(x(t))$. Let $v = x'(0) \in T_{p}M$ and $w = y'(0) \in T_{f(p)}N$. Then $w = df_p(v)$.
In short: $$\frac{d}{dt}_{|t=0}f(x(t)) = df_{x_0} (x'(0))$$
I find this interpretation of the differential useful both conceptually and also very often practically.
I like to think it this way:
The "infinitesimal" perspective is not encoded in $df$ or in $v_p$, but in $f$ itself. This is because we define the derivations at a point $p$ not on smooth functions, but on the stalk of germs of smooth functions at $p$, i.e. on $$ \mathcal{C}^{\infty}_p := \left\{ (f,U) : p \in U, U \text{open}, f \in \mathcal{C}^{\infty}(U,\mathbb{R})\right\} / \sim $$ where $(f,U) \sim (g,V)$ iff $f \mid_{U \cap V} = g \mid_{U \cap V}$. In particular, this means that the class $f_p$ of $(f,U)$ captures the local behaviour of $f$ at $p$.
Now fix a local coordinate chart $((x_1,\dotsc,x_n),U)$ for some open neighbourhood $U$ of $p$, so that $\frac{\partial}{\partial x_1},\dotsc,\frac{\partial}{\partial x_n}$ is a basis for $T_p M$. Here $\frac{\partial}{\partial x_j}$ is by definition the (unique) derivation such that $$ \frac{\partial}{\partial x_j}(x_{i,p}) = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{otherwise} \end{cases} $$ This means that you can indeed think of $\frac{\partial}{\partial x_j}(f_p)$ as the local change of $f$ at $p$ in the "direction" $\frac{\partial}{\partial x_j}$, i.e. in the same direction in which the coordinate map $x_j$ increases.
So, what is the differential of $f_p$? I think of it a useful change of perspective: it allows you to think of $\frac{\partial}{\partial x_j}(f_p)$ not as a function of $f_p$, but as a function of $\frac{\partial}{\partial x_j}$. This is useful because it allows you to define a very natural map from smooth functions to the dual space $T^*_p M$. On one hand, with this map you can easily prove that $dx_1,\dotsc,dx_n$ is a basis for this space. On the other hand, it suggests that if we can reasonably map $0$-forms to $1$-forms, then it might not be too hard to extend this idea to a map from $n$-forms to $n+1$-forms. Indeed, this is possible and it is what we call the exterior derivative.
It might be useful to clarify that Seub is talking about a slightly different thing in his answer. Indeed, consider a morphism of differential manifolds $\phi\colon M \to N$ and fix a point $p \in M$. Then precomposition by $\phi$ gives a map from $\mathcal{C}^{\infty}_{\phi(p)}$ (on $N$) to $\mathcal{C}^{\infty}_p$ (on $M$), simply because if $f\colon N \to \mathbb{R}$ is smooth at $\phi(p)$, then $f \circ \phi$ is smooth at $p$. Then you can define a map $$ \phi_* \colon T_p(M) \to T_{\phi(p)}(N) $$ by putting $\big(\phi_*(v_p)\big)(f_{\phi(p)}) = v_p((f \circ \phi)_p)$ for every $f \in \mathcal{C}^{\infty}_{\phi(p)}$.
Why are these two things related? Suppose $N = \mathbb{R}$. Then the identity on $N$ has a germ $\mathbf{1}_{\phi(p)} \in \mathcal{C}^{\infty}_{\phi(p)}$ and we have $\big(\phi_*(v_p)\big)(\mathbf{1}_{\phi(p)}) = v_p((\mathbf{1} \circ \phi)_p) = v_p(\phi_p)$.
This perspective is very useful, because it allows you to "compare" the tangent spaces of two manifolds (in to corresponding points). I don't think that its motivation is measuring how $\phi$ changes, though; maybe historically, but not in this formalism. You may probably try to do so by passing through local charts, and indeed Boothby does something similar in examples 1.9 and 1.10, chapter 4, of his book (pp. 112-115 in my edition).
Note: Your question asked about the intuition (or, I think, motivation) of the definition of differential in differential geometry, and I hope I made this clear. On the other hand, from your comments it seems like the sources of your confusion are Spivak's book and the effort to reconcile this definition with the classical notion of differentials as "infinitesimal quantities". Now, to quote L. Ryder's book:
The $1$-form $\mathbf{d}x^{\mu}$ is not the same as the infinitesimal $dx^{\mu}$: it is not a 'number', but a member of the cotangent space $T_p^*$.
Just before the paragraph you quoted in your comments, Spivak writes (emphasis mine):
Classical differential geometers (and classical analysts) did not hesitate to talk about "infinitely small" changes $dx^i$ of the coordinates $x^i$, just as Leibnitz had. No one wanted to admit that this was nonsense, because true results were obtained when these infinitely small quantities were divided into each other (provided one did it in the right way).
This simply means that the classical formalism is not rigorous: you may think of "infinitesimal changes" if it helps you visualise what you're doing, but you can't trust any result you obtain in this way until you prove them using solid definitions.
He then goes on saying that while we cannot say how much is an "infinitely small" change (at least in the framework of standard analysis), we can still say "where our function is going", like with tangent vectors to a curve. Furthermore, classically the differential of a function is seen as its variation in front of an "infinitesimal" variation in its variables, so you could start formalising it as a function of this change. Since we said that we can still (intuitively) think of change as a tangent vector it then becomes a function on the tangent space.
In this sense, if you think of a smooth curve on a smooth manifold $M$ as a smooth function $c$ from $\mathbb{R}$ to $M$ (which you can, at least locally), then the "infinitesimal change" of the curve in front of an "infinitesimal change" of the parameter becomes a function from $T_p\mathbb{R}$ to $T_{c(p)}M$, which we call the differential of $c$.
(The following is only about functions defined on some $\Omega\subset{\mathbb R}^n$)
If $p$ is a "generic" point for the function $f:\>{\mathbb R}^n\to{\mathbb R}$ then the rate of change of $f$ when walking away from $p$ depends on the chosen direction, but not in an arbitrary way: There is a direction of maximal increase, there is a plane through $p$ with virtually no change of $f$ when walking away staying in this plane, and there is a direction of maximal decrease. One can measure these various rates of change using directional derivatives: If $u$ is a unit vector one can define $$D_uf(p)=\lim_{t\to0+}{f(p+t u)-f(p)\over t}\ .$$ This idea does not tell us anything about how $D_uf(p)$ depends on $u$. In reality there is a certain vector $\nabla f(p)$, called the gradient of $f$ at $p$, such that $D_u f(p)$ can be computed for all $u$ by the formula $$D_u f(p)=\nabla f(p)\cdot u\ .\tag{1}$$ This shows that the "linear behavior" of $f$ when walking away from $p$ is encoded in this vector $\nabla f(p)$ once and for all.
The rule $$X\mapsto \nabla f(p)\cdot X$$ appearing in $(1)$ for the special case $X=u$, a unit vector, turns the vector $\nabla f(p)$ into a linear functional $\phi$ on the tangent space at $p$: For each $X\in T_p$ a value $\phi(X):=\nabla f(p)\cdot X$ is defined. This linear functional is called the differential of $f$ at $p$, and is denoted by $df(p)$. It is related to small changes of $f$ in the neighborhood of $p$ by the formula $$f(p+X)-f(p)=df(p).X+o\bigl(|X|\bigr)\qquad(X\to0)\ .\tag{2}$$ Note that in $(2)$ neither a scalar product, nor unit vectors appear.