Why is the gradient normal?
The gradient of a function is normal to the level sets because it is defined that way. The gradient of a function is not the natural derivative. When you have a function, f, defined on some Euclidean space (more generally, a Riemannian manifold) then its derivative at a point, say x, is a function dxf on tangent vectors. The intuitive way to think of it is that dxf(v) answers the question:
If I move infinitesimally in the direction v, what happens to f?
So dxf is not itself a tangent vector. However, as we have an inner product lying around, we can convert it into a tangent vector which we call ∇f. This represents the question:
What tangent vector u at x best represents dxf?
What we mean by "best represents" is that u should satisfy the condition:
<u,v> = dxf(v) for all tangent vectors v
Now we look at the level set of f through x. If v is a tangent vector at x which is tangent to the level set then dxf(v) = 0 since f doesn't change if we go (infinitesimally) in the direction of v. Hence our vector ∇f (aka u in the question) must satisfy <∇f, v> = 0. That is, ∇f is normal to the set of tangent vectors at x which are tangent to the level set.
For a generic x and a generic f (i.e. most of the time), the set of tangent vectors at x which are tangent to the level set of f at x is codimension 1 so this specifies ∇f up to a scalar multiple. The scalar multiple can be found by looking at a tangent vector v such that f does change in the v-direction. If no such v exists, then ∇f = 0, of course.
If you are standing on a level set and want to walk some small distance d and get as far as possible from the level set, you want to walk along the normal. Otherwise, if the path you take has a tangent component, it will tend to keep you closer to the level set if d is small enough compared to the size of the level set. Furthermore, getting as far as possible from your level set is approximately the same as walking to the highest/lowest level curve in range, with the approximation improving as d shrinks.
This is essentially Andrew Stacey's answer, but a bit lower level. This is the story I actually try to get my calculus 3 students to understand.
Let $F: \mathbb{R}^2 \to \mathbb{R}$. Then the derivative $D_{F,p}$ is a linear map from $D_{F,p}:\mathbb{R}^2 \to \mathbb{R}$, whose matrix with respect to the standard basis is $[ \dfrac{\partial F}{\partial x} \dfrac{\partial F}{\partial y}]$.
This is the unique linear map which satisfies $F(p+h) = F(p)+D_{F,p}(h)+Error(h)$, where $\displaystyle\lim_{h \to 0} \dfrac{|Error(h)|}{|h|} = 0$. Notice that $p$ and $h$ are both vectors in $\mathbb{R}^2$.
The cool thing about linear maps from $\mathbb{R}^n \to \mathbb{R}$ is that they look like dot products! In this case, with $h = \langle a,b \rangle$, evaluating the derivative at point $p$, then we have $D_{F,p}(\langle a,b \rangle) = [ \dfrac{\partial F}{\partial x} \dfrac{\partial F}{\partial y}] \begin{bmatrix}a \\\\ b\end{bmatrix} = \dfrac{\partial F}{\partial x}a + \dfrac{\partial F}{\partial y}b = \langle \dfrac{\partial F}{\partial x} \dfrac{\partial F}{\partial y} \rangle \cdot \langle a, b\rangle$,
(with the partials evaluated at $p$). This alternative viewpoint on the derivative is useful, because it gives a different geometric interpretation of the derivative. We call $\langle \dfrac{\partial F}{\partial x} \dfrac{\partial F}{\partial y} \rangle$ the gradient of $F$.
Now we are interested in the curve $F(x,y) = 0$. Given a point $p=(x_1,y_1)$ on this curve, the tangent direction will be the vector $h$ for which $D_{F,p}(h) = 0$, because to stay on the curve, the value of the function should not change to first order. Using the geometric interpretation in terms of dot products, we can see that $\langle \dfrac{\partial F}{\partial x} \dfrac{\partial F}{\partial y} \rangle \cdot \langle h_1, h_2\rangle = 0$, or geometrically that the gradient is perpendicular to the tangent direction!