How can a gradient be thought of as a function?
A function $F$ can be thought of as a machine that takes in inputs from one set, say $X$, and outputs elements in another set, say $Y$. A function outputs a single element for every input, and we write it $F : X \rightarrow Y$.
In your case, the gradient of $f$ is just a function $\nabla f : \mathbb{R}^2 \rightarrow \mathbb{R}^2$, where for each input $(x,y)$, it outputs the element $\left( \frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y) \right)$.
$(1)$ The gradient is a vector-valued function, this maps pairs of numbers $(x,y) \in \mathbb R^2$ to some other pair of numbers $(x',y')\in \mathbb R^2$. These pairs of numbers we call vectors and they have a very geometric interpretation: they have a length and a direction. Specifically the gradient corresponds to the direction and magnitude of steepest ascent.
Also see:
https://en.wikipedia.org/wiki/Vector-valued_function
This Khan Academy link I found as well is very useful as he also thought of the same example as I did: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient-and-graphs
$(2)$ On the contrary, when you plot a function that maps from $\mathbb{R^2} \rightarrow \mathbb{R}$ like $f(x,y)=x^2 +y^2$, when you want to plot this you often define a third variable $z=f(x,y)$ and you let the value of this variable be equal to the function value. I have plotted $x^2 +y^2$ in this way below:
Without introducing another axis, we can also just give different function value ranges a different colour, so a large value could be very dark and a low value could be very light or vice versa. We recognise the same function:
The difference between $(1)$ and $(2)$ is the notion of direction. The function you describe is usually a "scalar field", we only have the notion of magnitude or "value" but not of direction. Gradients will give you a so-called "vector field" as physicists often call it, we usually visualise this using VectorPlots. Below you find such a method, I've plotted the vector field $f(x,y)=(x^2+y,y^2+x )$
which has gradient $grad(f)(x,y)=(2x,2y)$
The gradient operator is a higher order function: it maps functions to functions. In case of scalar fields on real vector spaces, $$ \nabla :(\mathbb{R}^n\to\mathbb{R}) \to (\mathbb{R}^n\to\mathbb{R}^n). $$ Thus, if $F:\mathbb{R}^2\to\mathbb{R}$, then $\nabla F : \mathbb{R}^2 \to \mathbb{R}^2$, and if you evaluate that at some point, you get a single vector, like $\nabla F(x,y) : \mathbb{R}^2$. Specifically, $$ \nabla F(x,y) = \begin{pmatrix}\frac{\partial F(x,y)}{\partial x} \\ \frac{\partial F(x,y)}{\partial y}\end{pmatrix}. $$ See also What does the symbol nabla indicate?