What is the difference between the Jacobian, Hessian and the Gradient?
Some good resources on this would be any introductory vector calculus text. I'll try to be as consistent as I can be with Stewart's Calculus, perhaps the most popular calculus textbook in North America.
The Gradient
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a scalar field. The gradient, $\nabla f: \mathbb{R}^n \rightarrow \mathbb{R}^n$ is a vector, such that $(\nabla f)_j = \partial f/ \partial x_j$. Because every point in $\text{dom}(f)$ is mapped to a vector, then $\nabla f$ is a vector field.
The Jacobian
Let $\operatorname{F}: \mathbb{R}^n \rightarrow \mathbb{R}^m$ be a vector field. The Jacobian can be considered as the derivative of a vector field. Considering each component of $\mbox{F}$ as a single function (like $f$ above), then the Jacobian is a matrix in which the $i^{th}$ row is the gradient of the $i^{th}$ component of $\operatorname{F}$. If $\mathbf{J}$ is the Jacobian, then
$$\mathbf{J}_{i,j} = \dfrac{\partial \operatorname{F}_i}{\partial x_j}$$
The Hessian
Simply, the Hessian is the matrix of second order mixed partials of a scalar field.
$$\mathbf{H}_{i, j}=\frac{\partial^{2} f}{\partial x_{i} \partial x_{j}}$$
In summation:
Gradient: Vector of first order derivatives of a scalar field
Jacobian: Matrix of gradients for components of a vector field
Hessian: Matrix of second order mixed partials of a scalar field.
Squared error loss $f(\beta_0, \beta_1) = \sum_i (y_i - \beta_0 - \beta_1x_i)^2$ is a scalar field. We map every pair of coefficients to a loss value.
The gradient of this scalar field is $$\nabla f = \left< -2 \sum_i( y_i - \beta_0 - \beta_1x_i), -2\sum_i x_i(y_i - \beta_0 - \beta_1x_i) \right>$$
Now, each component of $\nabla f$ is itself a scalar field. Take gradients of those and set them to be rows of a matrix and you've got yourself the Jacobian
$$ \left[\begin{array}{cc} \sum_{i=1}^{n} 2 & \sum_{i=1}^{n} 2 x_{i} \\ \sum_{i=1}^{n} 2 x_{i} & \sum_{i=1}^{n} 2 x_{i}^{2} \end{array}\right]$$
- The Hessian of $f$ is the same as the Jacobian of $\nabla f$. It would behoove you to prove this to yourself.
Resources: Calculus: Early Transcendentals by James Stewart, or earlier editions, as well as Wikipedia which is surprisingly good for these topics.
If you have a function that maps a 1D number to a 1D number, then you can take the derivative of it,
$f(x) = x^2, f'(x) = 2x$
If you have a function that maps a ND vector to a 1D number, then you take the gradient of it
$f(x) = x^Tx, \nabla f(x) = 2x, x = (x_1, x_2, \ldots, x_N)$
If you have a function that maps a ND vector to a ND vector, then you take the Jacobian of it.
$f(x_1, x_2) = \begin{bmatrix} x_1x_2^2 \\ x_1^2x_2\end{bmatrix}, J_f(x_1, x_2) = \begin{bmatrix} x_2^2 & 2x_1x_2 \\ x_1^2 & 2 x_1x_2\end{bmatrix}$
The Hessian is the Jacobian of the gradient of a function that maps from ND to 1D
So the gradient, Jacobian and Hessian are different operations for different functions. You literally cannot take the gradient of a ND $\to $ ND function. That's the difference.