How to differentiate product of vectors (that gives scalar) by vector?

There are two approaches when taking vector derivatives. First, you can work in coordinates. This will always work, but is not always pleasant. In this case $$S(\beta) = y^Ty - 2\sum_i \beta_i(X^Ty)_i + \sum_{i,j} \beta_i (X^TX)_{ij} \beta_j$$ so \begin{align*} \frac{\partial S}{\partial \beta_k} &= -2\sum_i \delta_{ik}(X^Ty)_i + \sum_{i,j} \delta_{ik}(X^TX)_{ij}\beta_j + \sum_{i,j} \beta_i(X^TX)_{ij}\delta_{jk}\\ &= -2(X^Ty)_ k + \sum_j (X^TX)_{kj}\beta_j + \sum_i \beta_i(X^TX)_{ik}\\ \frac{\partial S}{\partial \beta} &= -2X^Ty + 2(X^TX)\beta. \end{align*}

The second approach is to work with the differential $dS(\beta)[\delta \beta]$ which computes the directional derivative $\frac{d}{dt}S(\beta + t\delta \beta)\Big\vert_{t\to 0}$; since the directional derivative is linear you must have $$dS(\beta)[\delta \beta] = \left(\frac{\partial S}{\partial \beta}\right)^T\delta \beta$$ and so you can often recover an elegant, coordinate-free expression for the derivative from the differential. I wrote up some notes on this here: https://www.dropbox.com/s/7bj966ifgqiljmt/calculus.pdf?dl=0

In this case \begin{align*} dS(\beta)[\delta \beta] &= -2\delta \beta^TX^Ty + \delta \beta^TX^TX\beta + \beta^TX^TX\delta \beta\\ &= \left[-2y^TX +2\beta^TX^TX\right]\delta \beta. \end{align*}


Recall that the multiple regression linear model is the equation given by $$Y_i = \beta_0 + \sum_{j=1}^{p}X_{ij}\beta_{j} + \epsilon_i\text{, } i = 1, 2, \dots, N\tag{1}$$ where $\epsilon_i$ is a random variable for each $i$. This can be written in matrix form like so. \begin{equation*} \begin{array}{c@{}c@{}c@{}c@{}c@{}c} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_N \end{bmatrix} &{}={} &\begin{bmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1p} \\ 1 & X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & X_{N1} & X_{N2} & \cdots & X_{Np} \end{bmatrix} &\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} &{}+{} &\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_N \end{bmatrix} \\ \\[0.1ex] \mathbf{y} &{}={} &\mathbf{X} &\boldsymbol{\beta} &{}+{} &\boldsymbol{\epsilon}\text{.} \end{array} \end{equation*} Since $\boldsymbol{\epsilon}$ is a vector of random variables, note that we call $\boldsymbol{\epsilon}$ a random vector. Our aim is to find an estimate for $\boldsymbol{\beta}$. One way to do this would be by minimizing the residual sum of squares, or $\text{RSS}$, defined by $$\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{N}\left(y_i - \sum_{j=0}^{p}x_{ij}\beta_{j}\right)^2$$ where we have defined $x_{i0} = 1$ for all $i$. (These are merely the entries of the first column of the matrix $\mathbf{X}$.) Notice here we are using lowercase letters, to indicate that we are working with observed values from data. To minimize this, let us find the critical values for the components of $\boldsymbol{\beta}$. For $k = 0, 1, \dots, p$, notice that $$\dfrac{\partial\text{RSS}}{\partial\beta_k}(\boldsymbol{\beta}) = \sum_{i=1}^{N}2\left(y_i - \sum_{j=0}^{p}x_{ij}\beta_{j}\right)(-x_{ik}) = -2\sum_{i=1}^{N}\left(y_ix_{ik} - \sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j}\right)\text{.}$$ Setting this equal to $0$, we get what are known as the normal equations: $$\sum_{i=1}^{N}y_ix_{ik} = \sum_{i=1}^{N}\sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j}\text{.}\tag{2}$$ for $k = 0, 1, \dots, p$. This can be represented in matrix notation. For $k = 0, 1, \dots, p$, $$\begin{align*} \sum_{i=1}^{N}y_ix_{ik} &= \begin{bmatrix} x_{1k} & x_{2k} & \cdots & x_{Nk} \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{N} \end{bmatrix} = \mathbf{c}_{k+1}^{T}\mathbf{y} \end{align*}$$ where $\mathbf{c}_\ell$ denotes the $\ell$th column of $\mathbf{X}$, $\ell = 1, \dots, p+1$. We can then represent each equation for $k = 0, 1, \dots, p$ as a matrix. Then $$\begin{bmatrix} \mathbf{c}_{1}^{T}\mathbf{y} \\ \mathbf{c}_{2}^{T}\mathbf{y} \\ \vdots \\ \mathbf{c}_{p+1}^{T}\mathbf{y} \end{bmatrix} = \begin{bmatrix} \mathbf{c}_{1}^{T} \\ \mathbf{c}_{2}^{T} \\ \vdots \\ \mathbf{c}_{p+1}^{T} \end{bmatrix}\mathbf{y} = \begin{bmatrix} \mathbf{c}_{1} & \mathbf{c}_{2} & \cdots & \mathbf{c}_{p+1} \end{bmatrix}^{T}\mathbf{y} = \mathbf{X}^{T}\mathbf{y}\text{.} $$ For justification of "factoring out" $\mathbf{y}$, see this link, on page 2. For the right-hand side of $(2)$ ($k = 0, 1, \dots, p$), $$\begin{align*} \sum_{i=1}^{N}\sum_{j=0}^{p}x_{ij}x_{ik}\beta_{j} &= \sum_{j=0}^{p}\left(\sum_{i=1}^{N}x_{ij}x_{ik}\right)\beta_{j} \\ &= \sum_{j=0}^{p}\left(\sum_{i=1}^{N}x_{ik}x_{ij}\right)\beta_{j} \\ &=\sum_{j=0}^{p}\begin{bmatrix} x_{1k} & x_{2k} & \cdots & x_{Nk} \end{bmatrix} \begin{bmatrix} x_{1j} \\ x_{2j} \\ \vdots \\ x_{Nj} \end{bmatrix}\beta_j \\ &= \sum_{j=0}^{p}\mathbf{c}^{T}_{k+1}\mathbf{c}_{j+1}\beta_j \\ &= \mathbf{c}^{T}_{k+1}\sum_{j=0}^{p}\mathbf{c}_{j+1}\beta_j \\ &= \mathbf{c}^{T}_{k+1}\begin{bmatrix} \mathbf{c}_{1} & \mathbf{c}_2 & \cdots & \mathbf{c}_{p+1} \end{bmatrix}\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \\ &= \mathbf{c}^{T}_{k+1}\mathbf{X}\boldsymbol{\beta}\text{.} \end{align*} $$ Bringing each case into a single matrix, we have $$\begin{bmatrix} \mathbf{c}^{T}_{1}\mathbf{X}\boldsymbol{\beta}\\ \mathbf{c}^{T}_{2}\mathbf{X}\boldsymbol{\beta}\\ \vdots \\ \mathbf{c}^{T}_{p+1}\mathbf{X}\boldsymbol{\beta}\\ \end{bmatrix} = \begin{bmatrix} \mathbf{c}^{T}_{1}\\ \mathbf{c}^{T}_{2}\\ \vdots \\ \mathbf{c}^{T}_{p+1}\\ \end{bmatrix}\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}\text{.}$$ Thus, in matrix form, we have the normal equations as $$\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^{T}\mathbf{y}\text{.}\tag{3}$$


What is meant by the vector derivative $\frac{dF}{d\beta}$ is the vector with components $\frac{dF}{d\beta_k}$. Then $$\frac{d}{d\beta_k}2\beta^T X^T y=\frac{d}{d\beta_k}\sum_{i,j}2\beta_i X_{ji} y_j=\sum_{i,j}2\delta_{ik} X_{ji} y_j=\sum_{j}2 X_{jk} y_j=(2X^T y)_k,$$ so indeed $\frac{d}{d\beta}(2\beta^T X^T y)=2 X^T y$ as desired.