Computational complexity of least square regression operation

For a least squares regression with $N$ training examples and $C$ features, it takes:

  • $O(C^2N)$ to multiply $X^T$ by $X$
  • $O(CN)$ to multiply $X^T$ by $Y$
  • $O(C^3)$ to compute the LU (or Cholesky) factorization of $X^TX$ and use that to compute the product $(X^TX)^{-1} (X^TY)$

Asymptotically, $O(C^2N)$ dominates $O(CN)$ so we can forget the $O(CN)$ part.

Since you're using the normal equation I will assume that $N>C$ - otherwise the matrix $X^T X$ would be singular (and hence non-invertible), which means that $O(C^2N)$ asymptotically dominates $O(C^3)$.

Therefore the total time complexity is $O(C^2N)$. You should note that this is only the asymptotic complexity - in particular, for $C$, $N$ smallish you may find that computing the LU or Cholesky decomposition of $X^T X$ takes significantly longer than multiplying $X^T$ by $X$.

Edit: Note that if you have two datasets with the same features, labeled 1 and 2, and you have already computed $X_1^T X_1$, $X_1^TY_1$ and $X_2^TX_2$, $X_2^TY_2$ then training the combined dataset is only $O(C^3)$ since you just add the relevant matrices and vectors, and then solve a $C\times C$ linear system.

In particular this allows you do to very fast bootstrap, jackknife and cross-validation when you are training an OLS regression (or variants like ridge regression, lasso, constrained OLS etc).