R-squared on test data

There are a couple of problems here. First, this is not a good way to use lm(...). lm(...) is meant to be used with a data frame, with the formula expressions referencing columns in the df. So, assuming your data is in two vectors x and y,

set.seed(1)    # for reproducible example
x <- 1:11000
y <- 3+0.1*x + rnorm(11000,sd=1000)

df <- data.frame(x,y)
# training set
train <- sample(1:nrow(df),0.75*nrow(df))   # random sample of 75% of data

fit <- lm(y~x,data=df[train,])

Now fit has the model based on the training set. Using lm(...) this way allows you, for example to generate predictions without all the matrix multiplication.

The second problem is the definition of R-squared. The conventional definition is:

1 - SS.residuals/SS.total

For the training set, and the training set ONLY,

SS.total = SS.regression + SS.residual

so

SS.regression = SS.total - SS.residual,

and therefore

R.sq = SS.regression/SS.total

so R.sq is the fraction of variability in the dataset that is explained by the model, and will always be between 0 and 1.

You can see this below.

SS.total      <- with(df[train,],sum((y-mean(y))^2))
SS.residual   <- sum(residuals(fit)^2)
SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 1.907349e-06
SS.regression/SS.total     # fraction of variation explained by the model
# [1] 0.08965502
1-SS.residual/SS.total     # same thing, for model frame ONLY!!! 
# [1] 0.08965502          
summary(fit)$r.squared     # both are = R.squared
# [1] 0.08965502

But this does not work with the test set (e.g., when you make predictions from a model).

test <- -train
test.pred <- predict(fit,newdata=df[test,])
test.y    <- df[test,]$y

SS.total      <- sum((test.y - mean(test.y))^2)
SS.residual   <- sum((test.y - test.pred)^2)
SS.regression <- sum((test.pred - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 8958890

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
test.rsq
# [1] 0.0924713

# fraction of variability explained by the model
SS.regression/SS.total 
# [1] 0.08956405

In this contrived example there is not much difference, but it is very possible to have an R-sq. value less than 0 (when defined this way).

If, for example, the model is a very poor predictor with the test set, then the residuals can actually be larger than the total variation in test set. This is equivalent to saying that the test set is modeled better using it's mean, than using the model derived from the training set.

I noticed that you use the first three quarters of your data as the training set, rather than taking a random sample (as in this example). If the dependance of y on x is non-linear, and the x's are in order, then you could get a negative R-sq with the test set.

Regarding OP's comment below, one way to assess the model with a test set is by comparing in-model to out-of-model mean squared error (MSE).

mse.train <- summary(fit)$sigma^2
mse.test  <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)

If we assume that the training and test set are both normally distributed with the same variance and having means which follow the same model formula, then the ratio should have an F-distribution with (n.train-2) and (n.test-2) degrees of freedom. If the MSE's are significantly different based on an F-test, then the model does not fit the test data well.

Have you plotted your test.y and pred.y vs x?? This alone will tell you a lot.


When you use an R2 measure on an (out-of-) sample, you loose certain aspects of the interpretation of the R2:

  • the equivalence SSR total = SSR explained + SSR error
  • The fact that R2 is equal to the squared of the correlation between y and predicted y
  • The fact that R2 is in [0,1]

If you want to use R, I would recommend the function modelr::rsquare. Note this uses the SSR total from the test sample, not the training sample (as some people seem to advocate).

Here I take an example where our train data has only 3 points, there is hence a high risk that we are having a bad model, and hence a poor out-of-sample performance, Indeed, you can see that the R2 is negative!

library(modelr)

train <- mtcars[c(1,3,4),]
test  <- mtcars[-c(1,3,4),]

mod <- lm(carb ~ drat, data = train)

Compute on train data:

## train
y_train <- train$carb
SSR_y_train <- sum((y_train-mean(y_train))^2)

cor(fitted(mod), y_train)^2
#> [1] 0.2985092
rsquare(mod, train)
#> [1] 0.2985092
1-sum(residuals(mod)^2)/SSR_y_train
#> [1] 0.2985092

Compute on test data:

## test
pred_test <- predict(mod, newdata = test)
y_test <- test$carb
SSR_y_test <- sum((y_test-mean(y_test))^2)

cor(pred_test, y_test)^2
#> [1] 0.01737236
rsquare(mod, test)
#> [1] -0.6769549

1- 28* var(pred_test-y_test)/SSR_y_train
#> [1] -19.31621
1- 28* var(pred_test-y_test)/SSR_y_test
#> [1] -0.6769549

If you want a function, the miscTools package has an rSquared function.

require(miscTools)
r2 <- rSquared(ytest, resid = ytest-yhat)

Calculating R-squared on the testing data is a little tricky, as you have to remember what your baseline is. Your baseline projection is a mean of your training data.

Therefore, extending the example provided by @jlhoward above:

SS.test.total      <- sum((test.y - mean(df[train,]$y))^2)
SS.test.residual   <- sum((test.y - test.pred)^2)
SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2)
SS.test.total - (SS.test.regression+SS.test.residual)
# [1] 11617720 not 8958890

test.rsq <- 1 - SS.test.residual/SS.test.total  
test.rsq
# [1] 0.09284556 not 0.0924713

# fraction of variability explained by the model
SS.test.regression/SS.test.total 
# [1] 0.08907705 not 0.08956405

Update: miscTools::rSquared() function makes an assumption that R-squared is calculated on the same dataset, on which the model is trained, as it calculates

yy <- y - mean(y)

behind the scenes in line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R