Monday, 10 July 2017
Assessing the Accuracy of our models (R Squared, Adjusted R Squared, RMSE, MAE, AIC)
There are several ways to check the accuracy of our models, some are printed directly in R within the summary output, others are just as easy to calculate with specific functions. Please take a look at my previous post for more info on the code.
This is probably the most commonly used statistics and allows us to understand the percentage of variance in the target variable explained by the model. It can be computed as a ratio of the regression sum of squares and the total sum of squares. This is one of the standard measures of accuracy that R prints out, through the function summary, for linear models and ANOVAs.
This is a form of R-squared that is adjusted for the number of predictors in the model. It can be computed as follows:
Where R2 is the R squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine whether our model is overfitting the data. This happens particularly when the sample size is small, in such cases if we fill the model with more predictors we may end up increasing the R squared simply because the model starts adapting to the noise (or random error) and not properly describing the data. It is a generally good indication if the adjusted R squared is similar to the standard R squared.
The previous indexes measure the amount of variance in the target variable that can be explained by our model. This is a good indication but in some cases we are more interested in quantifying the error in the same measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. The problem is residuals are both positive and negative and their distribution should be fairly symmetrical (this is actually one of the assumptions in most linear models, so if this is not the case we should be worried). This means that their average will always be zero. So we need to find other indexes to quantify the average residuals, for example by averaging the squared residuals:
This is the square root of the mean of the squared residuals, with Yhat_t being the estimated value at point t, Y_t being the observed value in t and n
being the sample size. The RMSE has the same
measuring unit of the variable y.
This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that, since they square the residuals, they tend to be more affected by extreme values. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE if the discrepancy between observed and predicted is large. Since this large residuals may be caused by potential outliers, this issue may cause overestimation of the error.
To solve the problem with potential outliers, we can use the mean absolute error, where we average the absolute value of the residuals:
This index is more robust against large residuals. Since RMSE is still widely used, even though its problems are well known, it is always better to calculate and present both in a research paper.
This is another popular index we have used in previous posts to compare different models. It is very popular because it corrects the RMSE for the number of predictors in the model, thus allowing to account for overfitting. It can be simply computed as follows:
Where again p is the number of terms in the model.