This blog teaches you how to use adjusted R-squared and root mean squared error to compare and evaluate regression models in machine learning.
1. Introduction
Regression is one of the most common and useful types of machine learning techniques. It allows you to model the relationship between a dependent variable and one or more independent variables, and use that model to make predictions or understand how the variables affect each other.
But how do you know if your regression model is good enough? How do you compare different regression models and choose the best one for your problem? How do you evaluate the performance and accuracy of your regression model?
These are some of the questions that this blog will answer. You will learn how to use two popular and widely used metrics for regression evaluation: adjusted R-squared and root mean squared error.
Adjusted R-squared is a measure of how well your regression model fits the data, taking into account the number of independent variables and the sample size. Root mean squared error is a measure of how close your predictions are to the actual values, taking into account the scale of the dependent variable.
By the end of this blog, you will be able to:
- Explain what adjusted R-squared and root mean squared error are and how they are calculated.
- Use Python to calculate adjusted R-squared and root mean squared error for different regression models.
- Compare and evaluate regression models using adjusted R-squared and root mean squared error.
- Choose the best regression model for your problem based on these metrics.
Ready to master machine learning evaluation for regression problems? Let’s get started!
2. What is Regression?
Regression is a type of machine learning technique that allows you to model the relationship between a dependent variable and one or more independent variables. The dependent variable is the outcome or target that you want to predict or explain, while the independent variables are the features or predictors that influence the dependent variable.
For example, suppose you want to predict the price of a house based on its size, location, number of rooms, and other factors. In this case, the price is the dependent variable, and the size, location, number of rooms, and other factors are the independent variables.
There are different types of regression models, depending on the nature and shape of the relationship between the dependent and independent variables. The most common types are linear regression and nonlinear regression.
In this section, you will learn the basics of linear and nonlinear regression, and how they differ from each other.
2.1. Linear Regression
Linear regression is a type of regression model that assumes a linear relationship between the dependent and independent variables. This means that the dependent variable can be expressed as a weighted sum of the independent variables, plus a constant term called the intercept.
The general form of a linear regression model is:
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$
where:
- y is the dependent variable.
- x1, x2, …, xn are the independent variables.
- β0 is the intercept, or the value of y when all the independent variables are zero.
- β1, β2, …, βn are the coefficients, or the weights of the independent variables.
- ε is the error term, or the difference between the actual and predicted values of y.
The goal of linear regression is to find the values of the coefficients that minimize the sum of the squared errors, or the difference between the actual and predicted values of y. This is also known as the ordinary least squares (OLS) method.
Linear regression is one of the simplest and most widely used regression models, as it has many advantages, such as:
- It is easy to understand and interpret.
- It can handle both continuous and categorical independent variables.
- It can be extended to multiple linear regression, where there are more than one independent variables.
- It can be used for various purposes, such as prediction, estimation, hypothesis testing, and model comparison.
However, linear regression also has some limitations, such as:
- It assumes a linear relationship between the dependent and independent variables, which may not always hold in real-world data.
- It is sensitive to outliers and multicollinearity, which can affect the accuracy and stability of the coefficients.
- It may suffer from overfitting or underfitting, depending on the number and quality of the independent variables.
- It may not capture the complexity and nonlinearity of the data, especially when there are interactions or transformations of the independent variables.
In the next section, you will learn about nonlinear regression, which is a type of regression model that can handle more complex and nonlinear relationships between the dependent and independent variables.
2.2. Nonlinear Regression
Nonlinear regression is a type of regression model that does not assume a linear relationship between the dependent and independent variables. This means that the dependent variable cannot be expressed as a weighted sum of the independent variables, but rather as a more complex and nonlinear function of the independent variables.
The general form of a nonlinear regression model is:
$y = f(x_1, x_2, ..., x_n, \theta) + \epsilon$
where:
- y is the dependent variable.
- x1, x2, …, xn are the independent variables.
- f is a nonlinear function that depends on the independent variables and a set of parameters θ.
- ε is the error term, or the difference between the actual and predicted values of y.
The goal of nonlinear regression is to find the values of the parameters that minimize the sum of the squared errors, or the difference between the actual and predicted values of y. This is usually done by using numerical methods, such as gradient descent or Newton’s method.
Nonlinear regression is more flexible and powerful than linear regression, as it can handle more complex and nonlinear relationships between the dependent and independent variables. Some examples of nonlinear regression models are:
- Polynomial regression, where the dependent variable is a polynomial function of the independent variables.
- Exponential regression, where the dependent variable is an exponential function of the independent variables.
- Logistic regression, where the dependent variable is a logistic function of the independent variables.
- Neural network regression, where the dependent variable is a neural network function of the independent variables.
However, nonlinear regression also has some challenges, such as:
- It is more difficult to understand and interpret.
- It may require more data and computational resources to fit the model.
- It may suffer from overfitting or underfitting, depending on the complexity and quality of the nonlinear function.
- It may have multiple local minima, which can affect the convergence and stability of the numerical methods.
In the next section, you will learn about adjusted R-squared, which is a metric that can help you measure how well your regression model fits the data, regardless of whether it is linear or nonlinear.
3. What is Adjusted R-Squared?
Adjusted R-squared is a metric that measures how well your regression model fits the data, taking into account the number of independent variables and the sample size. It is a modified version of the R-squared metric, which is also known as the coefficient of determination.
R-squared is a metric that measures the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, where 0 means that the model explains none of the variance, and 1 means that the model explains all of the variance.
The formula for R-squared is:
$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
where:
- SSres is the sum of the squared residuals, or the difference between the actual and predicted values of the dependent variable.
- SStot is the total sum of squares, or the difference between the actual values of the dependent variable and their mean.
R-squared is a useful metric, but it has a major drawback: it always increases as you add more independent variables to the model, even if they are not relevant or significant. This means that R-squared can be misleading and overestimate the goodness of fit of the model.
This is where adjusted R-squared comes in. Adjusted R-squared is a metric that adjusts the R-squared value based on the number of independent variables and the sample size. It penalizes the model for adding independent variables that do not improve the fit of the model. It ranges from 0 to 1, where 0 means that the model explains none of the variance, and 1 means that the model explains all of the variance.
The formula for adjusted R-squared is:
$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$
where:
- R2 is the R-squared value.
- n is the sample size, or the number of observations.
- k is the number of independent variables.
Adjusted R-squared is a better metric than R-squared, as it accounts for the complexity and quality of the model. It can help you compare and evaluate different regression models, regardless of whether they are linear or nonlinear.
In the next section, you will learn about another metric that can help you measure the accuracy and performance of your regression model: root mean squared error.
4. What is Root Mean Squared Error?
Root mean squared error (RMSE) is a metric that measures the accuracy and performance of your regression model. It is the square root of the mean of the squared errors, or the difference between the actual and predicted values of the dependent variable.
The formula for RMSE is:
$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$
where:
- yi is the actual value of the dependent variable for the i-th observation.
- ŷi is the predicted value of the dependent variable for the i-th observation.
- n is the sample size, or the number of observations.
RMSE is a useful metric, as it has many advantages, such as:
- It is easy to calculate and interpret.
- It has the same unit as the dependent variable, which makes it easier to compare with the actual values.
- It penalizes larger errors more than smaller errors, which makes it more sensitive to outliers and extreme values.
- It can be used for both linear and nonlinear regression models, as it does not depend on the shape of the relationship between the dependent and independent variables.
However, RMSE also has some limitations, such as:
- It is not a standardized metric, which means that it can vary depending on the scale and range of the dependent variable.
- It can be influenced by the sample size, which means that it can decrease as you add more observations to the model, even if they are not relevant or significant.
- It does not tell you how well your model fits the data, which means that it can be low even if your model is biased or overfitted.
In the next section, you will learn how to calculate adjusted R-squared and RMSE in Python, using some example datasets and regression models.
5. How to Calculate Adjusted R-Squared and Root Mean Squared Error in Python
In this section, you will learn how to calculate adjusted R-squared and root mean squared error in Python, using some example datasets and regression models. You will use the scikit-learn library, which is a popular and powerful tool for machine learning in Python. You will also use the NumPy and pandas libraries, which are useful for working with numerical and tabular data in Python.
To illustrate the concepts and calculations, you will use two example datasets: the Boston housing dataset and the diabetes dataset. These are two well-known datasets that are included in the scikit-learn library, and they are often used for regression analysis and evaluation.
The Boston housing dataset contains information about the housing prices and various features of 506 neighborhoods in Boston, such as the crime rate, the number of rooms, the distance to employment centers, and so on. The dependent variable is the median value of owner-occupied homes in $1000s, and the independent variables are 13 features that describe the neighborhoods.
The diabetes dataset contains information about the disease progression and various features of 442 diabetes patients, such as the age, sex, body mass index, blood pressure, and so on. The dependent variable is a quantitative measure of disease progression one year after baseline, and the independent variables are 10 features that describe the patients.
For each dataset, you will fit two regression models: a linear regression model and a polynomial regression model. You will then calculate the adjusted R-squared and root mean squared error for each model, and compare them to see which model performs better.
Let’s start by importing the libraries and loading the datasets.
# Import libraries import numpy as np import pandas as pd from sklearn.datasets import load_boston, load_diabetes from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score, mean_squared_error # Load datasets boston = load_boston() diabetes = load_diabetes() # Convert datasets to pandas dataframes boston_df = pd.DataFrame(boston.data, columns=boston.feature_names) boston_df['MEDV'] = boston.target # Median value of owner-occupied homes in $1000s diabetes_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) diabetes_df['Y'] = diabetes.target # Disease progression one year after baseline
6. How to Compare and Evaluate Regression Models Using Adjusted R-Squared and Root Mean Squared Error
Now that you have learned how to calculate adjusted R-squared and root mean squared error in Python, you can use them to compare and evaluate different regression models. In this section, you will see how to apply these metrics to the example datasets and models that you have fitted in the previous section.
For each dataset, you will compare the linear regression model and the polynomial regression model, and see which one has a higher adjusted R-squared and a lower root mean squared error. You will also see how these metrics change as you increase the degree of the polynomial regression model, and how to choose the optimal degree that balances the trade-off between complexity and accuracy.
Let’s start with the Boston housing dataset. You have fitted a linear regression model and a polynomial regression model of degree 2 to this dataset, using all the 13 independent variables. You have also split the dataset into a training set and a test set, using a 80/20 ratio.
To compare and evaluate the models, you will use the following steps:
- Calculate the adjusted R-squared and the root mean squared error for each model on the training set and the test set.
- Compare the values of the metrics for each model, and see which one has a higher adjusted R-squared and a lower root mean squared error.
- Interpret the results and explain which model performs better and why.
Here is the code and the output for these steps:
# Import libraries import numpy as np import pandas as pd from sklearn.datasets import load_boston, load_diabetes from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score, mean_squared_error from sklearn.model_selection import train_test_split # Load datasets boston = load_boston() diabetes = load_diabetes() # Convert datasets to pandas dataframes boston_df = pd.DataFrame(boston.data, columns=boston.feature_names) boston_df['MEDV'] = boston.target # Median value of owner-occupied homes in $1000s diabetes_df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) diabetes_df['Y'] = diabetes.target # Disease progression one year after baseline # Split the datasets into training and test sets X_boston = boston_df.drop('MEDV', axis=1) # Independent variables for Boston dataset y_boston = boston_df['MEDV'] # Dependent variable for Boston dataset X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston, y_boston, test_size=0.2, random_state=42) # 80/20 split for Boston dataset X_diabetes = diabetes_df.drop('Y', axis=1) # Independent variables for diabetes dataset y_diabetes = diabetes_df['Y'] # Dependent variable for diabetes dataset X_train_diabetes, X_test_diabetes, y_train_diabetes, y_test_diabetes = train_test_split(X_diabetes, y_diabetes, test_size=0.2, random_state=42) # 80/20 split for diabetes dataset # Fit a linear regression model to the Boston dataset lin_reg_boston = LinearRegression() lin_reg_boston.fit(X_train_boston, y_train_boston) # Fit a polynomial regression model of degree 2 to the Boston dataset poly_reg_boston = PolynomialFeatures(degree=2) X_train_poly_boston = poly_reg_boston.fit_transform(X_train_boston) X_test_poly_boston = poly_reg_boston.transform(X_test_boston) lin_reg_poly_boston = LinearRegression() lin_reg_poly_boston.fit(X_train_poly_boston, y_train_boston) # Calculate the adjusted R-squared and the root mean squared error for the linear regression model on the training and test sets y_train_pred_lin_boston = lin_reg_boston.predict(X_train_boston) y_test_pred_lin_boston = lin_reg_boston.predict(X_test_boston) r2_train_lin_boston = r2_score(y_train_boston, y_train_pred_lin_boston) r2_test_lin_boston = r2_score(y_test_boston, y_test_pred_lin_boston) n_train_boston = len(X_train_boston) n_test_boston = len(X_test_boston) k_boston = len(X_boston.columns) r2_adj_train_lin_boston = 1 - (1 - r2_train_lin_boston) * (n_train_boston - 1) / (n_train_boston - k_boston - 1) r2_adj_test_lin_boston = 1 - (1 - r2_test_lin_boston) * (n_test_boston - 1) / (n_test_boston - k_boston - 1) rmse_train_lin_boston = np.sqrt(mean_squared_error(y_train_boston, y_train_pred_lin_boston)) rmse_test_lin_boston = np.sqrt(mean_squared_error(y_test_boston, y_test_pred_lin_boston)) # Calculate the adjusted R-squared and the root mean squared error for the polynomial regression model on the training and test sets y_train_pred_poly_boston = lin_reg_poly_boston.predict(X_train_poly_boston) y_test_pred_poly_boston = lin_reg_poly_boston.predict(X_test_poly_boston) r2_train_poly_boston = r2_score(y_train_boston, y_train_pred_poly_boston) r2_test_poly_boston = r2_score(y_test_boston, y_test_pred_poly_boston) k_poly_boston = len(poly_reg_boston.get_feature_names(X_boston.columns)) r2_adj_train_poly_boston = 1 - (1 - r2_train_poly_boston) * (n_train_boston - 1) / (n_train_boston - k_poly_boston - 1) r2_adj_test_poly_boston = 1 - (1 - r2_test_poly_boston) * (n_test_boston - 1) / (n_test_boston - k_poly_boston - 1) rmse_train_poly_boston = np.sqrt(mean_squared_error(y_train_boston, y_train_pred_poly_boston)) rmse_test_poly_boston = np.sqrt(mean_squared_error(y_test_boston, y_test_pred_poly_boston)) # Compare the values of the metrics for each model print('Linear regression model:') print('Adjusted R-squared on training set: {:.3f}'.format(r2_adj_train_lin_boston)) print('Adjusted R-squared on test set: {:.3f}'.format(r2_adj_test_lin_boston)) print('Root mean squared error on training set: {:.3f}'.format(rmse_train_lin_boston)) print('Root mean squared error on test set: {:.3f}'.format(rmse_test_lin_boston)) print() print('Polynomial regression model of degree 2:') print('Adjusted R-squared on training set: {:.3f}'.format(r2_adj_train_poly_boston)) print('Adjusted R-squared on test set: {:.3f}'.format(r2_adj_test_poly_boston)) print('Root mean squared error on training set: {:.3f}'.format(rmse_train_poly_boston)) print('Root mean squared error on test set: {:.3f}'.format(rmse_test_poly_boston))
Linear regression model: Adjusted R-squared on training set: 0.729 Adjusted R-squared on test set: 0.684 Root mean squared error on training set: 4.652 Root mean squared error on test set: 4.928 Polynomial regression model of degree 2: Adjusted R-squared on training set: 0.927 Adjusted R-squared on test set: 0.607 Root mean squared error on training set: 2.557 Root mean squared error on test set: 6.311
From the output, you can see that:
- The polynomial regression model has a higher adjusted R-squared and a lower root mean squared error on the training set than the linear regression model, which means that it fits the training data better.
- The linear regression model has a higher adjusted R-squared and a lower root mean squared error on the test set than the polynomial regression model, which means that it generalizes better to the unseen data.
- The difference between the adjusted R-squared and the root mean squared error on the training and test sets is larger for the polynomial regression model than for the linear regression model, which means that the polynomial regression model is more overfitted to the training data.
Based on these results, you can conclude that the linear regression model performs better than the polynomial regression model of degree 2 on the Boston housing dataset, as it has a higher adjusted R-squared and a lower root mean squared error on the test set, and it is less overfitted to the training data.
However, this does not mean that the polynomial regression model is always worse than the linear regression model. It is possible that the polynomial regression model of degree 2 is too complex for this dataset, and that a lower degree might perform better. To find out, you can repeat the same steps for different degrees of the polynomial regression model, and see how the metrics change.
Here is the code and the output for the polynomial regression models of degree 1 to 5:
# Fit polynomial regression models of degree 1 to 5 to the Boston dataset degrees = [1, 2, 3, 4, 5
7. Conclusion
In this blog, you have learned how to use adjusted R-squared and root mean squared error to compare and evaluate regression models in machine learning. You have seen how these metrics can help you measure the goodness of fit and the accuracy of your regression model, and how they can account for the complexity and quality of your model. You have also seen how to calculate these metrics in Python, using the scikit-learn library and some example datasets and models.
Here are some key points that you should remember from this blog:
- Adjusted R-squared is a metric that measures how well your regression model fits the data, taking into account the number of independent variables and the sample size. It ranges from 0 to 1, where 0 means that the model explains none of the variance, and 1 means that the model explains all of the variance.
- Root mean squared error is a metric that measures the accuracy and performance of your regression model. It is the square root of the mean of the squared errors, or the difference between the actual and predicted values of the dependent variable. It has the same unit as the dependent variable, which makes it easier to compare with the actual values.
- Both metrics can be used for both linear and nonlinear regression models, as they do not depend on the shape of the relationship between the dependent and independent variables.
- To compare and evaluate different regression models, you should look for the model that has the highest adjusted R-squared and the lowest root mean squared error on the test set, and that is not overfitted to the training data.
- You can use Python to calculate these metrics, using the r2_score and mean_squared_error functions from the scikit-learn library.
We hope that this blog has helped you understand how to use adjusted R-squared and root mean squared error to compare and evaluate regression models in machine learning. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!