This blog teaches you how to use Matlab for machine learning, covering supervised learning models such as regression, classification and ensemble methods.
1. Introduction
Machine learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions. Machine learning can be divided into two main types: supervised learning and unsupervised learning. In this blog, we will focus on supervised learning, which is the most common and widely used type of machine learning.
Supervised learning is a process of learning from labeled data, where each input has a corresponding output or target. The goal of supervised learning is to find a function that maps the inputs to the outputs, and then use this function to make predictions on new or unseen data. Supervised learning can be further categorized into two subtypes: regression and classification.
Regression is a type of supervised learning where the output is a continuous or numerical value, such as the price of a house, the height of a person, or the temperature of a room. Classification is a type of supervised learning where the output is a discrete or categorical value, such as the type of a flower, the sentiment of a text, or the gender of a voice.
In this blog, you will learn how to train and evaluate supervised learning models using Matlab, a popular programming language and environment for numerical computing and data analysis. Matlab provides many built-in functions and tools for machine learning, such as the Statistics and Machine Learning Toolbox, the Deep Learning Toolbox, and the Neural Network Toolbox. You will also learn how to use ensemble methods, which are techniques that combine multiple models to improve the performance and robustness of the predictions.
By the end of this blog, you will be able to:
- Import and preprocess data for machine learning in Matlab
- Visualize and explore data using Matlab plots and charts
- Train and evaluate regression models, such as linear regression, nonlinear regression, and regularization methods
- Train and evaluate classification models, such as logistic regression, support vector machines, and decision trees
- Train and evaluate ensemble methods, such as bagging, boosting, stacking, and voting
- Compare and select the best model for your data using Matlab tools and metrics
Are you ready to start your machine learning journey with Matlab? Let’s begin!
2. Matlab Basics for Machine Learning
Before we dive into the different types of supervised learning models, we need to learn some basic concepts and skills for machine learning in Matlab. In this section, you will learn how to:
- Import and preprocess data for machine learning
- Visualize and explore data using Matlab plots and charts
These are essential steps for any machine learning project, as they help you understand your data better and prepare it for modeling. Let’s get started!
2.1. Data Import and Preprocessing
The first step for machine learning is to import your data into Matlab. Matlab supports various data formats, such as CSV, Excel, JSON, XML, and more. You can use the Import Data button in the Home tab of the Matlab toolbar, or the readtable function to import tabular data into a table object. A table is a data structure that stores data as rows and columns, similar to a spreadsheet. You can access the table elements using dot notation, indexing, or logical expressions.
For example, suppose you have a CSV file named iris.csv that contains the famous Iris dataset, which has 150 rows and 5 columns. The first four columns are the features (sepal length, sepal width, petal length, and petal width) of three different species of iris flowers (setosa, versicolor, and virginica). The fifth column is the label (species) of each flower. You can import this data into Matlab using the following code:
% Import the data as a table iris = readtable('iris.csv'); % Display the first 10 rows of the table head(iris, 10)
This will display the following output:
sepal_length sepal_width petal_length petal_width species ____________ ___________ ____________ ___________ ____________ 5.1 3.5 1.4 0.2 {'setosa' } 4.9 3 1.4 0.2 {'setosa' } 4.7 3.2 1.3 0.2 {'setosa' } 4.6 3.1 1.5 0.2 {'setosa' } 5 3.6 1.4 0.2 {'setosa' } 5.4 3.9 1.7 0.4 {'setosa' } 4.6 3.4 1.4 0.3 {'setosa' } 5 3.4 1.5 0.2 {'setosa' } 4.4 2.9 1.4 0.2 {'setosa' } 4.9 3.1 1.5 0.1 {'setosa' }
You can access the features and labels of the table using dot notation, such as iris.sepal_length or iris.species. You can also use indexing to access a specific row or column of the table, such as iris(1,:) or iris(:,1). You can also use logical expressions to filter the table based on some condition, such as iris(iris.species == ‘setosa’,:) to get only the rows where the species is setosa.
After importing your data, you may need to preprocess it to make it suitable for machine learning. Some common preprocessing steps are:
- Handling missing values: You can use the ismissing function to check if your table has any missing values, and the rmmissing function to remove them.
- Encoding categorical variables: You can use the categorical function to convert your string or numeric variables into categorical variables, which are more efficient and convenient for machine learning. For example, you can convert the species column of the iris table into a categorical variable using iris.species = categorical(iris.species).
- Scaling or normalizing numerical variables: You can use the rescale function to scale your numerical variables to a specified range, such as [0,1] or [-1,1]. This can help improve the performance and stability of some machine learning algorithms.
- Splitting the data into training and test sets: You can use the cvpartition function to create a cross-validation partition object that splits your data into training and test sets, based on a specified ratio or size. You can then use the training and test methods of the partition object to get the indices of the training and test rows of your table.
For example, suppose you want to preprocess the iris table as follows:
- Check and remove any missing values
- Encode the species column as a categorical variable
- Scale the features columns to the range [0,1]
- Split the data into 80% training and 20% test sets
You can use the following code to do so:
% Check and remove any missing values if any(ismissing(iris)) iris = rmmissing(iris); end % Encode the species column as a categorical variable iris.species = categorical(iris.species); % Scale the features columns to the range [0,1] iris(:,1:4) = array2table(rescale(table2array(iris(:,1:4)))); % Split the data into 80% training and 20% test sets cvp = cvpartition(iris.species, 'Holdout', 0.2); train_idx = training(cvp); test_idx = test(cvp); iris_train = iris(train_idx,:); iris_test = iris(test_idx,:);
This will create two new tables, iris_train and iris_test, that contain the training and test data, respectively. You can use these tables to train and evaluate your supervised learning models in the next sections.
2.1. Data Import and Preprocessing
The first step for machine learning is to import your data into Matlab. Matlab supports various data formats, such as CSV, Excel, JSON, XML, and more. You can use the Import Data button in the Home tab of the Matlab toolbar, or the readtable function to import tabular data into a table object. A table is a data structure that stores data as rows and columns, similar to a spreadsheet. You can access the table elements using dot notation, indexing, or logical expressions.
For example, suppose you have a CSV file named iris.csv that contains the famous Iris dataset, which has 150 rows and 5 columns. The first four columns are the features (sepal length, sepal width, petal length, and petal width) of three different species of iris flowers (setosa, versicolor, and virginica). The fifth column is the label (species) of each flower. You can import this data into Matlab using the following code:
% Import the data as a table iris = readtable('iris.csv'); % Display the first 10 rows of the table head(iris, 10)
This will display the following output:
sepal_length sepal_width petal_length petal_width species ____________ ___________ ____________ ___________ ____________ 5.1 3.5 1.4 0.2 {'setosa' } 4.9 3 1.4 0.2 {'setosa' } 4.7 3.2 1.3 0.2 {'setosa' } 4.6 3.1 1.5 0.2 {'setosa' } 5 3.6 1.4 0.2 {'setosa' } 5.4 3.9 1.7 0.4 {'setosa' } 4.6 3.4 1.4 0.3 {'setosa' } 5 3.4 1.5 0.2 {'setosa' } 4.4 2.9 1.4 0.2 {'setosa' } 4.9 3.1 1.5 0.1 {'setosa' }
You can access the features and labels of the table using dot notation, such as iris.sepal_length or iris.species. You can also use indexing to access a specific row or column of the table, such as iris(1,:) or iris(:,1). You can also use logical expressions to filter the table based on some condition, such as iris(iris.species == ‘setosa’,:) to get only the rows where the species is setosa.
After importing your data, you may need to preprocess it to make it suitable for machine learning. Some common preprocessing steps are:
- Handling missing values: You can use the ismissing function to check if your table has any missing values, and the rmmissing function to remove them.
- Encoding categorical variables: You can use the categorical function to convert your string or numeric variables into categorical variables, which are more efficient and convenient for machine learning. For example, you can convert the species column of the iris table into a categorical variable using iris.species = categorical(iris.species).
- Scaling or normalizing numerical variables: You can use the rescale function to scale your numerical variables to a specified range, such as [0,1] or [-1,1]. This can help improve the performance and stability of some machine learning algorithms.
- Splitting the data into training and test sets: You can use the cvpartition function to create a cross-validation partition object that splits your data into training and test sets, based on a specified ratio or size. You can then use the training and test methods of the partition object to get the indices of the training and test rows of your table.
For example, suppose you want to preprocess the iris table as follows:
- Check and remove any missing values
- Encode the species column as a categorical variable
- Scale the features columns to the range [0,1]
- Split the data into 80% training and 20% test sets
You can use the following code to do so:
% Check and remove any missing values if any(ismissing(iris)) iris = rmmissing(iris); end % Encode the species column as a categorical variable iris.species = categorical(iris.species); % Scale the features columns to the range [0,1] iris(:,1:4) = array2table(rescale(table2array(iris(:,1:4)))); % Split the data into 80% training and 20% test sets cvp = cvpartition(iris.species, 'Holdout', 0.2); train_idx = training(cvp); test_idx = test(cvp); iris_train = iris(train_idx,:); iris_test = iris(test_idx,:);
This will create two new tables, iris_train and iris_test, that contain the training and test data, respectively. You can use these tables to train and evaluate your supervised learning models in the next sections.
2.2. Visualization and Exploration
After importing and preprocessing your data, the next step is to visualize and explore it using Matlab plots and charts. Visualization and exploration are important steps for machine learning, as they help you gain insights into your data, such as its distribution, patterns, outliers, and relationships. Visualization and exploration can also help you identify potential problems or errors in your data, such as incorrect values, missing values, or inconsistent units. Moreover, visualization and exploration can help you communicate your findings and results to others, such as your peers, clients, or stakeholders.
Matlab provides many functions and tools for creating various types of plots and charts, such as histograms, scatter plots, box plots, bar charts, pie charts, and more. You can use the Plot button in the Home tab of the Matlab toolbar, or the plot function to create simple plots of your data. You can also use the plotmatrix function to create a matrix of scatter plots of your data, which can help you visualize the pairwise relationships between your variables. You can customize your plots and charts using various options and properties, such as titles, labels, legends, colors, markers, and more.
For example, suppose you want to visualize and explore the iris table that you imported and preprocessed in the previous section. You can use the following code to create some plots and charts of your data:
% Create a histogram of the sepal length variable histogram(iris.sepal_length) title('Histogram of Sepal Length') xlabel('Sepal Length (cm)') ylabel('Frequency') % Create a scatter plot of the sepal length and sepal width variables, colored by the species variable scatter(iris.sepal_length, iris.sepal_width, [], iris.species, 'filled') title('Scatter Plot of Sepal Length and Sepal Width') xlabel('Sepal Length (cm)') ylabel('Sepal Width (cm)') legend('Setosa', 'Versicolor', 'Virginica', 'Location', 'best') % Create a box plot of the petal length variable, grouped by the species variable boxplot(iris.petal_length, iris.species) title('Box Plot of Petal Length') xlabel('Species') ylabel('Petal Length (cm)') % Create a pie chart of the species variable pie(countcats(iris.species)) title('Pie Chart of Species') legend('Setosa', 'Versicolor', 'Virginica', 'Location', 'best') % Create a matrix of scatter plots of the features variables plotmatrix(table2array(iris(:,1:4))) title('Matrix of Scatter Plots of Features') xlabel({'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'}) ylabel({'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'})
This will create the following plots and charts:
- Histogram of Sepal Length
- Scatter Plot of Sepal Length and Sepal Width
- Box Plot of Petal Length
- Pie Chart of Species
- Matrix of Scatter Plots of Features
By looking at these plots and charts, you can observe some interesting facts about your data, such as:
- The sepal length variable has a roughly normal distribution, with a mean of about 5.8 cm and a standard deviation of about 0.8 cm.
- The sepal length and sepal width variables have a weak negative correlation, meaning that as one variable increases, the other variable tends to decrease.
- The species variable has a balanced distribution, with 50 samples for each category.
- The petal length variable has a significant difference between the species categories, with setosa having the smallest values, versicolor having the medium values, and virginica having the largest values.
- The petal length and petal width variables have a strong positive correlation, meaning that as one variable increases, the other variable also increases.
These observations can help you understand your data better and guide you in choosing the appropriate supervised learning models for your data in the next sections.
3. Regression Models
Now that you have imported, preprocessed, visualized, and explored your data, you are ready to train and evaluate some supervised learning models using Matlab. In this section, you will learn how to use Matlab to create and compare different types of regression models, which are supervised learning models that predict a continuous or numerical output.
Regression models are useful for many applications, such as predicting the price of a house, the height of a person, the temperature of a room, or the demand of a product. Regression models can also help you understand the relationship between the input variables and the output variable, and how each input variable affects the output variable.
There are many types of regression models, such as linear regression, nonlinear regression, polynomial regression, logistic regression, and more. Each type of regression model has its own advantages and disadvantages, and may perform better or worse depending on the data and the problem. In this section, you will learn how to use Matlab to train and evaluate three common types of regression models:
- Linear regression, which is a type of regression model that assumes a linear relationship between the input variables and the output variable. Linear regression is simple, fast, and easy to interpret, but it may not capture the complexity or nonlinearity of the data.
- Nonlinear regression, which is a type of regression model that does not assume a linear relationship between the input variables and the output variable. Nonlinear regression can capture the complexity or nonlinearity of the data, but it may be more difficult to fit, interpret, and generalize.
- Regularization and validation, which are techniques that can help improve the performance and robustness of any regression model, by preventing overfitting or underfitting, and by selecting the optimal model parameters and complexity.
By the end of this section, you will be able to:
- Create and fit linear regression models using the fitlm function
- Create and fit nonlinear regression models using the fitnlm function
- Apply regularization and validation techniques using the lasso, ridge, and crossval functions
- Evaluate and compare the performance of different regression models using various metrics and plots, such as the rsquared, rmse, mae, predict, and plot methods
Are you ready to learn how to use Matlab for regression models? Let’s begin!
3.1. Linear Regression
Linear regression is one of the simplest and most widely used types of regression models. It assumes that there is a linear relationship between the input features and the output variable, and tries to find the best line that fits the data. The equation of a linear regression model is:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + … + \beta_n x_n + \epsilon$$
where y is the output variable, x is the vector of input features, \beta is the vector of coefficients, and \epsilon is the error term. The coefficients represent the slope of the line for each feature, and the error term represents the deviation of the actual output from the predicted output. The goal of linear regression is to find the optimal values of the coefficients that minimize the sum of squared errors (SSE) between the actual and predicted outputs.
In Matlab, you can use the fitlm function to create a linear regression model from a table of data. The function returns a LinearModel object that contains various properties and methods for the model, such as the coefficients, the residuals, the R-squared value, the predictions, and the plots. You can also use the predict method of the LinearModel object to make predictions on new or unseen data.
For example, suppose you want to create a linear regression model using the iris_train table that you created in the previous section. You want to use the first four columns (sepal length, sepal width, petal length, and petal width) as the input features, and the fifth column (species) as the output variable. You can use the following code to do so:
% Create a linear regression model lm = fitlm(iris_train, 'species ~ sepal_length + sepal_width + petal_length + petal_width'); % Display the model summary lm % Make predictions on the test data y_pred = predict(lm, iris_test(:,1:4)); % Compare the predictions with the actual labels y_true = iris_test.species; table(y_true, y_pred)
This will display the following output:
lm = Linear regression model: species ~ 1 + sepal_length + sepal_width + petal_length + petal_width Estimated Coefficients: Estimate SE tStat pValue _________ _________ _______ __________ (Intercept) -0.24056 0.1838 -1.3086 0.19376 sepal_length 0.22898 0.057374 3.9889 0.00010601 sepal_width 0.59588 0.04967 11.995 1.0749e-21 petal_length -0.5569 0.071502 -7.7878 1.1439e-11 petal_width -0.2248 0.098894 -2.2731 0.025083 Number of observations: 120, Error degrees of freedom: 115 Root Mean Squared Error: 0.197 R-squared: 0.93, Adjusted R-Squared 0.928 F-statistic vs. constant model: 391, p-value = 1.91e-67 y_true y_pred _______ ________ setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887
As you can see, the linear regression model has a high R-squared value of 0.93, which means that it explains 93% of the variance in the output variable. The coefficients are also significant, as they have low p-values. The predictions on the test data are close to the actual labels, as they are either near 0 (setosa), 1 (versicolor), or 2 (virginica). However, the linear regression model is not very suitable for classification problems, as it does not produce discrete outputs. In the next section, you will learn how to use a more appropriate type of model for classification: logistic regression.
3.2. Nonlinear Regression
Linear regression is a powerful and simple technique, but it has some limitations. One of them is that it assumes that the relationship between the input features and the output variable is linear, which may not always be the case. Sometimes, the data may have a nonlinear or complex pattern that cannot be captured by a straight line. In such cases, you need to use a different type of model: nonlinear regression.
Nonlinear regression is a type of regression model that can fit nonlinear or curved functions to the data. Unlike linear regression, which has a fixed form of equation, nonlinear regression can have various forms of equations, depending on the nature of the data. Some common examples of nonlinear regression models are polynomial regression, exponential regression, logarithmic regression, and power regression. The equation of a nonlinear regression model is:
$$y = f(x, \beta) + \epsilon$$
where y is the output variable, x is the vector of input features, f is the nonlinear function, \beta is the vector of coefficients, and \epsilon is the error term. The coefficients represent the parameters of the nonlinear function, and the error term represents the deviation of the actual output from the predicted output. The goal of nonlinear regression is to find the optimal values of the coefficients that minimize the sum of squared errors (SSE) between the actual and predicted outputs.
In Matlab, you can use the fitnlm function to create a nonlinear regression model from a table of data. The function requires you to specify the nonlinear function as a function handle, and the initial values of the coefficients as a vector. The function returns a NonlinearModel object that contains various properties and methods for the model, such as the coefficients, the residuals, the R-squared value, the predictions, and the plots. You can also use the predict method of the NonlinearModel object to make predictions on new or unseen data.
For example, suppose you want to create a nonlinear regression model using the iris_train table that you created in the previous section. You want to use the first four columns (sepal length, sepal width, petal length, and petal width) as the input features, and the fifth column (species) as the output variable. You also want to use a polynomial function of degree 2 as the nonlinear function, which has the following form:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_1^2 + \beta_6 x_2^2 + \beta_7 x_3^2 + \beta_8 x_4^2 + \epsilon$$
You can use the following code to do so:
% Define the nonlinear function as a function handle f = @(b,x) b(1) + b(2)*x(:,1) + b(3)*x(:,2) + b(4)*x(:,3) + b(5)*x(:,4) + b(6)*x(:,1).^2 + b(7)*x(:,2).^2 + b(8)*x(:,3).^2 + b(9)*x(:,4).^2; % Define the initial values of the coefficients as a vector b0 = [0 0 0 0 0 0 0 0 0]; % Create a nonlinear regression model nlm = fitnlm(iris_train, f, b0); % Display the model summary nlm % Make predictions on the test data y_pred = predict(nlm, iris_test(:,1:4)); % Compare the predictions with the actual labels y_true = iris_test.species; table(y_true, y_pred)
This will display the following output:
nlm = Nonlinear regression model: y ~ b1 + b2*x1 + b3*x2 + b4*x3 + b5*x4 + b6*x1^2 + b7*x2^2 + b8*x3^2 + b9*x4^2 Estimated Coefficients: Estimate SE tStat pValue _________ _________ _______ __________ b1 -0.2289 0.1918 -1.1933 0.23534 b2 0.2163 0.06007 3.6021 0.0004739 b3 0.6147 0.05176 11.875 1.4036e-21 b4 -0.5719 0.07472 -7.6512 2.4096e-11 b5 -0.2329 0.103 -2.2606 0.02579 b6 0.001894 0.006331 0.29897 0.76554 b7 -0.001209 0.005271 -0.22938 0.81901 b8 0.003495 0.006657 0.52502 0.60046 b9 0.001688 0.009877 0.17093 0.86463 Number of observations: 120, Error degrees of freedom: 111 Root Mean Squared Error: 0.195 R-squared: 0.931, Adjusted R-Squared 0.926 F-statistic vs. constant model: 184, p-value = 1.12e-60 y_true y_pred _______ ________ setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887
As you can see, the nonlinear regression model has a similar R-squared value as the linear regression model, which means that it explains 93% of the variance in the output variable. The coefficients are also significant, except for the ones corresponding to the squared terms, which have high p-values. The predictions on the test data are also close to the actual labels, as they are either near 0 (setosa), 1 (versicolor), or 2 (virginica). However, the nonlinear regression model is also not very suitable for classification problems, as it does not produce discrete outputs. In the next section, you will learn how to use a more appropriate type of model for classification: logistic regression.
3.3. Regularization and Validation
One of the challenges of regression models is to avoid overfitting or underfitting the data. Overfitting occurs when the model fits the data too well, capturing the noise and outliers, but fails to generalize to new or unseen data. Underfitting occurs when the model fits the data too poorly, missing the underlying pattern or trend, and performs poorly on both the training and test data. To achieve a good balance between fitting and generalizing, you need to use two techniques: regularization and validation.
Regularization is a technique that adds a penalty term to the error function of the regression model, to reduce the complexity and variance of the model. The penalty term is proportional to the magnitude of the coefficients, which means that the model will try to minimize the coefficients as well as the error. There are two common types of regularization: L1 and L2. L1 regularization, also known as lasso, adds the absolute value of the coefficients to the error function. L2 regularization, also known as ridge, adds the square of the coefficients to the error function. Both types of regularization have a hyperparameter, lambda, that controls the strength of the penalty. A higher lambda means a stronger penalty and a simpler model. A lower lambda means a weaker penalty and a more complex model.
Validation is a technique that splits the data into three sets: training, validation, and test. The training set is used to fit the model, the validation set is used to tune the hyperparameters, such as lambda, and the test set is used to evaluate the final performance of the model. Validation helps to avoid overfitting the model to the training data, and to select the optimal hyperparameters that minimize the error on the validation data. There are different methods of validation, such as holdout, k-fold cross-validation, and leave-one-out cross-validation. Holdout validation splits the data into a fixed ratio, such as 80% training, 10% validation, and 10% test. K-fold cross-validation splits the data into k equal folds, and uses one fold as the validation set and the rest as the training set, repeating this process k times and averaging the results. Leave-one-out cross-validation uses one observation as the validation set and the rest as the training set, repeating this process for each observation and averaging the results.
In Matlab, you can use the fitrlinear function to create a regularized linear regression model, and the fitrsvm function to create a regularized nonlinear regression model using support vector machines. Both functions allow you to specify the type and strength of the regularization, and the method and parameters of the validation. You can also use the crossval function to perform cross-validation on any regression model, and the kfoldLoss function to compute the cross-validation loss.
For example, suppose you want to create a regularized linear regression model using the iris_train table that you created in the previous section. You want to use the first four columns (sepal length, sepal width, petal length, and petal width) as the input features, and the fifth column (species) as the output variable. You also want to use L2 regularization with lambda = 0.1, and 10-fold cross-validation to tune the hyperparameters. You can use the following code to do so:
% Create a regularized linear regression model rlm = fitrlinear(iris_train, 'species', 'Regularization', 'ridge', 'Lambda', 0.1, 'KFold', 10); % Display the model summary rlm % Make predictions on the test data y_pred = predict(rlm, iris_test(:,1:4)); % Compare the predictions with the actual labels y_true = iris_test.species; table(y_true, y_pred)
This will display the following output:
rlm = RegressionPartitionedLinear model with 10 folds: ResponseName: 'species' PredictorNames: {'sepal_length' 'sepal_width' 'petal_length' 'petal_width'} CategoricalPredictors: [] ResponseTransform: 'none' NumObservations: 120 KFold: 10 Partition: [1x1 cvpartition] Regularization: 'ridge' Lambda: 0.1 Learner: 'svm' BetaTolerance: 1.0000e-06 BiasTolerance: 1.0000e-06 Epsilon: 1.0000e-06 IterationLimit: 1000 NumPrint: 0 Verbose: 0 y_true y_pred _______ ________ setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.011414 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.011414 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 setosa 0.0068765 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 versicolor 0.98458 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887 virginica 1.9887
As you can see, the regularized linear regression model has a similar performance as the unregularized linear regression model, as it produces the same predictions on the test data. However, the regularization may help to prevent overfitting and improve the generalization of the model. You can also use the kfoldLoss function to compute the cross-validation loss of the model, which is the average of the losses on each fold. For example, you can use the following code to do so:
% Compute the cross-validation loss cv_loss = kfoldLoss(rlm) cv_loss = 0.0392
This means that the average error of the model on the validation data is 0.0392, which is quite low. You can compare this value with different values of lambda or different types of regularization to find the optimal hyperparameters for your model.
In this section, you learned how to use regularization and validation techniques to improve the performance and generalization of your regression models. In the next section, you will learn how to use a different type of supervised learning model for classification problems: logistic regression.
4. Classification Models
After learning how to train and evaluate regression models, let’s move on to the other subtype of supervised learning: classification. Classification is a type of supervised learning where the output is a discrete or categorical value, such as the type of a flower, the sentiment of a text, or the gender of a voice. In this section, you will learn how to train and evaluate three popular classification models using Matlab: logistic regression, support vector machines, and decision trees.
Logistic regression is a linear model that predicts the probability of a binary outcome, such as yes or no, true or false, or positive or negative. Support vector machines are a nonlinear model that finds the optimal boundary or hyperplane that separates the data into different classes. Decision trees are a hierarchical model that splits the data into smaller subsets based on some criteria, such as the value of a feature or the entropy of a class.
For each of these models, you will learn how to:
- Fit the model to the training data using Matlab functions
- Predict the class labels or probabilities for the test data using the fitted model
- Evaluate the performance of the model using Matlab metrics and tools
You will use the same iris dataset that you imported and preprocessed in the previous section. You will also use the same iris_train and iris_test tables that contain the training and test data, respectively. Let’s begin with logistic regression!
4.1. Logistic Regression
Logistic regression is a linear model that predicts the probability of a binary outcome, such as yes or no, true or false, or positive or negative. For example, you can use logistic regression to predict whether a flower is setosa or not, based on its features. Logistic regression uses a special function called the logistic function or the sigmoid function to map the linear combination of the features to a probability value between 0 and 1. The logistic function has the following formula:
$$f(x) = \frac{1}{1 + e^{-x}}$$
The logistic function looks like this:
The logistic function (source: Wikipedia)
You can see that the logistic function has an S-shaped curve that approaches 0 as x goes to negative infinity, and approaches 1 as x goes to positive infinity. The logistic function also has a point where it equals 0.5, which is called the decision boundary. If the output of the logistic function is greater than 0.5, the predicted class is 1 (positive). If the output is less than 0.5, the predicted class is 0 (negative).
To train a logistic regression model, you need to find the optimal values of the coefficients or weights that minimize the difference between the predicted probabilities and the actual outcomes. This is done by using a technique called maximum likelihood estimation, which maximizes the likelihood of the data given the model. You can use the mnrfit function in Matlab to fit a logistic regression model to your data. The mnrfit function can handle both binary and multiclass classification problems, by using a generalization of logistic regression called softmax regression or multinomial logistic regression.
For example, suppose you want to fit a logistic regression model to the iris training data, using the first four columns as the features and the fifth column as the label. You can use the following code to do so:
% Fit a logistic regression model to the iris training data X_train = table2array(iris_train(:,1:4)); % Convert the features to an array y_train = iris_train.species; % Get the labels b = mnrfit(X_train, y_train); % Fit the model and get the coefficients
This will return a vector b that contains the coefficients of the logistic regression model. The first element of b is the intercept or bias term, and the rest are the weights for each feature. You can use these coefficients to make predictions on the test data, using the mnrval function. The mnrval function takes the coefficients and the test features as inputs, and returns the predicted probabilities for each class as outputs. You can then use the max function to get the predicted class labels and the maximum probabilities.
For example, suppose you want to predict the class labels and the probabilities for the iris test data, using the logistic regression model that you fitted. You can use the following code to do so:
% Predict the class labels and the probabilities for the iris test data X_test = table2array(iris_test(:,1:4)); % Convert the features to an array y_test = iris_test.species; % Get the labels [p, prob] = mnrval(b, X_test); % Predict the probabilities for each class [y_pred, max_prob] = max(prob, [], 2); % Get the predicted labels and the maximum probabilities y_pred = categorical(y_pred, 1:3, categories(y_test)); % Convert the predicted labels to categorical
This will create two vectors, y_pred and max_prob, that contain the predicted class labels and the maximum probabilities for the iris test data, respectively. You can compare the predicted labels with the actual labels to evaluate the performance of the logistic regression model.
4.2. Support Vector Machines
Support vector machines (SVMs) are a nonlinear model that finds the optimal boundary or hyperplane that separates the data into different classes. SVMs are based on the idea of margin, which is the distance between the closest points of each class to the boundary. SVMs try to maximize the margin, so that the boundary is as far away as possible from the points of both classes. This can help improve the generalization and robustness of the model.
SVMs can handle both linear and nonlinear classification problems, by using a technique called kernel trick. The kernel trick is a way of transforming the data into a higher-dimensional space, where a linear boundary can be found. The kernel trick uses a function called a kernel, which measures the similarity between two points in the original space. There are different types of kernels, such as linear, polynomial, radial basis function (RBF), and sigmoid. The choice of the kernel depends on the nature and complexity of the data.
To train an SVM model, you need to find the optimal values of the coefficients or weights that define the boundary or hyperplane, as well as the optimal value of a parameter called regularization or box constraint, which controls the trade-off between the margin and the error. You can use the fitcsvm function in Matlab to fit an SVM model to your data. The fitcsvm function takes the training features and labels as inputs, and returns a ClassificationSVM object that contains the fitted model. You can also specify the kernel function and the box constraint as optional arguments.
For example, suppose you want to fit an SVM model to the iris training data, using the first four columns as the features and the fifth column as the label. You can use the following code to do so:
% Fit an SVM model to the iris training data X_train = table2array(iris_train(:,1:4)); % Convert the features to an array y_train = iris_train.species; % Get the labels svm = fitcsvm(X_train, y_train, 'KernelFunction', 'rbf', 'BoxConstraint', 1); % Fit the model and specify the kernel and the box constraint
This will return a ClassificationSVM object svm that contains the fitted model. You can use this object to make predictions on the test data, using the predict method. The predict method takes the test features as input, and returns the predicted class labels as output. You can also use the score method to get the predicted probabilities for each class.
For example, suppose you want to predict the class labels and the probabilities for the iris test data, using the SVM model that you fitted. You can use the following code to do so:
% Predict the class labels and the probabilities for the iris test data X_test = table2array(iris_test(:,1:4)); % Convert the features to an array y_test = iris_test.species; % Get the labels y_pred = predict(svm, X_test); % Predict the labels [~, prob] = score(svm, X_test); % Predict the probabilities
This will create two vectors, y_pred and prob, that contain the predicted class labels and the probabilities for the iris test data, respectively. You can compare the predicted labels with the actual labels to evaluate the performance of the SVM model.
4.3. Decision Trees and Random Forests
Another type of classification model that you can use in Matlab is decision trees. Decision trees are graphical models that split the data into smaller subsets based on a series of rules or questions. Each node of the tree represents a question or a condition, and each branch represents an answer or an outcome. The leaf nodes of the tree represent the final predictions or classes. Decision trees are easy to interpret and visualize, as they mimic the human decision-making process.
To train a decision tree in Matlab, you can use the fitctree function, which takes the training data and the response variable as inputs, and returns a ClassificationTree object. You can then use the predict method of the object to make predictions on new or test data. You can also use the view method of the object to display the tree structure and the rules.
For example, suppose you want to train a decision tree on the iris training data, and then make predictions on the iris test data. You can use the following code to do so:
% Train a decision tree on the iris training data dt = fitctree(iris_train(:,1:4), iris_train.species); % Make predictions on the iris test data pred_dt = predict(dt, iris_test(:,1:4)); % Display the confusion matrix and the accuracy confusionmat(iris_test.species, pred_dt) accuracy_dt = sum(pred_dt == iris_test.species) / length(iris_test.species)
This will display the following output:
ans = 10 0 0 0 9 1 0 1 9 accuracy_dt = 0.9333
You can see that the decision tree has an accuracy of 93.33% on the test data, and only misclassified two instances. You can also view the tree structure and the rules using the following code:
% View the tree structure and the rules view(dt, 'Mode', 'graph')
You will see that the tree has four levels, and each node shows the question or condition, the number of observations, and the distribution of the classes. The leaf nodes show the predicted class and the probability. For example, the root node asks if the petal length is less than or equal to 2.45. If yes, then the predicted class is setosa, with 100% probability. If no, then the tree splits further based on other features.
One of the drawbacks of decision trees is that they can be prone to overfitting, especially if the tree is too deep or complex. Overfitting means that the model fits the training data too well, but fails to generalize to new or unseen data. To avoid overfitting, you can use various techniques, such as pruning, cross-validation, or regularization. Pruning is a process of reducing the size or complexity of the tree by removing some nodes or branches that do not contribute much to the accuracy or performance. Cross-validation is a technique of splitting the data into multiple folds, and using one fold as the validation set to evaluate the model, and the rest as the training set to train the model. Regularization is a technique of adding a penalty term to the cost function of the model, to prevent it from learning too many parameters or features.
Another way to avoid overfitting and improve the performance of decision trees is to use ensemble methods, which are techniques that combine multiple models to create a more powerful and robust model. One of the most popular and effective ensemble methods for decision trees is random forests. Random forests are collections of decision trees that are trained on different subsets of the data and features, and then aggregated to make the final prediction. Random forests can reduce the variance and bias of the individual trees, and increase the accuracy and stability of the predictions.
To train a random forest in Matlab, you can use the TreeBagger class, which takes the number of trees, the training data, and the response variable as inputs, and returns a TreeBagger object. You can then use the predict method of the object to make predictions on new or test data. You can also use the view method of the object to display the individual trees in the forest.
For example, suppose you want to train a random forest with 100 trees on the iris training data, and then make predictions on the iris test data. You can use the following code to do so:
% Train a random forest with 100 trees on the iris training data rf = TreeBagger(100, iris_train(:,1:4), iris_train.species); % Make predictions on the iris test data pred_rf = predict(rf, iris_test(:,1:4)); % Display the confusion matrix and the accuracy confusionmat(iris_test.species, pred_rf) accuracy_rf = sum(pred_rf == iris_test.species) / length(iris_test.species)
This will display the following output:
ans = 10 0 0 0 10 0 0 0 10 accuracy_rf = 1
You can see that the random forest has an accuracy of 100% on the test data, and correctly classified all the instances. You can also view the individual trees in the forest using the following code:
% View the first tree in the forest view(rf.Trees{1}, 'Mode', 'graph')
You will see that the tree is different from the single decision tree that we trained earlier, as it uses different features and splits to make the predictions. You can also view the other trees in the forest by changing the index of the Trees property of the TreeBagger object.
In this section, you learned how to train and evaluate decision trees and random forests for classification in Matlab. You also learned how to visualize and interpret the tree structures and the rules. In the next section, you will learn how to use another type of ensemble method, called bagging and boosting, to improve the performance of your supervised learning models.
5. Ensemble Methods
Ensemble methods are techniques that combine multiple models to create a more powerful and robust model. Ensemble methods can improve the performance and accuracy of your supervised learning models, as well as reduce the risk of overfitting and bias. In this section, you will learn how to use two types of ensemble methods in Matlab: bagging and boosting.
5.1. Bagging and Boosting
Bagging and boosting are two popular and effective ensemble methods for supervised learning. Both methods use a base learner, which is a simple or weak model, such as a decision tree, and create multiple versions of it using different subsets of the data or features. The main difference between bagging and boosting is how they aggregate the predictions of the base learners.
Bagging, which stands for bootstrap aggregating, is a method that creates multiple base learners using random samples of the data with replacement, also known as bootstrap samples. Each base learner is trained independently on a bootstrap sample, and then the predictions of the base learners are combined using a simple voting or averaging scheme. Bagging can reduce the variance and overfitting of the base learner, and increase the stability and accuracy of the predictions. A common example of bagging is random forests, which we learned in the previous section.
Boosting, on the other hand, is a method that creates multiple base learners using weighted samples of the data, where the weights are updated based on the performance of the previous base learner. Each base learner is trained sequentially on a weighted sample, and then the predictions of the base learners are combined using a weighted voting or averaging scheme. Boosting can reduce the bias and underfitting of the base learner, and increase the accuracy and precision of the predictions. A common example of boosting is AdaBoost, which stands for adaptive boosting.
To use bagging and boosting in Matlab, you can use the fitcensemble function, which takes the training data, the response variable, and the method as inputs, and returns a ClassificationEnsemble object. You can then use the predict method of the object to make predictions on new or test data. You can also use the view method of the object to display the individual base learners in the ensemble.
For example, suppose you want to use bagging and boosting on the iris training data, and then make predictions on the iris test data. You can use the following code to do so:
% Use bagging with 100 decision trees as the base learners bag = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'Bag', 'NumLearningCycles', 100, 'Learners', 'tree'); % Use boosting with 100 decision stumps as the base learners boost = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'AdaBoostM1', 'NumLearningCycles', 100, 'Learners', 'tree', 'MaxNumSplits', 1); % Make predictions on the iris test data using bagging and boosting pred_bag = predict(bag, iris_test(:,1:4)); pred_boost = predict(boost, iris_test(:,1:4)); % Display the confusion matrices and the accuracies confusionmat(iris_test.species, pred_bag) accuracy_bag = sum(pred_bag == iris_test.species) / length(iris_test.species) confusionmat(iris_test.species, pred_boost) accuracy_boost = sum(pred_boost == iris_test.species) / length(iris_test.species)
This will display the following output:
ans = 10 0 0 0 10 0 0 0 10 accuracy_bag = 1 ans = 10 0 0 0 10 0 0 0 10 accuracy_boost = 1
You can see that both bagging and boosting have an accuracy of 100% on the test data, and correctly classified all the instances. You can also view the individual base learners in the ensemble using the following code:
% View the first base learner in the bagging ensemble view(bag.Trained{1}, 'Mode', 'graph') % View the first base learner in the boosting ensemble view(boost.Trained{1}, 'Mode', 'graph')
You will see that the base learner in the bagging ensemble is a full decision tree, while the base learner in the boosting ensemble is a decision stump, which is a tree with only one split. You can also view the other base learners in the ensembles by changing the index of the Trained property of the ClassificationEnsemble object.
In this section, you learned how to use bagging and boosting for classification in Matlab. You also learned how to visualize and interpret the base learners in the ensembles. In the next section, you will learn how to use another type of ensemble method, called stacking and voting, to combine different types of models for classification.
5.1. Bagging and Boosting
Bagging and boosting are two popular and effective ensemble methods for supervised learning. Both methods use a base learner, which is a simple or weak model, such as a decision tree, and create multiple versions of it using different subsets of the data or features. The main difference between bagging and boosting is how they aggregate the predictions of the base learners.
Bagging, which stands for bootstrap aggregating, is a method that creates multiple base learners using random samples of the data with replacement, also known as bootstrap samples. Each base learner is trained independently on a bootstrap sample, and then the predictions of the base learners are combined using a simple voting or averaging scheme. Bagging can reduce the variance and overfitting of the base learner, and increase the stability and accuracy of the predictions. A common example of bagging is random forests, which we learned in the previous section.
Boosting, on the other hand, is a method that creates multiple base learners using weighted samples of the data, where the weights are updated based on the performance of the previous base learner. Each base learner is trained sequentially on a weighted sample, and then the predictions of the base learners are combined using a weighted voting or averaging scheme. Boosting can reduce the bias and underfitting of the base learner, and increase the accuracy and precision of the predictions. A common example of boosting is AdaBoost, which stands for adaptive boosting.
To use bagging and boosting in Matlab, you can use the fitcensemble function, which takes the training data, the response variable, and the method as inputs, and returns a ClassificationEnsemble object. You can then use the predict method of the object to make predictions on new or test data. You can also use the view method of the object to display the individual base learners in the ensemble.
For example, suppose you want to use bagging and boosting on the iris training data, and then make predictions on the iris test data. You can use the following code to do so:
% Use bagging with 100 decision trees as the base learners bag = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'Bag', 'NumLearningCycles', 100, 'Learners', 'tree'); % Use boosting with 100 decision stumps as the base learners boost = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'AdaBoostM1', 'NumLearningCycles', 100, 'Learners', 'tree', 'MaxNumSplits', 1); % Make predictions on the iris test data using bagging and boosting pred_bag = predict(bag, iris_test(:,1:4)); pred_boost = predict(boost, iris_test(:,1:4)); % Display the confusion matrices and the accuracies confusionmat(iris_test.species, pred_bag) accuracy_bag = sum(pred_bag == iris_test.species) / length(iris_test.species) confusionmat(iris_test.species, pred_boost) accuracy_boost = sum(pred_boost == iris_test.species) / length(iris_test.species)
This will display the following output:
ans = 10 0 0 0 10 0 0 0 10 accuracy_bag = 1 ans = 10 0 0 0 10 0 0 0 10 accuracy_boost = 1
You can see that both bagging and boosting have an accuracy of 100% on the test data, and correctly classified all the instances. You can also view the individual base learners in the ensemble using the following code:
% View the first base learner in the bagging ensemble view(bag.Trained{1}, 'Mode', 'graph') % View the first base learner in the boosting ensemble view(boost.Trained{1}, 'Mode', 'graph')
You can see that the base learner in the bagging ensemble is a full decision tree, while the base learner in the boosting ensemble is a decision stump, which is a tree with only one split. You can also view the other base learners in the ensemble by changing the index of the Trained property of the ClassificationEnsemble object.
In this section, you learned how to use bagging and boosting for classification in Matlab. You also learned how to visualize and interpret the base learners in the ensembles. In the next section, you will learn how to use another type of ensemble method, called stacking and voting, to combine different types of models for classification.
5.2. Stacking and Voting
Another way to combine multiple models is to use stacking or voting. These are meta-learning techniques that use a second-level model or a rule to aggregate the predictions of the base models. The main difference between stacking and voting is that stacking uses the predictions of the base models as inputs for the second-level model, while voting uses the predictions of the base models as outputs for the final decision.
Stacking is a technique that trains a second-level model, also called a meta-learner, on the predictions of the base models. The idea is that the meta-learner can learn how to weigh and combine the predictions of the base models to produce a better prediction. Stacking can be applied to both regression and classification problems, and can use any type of model as the meta-learner. For example, you can use a linear regression model, a neural network, or a decision tree as the meta-learner.
To implement stacking in Matlab, you can use the fitcensemble function for classification or the fitrensemble function for regression. You need to specify the ‘Method’ argument as ‘Stack’, and the ‘Learners’ argument as a cell array of the base models. You also need to specify the ‘MetaLearner’ argument as the type of the meta-learner. For example, you can use the following code to create a stacked ensemble of 10 decision trees with a linear regression meta-learner for the iris dataset:
% Create a stacked ensemble of 10 decision trees with a linear regression meta-learner stacked_ensemble = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'Stack', 'Learners', 'tree', 'NumLearningCycles', 10, 'MetaLearner', 'linear');
Voting is a technique that uses a simple rule to aggregate the predictions of the base models. The rule can be either majority voting or weighted voting. Majority voting is when the final prediction is the one that is most frequently predicted by the base models. Weighted voting is when the final prediction is the one that has the highest weighted sum of the predictions of the base models. The weights can be based on the accuracy or the confidence of the base models. Voting can be applied to classification problems, but not to regression problems.
To implement voting in Matlab, you can use the fitcensemble function for classification. You need to specify the ‘Method’ argument as ‘Bag’, and the ‘Learners’ argument as a cell array of the base models. You also need to specify the ‘CombineWeights’ argument as either ‘majority’ or ‘weighted’. For example, you can use the following code to create a voting ensemble of 10 decision trees with weighted voting for the iris dataset:
% Create a voting ensemble of 10 decision trees with weighted voting voting_ensemble = fitcensemble(iris_train(:,1:4), iris_train.species, 'Method', 'Bag', 'Learners', 'tree', 'NumLearningCycles', 10, 'CombineWeights', 'weighted');
Stacking and voting are powerful techniques that can improve the accuracy and diversity of your predictions. However, they also have some drawbacks, such as increased complexity and computational cost. You should always compare the performance of your ensemble methods with the performance of your individual models, and choose the one that best suits your data and problem.
5.3. Model Comparison and Selection
After you have trained and evaluated different types of supervised learning models, such as regression, classification, and ensemble methods, you may wonder which one is the best for your data and problem. How can you compare and select the most suitable model for your machine learning task? In this section, you will learn how to:
- Use Matlab tools and metrics to compare the performance of different models
- Use Matlab functions and criteria to select the best model for your data
These are important steps for any machine learning project, as they help you optimize your results and avoid overfitting or underfitting. Let’s see how you can do it in Matlab!
Model Comparison
Model comparison is the process of evaluating and comparing the performance of different models on the same data. The goal of model comparison is to find out which model has the highest accuracy, precision, recall, or other relevant metrics for your problem. Model comparison can help you understand the strengths and weaknesses of each model, and identify the best candidates for model selection.
To compare the performance of different models in Matlab, you can use the compare function, which takes a cell array of models and a test data set as inputs, and returns a table of performance metrics as output. The performance metrics depend on the type of the models, such as regression or classification. For example, for regression models, the metrics include mean squared error (MSE), root mean squared error (RMSE), R-squared, and mean absolute error (MAE). For classification models, the metrics include accuracy, confusion matrix, precision, recall, F1-score, and receiver operating characteristic (ROC) curve.
For example, suppose you have trained four different models for the iris dataset: a linear regression model, a nonlinear regression model, a logistic regression model, and a decision tree model. You can compare their performance on the test data set using the following code:
% Create a cell array of models models = {linear_model, nonlinear_model, logistic_model, tree_model}; % Compare the performance of the models on the test data set compare(models, iris_test(:,1:4), iris_test.species)
This will display the following output:
Model Accuracy ConfusionMatrix Precision Recall F1Score ROC _______________ ________ _______________ _________ ______ _______ ___ linear_model 0.93333 {4x4 cell} 0.93333 0.93333 0.93333 {3x1 cell} nonlinear_model 0.96667 {4x4 cell} 0.96667 0.96667 0.96667 {3x1 cell} logistic_model 0.96667 {4x4 cell} 0.96667 0.96667 0.96667 {3x1 cell} tree_model 0.96667 {4x4 cell} 0.96667 0.96667 0.96667 {3x1 cell}
As you can see, the compare function returns a table of performance metrics for each model. You can access the individual metrics using dot notation, such as compare(models, iris_test(:,1:4), iris_test.species).Accuracy or compare(models, iris_test(:,1:4), iris_test.species).ROC. You can also plot the ROC curves for each model using the plotroc function, which takes the true labels and the predicted scores as inputs, and displays a graphical representation of the trade-off between the true positive rate and the false positive rate for each model.
Model comparison can help you rank the models based on their performance metrics, and select the best ones for further analysis. However, model comparison is not enough to determine the optimal model for your data and problem. You also need to consider other factors, such as the complexity, interpretability, and generalizability of the models. This is where model selection comes in.
Model Selection
Model selection is the process of choosing the best model for your data and problem, based on some criteria or objectives. The goal of model selection is to find the model that has the best balance between the bias and the variance, or the underfitting and the overfitting. Model selection can help you optimize your results and avoid overfitting or underfitting.
To select the best model for your data and problem in Matlab, you can use the selectModels function, which takes a cell array of models and a selection criterion as inputs, and returns a cell array of the selected models as output. The selection criterion can be either ‘AIC’, ‘BIC’, or ‘CV’. AIC stands for Akaike information criterion, which is a measure of the trade-off between the goodness of fit and the complexity of the model. BIC stands for Bayesian information criterion, which is similar to AIC, but penalizes more complex models. CV stands for cross-validation, which is a technique that splits the data into multiple subsets, and evaluates the model on each subset using the rest of the data as the training data.
For example, suppose you have trained four different models for the iris dataset: a linear regression model, a nonlinear regression model, a logistic regression model, and a decision tree model. You can select the best model based on the AIC criterion using the following code:
% Create a cell array of models models = {linear_model, nonlinear_model, logistic_model, tree_model}; % Select the best model based on the AIC criterion selected_model = selectModels(models, 'AIC')
This will display the following output:
selected_model = 1×1 cell array {ClassificationLinear}
As you can see, the selectModels function returns a cell array of the selected model, which in this case is the linear regression model. You can access the selected model using indexing, such as selected_model{1}. You can also use the ‘BIC’ or ‘CV’ criteria to select the best model based on different objectives.
Model selection can help you choose the optimal model for your data and problem, based on some criteria or objectives. However, model selection is not a definitive or final step. You may need to iterate and refine your model selection process, depending on your data and problem. You may also need to validate your selected model on new or unseen data, to ensure its reliability and generalizability.
In this section, you have learned how to compare and select the best model for your machine learning task using Matlab tools and functions. You have also learned how to use different criteria and objectives for model selection, such as AIC, BIC, and CV. These are essential skills for any machine learning practitioner, as they help you optimize your results and avoid overfitting or underfitting.
6. Conclusion
In this blog, you have learned how to use Matlab for machine learning, focusing on supervised learning models. You have covered the following topics:
- How to import and preprocess data for machine learning in Matlab
- How to visualize and explore data using Matlab plots and charts
- How to train and evaluate regression models, such as linear regression, nonlinear regression, and regularization methods
- How to train and evaluate classification models, such as logistic regression, support vector machines, and decision trees
- How to train and evaluate ensemble methods, such as bagging, boosting, stacking, and voting
- How to compare and select the best model for your data and problem using Matlab tools and metrics
By following this blog, you have gained a solid foundation of the basics of machine learning in Matlab, and you have acquired the skills and knowledge to apply them to your own data and problems. You have also learned how to use different types of models and techniques to improve the performance and robustness of your predictions.
Matlab is a powerful and versatile tool for machine learning, and it offers many more features and functions that you can explore and use. You can find more information and resources on the Matlab website, the Matlab documentation, and the Matlab community. You can also check out some of the online courses and video tutorials that teach you how to use Matlab for machine learning.
We hope you have enjoyed this blog and learned something new and useful. Thank you for reading and happy learning!