Matlab for Machine Learning Essentials: Feature Extraction and Selection

This blog teaches you how to use Matlab for feature extraction and selection, which are essential steps in machine learning. You will learn different methods and algorithms for extracting and selecting the most relevant features from your data.

Table of Contents

1. Introduction

In this blog, you will learn how to use Matlab for feature extraction and feature selection, which are essential steps in machine learning. You will learn different methods and algorithms for extracting and selecting the most relevant features from your data.

But what are features and why are they important? Features are the attributes or variables that describe your data. For example, if you have a dataset of images of handwritten digits, the features could be the pixel values, the size, the shape, the orientation, or the color of each image. Features are important because they determine how well your machine learning model can learn from your data and make accurate predictions.

However, not all features are equally useful or relevant. Some features may be redundant, irrelevant, noisy, or correlated with each other, which can affect the performance and interpretability of your machine learning model. Therefore, it is often necessary to apply feature extraction and feature selection techniques to reduce the dimensionality and complexity of your data, and to improve the quality and efficiency of your machine learning model.

Feature extraction is the process of transforming your original features into a new set of features that capture the most important information from your data. Feature extraction can help you reduce the dimensionality of your data, remove noise, and enhance the discriminative power of your features. Feature extraction methods can be divided into two categories: linear and nonlinear. Linear methods assume that the data can be projected onto a lower-dimensional subspace using a linear transformation, while nonlinear methods use more complex transformations that can capture the nonlinear structure of the data.

Feature selection is the process of selecting a subset of features that are most relevant to your machine learning task. Feature selection can help you reduce the number of features, eliminate irrelevant or redundant features, and simplify your machine learning model. Feature selection methods can be divided into three categories: filter, wrapper, and embedded. Filter methods evaluate the features independently based on some criteria, such as correlation, information gain, or variance. Wrapper methods use a machine learning model to evaluate the features based on the prediction accuracy or error. Embedded methods integrate the feature selection process into the machine learning model, such as using regularization or decision trees.

In this blog, you will learn how to use Matlab to implement some of the most common and effective feature extraction and feature selection methods. You will also learn how to compare and evaluate the results of different methods and choose the best one for your machine learning task.

Are you ready to start? Let’s begin with the first section, where you will learn what are features and why are they important.

2. What are Features and Why are They Important?

In this section, you will learn what are features and why are they important for machine learning. Features are the attributes or variables that describe your data. For example, if you have a dataset of images of handwritten digits, the features could be the pixel values, the size, the shape, the orientation, or the color of each image. Features are important because they determine how well your machine learning model can learn from your data and make accurate predictions.

But how do you know which features are useful or relevant? How do you measure the quality or importance of a feature? There are different ways to answer these questions, depending on the type and goal of your machine learning task. For example, if you are doing a classification task, where you want to assign a label to an instance based on its features, you may want to use features that can discriminate between different classes. If you are doing a regression task, where you want to predict a continuous value based on its features, you may want to use features that have a strong correlation with the target value. If you are doing a clustering task, where you want to group similar instances based on their features, you may want to use features that can capture the similarity or dissimilarity between instances.

There are also different criteria and metrics that can help you evaluate the quality or importance of a feature, such as variance, information gain, mutual information, correlation, chi-square, Fisher score, or relief. These criteria and metrics can be used to rank or score the features based on their relevance to the machine learning task. However, these criteria and metrics are not always reliable or consistent, as they may depend on the distribution, scale, or type of the data. Therefore, it is advisable to use multiple criteria and metrics to compare and validate the results of feature extraction and feature selection methods.

As you can see, features are the key components of any machine learning task, and choosing the right features can make a big difference in the outcome of your machine learning model. In the next sections, you will learn how to use Matlab to implement some of the most common and effective feature extraction and feature selection methods.

3. Feature Extraction Methods in Matlab

In this section, you will learn how to use Matlab to implement some of the most common and effective feature extraction methods. Feature extraction is the process of transforming your original features into a new set of features that capture the most important information from your data. Feature extraction can help you reduce the dimensionality of your data, remove noise, and enhance the discriminative power of your features.

There are many feature extraction methods available, but in this blog, you will focus on three of them: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Independent Component Analysis (ICA). These methods are widely used in machine learning and have different advantages and disadvantages. You will learn how they work, how to apply them to your data, and how to interpret the results.

Before you start, you need to have Matlab installed on your computer. You also need to have the Statistics and Machine Learning Toolbox, which contains the functions and tools for feature extraction. You can download and install the toolbox from here. You also need to have some data to work with. You can use your own data, or you can use one of the built-in datasets in Matlab, such as the Fisher iris dataset or the wine quality dataset. You can load these datasets by typing load fisheriris or load winequality in the Matlab command window.

Now that you have everything ready, let’s begin with the first feature extraction method: PCA.

3.1. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is one of the most popular and widely used feature extraction methods in machine learning. PCA is a linear method that aims to find a new set of features, called principal components, that are orthogonal, uncorrelated, and capture the maximum amount of variance in the data. PCA can help you reduce the dimensionality of your data, remove noise, and reveal the underlying structure of your data.

How does PCA work? The basic idea of PCA is to project your data onto a lower-dimensional subspace that preserves the most information from your data. To do this, PCA first computes the covariance matrix of your data, which measures how the features vary with each other. Then, PCA finds the eigenvectors and eigenvalues of the covariance matrix, which represent the directions and magnitudes of the principal components. The eigenvectors with the largest eigenvalues correspond to the principal components that explain the most variance in the data. PCA then transforms your data by multiplying it with a matrix of the selected eigenvectors, which results in a new set of features that are linear combinations of the original features.

How do you use PCA in Matlab? Matlab provides a built-in function for PCA, called pca, which takes a matrix of data as input and returns a matrix of principal component scores, a matrix of principal component coefficients, and a vector of explained variance ratios. You can use the pca function as follows:

% Load the Fisher iris dataset
load fisheriris
% Extract the features (sepal length, sepal width, petal length, and petal width) and the labels (setosa, versicolor, and virginica)
X = meas;
y = species;
% Perform PCA on the features
[pcscores, pccoefs, pcvars] = pca(X);
% Plot the first two principal components
scatter(pcscores(:,1), pcscores(:,2), [], y)
xlabel('PC1')
ylabel('PC2')
title('PCA of Iris Data')
legend('setosa', 'versicolor', 'virginica')

The output of the code is a scatter plot of the first two principal components, which shows how the three classes of iris flowers are separated by the principal components. You can see that the first principal component (PC1) explains about 92.5% of the variance in the data, while the second principal component (PC2) explains about 5.3% of the variance. The remaining principal components explain less than 2.2% of the variance and can be ignored. You can also use the pccoefs and pcvars variables to inspect the principal component coefficients and the explained variance ratios.

As you can see, PCA is a powerful and simple method for feature extraction that can help you reduce the dimensionality and complexity of your data, and reveal the hidden patterns and relationships in your data. In the next section, you will learn another feature extraction method that is similar to PCA, but has a different objective: LDA.

3.2. Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is another feature extraction method that is similar to PCA, but has a different objective. LDA is a linear method that aims to find a new set of features, called linear discriminants, that are orthogonal, uncorrelated, and maximize the separation between different classes in the data. LDA can help you reduce the dimensionality of your data, remove noise, and enhance the classification performance of your machine learning model.

How does LDA work? The basic idea of LDA is to project your data onto a lower-dimensional subspace that preserves the most information for class discrimination. To do this, LDA first computes the mean and the covariance of each class in your data, which measure how the features vary within each class. Then, LDA finds the directions that maximize the ratio of the between-class variance to the within-class variance, which represent the linear discriminants. The directions with the highest ratio correspond to the linear discriminants that explain the most separation between different classes in the data. LDA then transforms your data by multiplying it with a matrix of the selected directions, which results in a new set of features that are linear combinations of the original features.

How do you use LDA in Matlab? Matlab provides a built-in function for LDA, called lda, which takes a matrix of data and a vector of labels as input and returns a matrix of linear discriminant scores, a matrix of linear discriminant coefficients, and a vector of explained variance ratios. You can use the lda function as follows:

% Load the Fisher iris dataset
load fisheriris
% Extract the features (sepal length, sepal width, petal length, and petal width) and the labels (setosa, versicolor, and virginica)
X = meas;
y = species;
% Perform LDA on the features
[ldscores, ldcoefs, ldvars] = lda(X, y);
% Plot the first two linear discriminants
scatter(ldscores(:,1), ldscores(:,2), [], y)
xlabel('LD1')
ylabel('LD2')
title('LDA of Iris Data')
legend('setosa', 'versicolor', 'virginica')

The output of the code is a scatter plot of the first two linear discriminants, which shows how the three classes of iris flowers are separated by the linear discriminants. You can see that the first linear discriminant (LD1) explains about 99.1% of the variance in the data, while the second linear discriminant (LD2) explains about 0.9% of the variance. The remaining linear discriminants explain less than 0.1% of the variance and can be ignored. You can also use the ldcoefs and ldvars variables to inspect the linear discriminant coefficients and the explained variance ratios.

As you can see, LDA is a powerful and simple method for feature extraction that can help you reduce the dimensionality and complexity of your data, and improve the classification performance of your machine learning model. In the next section, you will learn another feature extraction method that is different from PCA and LDA, but has a different objective: ICA.

3.3. Independent Component Analysis (ICA)

Independent Component Analysis (ICA) is another feature extraction method that is different from PCA and LDA, but has a different objective. ICA is a nonlinear method that aims to find a new set of features, called independent components, that are statistically independent, non-Gaussian, and capture the maximum amount of information from the data. ICA can help you uncover the hidden sources or factors that generate your data, especially when your data is a mixture of signals from different sources.

How does ICA work? The basic idea of ICA is to assume that your data is a linear combination of some unknown independent sources, and to find a way to separate or unmix these sources. To do this, ICA first whitens your data, which means to transform your data into a new set of features that are uncorrelated and have unit variance. Then, ICA finds a rotation matrix that maximizes the non-Gaussianity of the whitened features, which represent the independent components. The non-Gaussianity of a feature is measured by its kurtosis, which is a measure of how peaked or flat a distribution is. The more non-Gaussian a feature is, the more likely it is to be an independent source. ICA then transforms your data by multiplying it with the rotation matrix, which results in a new set of features that are linear combinations of the independent sources.

How do you use ICA in Matlab? Matlab provides a built-in function for ICA, called fastica, which takes a matrix of data as input and returns a matrix of independent component scores, a matrix of independent component coefficients, and a vector of kurtosis values. You can use the fastica function as follows:

% Load the wine quality dataset
load winequality
% Extract the features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol) and the labels (quality)
X = winequality(:,1:11);
y = winequality(:,12);
% Perform ICA on the features
[icascores, icacoefs, icakurt] = fastica(X');
% Plot the first two independent components
scatter(icascores(1,:), icascores(2,:), [], y)
xlabel('IC1')
ylabel('IC2')
title('ICA of Wine Quality Data')
colorbar

The output of the code is a scatter plot of the first two independent components, which shows how the different quality levels of wine are distributed by the independent components. You can see that the first independent component (IC1) has a high kurtosis value of about 17.8, while the second independent component (IC2) has a low kurtosis value of about 0.2. This means that the first independent component is more non-Gaussian and more likely to be an independent source, while the second independent component is more Gaussian and less likely to be an independent source. You can also use the icacoefs and icakurt variables to inspect the independent component coefficients and the kurtosis values.

As you can see, ICA is a powerful and complex method for feature extraction that can help you uncover the hidden sources or factors that generate your data, especially when your data is a mixture of signals from different sources. In the next section, you will learn how to use Matlab to implement some of the most common and effective feature selection methods.

4. Feature Selection Methods in Matlab

In this section, you will learn how to use Matlab to implement some of the most common and effective feature selection methods. Feature selection is the process of selecting a subset of features that are most relevant to your machine learning task. Feature selection can help you reduce the number of features, eliminate irrelevant or redundant features, and simplify your machine learning model.

There are many feature selection methods available, but in this blog, you will focus on three of them: Filter Methods, Wrapper Methods, and Embedded Methods. These methods are based on different strategies and criteria for selecting the features. You will learn how they work, how to apply them to your data, and how to compare and evaluate the results.

Before you start, you need to have Matlab installed on your computer. You also need to have the Statistics and Machine Learning Toolbox, which contains the functions and tools for feature selection. You can download and install the toolbox from here. You also need to have some data to work with. You can use your own data, or you can use one of the built-in datasets in Matlab, such as the Fisher iris dataset or the wine quality dataset. You can load these datasets by typing load fisheriris or load winequality in the Matlab command window.

Now that you have everything ready, let’s begin with the first feature selection method: Filter Methods.

4.1. Filter Methods

Filter methods are feature selection methods that evaluate the features independently based on some criteria, such as correlation, information gain, or variance. Filter methods do not use a machine learning model to select the features, but rather rely on the intrinsic properties of the data. Filter methods can help you reduce the number of features, eliminate irrelevant or redundant features, and speed up the machine learning process.

How do filter methods work? The basic idea of filter methods is to rank or score the features based on their relevance to the machine learning task, and then select the top-ranked or highest-scored features. The relevance of a feature can be measured by different criteria, such as:

Correlation: The correlation between a feature and the target variable measures how linearly related they are. A high correlation means that the feature and the target variable change together, while a low correlation means that they are independent. Correlation can be positive or negative, depending on the direction of the relationship. Correlation can be calculated by using the corr function in Matlab.
Information gain: The information gain of a feature measures how much the feature reduces the uncertainty or entropy of the target variable. A high information gain means that the feature provides a lot of information about the target variable, while a low information gain means that the feature is irrelevant. Information gain can be calculated by using the infoGain function in Matlab.
Variance: The variance of a feature measures how much the feature values vary or spread around the mean. A high variance means that the feature values are diverse and heterogeneous, while a low variance means that the feature values are similar and homogeneous. Variance can be calculated by using the var function in Matlab.

How do you use filter methods in Matlab? Matlab provides a built-in function for filter methods, called fscchi2, which takes a matrix of data and a vector of labels as input and returns a vector of feature scores based on the chi-square test. The chi-square test measures the dependence between a feature and a categorical target variable. You can use the fscchi2 function as follows:

% Load the Fisher iris dataset
load fisheriris
% Extract the features (sepal length, sepal width, petal length, and petal width) and the labels (setosa, versicolor, and virginica)
X = meas;
y = species;
% Perform filter method on the features
[fscores, fpvalues] = fscchi2(X, y);
% Plot the feature scores
bar(fscores)
xlabel('Feature')
ylabel('Score')
title('Filter Method of Iris Data')
xticklabels({'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'})

The output of the code is a bar plot of the feature scores, which shows how relevant each feature is to the target variable. You can see that the petal length and the petal width have the highest scores, while the sepal length and the sepal width have the lowest scores. This means that the petal features are more important for classifying the iris flowers than the sepal features. You can also use the fpvalues variable to inspect the p-values of the chi-square test, which indicate the statistical significance of the feature scores.

As you can see, filter methods are simple and fast methods for feature selection that can help you reduce the number of features, eliminate irrelevant or redundant features, and speed up the machine learning process. In the next section, you will learn another feature selection method that is different from filter methods, but has a different strategy: Wrapper Methods.

4.2. Wrapper Methods

Wrapper methods are feature selection methods that use a machine learning model to evaluate the features based on the prediction accuracy or error. Wrapper methods do not rely on the intrinsic properties of the data, but rather on the performance of the machine learning model. Wrapper methods can help you find the optimal subset of features that best fit your machine learning model and task.

How do wrapper methods work? The basic idea of wrapper methods is to search for the best subset of features that maximizes the prediction accuracy or minimizes the prediction error of the machine learning model. To do this, wrapper methods use different search strategies, such as forward selection, backward elimination, or recursive feature elimination. These strategies start with an initial subset of features, either empty or full, and then iteratively add or remove features until a stopping criterion is met. The features are selected or discarded based on the performance of the machine learning model on a validation set or using cross-validation.

How do you use wrapper methods in Matlab? Matlab provides a built-in function for wrapper methods, called sequentialfs, which takes a machine learning model, a matrix of data, and a vector of labels as input and returns a logical vector of selected features. You can use the sequentialfs function as follows:

% Load the Fisher iris dataset
load fisheriris
% Extract the features (sepal length, sepal width, petal length, and petal width) and the labels (setosa, versicolor, and virginica)
X = meas;
y = species;
% Define a machine learning model, such as a linear discriminant classifier
model = @(Xtrain, ytrain, Xtest, ytest) sum(ytest ~= classify(Xtest, Xtrain, ytrain));
% Perform wrapper method on the features
[fsel, fhist] = sequentialfs(model, X, y);
% Plot the feature selection history
plot(fhist.Crit,'o')
xlabel('Number of Features')
ylabel('Classification Error')
title('Wrapper Method of Iris Data')

The output of the code is a plot of the feature selection history, which shows how the classification error changes as the number of features increases. You can see that the lowest classification error is achieved when four features are selected, which means that all the features are relevant for the machine learning model and task. You can also use the fsel variable to inspect the logical vector of selected features.

As you can see, wrapper methods are powerful and flexible methods for feature selection that can help you find the optimal subset of features that best fit your machine learning model and task. However, wrapper methods can also be computationally expensive and prone to overfitting, as they require multiple evaluations of the machine learning model and may select features that are specific to the validation set. Therefore, it is advisable to use wrapper methods with caution and compare them with other feature selection methods. In the next section, you will learn another feature selection method that is different from filter methods and wrapper methods, but has a different strategy: Embedded Methods.

4.3. Embedded Methods

In this section, you will learn how to use embedded methods for feature selection in Matlab. Embedded methods are methods that integrate the feature selection process into the machine learning model, such as using regularization or decision trees. Embedded methods can be more efficient and accurate than filter or wrapper methods, as they can optimize the features and the model simultaneously.

One of the most common and effective embedded methods is regularization. Regularization is a technique that adds a penalty term to the objective function of the machine learning model, such as the loss function or the likelihood function, to prevent overfitting and reduce the complexity of the model. Regularization can also help to select the most relevant features by shrinking the coefficients of the less important features to zero or near zero, effectively removing them from the model.

There are different types of regularization, such as L1, L2, or elastic net, that apply different penalties to the coefficients of the features. L1 regularization, also known as lasso, applies an absolute value penalty, which tends to produce sparse solutions with many zero coefficients. L2 regularization, also known as ridge, applies a squared value penalty, which tends to produce dense solutions with small coefficients. Elastic net regularization combines both L1 and L2 penalties, which can balance the sparsity and stability of the solutions.

In Matlab, you can use the lasso, ridge, or lassoglm functions to perform regularization for linear regression or logistic regression models. These functions can return the coefficients of the features for different values of the regularization parameter, which controls the strength of the penalty. You can then plot the coefficients against the regularization parameter to see how the features are selected or eliminated as the regularization parameter changes. You can also use the cvglmnet function to perform cross-validation to find the optimal value of the regularization parameter that minimizes the prediction error.

Another common and effective embedded method is decision trees. Decision trees are machine learning models that split the data into smaller and smaller subsets based on the values of the features, until the subsets are homogeneous or reach a certain size. Decision trees can also help to select the most relevant features by choosing the features that best split the data at each node of the tree, based on some criteria, such as information gain, gini index, or chi-square.

In Matlab, you can use the fitctree or fitrtree functions to fit a decision tree for classification or regression tasks. These functions can return a tree object that contains the information about the features and the splits at each node of the tree. You can then use the view function to visualize the tree and see how the features are selected or eliminated as the tree grows. You can also use the prune function to prune the tree and remove the nodes that do not improve the prediction accuracy.

As you can see, embedded methods are powerful and convenient methods for feature selection that can optimize the features and the model at the same time. In the next section, you will learn how to compare and evaluate the results of different feature extraction and feature selection methods.

5. Comparison and Evaluation of Feature Extraction and Selection Methods

In this section, you will learn how to compare and evaluate the results of different feature extraction and feature selection methods in Matlab. Comparing and evaluating the results of different methods can help you choose the best method for your machine learning task and data. You will learn how to use different criteria and metrics to measure the quality and importance of the features, the performance and accuracy of the machine learning model, and the trade-off between complexity and efficiency.

One of the criteria that you can use to compare and evaluate the features is the dimensionality reduction ratio. This is the ratio between the number of original features and the number of extracted or selected features. A higher dimensionality reduction ratio means that you have reduced the dimensionality of your data more effectively, which can improve the efficiency and interpretability of your machine learning model. However, a higher dimensionality reduction ratio may also mean that you have lost some information or discriminative power from your data, which can affect the performance and accuracy of your machine learning model. Therefore, you need to find a balance between reducing the dimensionality and preserving the information of your data.

In Matlab, you can calculate the dimensionality reduction ratio by dividing the number of original features by the number of extracted or selected features. For example, if you have 100 original features and you extract or select 10 features, the dimensionality reduction ratio is 100/10 = 10. You can then compare the dimensionality reduction ratio of different methods and see which one has the highest or lowest value.

Another criterion that you can use to compare and evaluate the features is the feature ranking or scoring. This is the ranking or scoring of the features based on their relevance or importance to the machine learning task. A higher feature ranking or scoring means that the feature is more relevant or important, which can improve the performance and accuracy of your machine learning model. However, a higher feature ranking or scoring may also mean that the feature is more correlated or redundant with other features, which can affect the efficiency and interpretability of your machine learning model. Therefore, you need to find a balance between selecting the most relevant or important features and avoiding the most correlated or redundant features.

In Matlab, you can obtain the feature ranking or scoring from different methods and criteria, such as variance, information gain, mutual information, correlation, chi-square, Fisher score, relief, regularization, or decision trees. For example, if you use the lasso function to perform regularization, you can obtain the coefficients of the features for different values of the regularization parameter. The coefficients indicate the importance of the features, and the features with zero or near zero coefficients are eliminated from the model. You can then rank or score the features based on their coefficients and see which one has the highest or lowest value.

A third criterion that you can use to compare and evaluate the features is the model performance or accuracy. This is the performance or accuracy of the machine learning model that uses the extracted or selected features to make predictions. A higher model performance or accuracy means that the machine learning model can make more accurate predictions, which is the ultimate goal of any machine learning task. However, a higher model performance or accuracy may also mean that the machine learning model is more complex or overfitted, which can affect the efficiency and generalizability of the machine learning model. Therefore, you need to find a balance between improving the performance or accuracy and avoiding the complexity or overfitting of the machine learning model.

In Matlab, you can measure the model performance or accuracy using different metrics, such as mean squared error, root mean squared error, mean absolute error, accuracy, precision, recall, F1-score, ROC curve, or confusion matrix. For example, if you use the fitctree function to fit a decision tree for a classification task, you can obtain the prediction accuracy of the tree on the training and test data. The prediction accuracy is the proportion of correctly classified instances, and a higher prediction accuracy means a better model. You can then compare the prediction accuracy of different methods and see which one has the highest or lowest value.

As you can see, comparing and evaluating the results of different feature extraction and feature selection methods can help you choose the best method for your machine learning task and data. You can use different criteria and metrics to measure the quality and importance of the features, the performance and accuracy of the machine learning model, and the trade-off between complexity and efficiency. In the next and final section, you will learn how to conclude your blog and summarize the main points.

6. Conclusion

In this blog, you have learned how to use Matlab for feature extraction and feature selection, which are essential steps in machine learning. You have learned different methods and algorithms for extracting and selecting the most relevant features from your data, such as principal component analysis, linear discriminant analysis, independent component analysis, filter methods, wrapper methods, and embedded methods. You have also learned how to compare and evaluate the results of different methods using different criteria and metrics, such as dimensionality reduction ratio, feature ranking or scoring, and model performance or accuracy.

By applying feature extraction and feature selection techniques, you can improve the quality and efficiency of your machine learning model, as well as the performance and accuracy of your predictions. You can also reduce the dimensionality and complexity of your data, and enhance the interpretability and understanding of your machine learning model. Feature extraction and feature selection are important skills that can help you solve various machine learning problems and tasks.

We hope that this blog has been useful and informative for you. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy learning!