This blog teaches you how to select and train the best machine learning model for your data using AWS AutoML, a service that automates the end-to-end machine learning workflow.
1. Overview of Model Selection and Training in AWS AutoML
In this blog, you will learn how to use AWS AutoML, a service that automates the end-to-end machine learning workflow. You will learn how to select and train the best machine learning model for your data using AWS AutoML.
Machine learning is the process of teaching a computer system to learn from data and make predictions or decisions. However, building a machine learning model can be challenging and time-consuming. You need to:
- Prepare and analyze your data
- Choose the right algorithm and framework
- Train and tune your model
- Deploy and monitor your model
AWS AutoML simplifies this process by automatically performing these tasks for you. You just need to provide your data and specify your problem type and objective metric. AWS AutoML will then explore different algorithms and configurations, train and evaluate multiple models, and select the best one for you.
AWS AutoML can handle various types of problems, such as regression, classification, and natural language processing. It can also handle different types of data, such as tabular, text, and image. AWS AutoML supports popular frameworks such as TensorFlow, PyTorch, and MXNet.
By using AWS AutoML, you can save time and resources, and focus on the business value of your machine learning solution. You can also improve the quality and performance of your model by leveraging the best practices and expertise of AWS.
Are you ready to get started with AWS AutoML? In the next section, you will learn how to create an AutoML job, which is the main unit of work in AWS AutoML.
2. How to Create an AutoML Job
An AutoML job is the main unit of work in AWS AutoML. It represents the entire process of finding, training, and selecting the best machine learning model for your data. To create an AutoML job, you need to use the AWS AutoML console or the AWS SDK for Python (Boto3).
In this section, you will learn how to create an AutoML job using the AWS AutoML console. The console provides a user-friendly interface that guides you through the steps of creating an AutoML job. You will need to:
- Specify the data source and target attribute
- Choose the problem type and objective metric
- Configure the AutoML job settings
Before you start, make sure you have an AWS account and a data set that you want to use for your AutoML job. You can use your own data set or one of the sample data sets provided by AWS. For this tutorial, we will use the Bank Marketing Data Set, which contains information about customers of a Portuguese bank and whether they subscribed to a term deposit or not.
Let’s begin by logging in to the AWS AutoML console and clicking on the Create AutoML job button. You can give your AutoML job a name and a description. You can also choose the IAM role that will grant permissions to your AutoML job. For this tutorial, we will use the default settings and name our AutoML job bank-marketing-automl.
Next, you will need to specify the data source and target attribute for your AutoML job. This is the first step of creating an AutoML job and it is very important to do it correctly. In the next section, you will learn how to do it.
2.1. Specify the Data Source and Target Attribute
The data source is the location where your data is stored. AWS AutoML supports data sources in Amazon S3, Amazon Athena, and Amazon Redshift. For this tutorial, we will use an Amazon S3 bucket as our data source.
The target attribute is the column in your data that you want to predict or classify. AWS AutoML will use this column as the output of your machine learning model. For this tutorial, we will use the y column as our target attribute, which indicates whether a customer subscribed to a term deposit or not.
To specify the data source and target attribute, you need to follow these steps:
- Click on the Specify data source and target attribute button on the AWS AutoML console.
- Select the Amazon S3 option as your data source type.
- Enter the S3 bucket name and prefix where your data is stored. For this tutorial, we will use the sample data set provided by AWS, which is stored in the s3://sagemaker-sample-files/datasets/tabular/uci_bank_marketing bucket.
- Click on the Next button to proceed.
- Select the bank-additional-full.csv file as your data source file. This file contains the data of 41,188 customers and 21 attributes.
- Click on the Next button to proceed.
- Select the y column as your target attribute. This column has two possible values: yes or no.
- Click on the Next button to proceed.
Here, you can review the summary of your data source and target attribute. You can also see a sample of your data and the data statistics. You can edit or delete your data source and target attribute if you need to.
Once you are satisfied with your data source and target attribute, click on the Finish button to complete this step. You have successfully specified the data source and target attribute for your AutoML job.
In the next section, you will learn how to choose the problem type and objective metric for your AutoML job.
2.2. Choose the Problem Type and Objective Metric
The problem type is the type of machine learning task that you want to solve with your data. AWS AutoML supports three main problem types: regression, classification, and natural language processing. The problem type determines the algorithms and frameworks that AWS AutoML will use to train your model.
The objective metric is the measure of how well your model performs on your data. AWS AutoML will use this metric to compare and rank the candidate models that it generates. The objective metric depends on the problem type and the goal of your machine learning solution.
To choose the problem type and objective metric, you need to follow these steps:
- Click on the Choose problem type and objective metric button on the AWS AutoML console.
- Select the problem type that matches your data and goal. For this tutorial, we will select the Binary classification problem type, since we want to predict whether a customer subscribed to a term deposit or not.
- Select the objective metric that you want to optimize for your model. For this tutorial, we will select the F1 macro objective metric, which is a harmonic mean of precision and recall for both classes.
- Click on the Next button to proceed.
Here, you can review the summary of your problem type and objective metric. You can also see the list of algorithms and frameworks that AWS AutoML will use to train your model. You can edit or delete your problem type and objective metric if you need to.
Once you are satisfied with your problem type and objective metric, click on the Finish button to complete this step. You have successfully chosen the problem type and objective metric for your AutoML job.
In the next section, you will learn how to configure the AutoML job settings for your AutoML job.
2.3. Configure the AutoML Job Settings
The last step of creating an AutoML job is to configure the AutoML job settings. These settings control how AWS AutoML will explore, train, and select the models for your data. You can adjust these settings to optimize the AutoML job for your specific needs and preferences.
Some of the settings that you can configure are:
- The completion criteria, which determine when the AutoML job will stop. You can set a maximum time limit, a maximum number of candidates, or a target objective metric value.
- The security configuration, which specify the encryption and access policies for your AutoML job. You can use the default settings or create your own security configuration.
- The tags, which are key-value pairs that you can use to organize and identify your AutoML job. You can add up to 50 tags per AutoML job.
For this tutorial, we will use the default settings for the completion criteria and the security configuration. We will also add a tag with the key project and the value bank-marketing to label our AutoML job.
Here, you can check the details of your AutoML job and make any changes if needed. You can also download a JSON file that contains the configuration of your AutoML job. This file can be useful if you want to create similar AutoML jobs in the future.
When you are ready, click on the Create AutoML job button. This will start the AutoML job and take you to the AutoML job dashboard, where you can monitor the progress of your AutoML job.
Congratulations, you have successfully created an AutoML job using the AWS AutoML console. In the next section, you will learn how to monitor the AutoML job progress and see the results of the model exploration and training.
3. How to Monitor the AutoML Job Progress
After you have created and configured your AutoML job, you can start it by clicking on the Start AutoML job button on the AWS AutoML console. This will launch the AutoML job and begin the process of finding, training, and selecting the best machine learning model for your data.
While your AutoML job is running, you can monitor its progress and see the results on the AWS AutoML console. You can see the status of your AutoML job, the elapsed time, the number of candidates generated, and the best candidate so far. You can also see the details of each candidate, such as the algorithm, the framework, the objective metric value, and the inference latency.
To monitor the AutoML job progress, you need to follow these steps:
- Click on the AutoML jobs tab on the AWS AutoML console.
- Select the AutoML job that you want to monitor from the list of AutoML jobs.
- Click on the View job details button to see the overview of your AutoML job.
- Click on the Candidates tab to see the list of candidates generated by your AutoML job.
- Click on the Best candidate tab to see the details of the best candidate selected by your AutoML job.
Here, you can review the progress and results of your AutoML job. You can also download the candidate definitions, the model artifacts, and the inference containers for each candidate. You can stop your AutoML job at any time by clicking on the Stop AutoML job button.
Once your AutoML job is completed, you can compare and select the best candidate model for your data. In the next section, you will learn how to do it.
4. How to Compare and Select the Best Candidate Model
After you create an AutoML job, AWS AutoML will start exploring, training, and evaluating different models for your data. You can monitor the progress of your AutoML job from the AutoML job dashboard.
Here, you can see the status of your AutoML job, the completion criteria, the data source, the problem type, and the objective metric. You can also see the list of candidate models that AWS AutoML has generated for your data. Each candidate model has a name, a score, a rank, and a status.
The score is the value of the objective metric that AWS AutoML uses to measure the performance of the model. The rank is the relative position of the model in the list, based on the score. The status indicates whether the model is in progress, completed, or failed.
You can click on any candidate model to see more details about it, such as the algorithm, the framework, the hyperparameters, the feature importance, and the performance metrics. You can also download the model artifacts, such as the model definition and the inference code.
By comparing the candidate models, you can select the best one for your data and your problem. The best model is not necessarily the one with the highest score or the lowest rank. You may also consider other factors, such as the complexity, the interpretability, the robustness, and the generalizability of the model.
For example, you may prefer a simpler model that is easier to understand and explain, even if it has a slightly lower score than a more complex model. Or you may prefer a more robust model that can handle outliers and noise, even if it has a slightly higher rank than a more sensitive model.
Ultimately, the best model is the one that meets your expectations and requirements. You can use your domain knowledge and your business goals to guide your decision. You can also use the AWS AutoML console or the AWS SDK for Python (Boto3) to test and validate your model on new data.
In the next section, you will learn how to evaluate the model performance on test data and see how well your model generalizes to unseen data.
5. How to Evaluate the Model Performance on Test Data
After you have selected the best candidate model from the AutoML job, you may want to evaluate its performance on a separate test data set. This can help you assess how well the model generalizes to new and unseen data, and avoid overfitting or underfitting.
To evaluate the model performance on test data, you need to use the AWS AutoML console or the AWS SDK for Python (Boto3). In this section, you will learn how to do it using the AWS AutoML console. You will need to:
- Upload your test data set to Amazon S3
- Create a batch transform job using the best candidate model
- Download and analyze the prediction results
Before you start, make sure you have a test data set that you want to use for evaluation. You can use your own test data set or one of the sample test data sets provided by AWS. For this tutorial, we will use the Bank Marketing Test Data Set, which contains information about customers of a Portuguese bank and whether they subscribed to a term deposit or not. The test data set has the same format and features as the training data set, but the target attribute (y) is not provided.
Let’s begin by uploading our test data set to Amazon S3, which is a secure and scalable cloud storage service. You can use the AWS AutoML console or the AWS CLI to upload your test data set. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the Amazon S3 console and click on the Create bucket button.
- Give your bucket a name and a region, and click on the Create bucket button.
- Select your bucket and click on the Upload button.
- Click on the Add files button and select your test data set file.
- Click on the Upload button and wait for the upload to complete.
Next, you will need to create a batch transform job using the best candidate model from your AutoML job. A batch transform job is a process that applies your model to a large data set and saves the predictions in Amazon S3. You can use the AWS AutoML console or the AWS SDK for Python (Boto3) to create a batch transform job. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the AWS AutoML console and select your AutoML job.
- Click on the Best candidate tab and then on the Create batch transform job button.
- Give your batch transform job a name and a description.
- Specify the S3 location of your test data set and the output path where the predictions will be saved.
- Choose the instance type and the number of instances that you want to use for the batch transform job.
- Click on the Create batch transform job button and wait for the job to complete.
Finally, you will need to download and analyze the prediction results from Amazon S3. You can use the AWS AutoML console or the AWS CLI to download the prediction results. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the Amazon S3 console and select the output bucket and folder of your batch transform job.
- Select the prediction file and click on the Download button.
- Open the prediction file with a text editor or a spreadsheet program.
- Compare the predictions with the actual outcomes of the test data set (if available).
- Calculate the accuracy, precision, recall, and F1-score of the model on the test data set.
By evaluating the model performance on test data, you can get a better understanding of how well the model can handle new and unseen data. You can also identify the strengths and weaknesses of the model, and decide whether you need to tune the model hyperparameters or not. In the next section, you will learn how to tune the model hyperparameters using AWS AutoML.
6. How to Tune the Model Hyperparameters
Hyperparameters are the settings that control the behavior and performance of a machine learning model. They include parameters such as the learning rate, the number of epochs, the batch size, the regularization factor, and so on. Hyperparameter tuning is the process of finding the optimal values for these settings that maximize the model performance on a given data set.
AWS AutoML automatically tunes the hyperparameters of the candidate models during the AutoML job. However, you may want to further tune the hyperparameters of the best candidate model to improve its performance or reduce its complexity. You can do this by using the AWS AutoML console or the AWS SDK for Python (Boto3). In this section, you will learn how to do it using the AWS AutoML console. You will need to:
- Create a tuning job using the best candidate model
- Specify the hyperparameters to tune and the objective metric to optimize
- Monitor the tuning job progress and results
Before you start, make sure you have selected the best candidate model from your AutoML job and that you have a validation data set that you want to use for tuning. You can use your own validation data set or one of the sample validation data sets provided by AWS. For this tutorial, we will use the Bank Marketing Validation Data Set, which contains information about customers of a Portuguese bank and whether they subscribed to a term deposit or not. The validation data set has the same format and features as the training data set, but the target attribute (y) is not provided.
Let’s begin by creating a tuning job using the best candidate model from your AutoML job. A tuning job is a process that runs multiple training jobs with different hyperparameter values and compares the results based on the objective metric. You can use the AWS AutoML console or the AWS SDK for Python (Boto3) to create a tuning job. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the AWS AutoML console and select your AutoML job.
- Click on the Best candidate tab and then on the Create tuning job button.
- Give your tuning job a name and a description.
- Specify the S3 location of your validation data set and the output path where the tuning results will be saved.
- Choose the instance type and the number of instances that you want to use for the tuning job.
- Click on the Create tuning job button and wait for the job to start.
Next, you will need to specify the hyperparameters to tune and the objective metric to optimize for your tuning job. You can choose from a predefined list of hyperparameters that are relevant for your problem type and framework, or you can add your own custom hyperparameters. You can also choose from a predefined list of objective metrics that are relevant for your problem type and framework, or you can define your own custom objective metric. You can use the AWS AutoML console or the AWS SDK for Python (Boto3) to specify the hyperparameters and the objective metric. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the AWS AutoML console and select your tuning job.
- Click on the Hyperparameters tab and then on the Add hyperparameter button.
- Select a hyperparameter from the list or enter a custom name.
- Choose the type of the hyperparameter (continuous, integer, categorical, or free text).
- Specify the range or the list of values for the hyperparameter.
- Repeat the steps for each hyperparameter that you want to tune.
- Click on the Objective metric tab and then on the Edit button.
- Select an objective metric from the list or enter a custom name.
- Choose the direction of the objective metric (maximize or minimize).
- Click on the Save button and wait for the tuning job to resume.
Finally, you will need to monitor the tuning job progress and results. You can see the status, the duration, the best training job, and the best objective metric value of your tuning job. You can also see the details of each training job, such as the hyperparameter values, the objective metric value, the training and validation errors, and the logs. You can use the AWS AutoML console or the AWS SDK for Python (Boto3) to monitor the tuning job. For this tutorial, we will use the AWS AutoML console and follow these steps:
- Go to the AWS AutoML console and select your tuning job.
- Click on the Overview tab and see the summary of your tuning job.
- Click on the Training jobs tab and see the details of each training job.
- Click on the Analytics tab and see the graphs and tables of the hyperparameters and the objective metric.
- Click on the Logs tab and see the logs of each training job.
By tuning the hyperparameters of the best candidate model, you can improve its performance or reduce its complexity. You can also explore the trade-offs between different hyperparameters and their effects on the model. In the next section, you will learn how to deploy the model and make predictions using AWS AutoML.
7. How to Deploy the Model and Make Predictions
After you have selected the best candidate model for your AutoML job, you can deploy it to an endpoint and use it to make predictions on new data. An endpoint is a web service that allows you to access your model from any application or device. You can also monitor and update your endpoint as needed.
In this section, you will learn how to deploy your model and make predictions using the AWS AutoML console and the AWS SDK for Python (Boto3). You will need to:
- Create an endpoint configuration
- Create an endpoint
- Invoke the endpoint
To create an endpoint configuration, you need to specify the name of your model, the instance type and number of instances that you want to use for your endpoint, and the data capture configuration that you want to use for monitoring your endpoint. You can use the default settings or customize them according to your needs.
To create an endpoint configuration using the AWS AutoML console, you can go to the Models tab of your AutoML job and select the model that you want to deploy. Then, you can click on the Create endpoint configuration button and fill in the required fields.
Here, you can give your endpoint configuration a name and choose the instance type and number of instances that you want to use for your endpoint. You can also enable data capture by selecting the Enable data capture checkbox and specifying the S3 bucket and prefix where you want to store the captured data. For this tutorial, we will use the default settings and name our endpoint configuration bank-marketing-endpoint-config.
To create an endpoint configuration using the AWS SDK for Python (Boto3), you can use the create_endpoint_config method of the SageMaker.Client class. You need to pass the name of your model, the instance type and number of instances that you want to use for your endpoint, and the data capture configuration that you want to use for monitoring your endpoint. For example, you can run the following code:
import boto3 sagemaker = boto3.client('sagemaker') endpoint_config_name = 'bank-marketing-endpoint-config' model_name = 'bank-marketing-automl-1-0e9f4a8f0c0f4f7a8a-001-6e2d9f9e' instance_type = 'ml.m5.large' instance_count = 1 data_capture_config = { 'EnableCapture': True, 'DestinationS3Uri': 's3://your-bucket-name/your-prefix', 'CaptureOptions': [ { 'CaptureMode': 'Input' }, { 'CaptureMode': 'Output' } ] } response = sagemaker.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { 'VariantName': 'AllTraffic', 'ModelName': model_name, 'InstanceType': instance_type, 'InitialInstanceCount': instance_count } ], DataCaptureConfig=data_capture_config ) print(response)
This code will create an endpoint configuration with the name bank-marketing-endpoint-config and return a response with the details of the endpoint configuration.
After you have created an endpoint configuration, you can create an endpoint using the AWS AutoML console or the AWS SDK for Python (Boto3). To create an endpoint using the AWS AutoML console, you can go to the Endpoint configurations tab of your AutoML job and select the endpoint configuration that you want to use. Then, you can click on the Create endpoint button and give your endpoint a name.
Here, you can name your endpoint and click on the Create endpoint button. It may take a few minutes for your endpoint to be created and ready to use. You can check the status of your endpoint on the Endpoints tab of your AutoML job. For this tutorial, we will name our endpoint bank-marketing-endpoint.
To create an endpoint using the AWS SDK for Python (Boto3), you can use the create_endpoint method of the SageMaker.Client class. You need to pass the name of your endpoint and the name of your endpoint configuration. For example, you can run the following code:
import boto3 sagemaker = boto3.client('sagemaker') endpoint_name = 'bank-marketing-endpoint' endpoint_config_name = 'bank-marketing-endpoint-config' response = sagemaker.create_endpoint( EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name ) print(response)
This code will create an endpoint with the name bank-marketing-endpoint and return a response with the details of the endpoint. It may take a few minutes for your endpoint to be created and ready to use. You can check the status of your endpoint using the describe_endpoint method of the SageMaker.Client class.
Once your endpoint is ready, you can invoke it to make predictions on new data. You can use the AWS AutoML console or the AWS SDK for Python (Boto3) to invoke your endpoint. To invoke your endpoint using the AWS AutoML console, you can go to the Endpoints tab of your AutoML job and select the endpoint that you want to use. Then, you can click on the Invoke endpoint button and upload a file with the data that you want to make predictions on.
Here, you can choose the file format of your data (CSV or JSON) and the content type of your request (text/csv or application/json). You can also download a sample file to see the expected format of your data. For this tutorial, we will use a CSV file with the same features as the Bank Marketing Data Set, but without the target attribute. We will also use the text/csv content type. After you upload your file, you can click on the Invoke endpoint button and see the predictions returned by your endpoint.
Here, you can see the predictions for each row of your data. The predictions are in the form of probabilities for each possible class. For example, the first row has a probability of 0.93 for the class 0 (no subscription) and a probability of 0.07 for the class 1 (subscription). You can also download the predictions as a CSV file.
To invoke your endpoint using the AWS SDK for Python (Boto3), you can use the invoke_endpoint method of the SageMaker.Runtime class. You need to pass the name of your endpoint, the data that you want to make predictions on, and the content type of your request. For example, you can run the following code:
import boto3 import pandas as pd sagemaker_runtime = boto3.client('sagemaker-runtime') endpoint_name = 'bank-marketing-endpoint' content_type = 'text/csv' data = pd.read_csv('test.csv') payload = data.to_csv(header=False, index=False) response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, ContentType=content_type, Body=payload ) print(response['Body'].read().decode())
This code will read a CSV file with the same features as the Bank Marketing Data Set, but without the target attribute, and convert it to a string. Then, it will invoke the endpoint with the name bank-marketing-endpoint and the content type text/csv, and pass the data as the body of the request. Finally, it will print the predictions returned by the endpoint. The predictions are in the same format as the ones returned by the AWS AutoML console.
Congratulations! You have successfully deployed your model and made predictions using AWS AutoML. You can now use your model to solve
8. Conclusion and Next Steps
In this blog, you have learned how to use AWS AutoML, a service that automates the end-to-end machine learning workflow. You have learned how to:
- Create an AutoML job and specify the data source and target attribute
- Choose the problem type and objective metric for your AutoML job
- Configure the AutoML job settings and launch the job
- Monitor the AutoML job progress and view the candidate models
- Compare and select the best candidate model for your data
- Evaluate the model performance on test data and tune the model hyperparameters
- Deploy the model to an endpoint and make predictions on new data
By using AWS AutoML, you have saved time and resources, and improved the quality and performance of your machine learning model. You have also leveraged the best practices and expertise of AWS to solve your machine learning problem.
What are the next steps that you can take to further enhance your machine learning solution? Here are some suggestions:
- Explore the other features and capabilities of AWS AutoML, such as data preprocessing, feature engineering, model explainability, and model debugging.
- Try AWS AutoML on different types of data and problems, such as image classification, text sentiment analysis, or time series forecasting.
- Learn more about the algorithms and frameworks that AWS AutoML uses, such as TensorFlow, PyTorch, and MXNet, and how they work under the hood.
- Integrate your AWS AutoML model with other AWS services, such as AWS Lambda, AWS S3, or AWS API Gateway, to create a complete machine learning application.
We hope you have enjoyed this blog and found it useful. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading and happy machine learning!