This blog shows you how to prepare and analyze your data for AWS AutoML using AWS Data Wrangler and AWS Glue. You will learn how to load, clean, transform, and save data with AWS Data Wrangler, and how to create, run, and query a Glue data catalog with AWS Glue.
1. Introduction
Welcome to the second part of the AWS AutoML: A Practical Guide series. In this part, you will learn how to prepare and analyze your data for AWS AutoML using AWS Data Wrangler and AWS Glue.
Data preparation and analysis are essential steps in any machine learning project. They involve loading, cleaning, transforming, and exploring the data to ensure its quality and suitability for the modeling task. Data preparation and analysis can also help you discover useful insights and patterns in your data that can inform your model design and evaluation.
AWS provides several services and tools that can help you with data preparation and analysis. In this tutorial, you will use two of them: AWS Data Wrangler and AWS Glue.
AWS Data Wrangler is an open-source Python library that simplifies data access, transformation, and integration with AWS data services. It allows you to read and write data from various sources, such as Amazon S3, Amazon Redshift, Amazon Athena, and more. It also provides a rich set of data manipulation functions, such as filtering, merging, aggregating, and reshaping data. AWS Data Wrangler is compatible with popular Python data science libraries, such as pandas, numpy, and scikit-learn.
AWS Glue is a fully managed service that provides a serverless data integration platform. It allows you to create, run, and monitor data pipelines that can extract, transform, and load (ETL) data from various sources to various destinations. It also provides a data catalog that stores the metadata and schema of your data sources, making them searchable and queryable. AWS Glue can also generate code for your data pipelines using Python or Scala.
By using AWS Data Wrangler and AWS Glue, you will be able to prepare and analyze your data for AWS AutoML in a scalable, efficient, and cost-effective way. You will also be able to leverage the power and flexibility of Python and AWS to handle complex and diverse data scenarios.
Are you ready to get started? Let’s dive into the data preparation and analysis process with AWS Data Wrangler and AWS Glue.
2. Data Preparation with AWS Data Wrangler
In this section, you will learn how to use AWS Data Wrangler to prepare your data for AWS AutoML. You will perform the following steps:
- Install and import AWS Data Wrangler
- Load data from Amazon S3
- Clean and transform data
- Save data to Amazon S3
AWS Data Wrangler is a Python library that simplifies data access, transformation, and integration with AWS data services. It allows you to read and write data from various sources, such as Amazon S3, Amazon Redshift, Amazon Athena, and more. It also provides a rich set of data manipulation functions, such as filtering, merging, aggregating, and reshaping data. AWS Data Wrangler is compatible with popular Python data science libraries, such as pandas, numpy, and scikit-learn.
To use AWS Data Wrangler, you need to have Python 3.6 or higher installed on your machine. You also need to have an AWS account and configure your credentials and region. You can follow the instructions here to install and configure AWS Data Wrangler.
Once you have AWS Data Wrangler installed and configured, you can start preparing your data for AWS AutoML. Let’s begin with the first step: installing and importing AWS Data Wrangler.
2.1. Install and Import AWS Data Wrangler
The first step to use AWS Data Wrangler is to install and import it in your Python environment. You can install AWS Data Wrangler using pip, the Python package manager. To do so, open a terminal and run the following command:
pip install awswrangler
This will download and install the latest version of AWS Data Wrangler and its dependencies. You can check the installation by running the following command:
pip show awswrangler
This will display some information about the installed package, such as its version, location, and description.
Once you have installed AWS Data Wrangler, you can import it in your Python code. To do so, open a Python editor or notebook and add the following line at the beginning of your script:
import awswrangler as wr
This will import the AWS Data Wrangler module and assign it the alias wr. You can use this alias to access the functions and classes of AWS Data Wrangler in your code. For example, you can use wr.s3 to access the functions related to Amazon S3, or wr.catalog to access the functions related to the Glue data catalog.
By installing and importing AWS Data Wrangler, you have completed the first step of data preparation with AWS Data Wrangler. You are now ready to load your data from Amazon S3 and start working with it.
2.2. Load Data from Amazon S3
The second step to use AWS Data Wrangler is to load your data from Amazon S3. Amazon S3 is a scalable and durable object storage service that can store any type of data. You can use AWS Data Wrangler to read data from Amazon S3 into a pandas DataFrame, which is a popular data structure for data analysis in Python.
To load data from Amazon S3, you need to have an S3 bucket that contains your data files. You also need to have the appropriate permissions to access the bucket and the files. You can follow the instructions here to create and configure an S3 bucket.
Once you have an S3 bucket with your data files, you can use the wr.s3.read_csv function to read a CSV file from S3 into a pandas DataFrame. The function takes the following parameters:
- path: The S3 path of the CSV file. It should start with s3:// and include the bucket name and the file name.
- sep: The separator character used in the CSV file. The default is ,.
- header: The row number of the header. The default is 0, which means the first row is the header. If the CSV file has no header, you can set this parameter to None.
- index_col: The column number or name of the index. The default is None, which means no index is assigned. If you want to use a column as the index, you can specify its number or name.
- usecols: The list of columns to read from the CSV file. The default is None, which means all columns are read. If you want to read only a subset of columns, you can specify their numbers or names.
- dtype: The data type of each column. The default is None, which means the data type is inferred from the data. If you want to specify the data type of each column, you can use a dictionary that maps the column names or numbers to the data types.
- parse_dates: The list of columns to parse as dates. The default is False, which means no columns are parsed as dates. If you want to parse some columns as dates, you can specify their numbers or names.
- infer_datetime_format: A boolean value that indicates whether to infer the date format from the data. The default is False, which means the date format is not inferred. If you set this parameter to True, it can speed up the parsing of dates.
- date_parser: A function that parses the date strings in the specified columns. The default is None, which means the default parser is used. If you want to use a custom parser, you can define a function that takes a list of date strings and returns a list of datetime objects.
- chunksize: The number of rows to read at a time. The default is None, which means the whole file is read at once. If you want to read the file in chunks, you can specify a positive integer.
For example, suppose you have a CSV file named customers.csv in your S3 bucket named my-bucket. The file has the following structure:
customer_id | name | phone | country | created_at | |
---|---|---|---|---|---|
1 | Alice | alice@example.com | +1-234-567-8901 | USA | 2020-01-01 |
2 | Bob | bob@example.com | +1-234-567-8902 | Canada | 2020-01-02 |
3 | Charlie | charlie@example.com | +1-234-567-8903 | UK | 2020-01-03 |
4 | David | david@example.com | +1-234-567-8904 | Australia | 2020-01-04 |
5 | Eve | eve@example.com | +1-234-567-8905 | India | 2020-01-05 |
To read this file into a pandas DataFrame, you can use the following code:
import awswrangler as wr df = wr.s3.read_csv(path="s3://my-bucket/customers.csv", parse_dates=["created_at"])
This will create a DataFrame named df that looks like this:
customer_id | name | phone | country | created_at | |
---|---|---|---|---|---|
1 | Alice | alice@example.com | +1-234-567-8901 | USA | 2020-01-01 |
2 | Bob | bob@example.com | +1-234-567-8902 | Canada | 2020-01-02 |
3 | Charlie | charlie@example.com | +1-234-567-8903 | UK | 2020-01-03 |
4 | David | david@example.com | +1-234-567-8904 | Australia | 2020-01-04 |
5 | Eve | eve@example.com | +1-234-567-8905 | India | 2020-01-05 |
You can use the same function to read other types of files from S3, such as JSON, Parquet, Excel, and more. You can check the documentation here to see the available functions and parameters.
By loading data from Amazon S3, you have completed the second step of data preparation with AWS Data Wrangler. You are now ready to clean and transform your data and make it ready for AWS AutoML.
2.3. Clean and Transform Data
The third step to use AWS Data Wrangler is to clean and transform your data. Cleaning and transforming data involves applying various operations and functions to your data to ensure its quality and suitability for the modeling task. Some common data cleaning and transformation tasks are:
- Removing or imputing missing values
- Removing or handling outliers
- Removing or encoding categorical variables
- Normalizing or scaling numerical variables
- Creating or extracting new features
AWS Data Wrangler provides a rich set of data manipulation functions that can help you with these tasks. These functions are based on the pandas library, which is a popular data analysis tool in Python. You can use these functions to apply various operations and transformations to your pandas DataFrame that you loaded from Amazon S3.
For example, suppose you have a DataFrame named df that contains some customer data, as shown in the previous section. You want to clean and transform this data for AWS AutoML. You can use the following code to perform some common data cleaning and transformation tasks:
# Remove any rows with missing values df = df.dropna() # Remove any rows with duplicate customer_id df = df.drop_duplicates(subset="customer_id") # Encode the country column as a numerical variable df["country"] = df["country"].astype("category").cat.codes # Normalize the customer_id column using min-max scaling df["customer_id"] = (df["customer_id"] - df["customer_id"].min()) / (df["customer_id"].max() - df["customer_id"].min()) # Extract the month and year from the created_at column df["month"] = df["created_at"].dt.month df["year"] = df["created_at"].dt.year # Drop the columns that are not needed for the modeling task df = df.drop(columns=["name", "email", "phone", "created_at"])
This will create a new DataFrame named df that looks like this:
customer_id | country | month | year |
---|---|---|---|
0.00 | 4 | 1 | 2020 |
0.25 | 1 | 1 | 2020 |
0.50 | 3 | 1 | 2020 |
0.75 | 0 | 1 | 2020 |
1.00 | 2 | 1 | 2020 |
You can use other functions and parameters to perform different data cleaning and transformation tasks, depending on your data and your modeling goal. You can check the documentation here to see the available functions and parameters.
By cleaning and transforming your data, you have completed the third step of data preparation with AWS Data Wrangler. You are now ready to save your data to Amazon S3 and make it ready for AWS AutoML.
2.4. Save Data to Amazon S3
After you have cleaned and transformed your data, you need to save it to Amazon S3 so that you can use it for AWS AutoML. AWS Data Wrangler makes this step very easy, as it allows you to write data to Amazon S3 in various formats, such as CSV, Parquet, JSON, and more. You can also specify the compression type, the partition columns, and the metadata catalog to register your data.
In this tutorial, you will save your data to Amazon S3 in Parquet format, which is a columnar storage format that is optimized for analytics. Parquet files are compressed, splittable, and can support complex data types. You will also partition your data by the target column, which is the column that contains the labels for your machine learning task. Partitioning your data can improve the performance and scalability of your data processing and querying. Finally, you will register your data in the AWS Glue Data Catalog, which will allow you to access and query your data using AWS Glue and Amazon Athena.
To save your data to Amazon S3 using AWS Data Wrangler, you can use the awswrangler.s3.to_parquet()
function. This function takes the following parameters:
df
: The pandas DataFrame that contains your data.path
: The Amazon S3 path where you want to save your data.dataset
: A boolean value that indicates whether you want to save your data as a dataset or a single file. If you set it toTrue
, you can use thepartition_cols
anddatabase
parameters to partition and register your data.partition_cols
: A list of column names that you want to use as partition keys. In this tutorial, you will use the target column, which isincome
.database
: The name of the database in the AWS Glue Data Catalog where you want to register your data. In this tutorial, you will use the same database that you created in the previous section, which isaws_automl_tutorial
.table
: The name of the table in the AWS Glue Data Catalog where you want to register your data. In this tutorial, you will useadult_data
as the table name.mode
: The mode of writing data to Amazon S3. You can choose fromappend
,overwrite
, oroverwrite_partitions
. In this tutorial, you will useoverwrite
to replace any existing data in the Amazon S3 path.
The following code snippet shows how to use the awswrangler.s3.to_parquet()
function to save your data to Amazon S3:
import awswrangler as wr # Define the Amazon S3 path where you want to save your data s3_path = "s3://your-bucket-name/your-folder-name/" # Save your data to Amazon S3 in Parquet format, partitioned by the target column, and registered in the AWS Glue Data Catalog wr.s3.to_parquet( df=df, # Your pandas DataFrame path=s3_path, # Your Amazon S3 path dataset=True, # Save your data as a dataset partition_cols=["income"], # Partition your data by the target column database="aws_automl_tutorial", # The name of your database in the AWS Glue Data Catalog table="adult_data", # The name of your table in the AWS Glue Data Catalog mode="overwrite" # Overwrite any existing data in the Amazon S3 path )
After you run this code, you will have your data saved to Amazon S3 in Parquet format, partitioned by the target column, and registered in the AWS Glue Data Catalog. You can verify this by checking the Amazon S3 console and the AWS Glue console.
Congratulations! You have successfully prepared your data for AWS AutoML using AWS Data Wrangler. You have learned how to load, clean, transform, and save your data with AWS Data Wrangler. You have also learned how to use Parquet format, partitioning, and the AWS Glue Data Catalog to optimize your data for analytics.
In the next section, you will learn how to use AWS Glue to analyze your data and query it with Amazon Athena. Stay tuned!
3. Data Analysis with AWS Glue
In this section, you will learn how to use AWS Glue to analyze your data and query it with Amazon Athena. You will perform the following steps:
- Create a Glue crawler
- Run the Glue crawler
- Explore the Glue data catalog
- Query data with Amazon Athena
AWS Glue is a fully managed service that provides a serverless data integration platform. It allows you to create, run, and monitor data pipelines that can extract, transform, and load (ETL) data from various sources to various destinations. It also provides a data catalog that stores the metadata and schema of your data sources, making them searchable and queryable. AWS Glue can also generate code for your data pipelines using Python or Scala.
To use AWS Glue, you need to have an AWS account and configure your credentials and region. You can follow the instructions here to get started with AWS Glue.
Once you have AWS Glue set up, you can start analyzing your data and querying it with Amazon Athena. Let’s begin with the first step: creating a Glue crawler.
3.1. Create a Glue Crawler
A Glue crawler is a component of AWS Glue that scans your data sources and extracts the metadata and schema of your data. It then creates or updates a table in the AWS Glue Data Catalog that contains the information about your data. A Glue crawler can crawl various types of data sources, such as Amazon S3, Amazon DynamoDB, Amazon Redshift, and more. You can configure a Glue crawler to run on a schedule or on demand.
In this tutorial, you will create a Glue crawler that crawls the data that you saved to Amazon S3 in the previous section. You will use the same database and table names that you used with AWS Data Wrangler, so that the Glue crawler can update the existing table in the AWS Glue Data Catalog. You will also specify the Parquet format and the partition columns for your data.
To create a Glue crawler, you can use the AWS Glue console or the AWS CLI. In this tutorial, you will use the AWS Glue console. You can follow the steps below to create a Glue crawler:
- Open the AWS Glue console and choose Crawlers from the left navigation pane.
- Choose Add crawler and enter a name for your crawler. In this tutorial, you will use
adult_data_crawler
as the crawler name. - Choose Next and select Specified path in another account as the data store. Enter the Amazon S3 path where you saved your data in the previous section. In this tutorial, you will use
s3://your-bucket-name/your-folder-name/
as the data store path. - Choose Next and select No for adding another data store.
- Choose Next and select Choose an existing IAM role as the IAM role. Select the IAM role that has permissions to access AWS Glue and Amazon S3. In this tutorial, you will use
AWSGlueServiceRole
as the IAM role. - Choose Next and select Run on demand as the frequency. You can also choose a schedule for your crawler, but in this tutorial, you will run it manually.
- Choose Next and select Choose an existing database as the database. Select the database that you created in the previous section. In this tutorial, you will use
aws_automl_tutorial
as the database. - Choose Next and enter the table name that you used in the previous section. In this tutorial, you will use
adult_data
as the table name. - Choose Next and select Configure the crawler’s output. Under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. This will ensure that the crawler creates a single table for your data.
- Under Output format, select Parquet as the format of your data.
- Under Partitioning, select Use the path to create partitions and enter the partition column name that you used in the previous section. In this tutorial, you will use
income
as the partition column name. - Choose Next and review the crawler settings. Choose Finish to create the crawler.
You have successfully created a Glue crawler that crawls your data in Amazon S3. You can now run the crawler to update the table in the AWS Glue Data Catalog.
3.2. Run the Glue Crawler
After you have created the Glue crawler, you need to run it to populate the Glue data catalog with the metadata and schema of your data source. Running the Glue crawler will also create a table in the Glue data catalog that you can use to query your data with Amazon Athena.
To run the Glue crawler, you can use the AWS console, the AWS CLI, or the AWS SDK. In this tutorial, you will use the AWS console to run the Glue crawler. You can follow the steps below to run the Glue crawler:
- Go to the AWS Glue console and select Crawlers from the left menu.
- Find the crawler that you created in the previous section and select it.
- Click on the Run crawler button on the top right corner.
- Wait for the crawler to finish running. You can monitor the status of the crawler on the console. The crawler status will change from Starting to Stopping to Ready when it is done.
- When the crawler is done, you can view the results on the console. You can see the number of tables created, the number of files and records scanned, and the duration of the crawler run.
Congratulations, you have successfully run the Glue crawler and populated the Glue data catalog with your data source metadata and schema. You can now explore the Glue data catalog and query your data with Amazon Athena.
3.3. Explore the Glue Data Catalog
Once you have run the Glue crawler, you can explore the Glue data catalog to see the metadata and schema of your data source. The Glue data catalog is a central repository that stores the information about your data sources, such as the location, format, structure, and properties of your data. You can use the Glue data catalog to search and browse your data sources, as well as to query your data with Amazon Athena.
To explore the Glue data catalog, you can use the AWS console, the AWS CLI, or the AWS SDK. In this tutorial, you will use the AWS console to explore the Glue data catalog. You can follow the steps below to explore the Glue data catalog:
- Go to the AWS Glue console and select Tables from the left menu.
- Find the table that was created by the Glue crawler and select it. You can see the name, description, classification, and location of the table.
- Click on the View partitions button to see the partitions of the table. Partitions are subsets of your data that are organized by a key, such as date, region, or category. Partitions can help you optimize the performance and cost of your queries by reducing the amount of data scanned.
- Click on the View columns button to see the columns of the table. Columns are the attributes or fields of your data, such as name, age, or gender. You can see the name, type, and comment of each column.
- Click on the View properties button to see the properties of the table. Properties are the additional information or metadata of your data, such as the number of rows, the size, or the compression type. You can see the key and value of each property.
By exploring the Glue data catalog, you can get a better understanding of your data source and its characteristics. You can also use the Glue data catalog to query your data with Amazon Athena, which is the next step of this tutorial.
3.4. Query Data with Amazon Athena
In this section, you will learn how to use Amazon Athena to query your data in the Glue data catalog. Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon S3 using standard SQL. You can use Amazon Athena to run ad-hoc queries, generate reports, and perform data analysis on your data.
To use Amazon Athena, you need to have an AWS account and configure your credentials and region. You also need to have a table in the Glue data catalog that corresponds to your data source in Amazon S3. You can use the Glue crawler that you created and ran in the previous sections to create the table in the Glue data catalog.
Once you have Amazon Athena set up, you can start querying your data with SQL. You can use the AWS console, the AWS CLI, or the AWS SDK to run queries with Amazon Athena. In this tutorial, you will use the AWS console to run queries with Amazon Athena. You can follow the steps below to run queries with Amazon Athena:
- Go to the Amazon Athena console and select Query Editor from the left menu.
- Select the database and the table that correspond to your data source in the Glue data catalog. You can see the schema and the sample data of the table on the console.
- Type your SQL query in the query editor. You can use the standard SQL syntax and functions to query your data. You can also use the query builder to generate SQL queries from the table schema.
- Click on the Run query button to execute your query. You can see the results of your query on the console. You can also see the query execution time, the data scanned, and the query ID on the console.
- You can save, download, or export your query results to Amazon S3 or other AWS services. You can also save, edit, or delete your queries on the console.
By using Amazon Athena, you can query your data in the Glue data catalog with SQL. You can also perform data analysis and visualization on your query results using AWS services such as Amazon QuickSight, Amazon SageMaker, or AWS Lambda.
4. Conclusion
In this tutorial, you have learned how to prepare and analyze your data for AWS AutoML using AWS Data Wrangler and AWS Glue. You have performed the following steps:
- Installed and imported AWS Data Wrangler
- Loaded data from Amazon S3
- Cleaned and transformed data
- Saved data to Amazon S3
- Created a Glue crawler
- Run the Glue crawler
- Explored the Glue data catalog
- Queried data with Amazon Athena
By using AWS Data Wrangler and AWS Glue, you have been able to prepare and analyze your data for AWS AutoML in a scalable, efficient, and cost-effective way. You have also been able to leverage the power and flexibility of Python and AWS to handle complex and diverse data scenarios.
AWS AutoML is a service that allows you to build, train, and deploy machine learning models without requiring any prior machine learning experience or expertise. AWS AutoML can automatically select the best algorithm, hyperparameters, and feature engineering for your data and use case. You can use AWS AutoML to solve various machine learning problems, such as classification, regression, forecasting, and anomaly detection.
In the next part of this series, you will learn how to use AWS AutoML to create and deploy a machine learning model based on the data that you have prepared and analyzed in this part. You will also learn how to evaluate and monitor your model performance and accuracy.
Thank you for reading this tutorial. We hope you have enjoyed it and learned something new. If you have any questions or feedback, please feel free to leave a comment below. Stay tuned for the next part of this series.