This blog teaches you how to use BeautifulSoup4 and Pandas to clean and store the scraped data in CSV, JSON, or SQL formats with Python code examples.
1. Introduction
Web scraping is a technique of extracting data from websites using automated scripts or programs. Web scraping can be useful for various purposes, such as data analysis, market research, content aggregation, and more. However, web scraping also comes with some challenges, such as data quality, data format, and data storage.
In this tutorial, you will learn how to use BeautifulSoup4 and Pandas to clean and store the scraped data in different formats, such as CSV, JSON, and SQL. BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. Pandas is a Python library that provides high-performance data structures and tools for data analysis.
By the end of this tutorial, you will be able to:
- Scrape data from websites using Python requests and BeautifulSoup4
- Clean the scraped data using BeautifulSoup4 methods and attributes
- Store the cleaned data in Pandas DataFrames
- Export the data to CSV, JSON, and SQL files using Pandas methods
To follow this tutorial, you will need:
- Python 3 installed on your system
- BeautifulSoup4 and Pandas libraries installed on your system
- A text editor or an IDE of your choice
- A basic understanding of Python syntax and web scraping concepts
Are you ready to start scraping, cleaning, and storing data with BeautifulSoup4 and Pandas? Let’s begin!
2. Web Scraping Basics
Before you can start cleaning and storing data with BeautifulSoup4 and Pandas, you need to know how to scrape data from websites using Python. Web scraping is the process of extracting data from web pages using automated scripts or programs. Web scraping can be done for various purposes, such as data analysis, market research, content aggregation, and more.
There are many tools and libraries available for web scraping in Python, but in this tutorial, you will use two of the most popular ones: requests and BeautifulSoup4. Requests is a Python library that allows you to send HTTP requests and get the response from a web server. BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents.
To scrape data from a website, you need to follow these basic steps:
- Send a HTTP request to the URL of the web page that you want to scrape using requests.
- Get the HTML content of the web page from the response using requests.
- Parse the HTML content using BeautifulSoup4 and create a BeautifulSoup object.
- Find and extract the data elements that you want from the BeautifulSoup object using BeautifulSoup4 methods and attributes.
For example, suppose you want to scrape the title and the summary of the latest news articles from the Bing News website. You can use the following code to do that:
# Import requests and BeautifulSoup4 libraries import requests from bs4 import BeautifulSoup # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 soup = BeautifulSoup(html, "html.parser") # Find all the
This code will output something like this:
Biden to announce new Covid measures for US as Omicron spreads President Joe Biden is expected to announce new measures to combat Covid-19 in the US, as the Omicron variant continues to spread around the world. Mr Biden will speak at the White House on Thursday ... UK approves Pfizer Covid vaccine for children aged 5-11 The UK has approved the Pfizer/BioNTech Covid-19 vaccine for children aged between five and 11, the Medicines and Healthcare products Regulatory Agency (MHRA) has announced. The decision comes after ... Australia v England: Ashes first Test, day one – live! Over-by-over report: Can England get off to a good start in the Ashes series against Australia at the Gabba? Join Geoff Lemon for updates
As you can see, web scraping is a powerful technique to get data from websites using Python. However, web scraping also comes with some challenges, such as data quality, data format, and data storage. How can you deal with these challenges? That’s what you will learn in the next sections of this tutorial.
3. Data Cleaning with BeautifulSoup4
Once you have scraped the data from a website using requests and BeautifulSoup4, you need to clean the data before you can store it in a desired format. Data cleaning is the process of transforming and improving the quality of the data by removing or correcting errors, inconsistencies, duplicates, and irrelevant information.
Data cleaning is an essential step in web scraping, as the data that you get from a website may not be in a suitable format or structure for your analysis or application. For example, the data may contain HTML tags, whitespace, punctuation, special characters, or other unwanted elements that you need to remove or modify.
In this section, you will learn how to use BeautifulSoup4 to clean the scraped data and prepare it for storage. BeautifulSoup4 provides various methods and attributes that allow you to parse and manipulate HTML and XML documents. You will use some of these methods and attributes to perform the following data cleaning tasks:
- Parsing HTML: You will learn how to create a BeautifulSoup object from the HTML content of a web page and specify the parser that you want to use.
- Extracting Data Elements: You will learn how to find and extract the data elements that you want from the BeautifulSoup object using different criteria, such as tags, attributes, classes, ids, text, and more.
- Handling Missing Values: You will learn how to deal with missing values in the scraped data and replace them with appropriate values or remove them altogether.
By the end of this section, you will be able to clean the scraped data using BeautifulSoup4 and get it ready for storage. Let’s start with parsing HTML.
3.1. Parsing HTML
Parsing HTML is the process of converting the HTML content of a web page into a structured representation that can be manipulated and searched. Parsing HTML is the first step in data cleaning with BeautifulSoup4, as it allows you to create a BeautifulSoup object that contains the parsed HTML document.
To parse HTML with BeautifulSoup4, you need to use the BeautifulSoup() function and pass two arguments: the HTML content and the name of the parser that you want to use. The HTML content can be a string, a file, or a response object from requests. The parser is a module that tells BeautifulSoup4 how to parse the HTML content.
There are different parsers available for BeautifulSoup4, such as html.parser, lxml, html5lib, and more. Each parser has its own advantages and disadvantages, such as speed, accuracy, and compatibility. You can choose the parser that suits your needs, but in this tutorial, you will use the html.parser parser, which is a built-in parser that comes with Python.
To parse HTML with BeautifulSoup4 and the html.parser parser, you can use the following code:
# Import BeautifulSoup4 library from bs4 import BeautifulSoup # Define the HTML content of a web page as a string html = """
Web Scraping 107: Cleaning and Storing Data with BeautifulSoup4 and Pandas in Python
This is a tutorial on web scraping with Python.
Visit Bing """ # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Print the BeautifulSoup object print(soup)
This code will output the following:
Web Scraping 107: Cleaning and Storing Data with BeautifulSoup4 and Pandas in Python
This is a tutorial on web scraping with Python.
As you can see, the BeautifulSoup object contains the parsed HTML document, which is organized into a tree structure of tags, attributes, and text. You can access and manipulate the elements of the BeautifulSoup object using various methods and attributes that BeautifulSoup4 provides. You will learn how to do that in the next section.
3.2. Extracting Data Elements
After you have parsed the HTML content of a web page using BeautifulSoup4 and created a BeautifulSoup object, you can extract the data elements that you want from the object using different criteria, such as tags, attributes, classes, ids, text, and more. Extracting data elements is the second step in data cleaning with BeautifulSoup4, as it allows you to get the relevant information that you need from the parsed HTML document.
BeautifulSoup4 provides various methods and attributes that allow you to find and extract data elements from the BeautifulSoup object. Some of the most common ones are:
- find(): This method returns the first element that matches the given criteria.
- find_all(): This method returns a list of all elements that match the given criteria.
- select(): This method returns a list of elements that match the given CSS selector.
- get_text(): This method returns the text content of an element or a list of elements.
- get(): This method returns the value of an attribute of an element.
To extract data elements with BeautifulSoup4, you need to pass the criteria that you want to use to the method or attribute that you want to use. The criteria can be a string, a regular expression, a function, a list, a dictionary, or a combination of these. You can also use logical operators, such as and, or, and not, to combine multiple criteria.
For example, suppose you want to extract the title and the summary of the latest news articles from the Bing News website using BeautifulSoup4. You can use the following code to do that:
# Import requests and BeautifulSoup4 libraries import requests from bs4 import BeautifulSoup # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Find all the
This code will output something like this:
Biden to announce new Covid measures for US as Omicron spreads President Joe Biden is expected to announce new measures to combat Covid-19 in the US, as the Omicron variant continues to spread around the world. Mr Biden will speak at the White House on Thursday ... UK approves Pfizer Covid vaccine for children aged 5-11 The UK has approved the Pfizer/BioNTech Covid-19 vaccine for children aged between five and 11, the Medicines and Healthcare products Regulatory Agency (MHRA) has announced. The decision comes after ... Australia v England: Ashes first Test, day one – live! Over-by-over report: Can England get off to a good start in the Ashes series against Australia at the Gabba? Join Geoff Lemon for updates
As you can see, you can extract data elements from the BeautifulSoup object using different criteria, such as tags, attributes, classes, ids, text, and more. However, extracting data elements is not enough to clean the data, as the data may still contain missing values that you need to handle. How can you do that? That’s what you will learn in the next section.
3.3. Handling Missing Values
One of the challenges that you may face when scraping data from websites is that the data may contain missing values. Missing values are values that are not present in the data, either because they are not available, not applicable, or not recorded. Missing values can affect the quality and accuracy of the data, and therefore, you need to handle them properly before you can store the data in a desired format.
Handling missing values is the third step in data cleaning with BeautifulSoup4, as it allows you to deal with the gaps and inconsistencies in the scraped data. There are different ways to handle missing values, depending on the type and the amount of the missing data, the purpose and the context of the analysis, and the preference of the analyst. Some of the common ways to handle missing values are:
- Removing the missing values: This means deleting the rows or columns that contain missing values from the data. This is a simple and fast way to handle missing values, but it can also reduce the size and the representativeness of the data, and introduce bias and errors.
- Replacing the missing values: This means substituting the missing values with some other values, such as the mean, the median, the mode, the previous or the next value, or a constant value. This is a more complex and flexible way to handle missing values, but it can also alter the distribution and the variability of the data, and introduce noise and outliers.
- Ignoring the missing values: This means leaving the missing values as they are and proceeding with the analysis or the storage of the data. This is a simple and convenient way to handle missing values, but it can also affect the validity and the reliability of the results, and cause errors and exceptions.
In this section, you will learn how to use BeautifulSoup4 to handle missing values in the scraped data. You will use some of the methods and attributes that BeautifulSoup4 provides to perform the following tasks:
- Detecting missing values: You will learn how to identify and count the missing values in the scraped data using different criteria, such as empty strings, None values, or specific values.
- Removing missing values: You will learn how to delete the elements that contain missing values from the BeautifulSoup object using different methods, such as decompose(), extract(), or clear().
- Replacing missing values: You will learn how to fill the missing values with some other values using different methods, such as string, append(), or insert().
By the end of this section, you will be able to handle missing values in the scraped data using BeautifulSoup4 and improve the quality and the consistency of the data. Let’s start with detecting missing values.
4. Data Storage with Pandas
After you have cleaned the data with BeautifulSoup4, you need to store the data in a desired format that can be easily accessed and analyzed. Data storage is the final step in web scraping with Python, as it allows you to save the data in a file or a database that can be used for further processing or presentation.
In this section, you will learn how to use Pandas to store the data in different formats, such as CSV, JSON, and SQL. Pandas is a Python library that provides high-performance data structures and tools for data analysis. One of the main data structures that Pandas provides is the DataFrame, which is a two-dimensional tabular data structure that can store data of different types and sizes.
To store the data with Pandas, you need to perform the following tasks:
- Creating DataFrames: You will learn how to create a Pandas DataFrame from the data elements that you extracted with BeautifulSoup4 and organize the data into rows and columns.
- Exporting Data to CSV: You will learn how to export the data from the Pandas DataFrame to a CSV file using the to_csv() method.
- Exporting Data to JSON: You will learn how to export the data from the Pandas DataFrame to a JSON file using the to_json() method.
- Exporting Data to SQL: You will learn how to export the data from the Pandas DataFrame to a SQL database using the to_sql() method.
By the end of this section, you will be able to store the data in different formats using Pandas and use the data for your analysis or application. Let’s start with creating DataFrames.
4.1. Creating DataFrames
A DataFrame is a two-dimensional tabular data structure that can store data of different types and sizes. A DataFrame has rows and columns that can be labeled with index and column names. A DataFrame can be created from various sources, such as lists, dictionaries, arrays, series, or files.
To create a DataFrame from the data elements that you extracted with BeautifulSoup4, you need to use the DataFrame() constructor and pass the data elements as an argument. The data elements can be a list of lists, a list of dictionaries, a dictionary of lists, or a dictionary of dictionaries. You can also specify the index and column names for the DataFrame as additional arguments.
For example, suppose you want to create a DataFrame from the title and the summary of the latest news articles that you scraped from the Bing News website using BeautifulSoup4. You can use the following code to do that:
# Import requests, BeautifulSoup4, and Pandas libraries import requests from bs4 import BeautifulSoup import pandas as pd # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Find all the
This code will output something like this:
title summary 0 Biden to announce new Covid measures for US as... President Joe Biden is expected to announce ne... 1 UK approves Pfizer Covid vaccine for children ... The UK has approved the Pfizer/BioNTech Covid-... 2 Australia v England: Ashes first Test, day one... Over-by-over report: Can England get off to a ...
As you can see, you can create a DataFrame from the data elements that you extracted with BeautifulSoup4 and organize the data into rows and columns. However, creating a DataFrame is not enough to store the data, as you need to export the data to a file or a database that can be used for further processing or presentation. How can you do that? That’s what you will learn in the next sections.
4.2. Exporting Data to CSV
A CSV (comma-separated values) file is a plain text file that stores tabular data in a simple and compact format. A CSV file consists of rows and columns that are separated by commas or other delimiters. A CSV file can be easily read and written by various programs and applications, such as Excel, Google Sheets, or Pandas.
To export the data from the Pandas DataFrame to a CSV file, you need to use the to_csv() method and pass the name of the file as an argument. You can also specify other arguments, such as the delimiter, the encoding, the index, the header, and more, to customize the output of the CSV file.
For example, suppose you want to export the DataFrame that you created from the title and the summary of the latest news articles that you scraped from the Bing News website using BeautifulSoup4 and Pandas. You can use the following code to do that:
# Import requests, BeautifulSoup4, and Pandas libraries import requests from bs4 import BeautifulSoup import pandas as pd # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Find all the
This code will create a CSV file named news.csv in the same directory as the Python script. The CSV file will have two columns: title and summary, and three rows: one for each news article. The CSV file will look something like this:
title,summary Biden to announce new Covid measures for US as Omicron spreads,President Joe Biden is expected to announce new measures to combat Covid-19 in the US, as the Omicron variant continues to spread around the world. Mr Biden will speak at the White House on Thursday ... UK approves Pfizer Covid vaccine for children aged 5-11,The UK has approved the Pfizer/BioNTech Covid-19 vaccine for children aged between five and 11, the Medicines and Healthcare products Regulatory Agency (MHRA) has announced. The decision comes after ... Australia v England: Ashes first Test, day one – live!,Over-by-over report: Can England get off to a good start in the Ashes series against Australia at the Gabba? Join Geoff Lemon for updates
As you can see, you can export the data from the Pandas DataFrame to a CSV file using the to_csv() method and save the data in a simple and compact format. However, CSV is not the only format that you can use to store the data, as there are other formats that may suit your needs better, such as JSON or SQL. How can you export the data to these formats? That’s what you will learn in the next sections.
4.3. Exporting Data to JSON
A JSON (JavaScript Object Notation) file is a lightweight and human-readable data interchange format that stores data as a collection of name-value pairs or an ordered list of values. A JSON file can be easily parsed and generated by various programs and applications, such as JavaScript, Python, or Pandas.
To export the data from the Pandas DataFrame to a JSON file, you need to use the to_json() method and pass the name of the file as an argument. You can also specify other arguments, such as the orientation, the indentation, the encoding, the index, and more, to customize the output of the JSON file.
For example, suppose you want to export the DataFrame that you created from the title and the summary of the latest news articles that you scraped from the Bing News website using BeautifulSoup4 and Pandas. You can use the following code to do that:
# Import requests, BeautifulSoup4, and Pandas libraries import requests from bs4 import BeautifulSoup import pandas as pd # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Find all the
This code will create a JSON file named news.json in the same directory as the Python script. The JSON file will have an array of objects, each representing a news article with two properties: title and summary. The JSON file will look something like this:
[ { "title": "Biden to announce new Covid measures for US as Omicron spreads", "summary": "President Joe Biden is expected to announce new measures to combat Covid-19 in the US, as the Omicron variant continues to spread around the world. Mr Biden will speak at the White House on Thursday ..." }, { "title": "UK approves Pfizer Covid vaccine for children aged 5-11", "summary": "The UK has approved the Pfizer/BioNTech Covid-19 vaccine for children aged between five and 11, the Medicines and Healthcare products Regulatory Agency (MHRA) has announced. The decision comes after ..." }, { "title": "Australia v England: Ashes first Test, day one – live!", "summary": "Over-by-over report: Can England get off to a good start in the Ashes series against Australia at the Gabba? Join Geoff Lemon for updates" } ]
As you can see, you can export the data from the Pandas DataFrame to a JSON file using the to_json() method and save the data in a lightweight and human-readable format. However, JSON is not the only format that you can use to store the data, as there are other formats that may suit your needs better, such as CSV or SQL. How can you export the data to these formats? That’s what you will learn in the next sections.
4.4. Exporting Data to SQL
A SQL (Structured Query Language) database is a relational database management system that stores data in tables that consist of rows and columns. A SQL database can be easily queried and manipulated using SQL commands and functions. A SQL database can be used for various purposes, such as data analysis, data visualization, data integration, and more.
To export the data from the Pandas DataFrame to a SQL database, you need to use the to_sql() method and pass the name of the table and the connection object as arguments. You can also specify other arguments, such as the schema, the index, the if_exists, and more, to customize the output of the SQL table.
For example, suppose you want to export the DataFrame that you created from the title and the summary of the latest news articles that you scraped from the Bing News website using BeautifulSoup4 and Pandas. You can use the following code to do that:
# Import requests, BeautifulSoup4, Pandas, and SQLAlchemy libraries import requests from bs4 import BeautifulSoup import pandas as pd from sqlalchemy import create_engine # Define the URL of the web page to scrape url = "https://www.bing.com/news" # Send a HTTP request to the URL and get the response response = requests.get(url) # Check if the response is successful (status code 200) if response.status_code == 200: # Get the HTML content from the response html = response.text # Parse the HTML content using BeautifulSoup4 and the html.parser parser soup = BeautifulSoup(html, "html.parser") # Find all the
This code will create a SQL database named news.db in the same directory as the Python script. The SQL database will have a table named news that will have two columns: title and summary, and three rows: one for each news article. The SQL table will look something like this:
sqlite> SELECT * FROM news; Biden to announce new Covid measures for US as Omicron spreads|President Joe Biden is expected to announce new measures to combat Covid-19 in the US, as the Omicron variant continues to spread around the world. Mr Biden will speak at the White House on Thursday ... UK approves Pfizer Covid vaccine for children aged 5-11|The UK has approved the Pfizer/BioNTech Covid-19 vaccine for children aged between five and 11, the Medicines and Healthcare products Regulatory Agency (MHRA) has announced. The decision comes after ... Australia v England: Ashes first Test, day one – live!|Over-by-over report: Can England get off to a good start in the Ashes series against Australia at the Gabba? Join Geoff Lemon for updates
As you can see, you can export the data from the Pandas DataFrame to a SQL database using the to_sql() method and save the data in a relational and queryable format. You have now learned how to store the data in different formats using Pandas, such as CSV, JSON, and SQL. You have also completed the web scraping tutorial with Python, BeautifulSoup4, and Pandas. Congratulations!
In the next and final section, you will review what you have learned in this tutorial and get some tips and resources for further learning and practice.
5. Conclusion
In this tutorial, you have learned how to scrape, clean, and store data from websites using Python, BeautifulSoup4, and Pandas. You have also learned how to use various methods and attributes of these libraries to perform different tasks, such as sending HTTP requests, parsing HTML, extracting data elements, handling missing values, creating DataFrames, and exporting data to different formats.
By following this tutorial, you have gained valuable skills and knowledge that can help you with various data-related projects and applications, such as data analysis, data visualization, data integration, and more. You have also learned how to use some of the most popular and powerful tools and libraries for web scraping and data manipulation in Python.
Here are some key points that you should remember from this tutorial:
- Web scraping is a technique of extracting data from websites using automated scripts or programs.
- Requests is a Python library that allows you to send HTTP requests and get the response from a web server.
- BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents.
- Data cleaning is the process of transforming and improving the quality of the data by removing or correcting errors, inconsistencies, duplicates, and irrelevant information.
- Pandas is a Python library that provides high-performance data structures and tools for data analysis.
- A DataFrame is a two-dimensional tabular data structure that can store data of different types and sizes.
- You can export the data from the Pandas DataFrame to different formats, such as CSV, JSON, and SQL, using the to_csv(), to_json(), and to_sql() methods.
We hope that you have enjoyed this tutorial and learned something new and useful. If you want to learn more about web scraping and data manipulation with Python, BeautifulSoup4, and Pandas, here are some resources that you can check out:
- Python Documentation: The official documentation of the Python programming language.
- Requests Documentation: The official documentation of the Requests library.
- BeautifulSoup4 Documentation: The official documentation of the BeautifulSoup4 library.
- Pandas Documentation: The official documentation of the Pandas library.
- Real Python: A website that offers high-quality Python tutorials, articles, courses, and books.
- DataCamp: A platform that offers interactive online courses and projects on data science and machine learning with Python.
Thank you for reading this tutorial and happy web scraping!