Web Scraping 106: Scraping Multiple Pages and Websites with BeautifulSoup4 in Python

This blog teaches you how to use BeautifulSoup4 and requests to scrape multiple pages and websites with pagination, links, and APIs in Python.

1. Introduction

Web scraping is a powerful technique to extract data from websites and use it for various purposes, such as data analysis, machine learning, web development, and more. However, not all websites have a single page that contains all the data you need. Sometimes, you may need to scrape multiple pages or even multiple websites to get the complete data set you want.

In this tutorial, you will learn how to use BeautifulSoup4 and requests to scrape multiple pages and websites with pagination, links, and APIs in Python. You will also learn how to handle common challenges and issues that arise when scraping multiple pages, such as rate limiting, dynamic content, and authentication.

By the end of this tutorial, you will be able to scrape multiple pages and websites with ease and confidence, and use the scraped data for your own projects and applications.

Before you start, make sure you have a basic understanding of web scraping, HTML, and Python. If you need a refresher, you can check out our previous tutorials on web scraping basics, BeautifulSoup4 and requests, and extracting data with BeautifulSoup4.

Ready to scrape multiple pages and websites with BeautifulSoup4 and requests? Let’s get started!

2. Web Scraping Basics

In this section, you will learn the basic concepts and principles of web scraping. You will also learn some of the common challenges and limitations of web scraping, and how to overcome them.

Web scraping is the process of extracting data from websites using automated tools or scripts. Web scraping can be used for various purposes, such as data analysis, machine learning, web development, and more.

However, web scraping is not as simple as it sounds. There are many factors that affect the success and efficiency of web scraping, such as:

  • The structure and design of the website
  • The type and amount of data you want to scrape
  • The speed and reliability of your internet connection
  • The legal and ethical issues of web scraping

Therefore, before you start scraping any website, you need to understand how web scraping works, and what are the best practices and tools to use.

How web scraping works?

Web scraping works by sending HTTP requests to the target website, and receiving HTML responses from the website. The HTML response contains the code and content of the website, which can be parsed and extracted using various methods and libraries.

One of the most popular and powerful libraries for web scraping in Python is BeautifulSoup4. BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. It can handle different types of parsers, such as html.parser, lxml, html5lib, and more. It can also handle different types of encodings, such as UTF-8, ISO-8859-1, and more.

Another popular and powerful library for web scraping in Python is requests. Requests is a Python library that allows you to send and receive HTTP requests easily and efficiently. It can handle different types of HTTP methods, such as GET, POST, PUT, DELETE, and more. It can also handle different types of parameters, headers, cookies, authentication, and more.

By combining BeautifulSoup4 and requests, you can scrape almost any website you want, as long as you follow the rules and guidelines of the website.

What are the rules and guidelines of web scraping?

Web scraping is not illegal, but it can be unethical or harmful if done without permission or consideration. Therefore, before you scrape any website, you should always check the following sources:

  • The terms and conditions of the website
  • The robots.txt file of the website
  • The privacy policy of the website

These sources will tell you what you can and cannot do with the data and content of the website, and how to respect the rights and interests of the website owners and users.

Some of the common rules and guidelines of web scraping are:

  • Do not scrape personal or sensitive data without consent
  • Do not scrape copyrighted or protected data without permission
  • Do not scrape data that is not publicly available or accessible
  • Do not scrape data at a high frequency or volume that may overload or harm the website
  • Do not scrape data for malicious or fraudulent purposes

If you follow these rules and guidelines, you can scrape data safely and responsibly, and avoid any legal or ethical issues.

Now that you have learned the basics of web scraping, you are ready to start scraping multiple pages and websites with BeautifulSoup4 and requests. In the next section, you will learn how to install and import BeautifulSoup4 and requests, and how to make a simple HTTP request.

2.1. What is Web Scraping?

Web scraping is the process of extracting data from websites using automated tools or scripts. Web scraping can be used for various purposes, such as data analysis, machine learning, web development, and more.

Web scraping involves two main steps: sending HTTP requests to the target website, and parsing the HTML response to extract the data you want.

An HTTP request is a message that you send to a web server, asking for a specific resource or action. For example, when you type a URL in your browser, you are sending an HTTP request to the web server, asking for the web page associated with that URL.

An HTML response is a message that the web server sends back to you, containing the resource or action you requested. For example, when you receive a web page from the web server, you are receiving an HTML response that contains the code and content of the web page.

To parse the HTML response, you need to use a library or tool that can understand and manipulate the HTML code and content. One of the most popular and powerful libraries for parsing HTML in Python is BeautifulSoup4. BeautifulSoup4 can help you find and extract the data you want from the HTML response, such as text, links, images, tables, and more.

Web scraping can be very useful and efficient, as it can help you collect large amounts of data from various websites in a short time. However, web scraping also has some challenges and limitations, such as:

  • The structure and design of the website may change over time, making your scraping code obsolete or inaccurate
  • The website may have anti-scraping measures, such as CAPTCHA, IP blocking, or rate limiting, that prevent you from scraping the data you want
  • The website may have dynamic or interactive content, such as JavaScript, AJAX, or web sockets, that require additional tools or techniques to scrape
  • The website may have authentication or authorization requirements, such as login, cookies, or tokens, that restrict your access to the data you want
  • The website may have legal or ethical issues, such as terms and conditions, robots.txt, or privacy policy, that limit your use of the data you scrape

Therefore, before you start scraping any website, you need to do some research and planning, and follow some best practices and guidelines, to ensure that your web scraping is successful and responsible.

In the next section, you will learn why web scraping is useful and valuable, and what are some of the common applications and benefits of web scraping.

2.2. Why Web Scraping?

Web scraping is a useful and valuable technique for many reasons. Some of the common applications and benefits of web scraping are:

  • Data analysis: Web scraping can help you collect and analyze large amounts of data from various sources, such as social media, news, e-commerce, and more. You can use web scraping to perform tasks such as sentiment analysis, trend analysis, market research, and more.
  • Machine learning: Web scraping can help you obtain and prepare data for machine learning models, such as natural language processing, computer vision, and more. You can use web scraping to gather and label data, augment data, and evaluate models.
  • Web development: Web scraping can help you create and improve web applications, such as web crawlers, web scrapers, web bots, and more. You can use web scraping to automate tasks, extract information, and interact with websites.
  • Content creation: Web scraping can help you generate and enhance content, such as articles, blogs, podcasts, videos, and more. You can use web scraping to find and curate content, summarize and paraphrase content, and create original content.
  • Personal use: Web scraping can help you with your personal needs and interests, such as travel, education, entertainment, and more. You can use web scraping to find and compare deals, learn new skills, discover new things, and have fun.

These are just some of the examples of how web scraping can be useful and valuable. Web scraping can also be used for many other purposes, depending on your goals and creativity.

However, web scraping also has some challenges and limitations, as we discussed in the previous section. Therefore, you need to be careful and responsible when you scrape any website, and follow the rules and guidelines of web scraping.

In the next section, you will learn how web scraping works, and what are the main steps and tools involved in web scraping.

2.3. How Web Scraping Works?

Web scraping works by sending HTTP requests to the target website, and receiving HTML responses from the website. The HTML response contains the code and content of the website, which can be parsed and extracted using various methods and libraries.

One of the most popular and powerful libraries for web scraping in Python is BeautifulSoup4. BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. It can handle different types of parsers, such as html.parser, lxml, html5lib, and more. It can also handle different types of encodings, such as UTF-8, ISO-8859-1, and more.

Another popular and powerful library for web scraping in Python is requests. Requests is a Python library that allows you to send and receive HTTP requests easily and efficiently. It can handle different types of HTTP methods, such as GET, POST, PUT, DELETE, and more. It can also handle different types of parameters, headers, cookies, authentication, and more.

By combining BeautifulSoup4 and requests, you can scrape almost any website you want, as long as you follow the rules and guidelines of the website.

To illustrate how web scraping works, let’s take a simple example. Suppose you want to scrape the title and summary of the latest news articles from the BBC website. How would you do that?

Here are the steps you need to follow:

  1. Find the URL of the website you want to scrape. In this case, it is https://www.bbc.com/news.
  2. Send an HTTP GET request to the URL using requests. This will return an HTML response that contains the code and content of the web page.
  3. Parse the HTML response using BeautifulSoup4. This will create a BeautifulSoup object that represents the HTML document as a nested data structure.
  4. Find and extract the data you want from the BeautifulSoup object using various methods and attributes. For example, you can use the find_all method to find all the elements that match a certain criteria, such as tag name, class name, id, etc. You can also use the text attribute to get the text content of an element, or the get method to get the value of an attribute, such as href, src, etc.
  5. Store and process the data you extracted as you wish. For example, you can save the data to a file, a database, or a data frame. You can also perform further analysis or manipulation on the data, such as filtering, sorting, grouping, etc.

Here is an example of how the code for this web scraping task would look like:

# Import requests and BeautifulSoup4
import requests
from bs4 import BeautifulSoup

# Define the URL of the website
url = "https://www.bbc.com/news"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the response is successful
if response.status_code == 200:
    # Parse the HTML response using BeautifulSoup4
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find and extract the data you want
    # In this case, we want to get the title and summary of the latest news articles
    # We can find them by looking for the elements with the class name "gs-c-promo-heading__title"
    # and the class name "gs-c-promo-summary"
    articles = soup.find_all("div", class_="gs-c-promo-body")
    for article in articles:
        title = article.find("h3", class_="gs-c-promo-heading__title").text
        summary = article.find("p", class_="gs-c-promo-summary").text
        print(title)
        print(summary)
        print()
else:
    # Handle the error if the response is not successful
    print("Error: Unable to access the website")

This code will print the title and summary of the latest news articles from the BBC website, such as:

UK Covid cases rise by 50,000 in a day
The UK has recorded more than 50,000 new Covid cases in a single day for the first time since mid-January.

US to release 50m barrels of oil to lower prices
The US will release 50 million barrels of oil from its strategic reserve to lower prices, the White House says.

...

As you can see, web scraping is a simple and powerful way to extract data from websites. However, web scraping also has some challenges and limitations, as we discussed in the previous section. Therefore, you need to be careful and responsible when you scrape any website, and follow the rules and guidelines of web scraping.

In the next section, you will learn how to install and import BeautifulSoup4 and requests, and how to make a simple HTTP request.

3. BeautifulSoup4 and requests

In this section, you will learn how to install and import BeautifulSoup4 and requests, and how to make a simple HTTP request. These are the essential tools and steps for web scraping in Python.

BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. It can help you find and extract the data you want from the HTML response, such as text, links, images, tables, and more.

requests is a Python library that allows you to send and receive HTTP requests easily and efficiently. It can help you communicate with the web server and request the web page you want to scrape.

To use BeautifulSoup4 and requests, you need to install and import them in your Python environment. You can install them using pip, which is a package manager for Python. You can also import them using the import statement, which allows you to access the functions and classes of the libraries.

Here is how you can install and import BeautifulSoup4 and requests:

# Install BeautifulSoup4 and requests using pip
pip install beautifulsoup4
pip install requests

# Import BeautifulSoup4 and requests using the import statement
from bs4 import BeautifulSoup
import requests

Once you have installed and imported BeautifulSoup4 and requests, you can start scraping any website you want. The first step is to make a simple HTTP request to the URL of the website.

An HTTP request is a message that you send to a web server, asking for a specific resource or action. For example, when you type a URL in your browser, you are sending an HTTP request to the web server, asking for the web page associated with that URL.

There are different types of HTTP methods, such as GET, POST, PUT, DELETE, and more. Each method has a different purpose and meaning. For web scraping, the most common method is GET, which means you want to get the resource from the web server.

To make an HTTP GET request using requests, you need to use the get function, which takes the URL as an argument and returns an HTTP response object. The HTTP response object contains the status code, headers, cookies, and content of the web page.

Here is how you can make a simple HTTP GET request using requests:

# Define the URL of the website you want to scrape
url = "https://www.bbc.com/news"

# Make an HTTP GET request to the URL using requests
response = requests.get(url)

# Check the status code of the response
print(response.status_code)

The status code of the response tells you if the request was successful or not. The most common status codes are:

  • 200: OK, the request was successful and the resource was returned
  • 404: Not Found, the resource was not found on the web server
  • 403: Forbidden, the web server refused to return the resource
  • 500: Internal Server Error, the web server encountered an error while processing the request

For web scraping, you want to get a status code of 200, which means the web page was successfully returned. If you get any other status code, you need to handle the error or try a different URL.

In the next section, you will learn how to parse the HTML response using BeautifulSoup4, and how to extract the data you want from the web page.

3.1. Installing and Importing BeautifulSoup4 and requests

In this section, you will learn how to install and import BeautifulSoup4 and requests, and how to make a simple HTTP request. These are the essential tools and steps for web scraping in Python.

BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. It can help you find and extract the data you want from the HTML response, such as text, links, images, tables, and more.

requests is a Python library that allows you to send and receive HTTP requests easily and efficiently. It can help you communicate with the web server and request the web page you want to scrape.

To use BeautifulSoup4 and requests, you need to install and import them in your Python environment. You can install them using pip, which is a package manager for Python. You can also import them using the import statement, which allows you to access the functions and classes of the libraries.

Here is how you can install and import BeautifulSoup4 and requests:

# Install BeautifulSoup4 and requests using pip
pip install beautifulsoup4
pip install requests

# Import BeautifulSoup4 and requests using the import statement
from bs4 import BeautifulSoup
import requests

Once you have installed and imported BeautifulSoup4 and requests, you can start scraping any website you want. The first step is to make a simple HTTP request to the URL of the website.

An HTTP request is a message that you send to a web server, asking for a specific resource or action. For example, when you type a URL in your browser, you are sending an HTTP request to the web server, asking for the web page associated with that URL.

There are different types of HTTP methods, such as GET, POST, PUT, DELETE, and more. Each method has a different purpose and meaning. For web scraping, the most common method is GET, which means you want to get the resource from the web server.

To make an HTTP GET request using requests, you need to use the get function, which takes the URL as an argument and returns an HTTP response object. The HTTP response object contains the status code, headers, cookies, and content of the web page.

Here is how you can make a simple HTTP GET request using requests:

# Define the URL of the website you want to scrape
url = "https://www.bbc.com/news"

# Make an HTTP GET request to the URL using requests
response = requests.get(url)

# Check the status code of the response
print(response.status_code)

The status code of the response tells you if the request was successful or not. The most common status codes are:

  • 200: OK, the request was successful and the resource was returned
  • 404: Not Found, the resource was not found on the web server
  • 403: Forbidden, the web server refused to return the resource
  • 500: Internal Server Error, the web server encountered an error while processing the request

For web scraping, you want to get a status code of 200, which means the web page was successfully returned. If you get any other status code, you need to handle the error or try a different URL.

In the next section, you will learn how to parse the HTML response using BeautifulSoup4, and how to extract the data you want from the web page.

3.2. Making a Simple HTTP Request

In this section, you will learn how to make a simple HTTP request using requests, and how to check the status code of the response. This is the first step of web scraping, as it allows you to communicate with the web server and request the web page you want to scrape.

An HTTP request is a message that you send to a web server, asking for a specific resource or action. For example, when you type a URL in your browser, you are sending an HTTP request to the web server, asking for the web page associated with that URL.

There are different types of HTTP methods, such as GET, POST, PUT, DELETE, and more. Each method has a different purpose and meaning. For web scraping, the most common method is GET, which means you want to get the resource from the web server.

To make an HTTP GET request using requests, you need to use the get function, which takes the URL as an argument and returns an HTTP response object. The HTTP response object contains the status code, headers, cookies, and content of the web page.

Here is how you can make a simple HTTP GET request using requests:

# Define the URL of the website you want to scrape
url = "https://www.bbc.com/news"

# Make an HTTP GET request to the URL using requests
response = requests.get(url)

# Check the status code of the response
print(response.status_code)

The status code of the response tells you if the request was successful or not. The most common status codes are:

  • 200: OK, the request was successful and the resource was returned
  • 404: Not Found, the resource was not found on the web server
  • 403: Forbidden, the web server refused to return the resource
  • 500: Internal Server Error, the web server encountered an error while processing the request

For web scraping, you want to get a status code of 200, which means the web page was successfully returned. If you get any other status code, you need to handle the error or try a different URL.

In the next section, you will learn how to parse the HTML response using BeautifulSoup4, and how to extract the data you want from the web page.

3.3. Parsing HTML with BeautifulSoup4

In this section, you will learn how to parse the HTML response using BeautifulSoup4, and how to extract the data you want from the web page. This is the second step of web scraping, as it allows you to access and manipulate the HTML document as a nested data structure.

HTML, or HyperText Markup Language, is the standard language for creating web pages. HTML consists of elements, which are the building blocks of the web page. Each element has a tag name, such as , , ,

,

, , etc. Each element can also have attributes, such as class, id, href, src, etc. Each element can also contain text or other elements, forming a tree-like structure.

Here is an example of a simple HTML document:

Hello, World!

This is an example web page.

    Visit Bing
    Example Image


When you make an HTTP request to a web server, you receive an HTML response that contains the code and content of the web page. However, the HTML response is just a string of text, which is not easy to work with. You need a way to convert the HTML response into a Python object that you can manipulate and extract data from.

That’s where BeautifulSoup4 comes in. BeautifulSoup4 is a Python library that allows you to parse and manipulate HTML and XML documents. It can handle different types of parsers, such as html.parser, lxml, html5lib, and more. It can also handle different types of encodings, such as UTF-8, ISO-8859-1, and more.

To parse the HTML response using BeautifulSoup4, you need to use the BeautifulSoup function, which takes the HTML response and the parser name as arguments and returns a BeautifulSoup object. The BeautifulSoup object represents the HTML document as a nested data structure, which you can access and manipulate using various methods and attributes.

Here is how you can parse the HTML response using BeautifulSoup4:

# Define the URL of the website you want to scrape
url = "https://www.bbc.com/news"

# Make an HTTP GET request to the URL using requests
response = requests.get(url)

# Check if the response is successful
if response.status_code == 200:
    # Parse the HTML response using BeautifulSoup4
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Print the type and content of the BeautifulSoup object
    print(type(soup))
    print(soup)
else:
    # Handle the error if the response is not successful
    print("Error: Unable to access the website")

This code will print the type and content of the BeautifulSoup object.

As you will see, the BeautifulSoup object is a Python object that contains the HTML elements and their attributes, text, and children. You can use the BeautifulSoup object to find and extract the data you want from the web page, such as text, links, images, tables, and more.

In the next section, you will learn how to extract data with BeautifulSoup4, and how to use various methods and attributes to find and access the elements you want.

3.4. Extracting Data with BeautifulSoup4

In this section, you will learn how to extract data with BeautifulSoup4, and how to use various methods and attributes to find and access the elements you want. This is the second step of web scraping, as it allows you to access and manipulate the HTML document as a nested data structure.

Once you have parsed the HTML response using BeautifulSoup4, you can use the BeautifulSoup object to find and extract the data you want from the web page, such as text, links, images, tables, and more. The BeautifulSoup object has many methods and attributes that allow you to navigate and search the HTML document, such as:

  • find: This method takes a tag name or a dictionary of attributes as an argument and returns the first element that matches the criteria.
  • find_all: This method takes a tag name or a dictionary of attributes as an argument and returns a list of all elements that match the criteria.
  • select: This method takes a CSS selector as an argument and returns a list of all elements that match the selector.
  • get_text: This method returns the text content of an element or a list of elements.
  • get: This method takes an attribute name as an argument and returns the value of the attribute for an element.
  • children: This attribute returns a generator object that contains the direct children of an element.
  • descendants: This attribute returns a generator object that contains all the descendants of an element.
  • parents: This attribute returns a generator object that contains the direct parents of an element.
  • ancestors: This attribute returns a generator object that contains all the ancestors of an element.
  • siblings: This attribute returns a generator object that contains the siblings of an element.
  • next_sibling: This attribute returns the next sibling of an element.
  • previous_sibling: This attribute returns the previous sibling of an element.
  • next_element: This attribute returns the next element in the HTML document.
  • previous_element: This attribute returns the previous element in the HTML document.

These methods and attributes can be combined and chained to find and access the elements you want. For example, if you want to find the first paragraph element that has the class “summary” and get its text content, you can use the following code:

# Find the first paragraph element that has the class "summary"
p = soup.find("p", class_="summary")

# Get the text content of the paragraph element
text = p.get_text()

# Print the text content
print(text)

This code will print the text content of the paragraph element, such as:

Some text that summarizes the web page.

You can use these methods and attributes to extract any data you want from the web page, such as text, links, images, tables, and more. You can also store the extracted data in a Python data structure, such as a list, a dictionary, or a pandas dataframe, for further analysis and manipulation.

In the next section, you will learn how to scrape multiple pages and websites with BeautifulSoup4 and requests, and how to handle pagination, links, and APIs.

4. Scraping Multiple Pages

In this section, you will learn how to scrape multiple pages and websites with BeautifulSoup4 and requests, and how to handle pagination, links, and APIs. This is the third step of web scraping, as it allows you to expand your data set and access more information from different sources.

Some websites have a single page that contains all the data you need. However, most websites have multiple pages that are connected by pagination, links, or APIs. Pagination is a technique that divides the data into smaller chunks and displays them on different pages. Links are hyperlinks that point to other web pages or resources. APIs are application programming interfaces that allow you to access data or functionality from another website or service.

To scrape multiple pages and websites, you need to follow these steps:

  1. Identify the pattern or logic that connects the pages or websites you want to scrape.
  2. Use a loop or a recursion to iterate over the pages or websites.
  3. Make an HTTP request to each page or website using requests.
  4. Parse the HTML response using BeautifulSoup4.
  5. Extract the data you want using BeautifulSoup4.
  6. Store the data in a Python data structure, such as a list, a dictionary, or a pandas dataframe.

Depending on the type of pagination, links, or APIs, you may need to use different techniques or libraries to scrape multiple pages and websites. For example, you may need to use regular expressions, string manipulation, or JSON parsing to extract the URLs or parameters of the pages or websites. You may also need to use selenium, scrapy, or other libraries to handle dynamic or complex web pages or websites.

In the next subsections, you will learn how to scrape multiple pages and websites with pagination, links, and APIs using BeautifulSoup4 and requests.

4.1. Scraping Pages with Pagination

In this subsection, you will learn how to scrape multiple pages that are connected by pagination. Pagination is a technique that divides the data into smaller chunks and displays them on different pages. For example, when you search for something on Google, you will see a list of results that are divided into several pages, each with a number or a symbol at the bottom.

To scrape multiple pages with pagination, you need to identify the pattern or logic that connects the pages. Usually, the URL of each page will have a parameter that indicates the page number or the offset of the results. For example, if you search for “web scraping” on Google, the URL of the first page will be:

https://www.google.com/search?q=web+scraping

The URL of the second page will be:

https://www.google.com/search?q=web+scraping&start=10

The URL of the third page will be:

https://www.google.com/search?q=web+scraping&start=20

As you can see, the URL of each page has a parameter called start, which indicates the offset of the results. The first page has no start parameter, which means the offset is zero. The second page has a start parameter of 10, which means the offset is 10. The third page has a start parameter of 20, which means the offset is 20. And so on.

Once you have identified the pattern or logic of the pagination, you can use a loop or a recursion to iterate over the pages. For each page, you need to make an HTTP request using requests, parse the HTML response using BeautifulSoup4, extract the data you want using BeautifulSoup4, and store the data in a Python data structure. You can also use a condition to stop the loop or the recursion when you reach the last page or when you have enough data.

Here is an example of how you can scrape multiple pages with pagination using BeautifulSoup4 and requests:

# Define the base URL of the website you want to scrape
base_url = "https://www.google.com/search?q=web+scraping"

# Define an empty list to store the scraped data
data = []

# Define a variable to store the offset of the results
offset = 0

# Define a variable to store the maximum number of pages you want to scrape
max_pages = 10

# Define a variable to store the number of pages you have scraped
page_count = 0

# Use a while loop to iterate over the pages
while page_count < max_pages:
    # Define the URL of the current page by adding the offset parameter
    url = base_url + "&start=" + str(offset)
    
    # Make an HTTP GET request to the URL using requests
    response = requests.get(url)
    
    # Check if the response is successful
    if response.status_code == 200:
        # Parse the HTML response using BeautifulSoup4
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Find all the elements that contain the data you want using BeautifulSoup4
        # For example, if you want to scrape the titles and links of the search results, you can use the following code:
        results = soup.find_all("div", class_="yuRUbf")
        
        # Iterate over the elements and extract the data using BeautifulSoup4
        for result in results:
            # Get the title of the result
            title = result.find("h3").get_text()
            
            # Get the link of the result
            link = result.find("a").get("href")
            
            # Store the data in a dictionary
            item = {"title": title, "link": link}
            
            # Append the dictionary to the data list
            data.append(item)
        
        # Increment the offset by 10
        offset += 10
        
        # Increment the page count by 1
        page_count += 1
        
        # Print a message to indicate the progress
        print(f"Scraped page {page_count}")
    else:
        # Handle the error if the response is not successful
        print("Error: Unable to access the website")
        # Break the loop
        break

# Print the scraped data
print(data)

This code will scrape the first 10 pages of the Google search results for “web scraping”, and print the titles and links of the results, such as:

Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Scraped page 8
Scraped page 9
Scraped page 10
[{'title': 'Web scraping - Wikipedia', 'link': 'https://en.wikipedia.org/wiki/Web_scraping'}, {'title': 'Web Scraping Tutorial: How to Scrape a Website in 2021', 'link': 'https://www.parsehub.com/blog/web-scraping-tutorial/'}, {'title': 'Web Scraping 101: What you Need to Know and How to Scrape ...', 'link': 'https://www.freecodecamp.org/news/web-scraping-101-what-you-need-to-know-and-how-to-scrape-with-python-selenium-beautifulsoup-5946935d93fe/'}, ...]

You can modify this code to scrape any website that has pagination, and to extract any data you want from the web pages. You can also change the max_pages variable to scrape more or less pages, depending on your needs.

In the next subsection, you will learn how to scrape multiple pages and websites with links using BeautifulSoup4 and requests.

4.2. Scraping Pages with Links

In this subsection, you will learn how to scrape multiple pages and websites that are connected by links. Links are hyperlinks that point to other web pages or resources. For example, when you visit a news website, you will see a list of headlines that link to the full articles on different pages or websites.

To scrape multiple pages and websites with links, you need to identify the elements that contain the links you want to follow. Usually, the links will have an tag with an href attribute that indicates the URL of the target page or website. For example, if you visit the BBC News website, the links to the articles will have the following HTML structure:


Once you have identified the elements that contain the links, you need to use a loop or a recursion to iterate over the elements. For each element, you need to extract the URL of the link using BeautifulSoup4, make an HTTP request to the URL using requests, parse the HTML response using BeautifulSoup4, extract the data you want using BeautifulSoup4, and store the data in a Python data structure. You can also use a condition to stop the loop or the recursion when you reach the last element or when you have enough data.

Here is an example of how you can scrape multiple pages and websites with links using BeautifulSoup4 and requests:

# Define the URL of the website you want to scrape
url = "https://www.bbc.com/news"

# Make an HTTP GET request to the URL using requests
response = requests.get(url)

# Check if the response is successful
if response.status_code == 200:
    # Parse the HTML response using BeautifulSoup4
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find all the elements that contain the links to the articles using BeautifulSoup4
    # For example, if you want to scrape the headlines and links of the top stories, you can use the following code:
    headlines = soup.find_all("a", class_="gs-c-promo-heading")
    
    # Define an empty list to store the scraped data
    data = []
    
    # Iterate over the elements and extract the data using BeautifulSoup4
    for headline in headlines:
        # Get the title of the headline
        title = headline.find("h3").get_text()
        
        # Get the link of the headline
        link = headline.get("href")
        
        # Store the data in a dictionary
        item = {"title": title, "link": link}
        
        # Append the dictionary to the data list
        data.append(item)
        
        # Print a message to indicate the progress
        print(f"Scraped headline: {title}")
else:
    # Handle the error if the response is not successful
    print("Error: Unable to access the website")

# Print the scraped data
print(data)

This code will scrape the headlines and links of the top stories from the BBC News website, and print the data, such as:

Scraped headline: Austria to impose lockdown for unvaccinated
Scraped headline: US to ease travel rules for vaccinated visitors
Scraped headline: COP26: What are the sticking points?
Scraped headline: 'I was trafficked by my boyfriend'
Scraped headline: The man who made the world laugh
...
[{'title': 'Austria to impose lockdown for unvaccinated', 'link': 'https://www.bbc.com/news/world-europe-59090662'}, {'title': 'US to ease travel rules for vaccinated visitors', 'link': 'https://www.bbc.com/news/world-us-canada-59090474'}, {'title': 'COP26: What are the sticking points?', 'link': 'https://www.bbc.com/news/science-environment-59082132'}, {'title': 'I was trafficked by my boyfriend', 'link': 'https://www.bbc.com/news/stories-59086331'}, {'title': 'The man who made the world laugh', 'link': 'https://www.bbc.com/news/entertainment-arts-59093964'}, ...]

You can modify this code to scrape any website that has links, and to extract any data you want from the web pages or websites. You can also change the headlines variable to scrape different sections or categories of the website, depending on your needs.

In the next subsection, you will learn how to scrape multiple pages and websites with APIs using BeautifulSoup4 and requests.

4.3. Scraping Pages with APIs

In this subsection, you will learn how to scrape multiple pages and websites that are connected by APIs. APIs are application programming interfaces that allow you to access data or functionality from another website or service. For example, when you visit a weather website, you will see a map that shows the current weather conditions in different locations, which are retrieved from an API.

To scrape multiple pages and websites with APIs, you need to identify the URL and the parameters of the API you want to use. Usually, the URL of the API will have a base URL and a query string that contains the parameters and values that specify the data you want to request. For example, if you want to use the OpenWeatherMap API to get the current weather data for London, the URL of the API will be:

https://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY

As you can see, the URL of the API has a base URL of https://api.openweathermap.org/data/2.5/weather, and a query string of ?q=London&appid=YOUR_API_KEY. The query string has two parameters: q, which specifies the city name, and appid, which specifies the API key. You can change the values of these parameters to request different data from the API.

Once you have identified the URL and the parameters of the API, you need to make an HTTP request to the URL using requests, and parse the JSON response using the json module. The JSON response is a data format that contains the data you requested from the API, which can be converted into a Python dictionary using the json.loads function. You can then extract the data you want from the dictionary, and store the data in a Python data structure. You can also use a loop or a recursion to iterate over different values of the parameters, and request more data from the API.

Here is an example of how you can scrape multiple pages and websites with APIs using requests and json:

# Import the requests and json modules
import requests
import json

# Define the base URL of the API you want to use
base_url = "https://api.openweathermap.org/data/2.5/weather"

# Define your API key
api_key = "YOUR_API_KEY"

# Define a list of cities you want to get the weather data for
cities = ["London", "Paris", "New York", "Tokyo", "Beijing"]

# Define an empty list to store the scraped data
data = []

# Use a for loop to iterate over the cities
for city in cities:
    # Define the query string of the API by adding the city name and the API key as parameters
    query_string = "?q=" + city + "&appid=" + api_key
    
    # Define the URL of the API by combining the base URL and the query string
    url = base_url + query_string
    
    # Make an HTTP GET request to the URL using requests
    response = requests.get(url)
    
    # Check if the response is successful
    if response.status_code == 200:
        # Parse the JSON response using json
        json_data = json.loads(response.text)
        
        # Extract the data you want from the JSON data using dictionary indexing
        # For example, if you want to get the temperature, the humidity, and the description of the weather, you can use the following code:
        temperature = json_data["main"]["temp"]
        humidity = json_data["main"]["humidity"]
        description = json_data["weather"][0]["description"]
        
        # Store the data in a dictionary
        item = {"city": city, "temperature": temperature, "humidity": humidity, "description": description}
        
        # Append the dictionary to the data list
        data.append(item)
        
        # Print a message to indicate the progress
        print(f"Scraped weather data for {city}")
    else:
        # Handle the error if the response is not successful
        print("Error: Unable to access the API")

# Print the scraped data
print(data)

This code will scrape the weather data for five cities from the OpenWeatherMap API, and print the data, such as:

Scraped weather data for London
Scraped weather data for Paris
Scraped weather data for New York
Scraped weather data for Tokyo
Scraped weather data for Beijing
[{'city': 'London', 'temperature': 284.15, 'humidity': 87, 'description': 'light rain'}, {'city': 'Paris', 'temperature': 283.15, 'humidity': 93, 'description': 'overcast clouds'}, {'city': 'New York', 'temperature': 285.37, 'humidity': 82, 'description': 'mist'}, {'city': 'Tokyo', 'temperature': 293.15, 'humidity': 82, 'description': 'broken clouds'}, {'city': 'Beijing', 'temperature': 281.15, 'humidity': 37, 'description': 'clear sky'}]

You can modify this code to scrape any API that provides data or functionality you want, and to extract any data you want from the JSON response. You can also change the cities list to scrape different locations, depending on your needs.

In the next section, you will learn how to conclude your blog and provide some useful resources and tips for your readers.

5. Conclusion

In this blog, you have learned how to scrape multiple pages and websites with BeautifulSoup4 and requests, and how to handle pagination, links, and APIs. You have also learned the basic concepts and principles of web scraping, and some of the common challenges and limitations of web scraping.

Web scraping is a powerful technique to extract data from websites and use it for various purposes, such as data analysis, machine learning, web development, and more. However, web scraping is not as simple as it sounds. There are many factors that affect the success and efficiency of web scraping, such as the structure and design of the website, the type and amount of data you want to scrape, the speed and reliability of your internet connection, and the legal and ethical issues of web scraping.

Therefore, before you start scraping any website, you should always check the terms and conditions, the robots.txt file, and the privacy policy of the website, and respect the rights and interests of the website owners and users. You should also use the best practices and tools to scrape data safely and responsibly, and avoid any legal or ethical issues.

Some of the best practices and tools for web scraping are:

  • Use BeautifulSoup4 and requests to scrape static and simple web pages and websites.
  • Use selenium, scrapy, or other libraries to scrape dynamic or complex web pages and websites.
  • Use regular expressions, string manipulation, or JSON parsing to extract the URLs or parameters of the pages or websites.
  • Use a loop or a recursion to iterate over the pages or websites.
  • Use a condition to stop the loop or the recursion when you reach the last page or when you have enough data.
  • Use a Python data structure, such as a list, a dictionary, or a pandas dataframe, to store the scraped data.
  • Use a CSV file, a JSON file, a database, or an API to export or share the scraped data.

By following these best practices and tools, you can scrape multiple pages and websites with ease and confidence, and use the scraped data for your own projects and applications.

We hope you have enjoyed this blog and learned something new and useful. If you have any questions, comments, or feedback, please feel free to leave them in the comment section below. We would love to hear from you and help you with your web scraping journey.

Thank you for reading and happy scraping!

Leave a Reply

Your email address will not be published. Required fields are marked *