1. Introduction
In this tutorial, you will learn how to use pandas indexing and slicing techniques to select and subset dataframes based on labels or conditions. Indexing and slicing are essential skills for data analysis, as they allow you to access and manipulate specific parts of your data.
Pandas provides several methods for indexing and slicing dataframes, such as loc, iloc, and boolean indexing. Each method has its own advantages and limitations, depending on the type and structure of your data. You will learn how to use each method and when to apply them in different scenarios.
By the end of this tutorial, you will be able to:
- Select rows and columns from a dataframe using loc and iloc
- Filter data based on conditions using boolean indexing
- Combine different indexing and slicing methods to perform complex operations on dataframes
To follow along with this tutorial, you will need to have pandas installed on your machine. You can install pandas using pip or conda, as explained in this guide. You will also need to import pandas and numpy in your Python script or notebook, as shown below:
import pandas as pd import numpy as np
Let’s get started!
2. Basic indexing and slicing
In this section, you will learn how to use basic indexing and slicing techniques to select rows and columns from a dataframe. Indexing and slicing are similar to the operations you can perform on lists or arrays, but with some differences.
First, you need to have a dataframe to work with. You can create one from scratch, or use an existing one from a file or a website. For this tutorial, we will use a sample dataframe that contains information about the top 10 countries by population, area, and GDP in 2020. You can download the dataframe from this link, or use the code below to load it into your Python environment:
# Load the dataframe from the URL url = "https://raw.githubusercontent.com/copilot-examples/pandas-tutorial/main/data/countries.csv" df = pd.read_csv(url) # Display the first 5 rows of the dataframe df.head()
The output should look something like this:
Country | Population | Area | GDP |
---|---|---|---|
China | 1,439,323,776 | 9,596,961 | 14,722,731 |
India | 1,380,004,385 | 3,287,263 | 2,869,868 |
United States | 331,002,651 | 9,833,517 | 20,807,269 |
Indonesia | 273,523,615 | 1,904,569 | 1,088,768 |
Pakistan | 220,892,340 | 881,912 | 278,222 |
Now that you have a dataframe, you can use basic indexing and slicing to access its elements. There are two ways to do this:
- Using square brackets []
- Using dot notation .
Let’s see how each method works and what are the advantages and limitations of each one.
3. Label-based indexing with loc
In this section, you will learn how to use loc, a pandas method for label-based indexing. Label-based indexing means that you can select rows and columns from a dataframe based on their names or labels, rather than their positions.
Label-based indexing is useful when you have meaningful labels for your data, such as country names, dates, categories, etc. It also allows you to select multiple rows or columns at once, using lists or ranges of labels.
To use loc, you need to specify the row and column labels that you want to select, separated by a comma. For example, if you want to select the row for China and the column for GDP, you can write:
# Select the row for China and the column for GDP df.loc["China", "GDP"]
The output should be:
14722731
This means that China’s GDP in 2020 was 14,722,731 million US dollars.
You can also select multiple rows or columns by using lists or ranges of labels. For example, if you want to select the rows for India, Pakistan, and Bangladesh, and the columns for Population and Area, you can write:
# Select the rows for India, Pakistan, and Bangladesh, and the columns for Population and Area df.loc[["India", "Pakistan", "Bangladesh"], ["Population", "Area"]]
The output should be:
Country | Population | Area |
---|---|---|
India | 1,380,004,385 | 3,287,263 |
Pakistan | 220,892,340 | 881,912 |
Bangladesh | 164,689,383 | 147,570 |
This means that India, Pakistan, and Bangladesh had a combined population of 1,765,586,108 and a combined area of 4,316,745 square kilometers in 2020.
You can also use ranges of labels to select rows or columns. For example, if you want to select the rows from China to Indonesia, and all the columns, you can write:
# Select the rows from China to Indonesia, and all the columns df.loc["China":"Indonesia", :]
The output should be:
Country | Population | Area | GDP |
---|---|---|---|
China | 1,439,323,776 | 9,596,961 | 14,722,731 |
India | 1,380,004,385 | 3,287,263 | 2,869,868 |
United States | 331,002,651 | 9,833,517 | 20,807,269 |
Indonesia | 273,523,615 | 1,904,569 | 1,088,768 |
This means that China, India, United States, and Indonesia had a combined population of 3,423,854,427, a combined area of 24,622,310 square kilometers, and a combined GDP of 39,488,636 million US dollars in 2020.
Note that when you use ranges of labels, the end label is included in the selection. This is different from position-based indexing, where the end index is excluded.
As you can see, loc is a powerful and flexible method for label-based indexing. You can use it to select any subset of data from a dataframe, based on the labels of the rows and columns.
4. Position-based indexing with iloc
In this section, you will learn how to use iloc, a pandas method for position-based indexing. Position-based indexing means that you can select rows and columns from a dataframe based on their positions or indices, rather than their labels.
Position-based indexing is useful when you don’t have meaningful labels for your data, or when you want to select data based on a specific order or pattern. It also allows you to select multiple rows or columns at once, using lists or ranges of indices.
To use iloc, you need to specify the row and column indices that you want to select, separated by a comma. For example, if you want to select the first row and the last column of the dataframe, you can write:
# Select the first row and the last column of the dataframe df.iloc[0, -1]
The output should be:
14722731
This means that the GDP of the first country in the dataframe (China) was 14,722,731 million US dollars in 2020.
You can also select multiple rows or columns by using lists or ranges of indices. For example, if you want to select the first three rows and the first two columns of the dataframe, you can write:
# Select the first three rows and the first two columns of the dataframe df.iloc[0:3, 0:2]
The output should be:
Country | Population |
---|---|
China | 1,439,323,776 |
India | 1,380,004,385 |
United States | 331,002,651 |
This means that the population of the first three countries in the dataframe (China, India, and United States) was 3,150,330,812 in 2020.
You can also use negative indices to select rows or columns from the end of the dataframe. For example, if you want to select the last two rows and the last two columns of the dataframe, you can write:
# Select the last two rows and the last two columns of the dataframe df.iloc[-2:, -2:]
The output should be:
Area | GDP |
---|---|
17,098,242 | 1,464,078 |
8,515,767 | 1,839,758 |
This means that the area and GDP of the last two countries in the dataframe (Russia and Brazil) were 25,613,009 square kilometers and 3,303,836 million US dollars in 2020.
Note that when you use ranges of indices, the end index is excluded from the selection. This is different from label-based indexing, where the end label is included.
As you can see, iloc is a powerful and flexible method for position-based indexing. You can use it to select any subset of data from a dataframe, based on the indices of the rows and columns.
5. Boolean indexing with conditions
In this section, you will learn how to use boolean indexing, a pandas technique for selecting data based on conditions. Boolean indexing means that you can filter data from a dataframe based on logical expressions that evaluate to True or False.
Boolean indexing is useful when you want to select data that satisfy certain criteria, such as values that are greater than, equal to, or less than a given number, or values that belong to a certain category, or values that match a certain pattern, etc.
To use boolean indexing, you need to create a boolean array that represents the condition that you want to apply to the dataframe. For example, if you want to select the countries that have a population greater than 500 million, you can write:
# Create a boolean array that represents the condition condition = df["Population"] > 500000000 # Display the boolean array condition
The output should be:
Country China True India True United States False Indonesia False Pakistan False Brazil False Nigeria False Bangladesh False Russia False Mexico False Name: Population, dtype: bool
This means that only China and India satisfy the condition of having a population greater than 500 million.
Once you have the boolean array, you can use it to select the rows from the dataframe that correspond to the True values. For example, if you want to select the rows for China and India, and all the columns, you can write:
# Select the rows that correspond to the True values, and all the columns df[condition]
The output should be:
Country | Population | Area | GDP |
---|---|---|---|
China | 1,439,323,776 | 9,596,961 | 14,722,731 |
India | 1,380,004,385 | 3,287,263 | 2,869,868 |
This means that China and India had a combined population of 2,819,328,161, a combined area of 12,884,224 square kilometers, and a combined GDP of 17,592,599 million US dollars in 2020.
You can also combine multiple conditions using logical operators, such as & (and), | (or), and ~ (not). For example, if you want to select the countries that have a GDP greater than 10,000,000 million US dollars or an area less than 1,000,000 square kilometers, you can write:
# Create a boolean array that represents the combined condition condition = (df["GDP"] > 10000000) | (df["Area"] < 1000000) # Select the rows that correspond to the True values, and all the columns df[condition]
The output should be:
Country | Population | Area | GDP |
---|---|---|---|
China | 1,439,323,776 | 9,596,961 | 14,722,731 |
India | 1,380,004,385 | 3,287,263 | 2,869,868 |
United States | 331,002,651 | 9,833,517 | 20,807,269 |
Indonesia | 273,523,615 | 1,904,569 | 1,088,768 |
Pakistan | 220,892,340 | 881,912 | 278,222 |
Bangladesh | 164,689,383 | 147,570 | 302,571 |
Mexico | 128,932,753 | 1,964,375 | 1,158,210 |
This means that these seven countries satisfy either one or both of the conditions of having a GDP greater than 10,000,000 million US dollars or an area less than 1,000,000 square kilometers.
As you can see, boolean indexing is a powerful and flexible technique for selecting data based on conditions. You can use it to filter data from a dataframe based on any logical expression that you can think of.
6. Summary and exercises
In this tutorial, you learned how to use pandas indexing and slicing techniques to select and subset dataframes based on labels or conditions. You learned how to use three methods for indexing and slicing dataframes:
- loc, for label-based indexing
- iloc, for position-based indexing
- boolean indexing, for condition-based indexing
You also learned how to use lists, ranges, and logical operators to select multiple rows or columns at once, and how to combine different indexing and slicing methods to perform complex operations on dataframes.
Indexing and slicing are essential skills for data analysis, as they allow you to access and manipulate specific parts of your data. You can use them to explore, filter, transform, and aggregate your data, depending on your needs and goals.
To practice what you learned in this tutorial, you can try the following exercises:
- Create a new dataframe from the countries.csv file, but with the columns in a different order: GDP, Population, Area, Country.
- Select the rows for the countries that have an area greater than 5,000,000 square kilometers, and the columns for Country and Area.
- Select the rows for the countries that have a GDP per capita (GDP divided by Population) less than 2,000 US dollars, and the columns for Country, GDP, and Population.
- Select the rows for the countries that have a name that starts with the letter B, and the columns for Country and GDP.
- Select the rows for the countries that have a population density (Population divided by Area) between 100 and 200 people per square kilometer, and the columns for Country, Population, Area, and Population Density.
You can find the solutions to these exercises in this notebook.
Congratulations, you have completed Step 4 of the pandas tutorial! You can move on to Step 5, where you will learn how to use pandas groupby and aggregate functions to perform data analysis.