Step 4: Indexing and slicing dataframes

This blog teaches you how to use pandas indexing and slicing methods to access and manipulate dataframes. You will learn how to use loc, iloc, and boolean indexing with examples.

1. Introduction

In this tutorial, you will learn how to use pandas indexing and slicing techniques to select and subset dataframes based on labels or conditions. Indexing and slicing are essential skills for data analysis, as they allow you to access and manipulate specific parts of your data.

Pandas provides several methods for indexing and slicing dataframes, such as loc, iloc, and boolean indexing. Each method has its own advantages and limitations, depending on the type and structure of your data. You will learn how to use each method and when to apply them in different scenarios.

By the end of this tutorial, you will be able to:

  • Select rows and columns from a dataframe using loc and iloc
  • Filter data based on conditions using boolean indexing
  • Combine different indexing and slicing methods to perform complex operations on dataframes

To follow along with this tutorial, you will need to have pandas installed on your machine. You can install pandas using pip or conda, as explained in this guide. You will also need to import pandas and numpy in your Python script or notebook, as shown below:

import pandas as pd
import numpy as np

Let’s get started!

2. Basic indexing and slicing

In this section, you will learn how to use basic indexing and slicing techniques to select rows and columns from a dataframe. Indexing and slicing are similar to the operations you can perform on lists or arrays, but with some differences.

First, you need to have a dataframe to work with. You can create one from scratch, or use an existing one from a file or a website. For this tutorial, we will use a sample dataframe that contains information about the top 10 countries by population, area, and GDP in 2020. You can download the dataframe from this link, or use the code below to load it into your Python environment:

# Load the dataframe from the URL
url = "https://raw.githubusercontent.com/copilot-examples/pandas-tutorial/main/data/countries.csv"
df = pd.read_csv(url)

# Display the first 5 rows of the dataframe
df.head()

The output should look something like this:

CountryPopulationAreaGDP
China1,439,323,7769,596,96114,722,731
India1,380,004,3853,287,2632,869,868
United States331,002,6519,833,51720,807,269
Indonesia273,523,6151,904,5691,088,768
Pakistan220,892,340881,912278,222

Now that you have a dataframe, you can use basic indexing and slicing to access its elements. There are two ways to do this:

  • Using square brackets []
  • Using dot notation .

Let’s see how each method works and what are the advantages and limitations of each one.

3. Label-based indexing with loc

In this section, you will learn how to use loc, a pandas method for label-based indexing. Label-based indexing means that you can select rows and columns from a dataframe based on their names or labels, rather than their positions.

Label-based indexing is useful when you have meaningful labels for your data, such as country names, dates, categories, etc. It also allows you to select multiple rows or columns at once, using lists or ranges of labels.

To use loc, you need to specify the row and column labels that you want to select, separated by a comma. For example, if you want to select the row for China and the column for GDP, you can write:

# Select the row for China and the column for GDP
df.loc["China", "GDP"]

The output should be:

14722731

This means that China’s GDP in 2020 was 14,722,731 million US dollars.

You can also select multiple rows or columns by using lists or ranges of labels. For example, if you want to select the rows for India, Pakistan, and Bangladesh, and the columns for Population and Area, you can write:

# Select the rows for India, Pakistan, and Bangladesh, and the columns for Population and Area
df.loc[["India", "Pakistan", "Bangladesh"], ["Population", "Area"]]

The output should be:

CountryPopulationArea
India1,380,004,3853,287,263
Pakistan220,892,340881,912
Bangladesh164,689,383147,570

This means that India, Pakistan, and Bangladesh had a combined population of 1,765,586,108 and a combined area of 4,316,745 square kilometers in 2020.

You can also use ranges of labels to select rows or columns. For example, if you want to select the rows from China to Indonesia, and all the columns, you can write:

# Select the rows from China to Indonesia, and all the columns
df.loc["China":"Indonesia", :]

The output should be:

CountryPopulationAreaGDP
China1,439,323,7769,596,96114,722,731
India1,380,004,3853,287,2632,869,868
United States331,002,6519,833,51720,807,269
Indonesia273,523,6151,904,5691,088,768

This means that China, India, United States, and Indonesia had a combined population of 3,423,854,427, a combined area of 24,622,310 square kilometers, and a combined GDP of 39,488,636 million US dollars in 2020.

Note that when you use ranges of labels, the end label is included in the selection. This is different from position-based indexing, where the end index is excluded.

As you can see, loc is a powerful and flexible method for label-based indexing. You can use it to select any subset of data from a dataframe, based on the labels of the rows and columns.

4. Position-based indexing with iloc

In this section, you will learn how to use iloc, a pandas method for position-based indexing. Position-based indexing means that you can select rows and columns from a dataframe based on their positions or indices, rather than their labels.

Position-based indexing is useful when you don’t have meaningful labels for your data, or when you want to select data based on a specific order or pattern. It also allows you to select multiple rows or columns at once, using lists or ranges of indices.

To use iloc, you need to specify the row and column indices that you want to select, separated by a comma. For example, if you want to select the first row and the last column of the dataframe, you can write:

# Select the first row and the last column of the dataframe
df.iloc[0, -1]

The output should be:

14722731

This means that the GDP of the first country in the dataframe (China) was 14,722,731 million US dollars in 2020.

You can also select multiple rows or columns by using lists or ranges of indices. For example, if you want to select the first three rows and the first two columns of the dataframe, you can write:

# Select the first three rows and the first two columns of the dataframe
df.iloc[0:3, 0:2]

The output should be:

CountryPopulation
China1,439,323,776
India1,380,004,385
United States331,002,651

This means that the population of the first three countries in the dataframe (China, India, and United States) was 3,150,330,812 in 2020.

You can also use negative indices to select rows or columns from the end of the dataframe. For example, if you want to select the last two rows and the last two columns of the dataframe, you can write:

# Select the last two rows and the last two columns of the dataframe
df.iloc[-2:, -2:]

The output should be:

AreaGDP
17,098,2421,464,078
8,515,7671,839,758

This means that the area and GDP of the last two countries in the dataframe (Russia and Brazil) were 25,613,009 square kilometers and 3,303,836 million US dollars in 2020.

Note that when you use ranges of indices, the end index is excluded from the selection. This is different from label-based indexing, where the end label is included.

As you can see, iloc is a powerful and flexible method for position-based indexing. You can use it to select any subset of data from a dataframe, based on the indices of the rows and columns.

5. Boolean indexing with conditions

In this section, you will learn how to use boolean indexing, a pandas technique for selecting data based on conditions. Boolean indexing means that you can filter data from a dataframe based on logical expressions that evaluate to True or False.

Boolean indexing is useful when you want to select data that satisfy certain criteria, such as values that are greater than, equal to, or less than a given number, or values that belong to a certain category, or values that match a certain pattern, etc.

To use boolean indexing, you need to create a boolean array that represents the condition that you want to apply to the dataframe. For example, if you want to select the countries that have a population greater than 500 million, you can write:

# Create a boolean array that represents the condition
condition = df["Population"] > 500000000

# Display the boolean array
condition

The output should be:

Country
China           True
India           True
United States  False
Indonesia      False
Pakistan       False
Brazil         False
Nigeria        False
Bangladesh     False
Russia         False
Mexico         False
Name: Population, dtype: bool

This means that only China and India satisfy the condition of having a population greater than 500 million.

Once you have the boolean array, you can use it to select the rows from the dataframe that correspond to the True values. For example, if you want to select the rows for China and India, and all the columns, you can write:

# Select the rows that correspond to the True values, and all the columns
df[condition]

The output should be:

CountryPopulationAreaGDP
China1,439,323,7769,596,96114,722,731
India1,380,004,3853,287,2632,869,868

This means that China and India had a combined population of 2,819,328,161, a combined area of 12,884,224 square kilometers, and a combined GDP of 17,592,599 million US dollars in 2020.

You can also combine multiple conditions using logical operators, such as & (and), | (or), and ~ (not). For example, if you want to select the countries that have a GDP greater than 10,000,000 million US dollars or an area less than 1,000,000 square kilometers, you can write:

# Create a boolean array that represents the combined condition
condition = (df["GDP"] > 10000000) | (df["Area"] < 1000000)

# Select the rows that correspond to the True values, and all the columns
df[condition]

The output should be:

CountryPopulationAreaGDP
China1,439,323,7769,596,96114,722,731
India1,380,004,3853,287,2632,869,868
United States331,002,6519,833,51720,807,269
Indonesia273,523,6151,904,5691,088,768
Pakistan220,892,340881,912278,222
Bangladesh164,689,383147,570302,571
Mexico128,932,7531,964,3751,158,210

This means that these seven countries satisfy either one or both of the conditions of having a GDP greater than 10,000,000 million US dollars or an area less than 1,000,000 square kilometers.

As you can see, boolean indexing is a powerful and flexible technique for selecting data based on conditions. You can use it to filter data from a dataframe based on any logical expression that you can think of.

6. Summary and exercises

In this tutorial, you learned how to use pandas indexing and slicing techniques to select and subset dataframes based on labels or conditions. You learned how to use three methods for indexing and slicing dataframes:

  • loc, for label-based indexing
  • iloc, for position-based indexing
  • boolean indexing, for condition-based indexing

You also learned how to use lists, ranges, and logical operators to select multiple rows or columns at once, and how to combine different indexing and slicing methods to perform complex operations on dataframes.

Indexing and slicing are essential skills for data analysis, as they allow you to access and manipulate specific parts of your data. You can use them to explore, filter, transform, and aggregate your data, depending on your needs and goals.

To practice what you learned in this tutorial, you can try the following exercises:

  1. Create a new dataframe from the countries.csv file, but with the columns in a different order: GDP, Population, Area, Country.
  2. Select the rows for the countries that have an area greater than 5,000,000 square kilometers, and the columns for Country and Area.
  3. Select the rows for the countries that have a GDP per capita (GDP divided by Population) less than 2,000 US dollars, and the columns for Country, GDP, and Population.
  4. Select the rows for the countries that have a name that starts with the letter B, and the columns for Country and GDP.
  5. Select the rows for the countries that have a population density (Population divided by Area) between 100 and 200 people per square kilometer, and the columns for Country, Population, Area, and Population Density.

You can find the solutions to these exercises in this notebook.

Congratulations, you have completed Step 4 of the pandas tutorial! You can move on to Step 5, where you will learn how to use pandas groupby and aggregate functions to perform data analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *