1. What is Filtering and Why is it Useful?
Filtering is a process of selecting a subset of data from a larger dataset based on some criteria. For example, you might want to filter a dataset of students by their grades, or a dataset of products by their prices. Filtering allows you to focus on the data that is relevant to your analysis, and ignore the data that is not.
In pandas, filtering is done using boolean indexing. Boolean indexing is a technique of using boolean values (True or False) to indicate which rows or columns of a dataframe meet the filtering criteria. For example, if you have a dataframe of students with a column named ‘grade’, you can use boolean indexing to filter the dataframe by the condition that the grade is greater than 80. The result will be a new dataframe that only contains the rows where the grade column has a value greater than 80.
Boolean indexing is a powerful and flexible way of filtering dataframes in pandas. You can use it to filter data by multiple conditions, combine different types of operators, and apply complex logic. You can also use it to filter both rows and columns of a dataframe, or even filter by the values of another dataframe. Boolean indexing is one of the most useful skills you can learn to manipulate data in pandas.
In this section, you will learn the basics of filtering dataframes using boolean indexing. You will learn how to create boolean masks, which are the key components of boolean indexing, and how to apply them to dataframes. You will also learn some common pitfalls and best practices of filtering data in pandas.
2. How to Create Boolean Masks for Filtering
The first step to filter dataframes using boolean indexing is to create a boolean mask. A boolean mask is a pandas series or dataframe that contains boolean values (True or False) that indicate which rows or columns of the original dataframe meet the filtering criteria. You can think of a boolean mask as a filter that you apply to the dataframe to select only the data that you want.
There are different ways to create boolean masks in pandas, depending on the type and complexity of the filtering criteria. In this section, you will learn three common methods to create boolean masks:
- Using comparison operators
- Using logical operators
- Using methods and attributes
Each method has its own advantages and disadvantages, and you can combine them to create more complex boolean masks. You will also learn some tips and tricks to make your boolean masks more efficient and readable.
2.1. Using Comparison Operators
One of the simplest ways to create boolean masks for filtering is to use comparison operators. Comparison operators are symbols that compare the values of two operands and return a boolean value. For example, the operator ==
checks if the operands are equal, and the operator >
checks if the left operand is greater than the right operand.
In pandas, you can use comparison operators to compare a series or a dataframe with a scalar value, another series, or another dataframe. The result will be a series or a dataframe of boolean values that indicate which elements satisfy the comparison. For example, if you have a series of numbers, you can use the operator <
to compare it with a scalar value and get a series of True or False values that indicate which numbers are less than the scalar value.
To illustrate how to use comparison operators to create boolean masks, let's use a sample dataframe of students with their names, grades, and majors. You can create this dataframe using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] })
Now, suppose you want to create a boolean mask that indicates which students have a grade greater than 75. You can use the comparison operator >
to compare the grade
column with the scalar value 75. The result will be a series of boolean values that correspond to each row of the dataframe. You can assign this series to a variable called mask
and print it out using the following code:
mask = students['grade'] > 75 print(mask)
The output will look something like this:
0 True 1 True 2 False 3 False 4 False Name: grade, dtype: bool
As you can see, the mask has the same index and length as the original dataframe, and each value indicates whether the corresponding student has a grade greater than 75. You can use this mask to filter the dataframe by the condition that the grade is greater than 75, which we will learn how to do in the next section.
You can also use comparison operators to compare two series or two dataframes element-wise. For example, if you have another series of numbers, you can use the operator !=
to compare it with the grade
column and get a series of boolean values that indicate which elements are not equal. You can use the following code to create and print such a series:
numbers = pd.Series([90, 70, 70, 60, 40]) mask = students['grade'] != numbers print(mask)
The output will look something like this:
0 False 1 True 2 False 3 False 4 True dtype: bool
As you can see, the mask has the same index and length as the original series, and each value indicates whether the corresponding elements of the grade
column and the numbers
series are not equal. You can use this mask to filter the dataframe by the condition that the grade is not equal to the number, which we will learn how to do in the next section.
Comparison operators are useful for creating boolean masks based on simple conditions, such as equality, inequality, or order. However, they have some limitations when it comes to creating more complex conditions, such as combining multiple criteria or checking for membership. For example, if you want to create a boolean mask that indicates which students have a grade greater than 75 or a major in CS, you cannot use a single comparison operator to do so. You will need to use another method to create such a boolean mask, which we will learn in the next subsection.
2.2. Using Logical Operators
Another way to create boolean masks for filtering is to use logical operators. Logical operators are symbols that combine two or more boolean values and return a single boolean value. For example, the operator and
returns True if both operands are True, and the operator or
returns True if either operand is True.
In pandas, you can use logical operators to combine two or more boolean masks and create a new boolean mask that satisfies multiple conditions. For example, if you have two boolean masks that indicate which students have a grade greater than 75 and which students have a major in CS, you can use the logical operator and
to combine them and get a new boolean mask that indicates which students have both a grade greater than 75 and a major in CS.
To illustrate how to use logical operators to create boolean masks, let's use the same sample dataframe of students that we used in the previous subsection. You can create this dataframe using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] })
Now, suppose you want to create a boolean mask that indicates which students have a grade greater than 75 or a major in CS. You can use the comparison operator >
to create a boolean mask for the grade condition, and the operator ==
to create a boolean mask for the major condition. Then, you can use the logical operator or
to combine them and get a new boolean mask that satisfies either condition. You can assign this mask to a variable called mask
and print it out using the following code:
mask = (students['grade'] > 75) or (students['major'] == 'CS') print(mask)
The output will look something like this:
0 True 1 True 2 False 3 False 4 False dtype: bool
As you can see, the mask has the same index and length as the original dataframe, and each value indicates whether the corresponding student has a grade greater than 75 or a major in CS. You can use this mask to filter the dataframe by the condition that either the grade is greater than 75 or the major is CS, which we will learn how to do in the next section.
When you use logical operators to combine boolean masks, you need to use parentheses to enclose each mask. This is because logical operators have lower precedence than comparison operators, and without parentheses, the expression will be evaluated incorrectly. For example, if you write students['grade'] > 75 or students['major'] == 'CS'
without parentheses, you will get an error message saying that the truth value of a series is ambiguous. This is because pandas does not know how to compare a series with a scalar value using or
. Therefore, you need to use parentheses to make the expression clear and valid.
Logical operators are useful for creating boolean masks based on multiple conditions, such as conjunction, disjunction, or negation. However, they have some limitations when it comes to creating more specific or complex conditions, such as checking for membership, null values, or duplicates. For example, if you want to create a boolean mask that indicates which students have a major in either Math or CS, you cannot use a single logical operator to do so. You will need to use another method to create such a boolean mask, which we will learn in the next subsection.
2.3. Using Methods and Attributes
A third way to create boolean masks for filtering is to use methods and attributes. Methods and attributes are functions and properties that belong to a series or a dataframe and can be accessed using the dot notation. For example, the method .sum()
returns the sum of the values in a series or a dataframe, and the attribute .shape
returns the dimensions of a series or a dataframe.
In pandas, you can use methods and attributes to create boolean masks based on more specific or complex conditions, such as checking for membership, null values, or duplicates. For example, if you have a series of strings, you can use the method .isin()
to check if each element is in a given list of values and get a series of boolean values that indicate which elements are in the list.
To illustrate how to use methods and attributes to create boolean masks, let's use the same sample dataframe of students that we used in the previous subsections. You can create this dataframe using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] })
Now, suppose you want to create a boolean mask that indicates which students have a major in either Math or CS. You can use the method .isin()
to check if the major
column is in a list of values that contains 'Math' and 'CS'. The result will be a series of boolean values that correspond to each row of the dataframe. You can assign this series to a variable called mask
and print it out using the following code:
mask = students['major'].isin(['Math', 'CS']) print(mask)
The output will look something like this:
0 True 1 True 2 False 3 False 4 False Name: major, dtype: bool
As you can see, the mask has the same index and length as the original dataframe, and each value indicates whether the corresponding student has a major in either Math or CS. You can use this mask to filter the dataframe by the condition that the major is in the list of values, which we will learn how to do in the next section.
You can also use methods and attributes to create boolean masks based on other conditions, such as checking for null values, duplicates, or outliers. For example, if you want to create a boolean mask that indicates which rows of the dataframe have any null values, you can use the method .isnull()
to check if each element is null and then use the method .any()
to check if any element in each row is null. You can use the following code to create and print such a mask:
mask = students.isnull().any(axis=1) print(mask)
The output will look something like this:
0 False 1 False 2 False 3 False 4 False dtype: bool
As you can see, the mask has the same index and length as the original dataframe, and each value indicates whether the corresponding row has any null values. You can use this mask to filter the dataframe by the condition that there are no null values, which we will learn how to do in the next section.
Methods and attributes are useful for creating boolean masks based on more specific or complex conditions, such as checking for membership, null values, or duplicates. However, they have some limitations when it comes to creating custom conditions, such as applying a function or a lambda expression to each element. For example, if you want to create a boolean mask that indicates which students have a name that starts with 'A', you cannot use a method or an attribute to do so. You will need to use another method to create such a boolean mask, which we will not cover in this tutorial.
3. How to Apply Boolean Masks to DataFrames
After you create a boolean mask for filtering, the next step is to apply it to the dataframe and select the data that meets the filtering criteria. To do this, you can use the indexing operator, which is the square brackets []
that you use to access the elements of a series or a dataframe. You can pass the boolean mask as an argument to the indexing operator and get a new dataframe that only contains the rows or columns that have a True value in the mask.
To illustrate how to apply boolean masks to dataframes, let's use the same sample dataframe of students and the boolean masks that we created in the previous sections. You can create this dataframe and the masks using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] }) mask1 = students['grade'] > 75 mask2 = (students['grade'] > 75) or (students['major'] == 'CS') mask3 = students['major'].isin(['Math', 'CS']) mask4 = students.isnull().any(axis=1)
Now, suppose you want to filter the dataframe by the condition that the grade is greater than 75. You can use the mask1 that we created using the comparison operator >
and pass it to the indexing operator. The result will be a new dataframe that only contains the rows where the grade column has a value greater than 75. You can assign this dataframe to a variable called filtered
and print it out using the following code:
filtered = students[mask1] print(filtered)
The output will look something like this:
name grade major 0 Alice 90 Math 1 Bob 80 CS
As you can see, the filtered dataframe has the same columns as the original dataframe, but only two rows that satisfy the condition that the grade is greater than 75. You can use the filtered dataframe for further analysis or manipulation.
You can also use the other masks that we created using logical operators and methods to filter the dataframe by different conditions. For example, if you want to filter the dataframe by the condition that either the grade is greater than 75 or the major is CS, you can use the mask2 that we created using the logical operator or
and pass it to the indexing operator. The result will be a new dataframe that only contains the rows that satisfy either condition. You can use the following code to create and print such a dataframe:
filtered = students[mask2] print(filtered)
The output will look something like this:
name grade major 0 Alice 90 Math 1 Bob 80 CS
As you can see, the filtered dataframe has the same columns as the original dataframe, but only two rows that satisfy the condition that either the grade is greater than 75 or the major is CS.
Applying boolean masks to dataframes is a simple and effective way of filtering data in pandas. You can use it to filter data by any condition that you can express as a boolean value. However, it has some limitations when it comes to filtering both rows and columns of a dataframe, or filtering by the values of another dataframe. For example, if you want to filter the dataframe by the condition that the name column is equal to the name column of another dataframe, you cannot use the indexing operator to do so. You will need to use another method to filter such data, which we will learn in the next subsection.
3.1. Filtering Rows
After you create a boolean mask for filtering, the next step is to apply it to the dataframe and select the data that meets the filtering criteria. To do this, you can use the indexing operator, which is the square brackets []
that you use to access the elements of a series or a dataframe. You can pass the boolean mask as an argument to the indexing operator and get a new dataframe that only contains the rows that have a True value in the mask.
To illustrate how to apply boolean masks to dataframes, let's use the same sample dataframe of students and the boolean masks that we created in the previous sections. You can create this dataframe and the masks using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] }) mask1 = students['grade'] > 75 mask2 = (students['grade'] > 75) or (students['major'] == 'CS') mask3 = students['major'].isin(['Math', 'CS']) mask4 = students.isnull().any(axis=1)
Now, suppose you want to filter the dataframe by the condition that the grade is greater than 75. You can use the mask1 that we created using the comparison operator >
and pass it to the indexing operator. The result will be a new dataframe that only contains the rows where the grade column has a value greater than 75. You can assign this dataframe to a variable called filtered
and print it out using the following code:
filtered = students[mask1] print(filtered)
The output will look something like this:
name grade major 0 Alice 90 Math 1 Bob 80 CS
As you can see, the filtered dataframe has the same columns as the original dataframe, but only two rows that satisfy the condition that the grade is greater than 75. You can use the filtered dataframe for further analysis or manipulation.
You can also use the other masks that we created using logical operators and methods to filter the dataframe by different conditions. For example, if you want to filter the dataframe by the condition that either the grade is greater than 75 or the major is CS, you can use the mask2 that we created using the logical operator or
and pass it to the indexing operator. The result will be a new dataframe that only contains the rows that satisfy either condition. You can use the following code to create and print such a dataframe:
filtered = students[mask2] print(filtered)
The output will look something like this:
name grade major 0 Alice 90 Math 1 Bob 80 CS
As you can see, the filtered dataframe has the same columns as the original dataframe, but only two rows that satisfy the condition that either the grade is greater than 75 or the major is CS.
Applying boolean masks to dataframes is a simple and effective way of filtering data in pandas. You can use it to filter data by any condition that you can express as a boolean value. However, it has some limitations when it comes to filtering both rows and columns of a dataframe, or filtering by the values of another dataframe. For example, if you want to filter the dataframe by the condition that the name column is equal to the name column of another dataframe, you cannot use the indexing operator to do so. You will need to use another method to filter such data, which we will learn in the next subsection.
3.2. Filtering Columns
Sometimes, you may want to filter the dataframe by selecting only certain columns that are relevant to your analysis. To do this, you can use the indexing operator, which is the square brackets []
that you use to access the elements of a series or a dataframe. You can pass a list of column names as an argument to the indexing operator and get a new dataframe that only contains the columns that are in the list.
To illustrate how to filter columns, let's use the same sample dataframe of students that we used in the previous sections. You can create this dataframe using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] })
Now, suppose you want to filter the dataframe by selecting only the name and grade columns. You can use the indexing operator and pass a list of column names as an argument. The result will be a new dataframe that only contains the name and grade columns. You can assign this dataframe to a variable called filtered
and print it out using the following code:
filtered = students[['name', 'grade']] print(filtered)
The output will look something like this:
name grade 0 Alice 90 1 Bob 80 2 Charlie 70 3 David 60 4 Eve 50
As you can see, the filtered dataframe has only two columns, name and grade, and the same number of rows as the original dataframe. You can use the filtered dataframe for further analysis or manipulation.
You can also use the indexing operator to filter columns by passing a single column name as an argument. However, the result will not be a dataframe, but a series. A series is a one-dimensional array of data that has an index and a name. For example, if you want to filter the dataframe by selecting only the name column, you can use the indexing operator and pass the column name as an argument. The result will be a series that contains the name column. You can assign this series to a variable called filtered
and print it out using the following code:
filtered = students['name'] print(filtered)
The output will look something like this:
0 Alice 1 Bob 2 Charlie 3 David 4 Eve Name: name, dtype: object
As you can see, the filtered series has only one column, name, and the same number of rows as the original dataframe. You can use the filtered series for further analysis or manipulation.
Filtering columns is a simple and effective way of selecting the data that is relevant to your analysis in pandas. You can use it to filter data by any column name that you specify. However, it has some limitations when it comes to filtering columns by their values, types, or positions. For example, if you want to filter the dataframe by selecting only the columns that have numeric values, you cannot use the indexing operator to do so. You will need to use another method to filter such data, which we will learn in the next subsection.
3.3. Filtering Both Rows and Columns
Sometimes, you may want to filter the dataframe by selecting both certain rows and certain columns that are relevant to your analysis. To do this, you can use the loc and iloc methods, which are specialized indexing methods that allow you to access the elements of a dataframe by their labels or positions. You can pass a boolean mask as an argument to the loc or iloc methods and get a new dataframe that only contains the rows and columns that have a True value in the mask.
To illustrate how to filter both rows and columns, let's use the same sample dataframe of students and the boolean masks that we created in the previous sections. You can create this dataframe and the masks using the following code:
import pandas as pd students = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'grade': [90, 80, 70, 60, 50], 'major': ['Math', 'CS', 'Biology', 'History', 'Art'] }) mask1 = students['grade'] > 75 mask2 = (students['grade'] > 75) or (students['major'] == 'CS') mask3 = students['major'].isin(['Math', 'CS']) mask4 = students.isnull().any(axis=1)
Now, suppose you want to filter the dataframe by selecting only the name and grade columns and only the rows where the grade is greater than 75. You can use the loc method and pass the mask1 as the first argument and a list of column names as the second argument. The result will be a new dataframe that only contains the name and grade columns and the rows where the grade column has a value greater than 75. You can assign this dataframe to a variable called filtered
and print it out using the following code:
filtered = students.loc[mask1, ['name', 'grade']] print(filtered)
The output will look something like this:
name grade 0 Alice 90 1 Bob 80
As you can see, the filtered dataframe has only two columns, name and grade, and only two rows that satisfy the condition that the grade is greater than 75. You can use the filtered dataframe for further analysis or manipulation.
You can also use the iloc method to filter both rows and columns by their positions. For example, if you want to filter the dataframe by selecting only the first and third columns and only the first and fourth rows, you can use the iloc method and pass a list of row positions as the first argument and a list of column positions as the second argument. The result will be a new dataframe that only contains the first and third columns and the first and fourth rows. You can use the following code to create and print such a dataframe:
filtered = students.iloc[[0, 3], [0, 2]] print(filtered)
The output will look something like this:
name major 0 Alice Math 3 David History
As you can see, the filtered dataframe has only two columns, name and major, and only two rows, the first and the fourth. You can use the filtered dataframe for further analysis or manipulation.
Filtering both rows and columns is a more advanced and flexible way of selecting the data that is relevant to your analysis in pandas. You can use it to filter data by any combination of labels or positions that you specify. However, it has some limitations when it comes to filtering by the values of another dataframe, or filtering by complex conditions that involve multiple columns. For example, if you want to filter the dataframe by selecting only the rows where the name column is equal to the name column of another dataframe, you cannot use the loc or iloc methods to do so. You will need to use another method to filter such data, which we will not cover in this tutorial.
4. Conclusion and Further Resources
In this tutorial, you learned how to filter pandas dataframes using boolean indexing. You learned what filtering is and why it is useful, how to create boolean masks for filtering, how to apply boolean masks to dataframes, and how to filter both rows and columns of a dataframe. You also learned some common pitfalls and best practices of filtering data in pandas.
Filtering is a powerful and flexible technique that allows you to select a subset of data from a larger dataset based on some criteria. Filtering can help you focus on the data that is relevant to your analysis, and ignore the data that is not. Filtering can also help you reduce the size and complexity of your data, and improve the performance and readability of your code.
Boolean indexing is one of the most common and useful methods of filtering dataframes in pandas. Boolean indexing uses boolean values (True or False) to indicate which rows or columns of a dataframe meet the filtering criteria. Boolean indexing can handle simple and complex conditions, and can filter both rows and columns of a dataframe. Boolean indexing is also easy to use and understand, as it follows the same logic and syntax as regular indexing.
However, boolean indexing is not the only way of filtering dataframes in pandas. There are other methods that can offer more functionality and flexibility, such as query, where, filter, and select_dtypes. These methods can handle different types of filtering criteria, such as expressions, functions, regex, or data types. They can also offer more control and customization over the filtering process, such as modifying the original dataframe, returning a copy, or preserving the index.
If you want to learn more about filtering dataframes in pandas, you can check out the following resources:
- Indexing and selecting data: The official pandas documentation on indexing and selecting data, including boolean indexing and other methods.
- Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects: A comprehensive guide on how to use pandas effectively and efficiently, including filtering dataframes using query and where.
- Python | Pandas dataframe.filter(): A tutorial on how to use the filter method to filter dataframes by labels, regex, or functions.
- Python | Pandas dataframe.select_dtypes(): A tutorial on how to use the select_dtypes method to filter dataframes by data types.
We hope you enjoyed this tutorial and learned something new and useful. Happy filtering!