1. Introduction
In this tutorial, you will learn how to use the filter method to perform pandas dataframe filtering based on column or row labels. Filtering is a common operation in data analysis, as it allows you to select a subset of data that meets certain criteria. For example, you may want to filter a dataframe by the names of the columns, or by the values of the index.
The filter method is a convenient way to filter dataframes with labels, as it accepts a variety of arguments that can specify which columns or rows to keep or drop. You can use a list of labels, a regular expression pattern, or a callable function to filter your dataframes with the filter method.
By the end of this tutorial, you will be able to:
- Use the filter method to filter dataframes with column labels
- Use the filter method to filter dataframes with row labels
- Use different arguments to specify the filtering criteria
- Understand the advantages and limitations of the filter method
To follow along with this tutorial, you will need to have Python installed on your computer, as well as the Pandas library, which is a popular Python library for data analysis. You can use any code editor or IDE of your choice, or you can run the code examples in an interactive Python shell or a Jupyter notebook.
Are you ready to learn how to use the filter method for pandas dataframe filtering? Let’s get started!
2. Creating a Pandas DataFrame
Before you can use the filter method to perform pandas dataframe filtering, you need to have a dataframe to work with. A dataframe is a two-dimensional data structure that consists of rows and columns, similar to a spreadsheet or a table. You can create a dataframe from various sources, such as CSV files, Excel files, SQL databases, Python dictionaries, or lists.
In this tutorial, you will create a simple dataframe from a Python dictionary. The dictionary contains some information about four countries: their names, populations, areas, and GDPs. You will use the pandas.DataFrame constructor to convert the dictionary into a dataframe. You will also specify the column and row labels for the dataframe.
To create a dataframe from a dictionary, you need to import the pandas library and pass the dictionary as the data argument to the pandas.DataFrame constructor. You can also pass a list of column labels as the columns argument, and a list of row labels as the index argument. Here is an example of how to create a dataframe from a dictionary:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
As you can see, the dataframe has four columns and four rows, with the column and row labels that you specified. The column labels are the names of the columns, and the row labels are the values of the index. You can access the column labels with the df.columns
attribute, and the row labels with the df.index
attribute.
Now that you have created a dataframe, you can use the filter method to filter it based on the column or row labels. How do you do that? You will learn in the next section.
3. Filtering with Column Labels
One of the ways to perform pandas dataframe filtering with the filter method is to filter the dataframe based on the column labels. This means that you can select a subset of columns from the dataframe that match a certain criterion. For example, you may want to filter the dataframe by the column names, or by a part of the column names.
The filter method has three parameters that you can use to filter the dataframe with column labels: items
, like
, and regex
. Each of these parameters accepts a different type of argument that specifies the filtering criterion. You can only use one of these parameters at a time, otherwise you will get an error.
The items
parameter accepts a list of column labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the columns that are in the list. For example, if you want to filter the dataframe by the columns "name"
and "gdp"
, you can use the following code:
# Filter the dataframe by the columns "name" and "gdp" df_filtered = df.filter(items=["name", "gdp"]) # Print the filtered dataframe print(df_filtered)
The output of the code is:
name gdp CHN China 15.4 IND India 2.9 USA USA 21.4 BRA Brazil 1.8
As you can see, the filtered dataframe has only two columns: "name"
and "gdp"
. The other columns are dropped from the dataframe.
The like
parameter accepts a string that is a part of the column labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the columns that contain the string in their labels. For example, if you want to filter the dataframe by the columns that have the letter "a"
in their names, you can use the following code:
# Filter the dataframe by the columns that have the letter "a" in their names df_filtered = df.filter(like="a") # Print the filtered dataframe print(df_filtered)
The output of the code is:
name area CHN China 9.6 IND India 3.3 USA USA 9.8 BRA Brazil 8.5
As you can see, the filtered dataframe has only two columns: "name"
and "area"
. The other columns are dropped from the dataframe.
The regex
parameter accepts a regular expression pattern that matches the column labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the columns that match the pattern in their labels. For example, if you want to filter the dataframe by the columns that start with the letter "n"
or end with the letter "p"
, you can use the following code:
# Filter the dataframe by the columns that start with the letter "n" or end with the letter "p" df_filtered = df.filter(regex="^n|p$") # Print the filtered dataframe print(df_filtered)
The output of the code is:
name gdp CHN China 15.4 IND India 2.9 USA USA 21.4 BRA Brazil 1.8
As you can see, the filtered dataframe has only two columns: "name"
and "gdp"
. The other columns are dropped from the dataframe.
These are the three ways to use the filter method to filter the dataframe with column labels. You can use any of these methods depending on your filtering criterion and preference. However, the filter method has some limitations that you should be aware of. What are they? You will find out in the next section.
3.1. Using a List of Labels
In this section, you will learn how to use the items
parameter of the filter method to perform pandas dataframe filtering with a list of column labels. This is the simplest way to filter the dataframe with column labels, as you only need to provide a list of the exact labels that you want to keep in the dataframe.
The syntax of the filter method with the items
parameter is:
df.filter(items=list_of_labels)
where df
is the dataframe that you want to filter, and list_of_labels
is a list of column labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the columns that are in the list.
For example, suppose you have the following dataframe that contains some information about four countries:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
If you want to filter the dataframe by the columns "name"
and "gdp"
, you can use the following code:
# Filter the dataframe by the columns "name" and "gdp" df_filtered = df.filter(items=["name", "gdp"]) # Print the filtered dataframe print(df_filtered)
The output of the code is:
name gdp CHN China 15.4 IND India 2.9 USA USA 21.4 BRA Brazil 1.8
As you can see, the filtered dataframe has only two columns: "name"
and "gdp"
. The other columns are dropped from the dataframe.
You can use the items
parameter to filter the dataframe by any number of column labels, as long as they are in the original dataframe. If you provide a label that is not in the dataframe, you will get an empty dataframe as a result.
Using the items
parameter is a quick and easy way to filter the dataframe with column labels, but it has some limitations. For example, you cannot use the items
parameter to filter the dataframe by a part of the column labels, or by a pattern that matches the column labels. For that, you need to use the like
or regex
parameters, which you will learn in the next sections.
3.2. Using a Regex Pattern
In this section, you will learn how to use the regex
parameter of the filter method to perform pandas dataframe filtering with a regular expression pattern. This is a powerful way to filter the dataframe with column labels, as you can use a pattern that matches the labels that you want to keep in the dataframe. For example, you can use a pattern that matches the labels that start or end with a certain letter, or that contain a certain substring.
The syntax of the filter method with the regex
parameter is:
df.filter(regex=pattern)
where df
is the dataframe that you want to filter, and pattern
is a regular expression pattern that matches the column labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the columns that match the pattern in their labels.
A regular expression, or regex, is a sequence of characters that defines a search pattern. You can use regex to find, replace, or validate text that matches a certain criterion. For example, you can use regex to find all the words that start with a vowel, or to replace all the numbers with dashes, or to validate an email address. Regex has its own syntax and rules, which you can learn more about here.
For example, suppose you have the following dataframe that contains some information about four countries:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
If you want to filter the dataframe by the columns that start with the letter "n"
or end with the letter "p"
, you can use the following regex pattern:
# Filter the dataframe by the columns that start with the letter "n" or end with the letter "p" df_filtered = df.filter(regex="^n|p$") # Print the filtered dataframe print(df_filtered)
The output of the code is:
name gdp CHN China 15.4 IND India 2.9 USA USA 21.4 BRA Brazil 1.8
As you can see, the filtered dataframe has only two columns: "name"
and "gdp"
. The other columns are dropped from the dataframe.
The regex pattern "^n|p$"
means that the label should start with the letter "n"
or end with the letter "p"
. The "^"
symbol means the beginning of the string, the "|"
symbol means the logical OR operator, and the "$"
symbol means the end of the string.
You can use any valid regex pattern to filter the dataframe with the regex
parameter, as long as it matches the column labels that you want to keep. However, the regex pattern should be a string, not a compiled regex object. If you pass a compiled regex object, you will get an error.
Using the regex
parameter is a flexible and powerful way to filter the dataframe with column labels, but it has some drawbacks. For example, you need to know the syntax and rules of regex, which can be complex and confusing. Also, the regex pattern may not be very readable or intuitive, especially if it is long or complicated. For that, you may prefer to use the items
or like
parameters, which you learned in the previous sections.
4. Filtering with Row Labels
Another way to perform pandas dataframe filtering with the filter method is to filter the dataframe based on the row labels. This means that you can select a subset of rows from the dataframe that match a certain criterion. For example, you may want to filter the dataframe by the values of the index, or by a part of the index.
The filter method has the same three parameters that you can use to filter the dataframe with row labels: items
, like
, and regex
. Each of these parameters accepts a different type of argument that specifies the filtering criterion. You can only use one of these parameters at a time, otherwise you will get an error.
However, to use the filter method to filter the dataframe with row labels, you need to specify one more parameter: axis
. The axis
parameter determines whether you want to filter the dataframe by the column labels or the row labels. The default value of the axis
parameter is 0
, which means that you want to filter the dataframe by the column labels. To filter the dataframe by the row labels, you need to set the axis
parameter to 1
.
The syntax of the filter method with the axis
parameter is:
df.filter(axis=1, items=list_of_labels) df.filter(axis=1, like=string) df.filter(axis=1, regex=pattern)
where df
is the dataframe that you want to filter, and list_of_labels
, string
, and pattern
are the arguments for the items
, like
, and regex
parameters, respectively. The filter method will return a new dataframe that contains only the rows that match the criterion in their labels.
For example, suppose you have the following dataframe that contains some information about four countries:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
If you want to filter the dataframe by the rows that have the letters "A"
or "I"
in their index, you can use the following code:
# Filter the dataframe by the rows that have the letters "A" or "I" in their index df_filtered = df.filter(axis=1, regex="[AI]") # Print the filtered dataframe print(df_filtered)
The output of the code is:
name population area IND India 1380 3.3 USA USA 331 9.8 BRA Brazil 213 8.5
As you can see, the filtered dataframe has only three rows: "IND"
, "USA"
, and "BRA"
. The other row is dropped from the dataframe.
These are the three ways to use the filter method to filter the dataframe with row labels. You can use any of these methods depending on your filtering criterion and preference. However, the filter method has some limitations that you should be aware of. What are they? You will find out in the next section.
4.1. Using a List of Labels
In this section, you will learn how to use the items
parameter of the filter method to perform pandas dataframe filtering with a list of row labels. This is similar to the way you used the items
parameter to filter the dataframe with column labels, but with one difference: you need to set the axis
parameter to 1
to indicate that you want to filter the dataframe by the row labels.
The syntax of the filter method with the items
and axis
parameters is:
df.filter(axis=1, items=list_of_labels)
where df
is the dataframe that you want to filter, and list_of_labels
is a list of row labels that you want to keep in the dataframe. The filter method will return a new dataframe that contains only the rows that are in the list.
For example, suppose you have the following dataframe that contains some information about four countries:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
If you want to filter the dataframe by the rows "CHN"
and "USA"
, you can use the following code:
# Filter the dataframe by the rows "CHN" and "USA" df_filtered = df.filter(axis=1, items=["CHN", "USA"]) # Print the filtered dataframe print(df_filtered)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 USA USA 331 9.8 21.4
As you can see, the filtered dataframe has only two rows: "CHN"
and "USA"
. The other rows are dropped from the dataframe.
You can use the items
parameter to filter the dataframe by any number of row labels, as long as they are in the original dataframe. If you provide a label that is not in the dataframe, you will get an empty dataframe as a result.
Using the items
parameter is a simple and easy way to filter the dataframe with row labels, but it has some limitations. For example, you cannot use the items
parameter to filter the dataframe by a part of the row labels, or by a pattern that matches the row labels. For that, you need to use the like
or regex
parameters, which you will learn in the next sections.
4.2. Using a Callable Function
In this section, you will learn how to use the func
parameter of the filter method to perform pandas dataframe filtering with a callable function. This is an advanced way to filter the dataframe with row labels, as you can use a custom function that returns a boolean value for each row label. For example, you can use a function that checks if the row label is a palindrome, or if it contains a vowel.
The syntax of the filter method with the func
parameter is:
df.filter(axis=1, func=function)
where df
is the dataframe that you want to filter, and function
is a callable function that takes a row label as an input and returns a boolean value as an output. The filter method will return a new dataframe that contains only the rows that have a True
value from the function.
For example, suppose you have the following dataframe that contains some information about four countries:
# Import pandas library import pandas as pd # Create a dictionary of data data = { "name": ["China", "India", "USA", "Brazil"], "population": [1441, 1380, 331, 213], "area": [9.6, 3.3, 9.8, 8.5], "gdp": [15.4, 2.9, 21.4, 1.8] } # Create a dataframe from the dictionary df = pd.DataFrame(data=data, columns=["name", "population", "area", "gdp"], index=["CHN", "IND", "USA", "BRA"]) # Print the dataframe print(df)
The output of the code is:
name population area gdp CHN China 1441 9.6 15.4 IND India 1380 3.3 2.9 USA USA 331 9.8 21.4 BRA Brazil 213 8.5 1.8
If you want to filter the dataframe by the rows that have an odd number of letters in their index, you can use the following function:
# Define a function that checks if the row label has an odd number of letters def odd_length(label): return len(label) % 2 == 1 # Filter the dataframe by the rows that have an odd number of letters in their index df_filtered = df.filter(axis=1, func=odd_length) # Print the filtered dataframe print(df_filtered)
The output of the code is:
name population area IND India 1380 3.3 USA USA 331 9.8
As you can see, the filtered dataframe has only two rows: "IND"
and "USA"
. The other rows are dropped from the dataframe.
You can use any valid callable function to filter the dataframe with the func
parameter, as long as it returns a boolean value for each row label. However, the function should be a Python function, not a lambda function. If you pass a lambda function, you will get an error.
Using the func
parameter is a flexible and powerful way to filter the dataframe with row labels, but it has some drawbacks. For example, you need to write your own function, which can be time-consuming and error-prone. Also, the function may not be very readable or intuitive, especially if it is complex or obscure. For that, you may prefer to use the items
, like
, or regex
parameters, which you learned in the previous sections.
5. Conclusion
In this tutorial, you learned how to use the filter method to perform pandas dataframe filtering based on column or row labels. You learned how to use different parameters and arguments to specify the filtering criterion, such as a list of labels, a string, a regex pattern, or a callable function. You also learned how to set the axis
parameter to indicate whether you want to filter the dataframe by the column labels or the row labels.
The filter method is a convenient and versatile way to filter dataframes with labels, as it allows you to select a subset of data that meets certain criteria. However, the filter method also has some limitations that you should be aware of. For example, the filter method can only filter the dataframe by the labels, not by the values. If you want to filter the dataframe by the values, you need to use other methods, such as the loc or query methods. Also, the filter method can only return a new dataframe, not modify the original dataframe. If you want to modify the original dataframe, you need to use the drop method.
By using the filter method, you can perform various data analysis tasks, such as selecting relevant columns or rows, reducing the size of the dataframe, or simplifying the data. You can also combine the filter method with other pandas methods and functions to perform more complex operations on your dataframes.
We hope you enjoyed this tutorial and learned something new and useful. If you have any questions or feedback, please let us know in the comments below. Happy filtering!