1. Introduction
In this tutorial, you will learn how to use pandas methods to filter and sort dataframes based on values or criteria. Filtering and sorting are common operations that you can apply to your dataframes to extract the information you need or to organize the data in a meaningful way.
Filtering allows you to select a subset of rows from a dataframe that meet certain conditions. For example, you can filter a dataframe by a specific column value, by multiple conditions, by string methods, or by query method. Sorting allows you to arrange the rows of a dataframe in ascending or descending order based on one or more columns or the index. For example, you can sort a dataframe by column values, by index, or by multiple criteria.
To follow along with this tutorial, you will need to have pandas installed on your machine. You can install pandas using pip or conda. You will also need to import pandas and create a sample dataframe that we will use throughout the tutorial. You can copy and paste the code below to create the dataframe.
import pandas as pd # create a sample dataframe data = { "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"], "age": [25, 30, 35, 40, 45, 50], "gender": ["F", "M", "M", "M", "F", "M"], "country": ["USA", "UK", "Canada", "Australia", "Germany", "France"], "salary": [4000, 5000, 6000, 7000, 8000, 9000] } df = pd.DataFrame(data) # display the dataframe df
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
1 | Bob | 30 | M | UK | 5000 |
2 | Charlie | 35 | M | Canada | 6000 |
3 | David | 40 | M | Australia | 7000 |
4 | Eve | 45 | F | Germany | 8000 |
5 | Frank | 50 | M | France | 9000 |
Now that you have the dataframe ready, let’s see how you can filter and sort it using pandas methods.
2. Filtering dataframes
In this section, you will learn how to filter dataframes using pandas methods. Filtering dataframes means selecting a subset of rows that meet certain conditions. This can help you to extract the information you need from your dataframes or to perform further analysis on them.
There are different ways to filter dataframes using pandas. Some of the most common methods are:
- Filtering by column values: This method allows you to select rows that have a specific value or a range of values in a given column.
- Filtering by multiple conditions: This method allows you to combine two or more conditions using logical operators such as and, or, and not.
- Filtering by string methods: This method allows you to select rows that contain a certain string or match a certain pattern in a given column.
- Filtering by query method: This method allows you to use a SQL-like syntax to filter dataframes based on complex expressions.
In the following subsections, you will see examples of how to use each of these methods to filter the sample dataframe that we created in the previous section. You will also see how to use the shape attribute to check the number of rows and columns in the filtered dataframes.
2.1. Filtering by column values
One of the simplest ways to filter dataframes is by column values. This means selecting rows that have a specific value or a range of values in a given column. For example, you can filter the dataframe by the gender column to get only the rows where the gender is F (female) or M (male).
To filter by column values, you can use the square brackets notation and pass a boolean expression that evaluates to True or False for each row. For example, the expression df["gender"] == "F"
will return True for the rows where the gender column is F and False for the rows where the gender column is not F. You can then use this expression inside the square brackets to get the subset of rows where the expression is True. For example, df[df["gender"] == "F"]
will return the dataframe with only the female rows.
Here is an example of how to filter the dataframe by the gender column to get only the female rows:
# filter the dataframe by the gender column df_female = df[df["gender"] == "F"] # display the filtered dataframe df_female
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
4 | Eve | 45 | F | Germany | 8000 |
You can see that the filtered dataframe has only two rows, where the gender column is F. You can also check the shape of the filtered dataframe to see how many rows and columns it has:
# check the shape of the filtered dataframe df_female.shape
The output of the code should look like this:
(2, 5)
This means that the filtered dataframe has 2 rows and 5 columns.
You can also filter by column values using other operators, such as greater than (>
), less than (<
), greater than or equal to (>=
), less than or equal to (<=
), or not equal to (!=
). For example, you can filter the dataframe by the age column to get only the rows where the age is greater than 30:
# filter the dataframe by the age column df_age = df[df["age"] > 30] # display the filtered dataframe df_age
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
2 | Charlie | 35 | M | Canada | 6000 |
3 | David | 40 | M | Australia | 7000 |
4 | Eve | 45 | F | Germany | 8000 |
5 | Frank | 50 | M | France | 9000 |
You can see that the filtered dataframe has four rows, where the age column is greater than 30.
Filtering by column values is a useful method to select rows that have a specific value or a range of values in a given column. You can use different operators to filter by different conditions. In the next subsection, you will learn how to filter by multiple conditions using logical operators.
2.2. Filtering by multiple conditions
Sometimes, you may want to filter dataframes by more than one condition. For example, you may want to select rows that satisfy both of the following conditions: the gender is F and the salary is greater than 5000. Or, you may want to select rows that satisfy either of the following conditions: the country is USA or the country is UK. To filter by multiple conditions, you can use logical operators such as and, or, and not.
To filter by multiple conditions using logical operators, you can use the bitwise operators &
(and), |
(or), and ~
(not). You can also use parentheses to group the conditions and make the expression more readable. For example, the expression (df["gender"] == "F") & (df["salary"] > 5000)
will return True for the rows where both the gender column is F and the salary column is greater than 5000. You can then use this expression inside the square brackets to get the subset of rows where the expression is True. For example, df[(df["gender"] == "F") & (df["salary"] > 5000)]
will return the dataframe with only the female rows whose salary is greater than 5000.
Here is an example of how to filter the dataframe by multiple conditions using logical operators:
# filter the dataframe by multiple conditions df_multi = df[(df["gender"] == "F") & (df["salary"] > 5000)] # display the filtered dataframe df_multi
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
4 | Eve | 45 | F | Germany | 8000 |
You can see that the filtered dataframe has only one row, where both the gender column is F and the salary column is greater than 5000.
You can also use the or operator (|
) to filter by multiple conditions that are mutually exclusive. For example, you can filter the dataframe by the country column to get only the rows where the country is either USA or UK:
# filter the dataframe by multiple conditions df_or = df[(df["country"] == "USA") | (df["country"] == "UK")] # display the filtered dataframe df_or
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
1 | Bob | 30 | M | UK | 5000 |
You can see that the filtered dataframe has two rows, where the country column is either USA or UK.
Filtering by multiple conditions using logical operators is a powerful method to select rows that satisfy complex criteria. You can use different operators to filter by different combinations of conditions. In the next subsection, you will learn how to filter by string methods using regular expressions.
2.3. Filtering by string methods
Another way to filter dataframes is by string methods. This means selecting rows that contain a certain string or match a certain pattern in a given column. For example, you can filter the dataframe by the name column to get only the rows where the name starts with A or ends with E.
To filter by string methods, you can use the str accessor and apply various string methods to the column of interest. For example, the method str.startswith()
will return True for the rows where the column value starts with a given string, and the method str.endswith()
will return True for the rows where the column value ends with a given string. You can then use these methods inside the square brackets to get the subset of rows where the methods return True. For example, df[df["name"].str.startswith("A")]
will return the dataframe with only the rows whose name starts with A.
Here is an example of how to filter the dataframe by string methods:
# filter the dataframe by string methods df_str = df[df["name"].str.startswith("A") | df["name"].str.endswith("E")] # display the filtered dataframe df_str
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
4 | Eve | 45 | F | Germany | 8000 |
You can see that the filtered dataframe has two rows, where the name column starts with A or ends with E.
You can also use other string methods, such as contains, match, replace, split, slice, and more. Some of these methods accept regular expressions as arguments, which allow you to specify more complex patterns to filter by. For example, you can filter the dataframe by the country column to get only the rows where the country contains the letter A:
# filter the dataframe by string methods df_regex = df[df["country"].str.contains("A")] # display the filtered dataframe df_regex
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
2 | Charlie | 35 | M | Canada | 6000 |
3 | David | 40 | M | Australia | 7000 |
You can see that the filtered dataframe has three rows, where the country column contains the letter A.
Filtering by string methods is a handy method to select rows that contain a certain string or match a certain pattern in a given column. You can use various string methods and regular expressions to filter by different criteria. In the next subsection, you will learn how to filter by query method using a SQL-like syntax.
2.4. Filtering by query method
A final way to filter dataframes is by query method. This means using a SQL-like syntax to filter dataframes based on complex expressions. For example, you can filter the dataframe by the salary column to get only the rows where the salary is greater than the average salary of the dataframe.
To filter by query method, you can use the query() method and pass a string that contains the expression to filter by. For example, the expression "salary > df['salary'].mean()"
will return True for the rows where the salary column is greater than the mean of the salary column of the dataframe. You can then use this expression as an argument to the query() method to get the subset of rows where the expression is True. For example, df.query("salary > df['salary'].mean()")
will return the dataframe with only the rows whose salary is greater than the average salary of the dataframe.
Here is an example of how to filter the dataframe by query method:
# filter the dataframe by query method df_query = df.query("salary > df['salary'].mean()") # display the filtered dataframe df_query
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
3 | David | 40 | M | Australia | 7000 |
4 | Eve | 45 | F | Germany | 8000 |
5 | Frank | 50 | M | France | 9000 |
You can see that the filtered dataframe has three rows, where the salary column is greater than the average salary of the dataframe.
You can also use other expressions and operators in the query method, such as arithmetic, comparison, logical, in, not in, and more. For example, you can filter the dataframe by the name column to get only the rows where the name is in a given list:
# filter the dataframe by query method df_in = df.query("name in ['Alice', 'Bob', 'Charlie']") # display the filtered dataframe df_in
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
1 | Bob | 30 | M | UK | 5000 |
2 | Charlie | 35 | M | Canada | 6000 |
You can see that the filtered dataframe has three rows, where the name column is in the list [‘Alice’, ‘Bob’, ‘Charlie’].
Filtering by query method is a convenient method to filter dataframes based on complex expressions using a SQL-like syntax. You can use various expressions and operators to filter by different criteria. In the next section, you will learn how to sort dataframes using pandas methods.
3. Sorting dataframes
In this section, you will learn how to sort dataframes using pandas methods. Sorting dataframes means arranging the rows of a dataframe in ascending or descending order based on one or more columns or the index. This can help you to organize the data in a meaningful way or to compare the values of different rows.
There are different ways to sort dataframes using pandas. Some of the most common methods are:
- Sorting by column values: This method allows you to sort the rows of a dataframe based on the values of a given column or a list of columns.
- Sorting by index: This method allows you to sort the rows of a dataframe based on the values of the index or a level of a multi-index.
- Sorting by multiple criteria: This method allows you to sort the rows of a dataframe based on a combination of column values and index values, with different orders and priorities.
In the following subsections, you will see examples of how to use each of these methods to sort the sample dataframe that we created in the previous section. You will also see how to use the head() and tail() methods to get the first or last n rows of a sorted dataframe.
3.1. Sorting by column values
One of the simplest ways to sort dataframes is by column values. This means arranging the rows of a dataframe in ascending or descending order based on the values of a given column or a list of columns. For example, you can sort the dataframe by the salary column to get the rows with the highest or lowest salaries.
To sort by column values, you can use the sort_values() method and pass the name of the column or a list of columns to sort by. You can also specify the order of the sorting by using the ascending parameter, which can be True (default) or False. For example, the expression df.sort_values("salary", ascending=False)
will sort the dataframe by the salary column in descending order. You can then assign the sorted dataframe to a new variable or overwrite the original dataframe. For example, df_sorted = df.sort_values("salary", ascending=False)
will create a new dataframe with the sorted rows.
Here is an example of how to sort the dataframe by column values:
# sort the dataframe by column values df_sorted = df.sort_values("salary", ascending=False) # display the sorted dataframe df_sorted
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
5 | Frank | 50 | M | France | 9000 |
4 | Eve | 45 | F | Germany | 8000 |
3 | David | 40 | M | Australia | 7000 |
2 | Charlie | 35 | M | Canada | 6000 |
1 | Bob | 30 | M | UK | 5000 |
0 | Alice | 25 | F | USA | 4000 |
You can see that the sorted dataframe has the rows arranged by the salary column in descending order, from the highest to the lowest.
You can also sort by multiple columns by passing a list of column names to the sort_values() method. For example, you can sort the dataframe by the gender column and then by the age column:
# sort the dataframe by multiple columns df_multi = df.sort_values(["gender", "age"]) # display the sorted dataframe df_multi
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
0 | Alice | 25 | F | USA | 4000 |
4 | Eve | 45 | F | Germany | 8000 |
1 | Bob | 30 | M | UK | 5000 |
2 | Charlie | 35 | M | Canada | 6000 |
3 | David | 40 | M | Australia | 7000 |
5 | Frank | 50 | M | France | 9000 |
You can see that the sorted dataframe has the rows arranged by the gender column in ascending order, and then by the age column in ascending order within each gender group.
Sorting by column values is a useful method to arrange the rows of a dataframe in a meaningful order based on the values of a given column or a list of columns. You can use different parameters to sort by different orders and priorities. In the next subsection, you will learn how to sort by index using pandas methods.
3.2. Sorting by index
Another way to sort dataframes is by index. This means arranging the rows of a dataframe in ascending or descending order based on the values of the index or a level of a multi-index. For example, you can sort the dataframe by the index to get the rows with the lowest or highest index values.
To sort by index, you can use the sort_index() method and pass the axis parameter, which can be 0 (default) for rows or 1 for columns. You can also specify the order of the sorting by using the ascending parameter, which can be True (default) or False. For example, the expression df.sort_index(axis=0, ascending=False)
will sort the dataframe by the row index in descending order. You can then assign the sorted dataframe to a new variable or overwrite the original dataframe. For example, df_sorted = df.sort_index(axis=0, ascending=False)
will create a new dataframe with the sorted rows.
Here is an example of how to sort the dataframe by index:
# sort the dataframe by index df_sorted = df.sort_index(axis=0, ascending=False) # display the sorted dataframe df_sorted
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
5 | Frank | 50 | M | France | 9000 |
4 | Eve | 45 | F | Germany | 8000 |
3 | David | 40 | M | Australia | 7000 |
2 | Charlie | 35 | M | Canada | 6000 |
1 | Bob | 30 | M | UK | 5000 |
0 | Alice | 25 | F | USA | 4000 |
You can see that the sorted dataframe has the rows arranged by the index in descending order, from the highest to the lowest.
You can also sort by a level of a multi-index by passing the level parameter, which can be an integer or a string. For example, you can sort the dataframe by the first level of a multi-index:
# create a multi-index dataframe df_multi = df.set_index(["gender", "name"]) # sort the dataframe by a level of a multi-index df_multi_sorted = df_multi.sort_index(level=0) # display the sorted dataframe df_multi_sorted
The output of the code should look like this:
gender | name | age | country | salary |
---|---|---|---|---|
F | Alice | 25 | USA | 4000 |
Eve | 45 | Germany | 8000 | |
M | Bob | 30 | UK | 5000 |
Charlie | 35 | Canada | 6000 | |
David | 40 | Australia | 7000 | |
Frank | 50 | France | 9000 |
You can see that the sorted dataframe has the rows arranged by the first level of the multi-index (gender) in ascending order.
Sorting by index is a handy method to arrange the rows of a dataframe in a meaningful order based on the values of the index or a level of a multi-index. You can use different parameters to sort by different axes, orders, and levels. In the next subsection, you will learn how to sort by multiple criteria using pandas methods.
3.3. Sorting by multiple criteria
A final way to sort dataframes is by multiple criteria. This means arranging the rows of a dataframe in a custom order based on a combination of column values and index values, with different orders and priorities. For example, you can sort the dataframe by the gender column in ascending order, and then by the salary column in descending order within each gender group.
To sort by multiple criteria, you can use the sort_values() method and pass a list of column names and a list of orders to the by and ascending parameters, respectively. You can also use the sort_index() method and pass a list of levels and a list of orders to the level and ascending parameters, respectively. For example, the expression df.sort_values(by=["gender", "salary"], ascending=[True, False])
will sort the dataframe by the gender column in ascending order, and then by the salary column in descending order within each gender group. You can then assign the sorted dataframe to a new variable or overwrite the original dataframe. For example, df_sorted = df.sort_values(by=["gender", "salary"], ascending=[True, False])
will create a new dataframe with the sorted rows.
Here is an example of how to sort the dataframe by multiple criteria:
# sort the dataframe by multiple criteria df_sorted = df.sort_values(by=["gender", "salary"], ascending=[True, False]) # display the sorted dataframe df_sorted
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
4 | Eve | 45 | F | Germany | 8000 |
0 | Alice | 25 | F | USA | 4000 |
5 | Frank | 50 | M | France | 9000 |
3 | David | 40 | M | Australia | 7000 |
2 | Charlie | 35 | M | Canada | 6000 |
1 | Bob | 30 | M | UK | 5000 |
You can see that the sorted dataframe has the rows arranged by the gender column in ascending order, and then by the salary column in descending order within each gender group.
You can also sort by a combination of column values and index values by using both the sort_values() and sort_index() methods. For example, you can sort the dataframe by the gender column in ascending order, and then by the index in descending order within each gender group:
# sort the dataframe by a combination of column values and index values df_combo = df.sort_values(by="gender").sort_index(level=0, ascending=False) # display the sorted dataframe df_combo
The output of the code should look like this:
name | age | gender | country | salary | |
---|---|---|---|---|---|
4 | Eve | 45 | F | Germany | 8000 |
0 | Alice | 25 | F | USA | 4000 |
5 | Frank | 50 | M | France | 9000 |
3 | David | 40 | M | Australia | 7000 |
2 | Charlie | 35 | M | Canada | 6000 |
1 | Bob | 30 | M | UK | 5000 |
You can see that the sorted dataframe has the rows arranged by the gender column in ascending order, and then by the index in descending order within each gender group.
Sorting by multiple criteria is a powerful method to arrange the rows of a dataframe in a custom order based on a combination of column values and index values, with different orders and priorities. You can use various parameters to sort by different criteria and combinations. In the next section, you will learn how to conclude your blog and provide some additional resources for the readers.
4. Conclusion
In this blog, you learned how to use pandas methods to filter and sort dataframes based on values or criteria. You saw how to use the square brackets notation, the sort_values() method, and the sort_index() method to select and arrange the rows of a dataframe in different ways. You also saw how to use the shape attribute, the head() method, and the tail() method to check the size and the first or last rows of a filtered or sorted dataframe.
Filtering and sorting dataframes are essential skills for data analysis and manipulation. They can help you to extract the information you need or to organize the data in a meaningful way. You can apply these methods to any dataframe that you have or create using pandas.
Here are some key points to remember from this blog:
- Filtering dataframes means selecting a subset of rows that meet certain conditions. You can filter by column values, by multiple conditions, by string methods, or by query method.
- Sorting dataframes means arranging the rows of a dataframe in ascending or descending order based on one or more columns or the index. You can sort by column values, by index, or by multiple criteria.
- You can use the square brackets notation to filter by column values using boolean expressions.
- You can use the sort_values() method to sort by column values using the by and ascending parameters.
- You can use the sort_index() method to sort by index using the axis, level, and ascending parameters.
- You can use the shape attribute to check the number of rows and columns in a filtered or sorted dataframe.
- You can use the head() and tail() methods to get the first or last n rows of a filtered or sorted dataframe.
We hope you enjoyed this blog and learned something new and useful. If you want to learn more about pandas and data analysis, you can check out the following resources:
- Pandas Documentation: The official documentation of pandas, with tutorials, user guides, and API reference.
- Pandas Course on Kaggle: A free online course on pandas, with interactive exercises and notebooks.
- Data Manipulation with pandas on DataCamp: A paid online course on pandas, with video lessons and coding challenges.
Thank you for reading this blog and happy coding!