1. Introduction
In this tutorial, you will learn how to use pandas methods to group and aggregate dataframes based on columns or categories. Grouping and aggregating dataframes are powerful techniques that allow you to perform various operations on subsets of data and obtain summary statistics or insights.
Some of the questions that you can answer with grouping and aggregating dataframes are:
- What is the average salary of employees by department?
- How many products are sold by each category?
- What is the correlation between age and income by gender?
To answer these questions, you need to use two main pandas methods: groupby and agg. These methods allow you to split a dataframe into groups based on one or more columns, apply one or more functions to each group, and combine the results into a new dataframe.
Another pandas method that you can use to aggregate dataframes is pivot_table. This method allows you to create a spreadsheet-style table that summarizes data by different categories and values. You can also add margins and custom functions to the pivot table to enhance the analysis.
By the end of this tutorial, you will be able to use groupby, agg, and pivot_table methods to group and aggregate dataframes in various ways. You will also learn how to use these methods with real-world datasets and examples.
Let’s get started!
2. Grouping dataframes with groupby
One of the most common and useful operations that you can perform on dataframes is grouping. Grouping allows you to split a dataframe into smaller subsets based on one or more columns or categories. For example, you can group a dataframe of employees by their department, or a dataframe of products by their category.
Once you have grouped a dataframe, you can apply various functions to each group and obtain summary statistics or insights. For example, you can calculate the mean, median, sum, count, or standard deviation of each group, or apply custom functions that suit your needs.
To group a dataframe, you need to use the groupby method. The groupby method takes one or more columns or categories as arguments and returns a groupby object. A groupby object is a special type of object that contains information about the groups and allows you to perform operations on them.
In this section, you will learn how to use the groupby method to group dataframes by different criteria and apply various functions to the groups. You will also learn how to use the agg method to apply multiple functions to each group and obtain a new dataframe with the results.
Let’s start by importing pandas and creating a sample dataframe to work with.
# Import pandas import pandas as pd # Create a sample dataframe df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'], 'age': [25, 30, 35, 40, 45, 50], 'gender': ['F', 'M', 'M', 'M', 'F', 'M'], 'department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'Sales'], 'salary': [4000, 5000, 6000, 7000, 8000, 9000] }) # Print the dataframe print(df)
The output of the code is:
name age gender department salary 0 Alice 25 F HR 4000 1 Bob 30 M IT 5000 2 Charlie 35 M IT 6000 3 David 40 M Sales 7000 4 Eve 45 F HR 8000 5 Frank 50 M Sales 9000
This dataframe contains information about six employees, such as their name, age, gender, department, and salary. We can use this dataframe to demonstrate how to group and aggregate dataframes with pandas.
2.1. Basic syntax and examples of groupby
The basic syntax of the groupby method is as follows:
# Group a dataframe by one or more columns grouped_df = df.groupby(by=['column1', 'column2', ...]) # Group a dataframe by a categorical variable grouped_df = df.groupby(by='category')
The by argument can be a single column name, a list of column names, or a categorical variable. The groupby method returns a groupby object, which is a special type of object that contains information about the groups and allows you to perform operations on them.
To see the groups that are created by the groupby method, you can use the groups attribute. This attribute returns a dictionary that maps each group name to the corresponding row labels. For example, if you group the sample dataframe by the department column, you can see the groups as follows:
# Group the dataframe by the department column grouped_df = df.groupby(by='department') # See the groups print(grouped_df.groups)
The output of the code is:
{'HR': [0, 4], 'IT': [1, 2], 'Sales': [3, 5]}
This means that there are three groups: HR, IT, and Sales. The HR group contains the rows with labels 0 and 4, the IT group contains the rows with labels 1 and 2, and the Sales group contains the rows with labels 3 and 5.
To access a specific group, you can use the get_group method. This method takes the group name as an argument and returns a dataframe that contains only the rows that belong to that group. For example, if you want to access the IT group, you can do the following:
# Access the IT group it_group = grouped_df.get_group('IT') # Print the IT group print(it_group)
The output of the code is:
name age gender department salary 1 Bob 30 M IT 5000 2 Charlie 35 M IT 6000
This is a dataframe that contains only the rows that have IT as the department value.
Now that you know how to group a dataframe and access the groups, you can apply various functions to each group and obtain summary statistics or insights. In the next section, you will learn how to use the agg method to apply multiple functions to each group and obtain a new dataframe with the results.
2.2. Applying multiple functions with agg
One of the advantages of using the groupby method is that you can apply multiple functions to each group and obtain a new dataframe with the results. For example, you can calculate the mean, median, sum, count, or standard deviation of each group, or apply custom functions that suit your needs.
To apply multiple functions to each group, you need to use the agg method. The agg method takes one or more functions as arguments and returns a new dataframe with the aggregated values. You can pass the functions as strings, lists, dictionaries, or lambda expressions.
Let’s see some examples of how to use the agg method with the sample dataframe that we grouped by the department column in the previous section.
If you want to apply a single function to each group, you can pass the function name as a string. For example, if you want to calculate the mean salary of each department, you can do the following:
# Group the dataframe by the department column grouped_df = df.groupby(by='department') # Apply the mean function to each group mean_df = grouped_df.agg('mean') # Print the new dataframe print(mean_df)
The output of the code is:
age salary department HR 35 6000 IT 32.5 5500 Sales 45 8000
This is a new dataframe that contains the mean age and salary of each department. Note that the department column becomes the index of the new dataframe.
If you want to apply multiple functions to each group, you can pass a list of function names as strings. For example, if you want to calculate the mean, median, and standard deviation of the salary of each department, you can do the following:
# Group the dataframe by the department column grouped_df = df.groupby(by='department') # Apply a list of functions to each group multi_df = grouped_df.agg(['mean', 'median', 'std']) # Print the new dataframe print(multi_df)
The output of the code is:
age salary mean median std mean median std department HR 35.0 35.0 14.142136 6000 6000 2828.427125 IT 32.5 32.5 3.535534 5500 5500 707.106781 Sales 45.0 45.0 7.071068 8000 8000 1414.213562
This is a new dataframe that contains the mean, median, and standard deviation of the age and salary of each department. Note that the new dataframe has a hierarchical index with two levels: the department level and the function level.
If you want to apply different functions to different columns, you can pass a dictionary that maps the column names to the function names as strings. For example, if you want to calculate the mean age and the sum salary of each department, you can do the following:
# Group the dataframe by the department column grouped_df = df.groupby(by='department') # Apply a dictionary of functions to each group dict_df = grouped_df.agg({'age': 'mean', 'salary': 'sum'}) # Print the new dataframe print(dict_df)
The output of the code is:
age salary department HR 35 12000 IT 32.5 11000 Sales 45 16000
This is a new dataframe that contains the mean age and the sum salary of each department. Note that the new dataframe has only the columns that are specified in the dictionary.
If you want to apply custom functions to each group, you can pass lambda expressions as arguments. For example, if you want to calculate the difference between the maximum and minimum salary of each department, you can do the following:
# Group the dataframe by the department column grouped_df = df.groupby(by='department') # Apply a lambda function to each group lambda_df = grouped_df.agg(lambda x: x.max() - x.min()) # Print the new dataframe print(lambda_df)
The output of the code is:
age salary department HR 20 4000 IT 5 1000 Sales 10 2000
This is a new dataframe that contains the difference between the maximum and minimum age and salary of each department. Note that the lambda function is applied to each column of each group.
As you can see, the agg method is very flexible and powerful, as it allows you to apply multiple functions to each group and obtain a new dataframe with the results. In the next section, you will learn how to group dataframes by multiple columns and levels.
2.3. Grouping by multiple columns and levels
Sometimes, you may want to group a dataframe by more than one column or level. For example, you may want to group a dataframe of employees by both their department and gender, or a dataframe of products by both their category and subcategory.
To group a dataframe by multiple columns or levels, you can pass a list of column names or level names to the by argument of the groupby method. For example, if you want to group the sample dataframe by both the department and gender columns, you can do the following:
# Group the dataframe by the department and gender columns grouped_df = df.groupby(by=['department', 'gender']) # See the groups print(grouped_df.groups)
The output of the code is:
{('HR', 'F'): [0, 4], ('IT', 'M'): [1, 2], ('Sales', 'M'): [3, 5]}
This means that there are three groups: HR and F, IT and M, and Sales and M. The HR and F group contains the rows with labels 0 and 4, the IT and M group contains the rows with labels 1 and 2, and the Sales and M group contains the rows with labels 3 and 5.
To access a specific group, you can use the get_group method with a tuple of the group names as an argument. For example, if you want to access the HR and F group, you can do the following:
# Access the HR and F group hr_f_group = grouped_df.get_group(('HR', 'F')) # Print the HR and F group print(hr_f_group)
The output of the code is:
name age gender department salary 0 Alice 25 F HR 4000 4 Eve 45 F HR 8000
This is a dataframe that contains only the rows that have HR as the department value and F as the gender value.
Once you have grouped a dataframe by multiple columns or levels, you can apply various functions to each group and obtain a new dataframe with the results. For example, you can use the agg method to apply multiple functions to each group and obtain a new dataframe with the results. The syntax and examples of the agg method are the same as in the previous section, except that the new dataframe will have a hierarchical index with multiple levels corresponding to the grouping criteria.
For example, if you want to calculate the mean, median, and standard deviation of the salary of each department and gender group, you can do the following:
# Group the dataframe by the department and gender columns grouped_df = df.groupby(by=['department', 'gender']) # Apply a list of functions to each group multi_df = grouped_df.agg(['mean', 'median', 'std']) # Print the new dataframe print(multi_df)
The output of the code is:
salary mean median std department gender HR F 6000 6000 2828.427125 IT M 5500 5500 707.106781 Sales M 8000 8000 1414.213562
This is a new dataframe that contains the mean, median, and standard deviation of the salary of each department and gender group. Note that the new dataframe has a hierarchical index with two levels: the department and gender level and the function level.
As you can see, grouping by multiple columns or levels allows you to perform more granular and complex analysis on dataframes. In the next section, you will learn how to aggregate dataframes with pivot_table.
3. Aggregating dataframes with pivot_table
Another way to aggregate dataframes is to use the pivot_table method. The pivot_table method allows you to create a spreadsheet-style table that summarizes data by different categories and values. You can also add margins and custom functions to the pivot table to enhance the analysis.
The basic syntax of the pivot_table method is as follows:
# Create a pivot table from a dataframe pivot_df = df.pivot_table(values='column1', index='column2', columns='column3', aggfunc='function', margins=True, margins_name='name', fill_value=0)
The arguments of the pivot_table method are:
- values: the column or columns that you want to aggregate.
- index: the column or columns that you want to use as the row labels of the pivot table.
- columns: the column or columns that you want to use as the column labels of the pivot table.
- aggfunc: the function or functions that you want to apply to the values. You can pass a single function name as a string, a list of function names as strings, or a dictionary that maps the column names to the function names as strings.
- margins: a boolean value that indicates whether to add row and column totals to the pivot table.
- margins_name: the name that you want to use for the row and column totals.
- fill_value: the value that you want to use to fill the missing or NaN values in the pivot table.
Let’s see some examples of how to use the pivot_table method with the sample dataframe that we created in the previous sections.
If you want to create a simple pivot table that shows the mean salary of each department and gender, you can do the following:
# Create a simple pivot table simple_pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc='mean') # Print the pivot table print(simple_pivot)
The output of the code is:
gender F M department HR 6000.0 NaN IT NaN 5500.0 Sales NaN 8000.0
This is a pivot table that shows the mean salary of each department and gender. Note that the department column becomes the index of the pivot table, and the gender column becomes the columns of the pivot table. Also note that there are some missing or NaN values in the pivot table, because some combinations of department and gender do not exist in the original dataframe.
If you want to create a more complex pivot table that shows the mean, median, and standard deviation of the salary and age of each department and gender, you can do the following:
# Create a complex pivot table complex_pivot = df.pivot_table(values=['salary', 'age'], index='department', columns='gender', aggfunc=['mean', 'median', 'std']) # Print the pivot table print(complex_pivot)
The output of the code is:
mean median std age salary age salary age gender F M F M F M F M F department HR 35.0 NaN 6000.0 NaN 35.0 NaN 6000.0 NaN 20.0 IT NaN 32.5 NaN 5500.0 NaN 32.5 NaN 5500.0 NaN Sales NaN 45.0 NaN 8000.0 NaN 45.0 NaN 8000.0 NaN M gender department HR NaN IT 3.535534 Sales 7.071068
This is a pivot table that shows the mean, median, and standard deviation of the salary and age of each department and gender. Note that the pivot table has a hierarchical index and columns with multiple levels corresponding to the values and functions.
If you want to add row and column totals to the pivot table, you can set the margins argument to True and specify the margins_name argument. For example, if you want to add row and column totals with the name ‘Total’, you can do the following:
# Create a pivot table with margins margin_pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc='mean', margins=True, margins_name='Total') # Print the pivot table print(margin_pivot)
The output of the code is:
gender F M Total department HR 6000.0 NaN 6000.0 IT NaN 5500.0 5500.0 Sales NaN 8000.0 8000.0 Total 6000.0 6500.0 6250.0
This is a pivot table that shows the mean salary of each department and gender, along with the row and column totals. Note that the pivot table has an extra row and column with the name ‘Total’ that show the overall mean salary and the mean salary by department and gender.
If you want to fill the missing or NaN values in the pivot table with a specific value, you can set the fill_value argument to that value. For example, if you want to fill the missing or NaN values with 0, you can do the following:
# Create a pivot table with fill value fill_pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc='mean', fill_value=0) # Print the pivot table print(fill_pivot)
The output of the code is:
gender F M department HR 6000 0 IT 0 5500 Sales 0 8000
This is a pivot table that shows the mean salary of each department and gender, with the missing or NaN values filled with 0. Note that the pivot table does not have any missing or NaN values anymore.
As you can see, the pivot_table method is very flexible and powerful, as it allows you to create a spreadsheet-style table that summarizes data by different categories and values. You can also add margins and custom functions to the pivot table to enhance the analysis. In the next section, you will learn how to use custom functions with pivot_table.
3.1. Basic syntax and examples of pivot_table
Another way to aggregate dataframes is to use the pivot_table method. The pivot_table method allows you to create a spreadsheet-style table that summarizes data by different categories and values. You can think of a pivot table as a way to rearrange and reorganize data in a more meaningful and readable way.
The basic syntax of the pivot_table method is:
df.pivot_table(values, index, columns, aggfunc, margins, margins_name, fill_value, dropna)
The arguments of the pivot_table method are:
- values: the column or columns that you want to aggregate.
- index: the column or columns that you want to use as the row labels of the pivot table.
- columns: the column or columns that you want to use as the column labels of the pivot table.
- aggfunc: the function or functions that you want to apply to the values. The default is mean.
- margins: a boolean value that indicates whether to add a row and a column with the grand totals. The default is False.
- margins_name: the name of the row and column that contain the grand totals. The default is “All”.
- fill_value: the value that you want to use to fill the missing values in the pivot table. The default is None.
- dropna: a boolean value that indicates whether to drop the columns that contain only missing values. The default is True.
To demonstrate how to use the pivot_table method, let’s use the same sample dataframe that we used in the previous section.
# Import pandas import pandas as pd # Create a sample dataframe df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'], 'age': [25, 30, 35, 40, 45, 50], 'gender': ['F', 'M', 'M', 'M', 'F', 'M'], 'department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'Sales'], 'salary': [4000, 5000, 6000, 7000, 8000, 9000] }) # Print the dataframe print(df)
The output of the code is:
name age gender department salary 0 Alice 25 F HR 4000 1 Bob 30 M IT 5000 2 Charlie 35 M IT 6000 3 David 40 M Sales 7000 4 Eve 45 F HR 8000 5 Frank 50 M Sales 9000
Let’s say we want to create a pivot table that shows the average salary of employees by department and gender. We can use the pivot_table method as follows:
# Create a pivot table pivot = df.pivot_table(values='salary', index='department', columns='gender') # Print the pivot table print(pivot)
The output of the code is:
gender F M department HR 6000.0 NaN IT NaN 5500.0 Sales NaN 8000.0
As you can see, the pivot table has the department as the row labels, the gender as the column labels, and the average salary as the values. The missing values indicate that there are no employees of that gender in that department.
You can also use multiple columns or functions for the values, index, or columns arguments. For example, if you want to see the average and the sum of the salary by department and gender, you can use the aggfunc argument to specify a list of functions:
# Create a pivot table with multiple functions pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc=['mean', 'sum']) # Print the pivot table print(pivot)
The output of the code is:
mean sum gender F M F M department HR 6000.0 NaN 12000 NaN IT NaN 5500.0 NaN 11000 Sales NaN 8000.0 NaN 16000
Now, the pivot table has two levels of column labels: the function and the gender. The values are the mean and the sum of the salary for each group.
3.2. Specifying values, index, and columns
In the previous section, you learned how to use the basic syntax of the pivot_table method to create a simple pivot table. In this section, you will learn how to specify different values, index, and columns for the pivot table and how they affect the output.
As you may recall, the values argument of the pivot_table method determines the column or columns that you want to aggregate. You can use a single column name, a list of column names, or a dictionary of column names and functions. For example, if you want to see the average and the sum of the salary and the age by department and gender, you can use a list of column names for the values argument:
# Create a pivot table with multiple values pivot = df.pivot_table(values=['salary', 'age'], index='department', columns='gender', aggfunc=['mean', 'sum']) # Print the pivot table print(pivot)
The output of the code is:
mean sum age salary age salary gender F M F M F M F M department HR 35.000 NaN 6000.000 NaN 70 NaN 12000 NaN IT NaN 32.5 NaN 5500.0 NaN 65 NaN 11000 Sales NaN 45.0 NaN 8000.0 NaN 90 NaN 16000
Now, the pivot table has four levels of column labels: the function, the value, and the gender. The values are the mean and the sum of the salary and the age for each group.
You can also use a dictionary of column names and functions for the values argument to apply different functions to different columns. For example, if you want to see the mean of the salary, the sum of the age, and the count of the name by department and gender, you can use a dictionary for the values argument:
# Create a pivot table with a dictionary of values pivot = df.pivot_table(values={'salary': 'mean', 'age': 'sum', 'name': 'count'}, index='department', columns='gender') # Print the pivot table print(pivot)
The output of the code is:
age name salary gender F M F M F M department HR 70.0 NaN 2.0 0 6000.0 NaN IT NaN 65 0.0 2 NaN 5500.0 Sales NaN 90 0.0 2 NaN 8000.0
Now, the pivot table has three levels of column labels: the value, the function, and the gender. The values are the mean of the salary, the sum of the age, and the count of the name for each group.
The index argument of the pivot_table method determines the column or columns that you want to use as the row labels of the pivot table. You can use a single column name, a list of column names, or a pandas index object. For example, if you want to use the name and the department as the row labels of the pivot table, you can use a list of column names for the index argument:
# Create a pivot table with multiple index pivot = df.pivot_table(values='salary', index=['name', 'department'], columns='gender') # Print the pivot table print(pivot)
The output of the code is:
gender F M name department Alice HR 4000 NaN Bob IT NaN 5000.0 Charlie IT NaN 6000.0 David Sales NaN 7000.0 Eve HR 8000 NaN Frank Sales NaN 9000.0
Now, the pivot table has two levels of row labels: the name and the department. The values are the salary for each group.
The columns argument of the pivot_table method determines the column or columns that you want to use as the column labels of the pivot table. You can use a single column name, a list of column names, or a pandas index object. For example, if you want to use the gender and the age as the column labels of the pivot table, you can use a list of column names for the columns argument:
# Create a pivot table with multiple columns pivot = df.pivot_table(values='salary', index='department', columns=['gender', 'age']) # Print the pivot table print(pivot)
The output of the code is:
gender F M age 25 45 50 30 35 40 50 department HR 4000.0 8000 NaN NaN NaN NaN NaN IT NaN NaN NaN 5000 6000 NaN NaN Sales NaN NaN NaN NaN NaN 7000 9000
Now, the pivot table has two levels of column labels: the gender and the age. The values are the salary for each group.
As you can see, you can specify different values, index, and columns for the pivot table and create different views of the data. You can experiment with different combinations of arguments and see how they affect the output.
3.3. Adding margins and custom functions
In the previous section, you learned how to specify different values, index, and columns for the pivot table and how they affect the output. In this section, you will learn how to add margins and custom functions to the pivot table and how they enhance the analysis.
One of the arguments of the pivot_table method that you can use to add more information to the pivot table is margins. The margins argument is a boolean value that indicates whether to add a row and a column with the grand totals. The default is False, which means that no margins are added. If you set it to True, then a row and a column with the name “All” are added, which show the total values for each row and column.
For example, if you want to see the average salary of employees by department and gender, and also the total average salary for each department and gender, you can use the margins argument as follows:
# Create a pivot table with margins pivot = df.pivot_table(values='salary', index='department', columns='gender', margins=True) # Print the pivot table print(pivot)
The output of the code is:
gender F M All department HR 6000.0 NaN 6000.0 IT NaN 5500.0 5500.0 Sales NaN 8000.0 8000.0 All 6000.0 6500.0 6333.333333
As you can see, the pivot table has a row and a column with the name “All”, which show the total average salary for each department and gender. You can also change the name of the margins by using the margins_name argument, which takes a string as the name of the row and column that contain the grand totals. The default is “All”.
Another argument of the pivot_table method that you can use to add more functionality to the pivot table is aggfunc. The aggfunc argument determines the function or functions that you want to apply to the values. The default is mean, which means that the average of the values is calculated. However, you can also use other built-in functions, such as sum, min, max, count, std, or var, or you can use your own custom functions.
For example, if you want to see the average and the standard deviation of the salary by department and gender, you can use the aggfunc argument to specify a list of functions:
# Create a pivot table with multiple functions pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc=['mean', 'std']) # Print the pivot table print(pivot)
The output of the code is:
mean std gender F M F M department HR 6000.0 NaN 2828.427125 NaN IT NaN 5500.0 NaN 707.106781 Sales NaN 8000.0 NaN 1414.213562
Now, the pivot table has two levels of column labels: the function and the gender. The values are the mean and the standard deviation of the salary for each group.
You can also use your own custom functions for the aggfunc argument. For example, if you want to see the range of the salary by department and gender, you can define a function that calculates the difference between the maximum and the minimum of the values, and use it for the aggfunc argument:
# Define a custom function def range(x): return x.max() - x.min() # Create a pivot table with a custom function pivot = df.pivot_table(values='salary', index='department', columns='gender', aggfunc=range) # Print the pivot table print(pivot)
The output of the code is:
gender F M department HR 4000 NaN IT NaN 1000.0 Sales NaN 2000.0
Now, the pivot table has the range of the salary for each group.
As you can see, you can add margins and custom functions to the pivot table and enhance the analysis. You can experiment with different arguments and functions and see how they affect the output.
4. Conclusion
In this tutorial, you have learned how to use pandas methods to group and aggregate dataframes based on columns or categories. You have learned how to use the groupby method to split a dataframe into groups, apply one or more functions to each group, and combine the results into a new dataframe. You have also learned how to use the agg method to apply multiple functions to each group and obtain a new dataframe with the results.
You have also learned how to use the pivot_table method to create a spreadsheet-style table that summarizes data by different categories and values. You have learned how to specify different values, index, and columns for the pivot table and how they affect the output. You have also learned how to add margins and custom functions to the pivot table and enhance the analysis.
By using these methods, you can perform various operations on subsets of data and obtain summary statistics or insights. You can also create different views of the data and make it more meaningful and readable. These techniques are very useful for data analysis and exploration, as they allow you to answer different questions and discover patterns or trends in the data.
We hope you have enjoyed this tutorial and learned something new and useful. If you want to learn more about pandas and how to manipulate and analyze data with it, you can check out the official documentation or some of the other tutorials available online. Happy coding!