This blog will teach you the basics of pandas, a powerful Python library for data analysis. You will learn how to create a dataframe object from a list of dictionaries and how to access and manipulate its elements.
1. What is pandas?
Pandas is a popular Python library for data analysis and manipulation. It provides high-performance and easy-to-use data structures and tools for working with various types of data, such as tabular, time series, text, and more.
Pandas stands for Python Data Analysis Library, and it is built on top of another Python library called NumPy, which provides fast and efficient numerical computations. Pandas also integrates well with other Python libraries, such as matplotlib for visualization, scikit-learn for machine learning, and TensorFlow for deep learning.
One of the main features of pandas is the dataframe object, which is a two-dimensional, labeled data structure that can store different types of data in each column. A dataframe is similar to a spreadsheet or a database table, but it offers more functionality and flexibility for data analysis. You can think of a dataframe as a collection of series, which are one-dimensional arrays of data with labels.
In this tutorial, you will learn how to create a dataframe from a list of dictionaries, which is a common way of representing data in Python. You will also learn how to view and inspect the dataframe, and how to access and manipulate its elements. By the end of this tutorial, you will have a solid foundation for working with pandas and dataframes.
Are you ready to get started? Let’s dive in!
2. How to install pandas
Before you can start working with pandas and dataframes, you need to install the pandas library on your computer. There are different ways to install pandas, depending on your operating system and Python environment. In this section, we will show you how to install pandas using pip, which is a package manager for Python.
Pip allows you to install and manage Python packages from the Python Package Index (PyPI), which is a repository of open-source software. To use pip, you need to have Python installed on your computer. You can check if you have Python and pip by opening a terminal or command prompt and typing:
python --version pip --version
If you see the version numbers of Python and pip, then you are good to go. If not, you need to install Python and pip first. You can find the instructions on how to do that on the official Python website.
Once you have Python and pip, you can install pandas by typing the following command in your terminal or command prompt:
pip install pandas
This will download and install the latest version of pandas and its dependencies from PyPI. You can also specify a specific version of pandas by adding == followed by the version number, for example:
pip install pandas==1.3.4
This will install pandas version 1.3.4, which is the latest version as of February 2024. You can check the available versions of pandas on the PyPI page for pandas.
After installing pandas, you can verify that it works by importing it in a Python script or an interactive shell. To do that, type:
import pandas as pd
This will import the pandas library and assign it to the alias pd, which is a common convention among pandas users. You can also check the version of pandas by typing:
pd.__version__
This will print the version number of pandas that you have installed. If you see no errors, then congratulations, you have successfully installed pandas!
3. How to create a dataframe from a list of dictionaries
Now that you have installed pandas, you are ready to create your first dataframe. A dataframe is a two-dimensional, labeled data structure that can store different types of data in each column. You can create a dataframe from various sources of data, such as CSV files, Excel files, SQL databases, and more. In this section, you will learn how to create a dataframe from a list of dictionaries, which is a common way of representing data in Python.
A list of dictionaries is a collection of Python dictionaries, where each dictionary represents a row of data, and each key-value pair represents a column name and a cell value. For example, suppose you have the following list of dictionaries that contains some information about four students:
students = [ {"name": "Alice", "age": 20, "gender": "F", "major": "Math"}, {"name": "Bob", "age": 21, "gender": "M", "major": "CS"}, {"name": "Charlie", "age": 19, "gender": "M", "major": "Physics"}, {"name": "Diana", "age": 22, "gender": "F", "major": "Biology"} ]
This list of dictionaries can be easily converted to a dataframe using the pandas DataFrame constructor, which takes the list as an argument and returns a dataframe object. You can assign the dataframe to a variable and print it to see the result:
import pandas as pd # import pandas library df = pd.DataFrame(students) # create dataframe from list of dictionaries print(df) # print dataframe
The output should look something like this:
name age gender major 0 Alice 20 F Math 1 Bob 21 M CS 2 Charlie 19 M Physics 3 Diana 22 F Biology
As you can see, the dataframe has four rows and four columns, corresponding to the four dictionaries and the four key-value pairs in each dictionary. The dataframe also has labels for the rows and columns, which are called the index and the columns, respectively. The index is a sequence of integers that identifies each row, starting from zero. The columns are the names of the keys in the dictionaries, which are also the names of the variables in the dataframe.
Creating a dataframe from a list of dictionaries is a simple and convenient way to store and manipulate data in pandas. However, there are some limitations and caveats that you should be aware of. For example, the list of dictionaries must have the same keys in each dictionary, otherwise the dataframe will have missing values. Also, the order of the columns in the dataframe may not match the order of the keys in the dictionaries, unless you specify the order explicitly. We will discuss these issues and how to deal with them in the next sections.
3.1. Define a list of dictionaries
The first step to create a dataframe from a list of dictionaries is to define the list of dictionaries. A list of dictionaries is a collection of Python dictionaries, where each dictionary represents a row of data, and each key-value pair represents a column name and a cell value. You can define a list of dictionaries using the square brackets [] and the curly braces {}, as shown below:
# define a list of dictionaries list_of_dicts = [ {"key1": "value1", "key2": "value2", "key3": "value3"}, {"key1": "value4", "key2": "value5", "key3": "value6"}, {"key1": "value7", "key2": "value8", "key3": "value9"} ]
In this example, the list of dictionaries has three dictionaries, each with three key-value pairs. The keys are “key1”, “key2”, and “key3”, and the values are “value1” to “value9”. You can use any names for the keys and any types of data for the values, as long as they are consistent across the dictionaries.
To define a list of dictionaries, you need to follow some rules and conventions:
- The list of dictionaries must be enclosed in square brackets [] and separated by commas.
- Each dictionary must be enclosed in curly braces {} and separated by commas.
- Each key-value pair must be separated by a colon : and enclosed in quotes ” or “” if they are strings.
- The order of the key-value pairs in each dictionary does not matter, as long as they have the same keys.
- The order of the dictionaries in the list does matter, as it determines the order of the rows in the dataframe.
Defining a list of dictionaries is a simple and intuitive way to represent data in Python. However, it can also be tedious and error-prone, especially if you have a large amount of data or complex data structures. In that case, you may want to use other sources of data, such as files or databases, to create your list of dictionaries. We will cover some of these options in the next sections.
3.2. Convert the list to a dataframe
Once you have defined your list of dictionaries, you can easily convert it to a dataframe using the pandas DataFrame constructor. The DataFrame constructor is a function that takes a list of dictionaries as an argument and returns a dataframe object. You can import the DataFrame constructor from the pandas library using the following statement:
from pandas import DataFrame # import DataFrame constructor
Then, you can pass your list of dictionaries to the DataFrame constructor and assign the result to a variable. For example, if your list of dictionaries is called list_of_dicts, you can create a dataframe called df by typing:
df = DataFrame(list_of_dicts) # create dataframe from list of dictionaries
This will create a dataframe with the same number of rows and columns as your list of dictionaries, and with the same labels for the rows and columns. You can print the dataframe to see the result by typing:
print(df) # print dataframe
The output should look something like this:
key1 key2 key3 0 value1 value2 value3 1 value4 value5 value6 2 value7 value8 value9
As you can see, the dataframe has three rows and three columns, corresponding to the three dictionaries and the three key-value pairs in each dictionary. The dataframe also has labels for the rows and columns, which are called the index and the columns, respectively. The index is a sequence of integers that identifies each row, starting from zero. The columns are the names of the keys in the dictionaries, which are also the names of the variables in the dataframe.
Converting a list of dictionaries to a dataframe is a quick and easy way to create a dataframe in pandas. However, you may want to customize some aspects of the dataframe, such as the order of the columns, the names of the index, or the data types of the values. You can do that by passing some additional arguments to the DataFrame constructor, which we will discuss in the next sections.
3.3. View and inspect the dataframe
After creating a dataframe from a list of dictionaries, you may want to view and inspect the dataframe to check if it has the correct data and structure. Pandas provides several methods and attributes that allow you to do that. In this section, you will learn how to use some of these methods and attributes to view and inspect your dataframe.
One of the simplest ways to view your dataframe is to print it using the print function. For example, if your dataframe is called df, you can print it by typing:
print(df) # print dataframe
This will display the dataframe in a tabular format, with the index on the left and the columns on the top. However, if your dataframe is too large, it may not fit in the screen and some rows or columns may be truncated. To avoid this, you can use the head and tail methods to view the first and last few rows of the dataframe, respectively. For example, you can view the first five rows of the dataframe by typing:
print(df.head()) # print first five rows of dataframe
Similarly, you can view the last five rows of the dataframe by typing:
print(df.tail()) # print last five rows of dataframe
You can also specify the number of rows you want to view by passing an integer argument to the head and tail methods. For example, you can view the first three rows of the dataframe by typing:
print(df.head(3)) # print first three rows of dataframe
To inspect the dataframe, you can use some attributes that provide information about the dataframe, such as its shape, size, index, columns, and data types. For example, you can get the number of rows and columns of the dataframe by using the shape attribute, which returns a tuple of (rows, columns). For example, you can get the shape of the dataframe by typing:
print(df.shape) # print shape of dataframe
This will print something like (4, 4), indicating that the dataframe has four rows and four columns. You can also get the number of elements in the dataframe by using the size attribute, which returns an integer. For example, you can get the size of the dataframe by typing:
print(df.size) # print size of dataframe
This will print 16, indicating that the dataframe has 16 elements in total. You can also get the labels of the rows and columns of the dataframe by using the index and columns attributes, which return index and column objects, respectively. For example, you can get the index and columns of the dataframe by typing:
print(df.index) # print index of dataframe print(df.columns) # print columns of dataframe
This will print something like Int64Index([0, 1, 2, 3], dtype=’int64′) and Index([‘name’, ‘age’, ‘gender’, ‘major’], dtype=’object’), indicating that the index is a sequence of integers from 0 to 3 and the columns are the names of the keys in the dictionaries. You can also get the data types of the values in the dataframe by using the dtypes attribute, which returns a series of data types for each column. For example, you can get the data types of the dataframe by typing:
print(df.dtypes) # print data types of dataframe
This will print something like name object
age int64
gender object
major object
dtype: object, indicating that the name, gender, and major columns are of type object, which means they are strings, and the age column is of type int64, which means it is an integer.
Viewing and inspecting the dataframe is an important step to verify that the dataframe has the correct data and structure. It can also help you to identify any errors or inconsistencies in the data, such as missing values, wrong data types, or incorrect labels. You can use the methods and attributes that we have discussed in this section to view and inspect your dataframe, or you can explore other methods and attributes that pandas provides. You can find more information about them in the official pandas documentation.
4. How to access and manipulate dataframe elements
After creating and inspecting your dataframe, you may want to access and manipulate its elements to perform some data analysis and manipulation tasks. Pandas provides several methods and operators that allow you to do that. In this section, you will learn how to use some of these methods and operators to access and manipulate your dataframe elements.
One of the most common tasks is to select a subset of the dataframe based on the columns or rows that you are interested in. You can do that by using the loc and iloc methods, which allow you to select data by labels or positions, respectively. For example, if you want to select the name and major columns of the dataframe, you can use the loc method and pass the column names as a list:
print(df.loc[:, ["name", "major"]]) # print name and major columns of dataframe
The output should look something like this:
name major 0 Alice Math 1 Bob CS 2 Charlie Physics 3 Diana Biology
As you can see, the loc method takes two arguments, separated by a comma. The first argument is for the rows, and the second argument is for the columns. In this case, we used a colon : to indicate that we want all the rows, and a list of column names to indicate that we want only the name and major columns. You can also use the loc method to select data by row labels, such as the index values. For example, if you want to select the first and third rows of the dataframe, you can use the loc method and pass the index values as a list:
print(df.loc[[0, 2], :]) # print first and third rows of dataframe
The output should look something like this:
name age gender major 0 Alice 20 F Math 2 Charlie 19 M Physics
In this case, we used a list of index values to indicate that we want only the first and third rows, and a colon : to indicate that we want all the columns. You can also use the loc method to select data by both row and column labels, by passing a list of lists or a slice object. For example, if you want to select the name and age columns of the first two rows of the dataframe, you can use the loc method and pass a slice object:
print(df.loc[:1, ["name", "age"]]) # print name and age columns of first two rows of dataframe
The output should look something like this:
name age 0 Alice 20 1 Bob 21
In this case, we used a slice object :1 to indicate that we want the rows from the beginning up to and including the first row, and a list of column names to indicate that we want only the name and age columns. Note that the slice object is inclusive of the end point, unlike the usual Python slicing syntax.
The loc method is useful when you want to select data by labels, but sometimes you may want to select data by positions, such as the row or column numbers. In that case, you can use the iloc method, which works similarly to the loc method, but takes integers as arguments. For example, if you want to select the second and fourth columns of the dataframe, you can use the iloc method and pass the column numbers as a list:
print(df.iloc[:, [1, 3]]) # print second and fourth columns of dataframe
The output should look something like this:
age major 0 20 Math 1 21 CS 2 19 Physics 3 22 Biology
As you can see, the iloc method takes two arguments, separated by a comma. The first argument is for the rows, and the second argument is for the columns. In this case, we used a colon : to indicate that we want all the rows, and a list of column numbers to indicate that we want only the second and fourth columns. Note that the column numbers start from zero, so the second column is 1 and the fourth column is 3. You can also use the iloc method to select data by row numbers, such as the first and last rows of the dataframe. For example, you can use the iloc method and pass the row numbers as a list:
print(df.iloc[[0, -1], :]) # print first and last rows of dataframe
The output should look something like this:
name age gender major 0 Alice 20 F Math 3 Diana 22 F Biology
In this case, we used a list of row numbers to indicate that we want only the first and last rows, and a colon : to indicate that we want all the columns. Note that you can use negative numbers to indicate the positions from the end, so the last row is -1. You can also use the iloc method to select data by both row and column numbers, by passing a list of lists or a slice object. For example, if you want to select the first three rows and the first two columns of the dataframe, you can use the iloc method and pass a slice object:
print(df.iloc[:3, :2]) # print first three rows and first two columns of dataframe
The output should look something like this:
name age 0 Alice 20 1 Bob 21 2 Charlie 19
In this case, we used a slice object :3 to indicate that we want the rows from the beginning up to but not including the third row, and a slice object :2 to indicate that we want the columns from the beginning up to but not including the second column. Note that the slice object is exclusive of the end point, unlike the loc method.
The loc and iloc methods are powerful and flexible ways to select data from a dataframe. However, they are not the only ways to do that. You can also use other methods and operators, such as the dot notation, the bracket notation, the query method, the filter method, and the where method, to access and manipulate dataframe elements. We will cover some of these methods and operators in the next sections.
4.1. Select columns and rows
One of the most common tasks in data analysis is to select specific columns and rows from a dataframe. This allows you to focus on the data that you are interested in and perform further operations on it. Pandas provides various methods and attributes for selecting columns and rows, such as loc, iloc, at, iat, and []. In this section, we will explain how to use these methods and attributes and what are the differences between them.
The first thing you need to know is that pandas uses two types of labels for identifying the columns and rows of a dataframe: index and columns. The index is the label for the rows, and the columns are the labels for the columns. You can see the index and columns of a dataframe by using the index and columns attributes, respectively. For example, if you have a dataframe called df, you can type:
df.index df.columns
This will print the index and columns of the dataframe. By default, the index and columns are integers starting from 0, but you can also assign custom labels to them. For example, you can use the index and columns parameters when creating a dataframe to specify the labels, or you can use the set_index and rename methods to change the labels of an existing dataframe.
The second thing you need to know is that pandas uses two types of indexing for selecting columns and rows: positional and label-based. Positional indexing means that you use the integer position of the column or row to select it, while label-based indexing means that you use the label of the column or row to select it. For example, if you have a dataframe with three columns and five rows, you can use positional indexing to select the second column and the fourth row by using the integers 1 and 3, respectively. Alternatively, you can use label-based indexing to select the same column and row by using their labels, such as ‘B’ and ‘D’, respectively.
The third thing you need to know is that pandas provides different methods and attributes for selecting columns and rows based on whether you use positional or label-based indexing, and whether you want to select a single element or a range of elements. The following table summarizes the main methods and attributes for selecting columns and rows in pandas:
Method/Attribute | Description | Example |
---|---|---|
[] | Selects one or more columns by label or a range of rows by position | df[‘A’] df[1:3] |
loc | Selects one or more columns and rows by label or a boolean array | df.loc[‘A’:’C’, ‘B’:’D’] df.loc[df[‘A’] > 10] |
iloc | Selects one or more columns and rows by position or a boolean array | df.iloc[0:2, 1:3] df.iloc[[0, 2, 4], [1, 3]] |
at | Selects a single element by label | df.at[‘A’, ‘B’] |
iat | Selects a single element by position | df.iat[0, 1] |
In the next sections, we will show you how to use each of these methods and attributes in more detail and with examples. You will learn how to select columns and rows from a dataframe in various ways and how to use the results for further analysis.
4.2. Filter and sort data
Another common task in data analysis is to filter and sort data based on certain criteria or preferences. This allows you to extract the data that meets your requirements and arrange it in a meaningful order. Pandas provides various methods and functions for filtering and sorting data, such as boolean indexing, query, where, mask, sort_values, and sort_index. In this section, we will explain how to use these methods and functions and what are the differences between them.
The first thing you need to know is that pandas uses boolean indexing as a powerful and flexible way of filtering data. Boolean indexing means that you use a boolean expression or array to select the rows or columns that satisfy a certain condition. For example, if you have a dataframe called df with a column called ‘A’, you can use boolean indexing to select the rows where the values in column ‘A’ are greater than 10 by typing:
df[df['A'] > 10]
This will return a new dataframe with only the rows that match the condition. You can also combine multiple conditions using logical operators, such as & (and), | (or), and ~ (not). For example, you can select the rows where the values in column ‘A’ are between 10 and 20 by typing:
df[(df['A'] > 10) & (df['A'] < 20)]
You can also use boolean indexing to select specific columns by passing a list of column names or a boolean array. For example, you can select the columns ‘A’ and ‘C’ by typing:
df[['A', 'C']]
Or you can select the columns that have a mean value greater than 10 by typing:
df[df.mean() > 10]
The second thing you need to know is that pandas provides the query method as an alternative way of filtering data using a string expression. The query method allows you to write the condition as a string, which can be more convenient and readable than using boolean indexing. For example, you can select the rows where the values in column ‘A’ are greater than 10 by typing:
df.query('A > 10')
This will produce the same result as using boolean indexing, but with a simpler syntax. You can also use the query method to combine multiple conditions using logical operators, such as and, or, and not. For example, you can select the rows where the values in column ‘A’ are between 10 and 20 by typing:
df.query('10 < A < 20')
The query method also supports some advanced features, such as using variables, arithmetic operations, and index values in the condition. You can find more details and examples on the official documentation page for the query method.
The third thing you need to know is that pandas provides the where and mask methods as another way of filtering data using a boolean condition. The where and mask methods are similar to boolean indexing, but they have some differences in how they handle the rows or columns that do not match the condition. The where method replaces the values that do not match the condition with NaN (not a number), while the mask method replaces the values that match the condition with NaN. For example, if you have a dataframe called df with a column called ‘A’, you can use the where method to select the rows where the values in column ‘A’ are greater than 10 by typing:
df.where(df['A'] > 10)
This will return a new dataframe with the same shape as the original dataframe, but with NaN values in the rows that do not match the condition. You can also use the mask method to select the rows where the values in column ‘A’ are less than or equal to 10 by typing:
df.mask(df['A'] > 10)
This will return a new dataframe with the same shape as the original dataframe, but with NaN values in the rows that match the condition. You can also use the inplace parameter to modify the original dataframe instead of returning a new one, and the other parameter to specify a different value to replace the values that do not match the condition. You can find more details and examples on the official documentation page for the where method and the official documentation page for the mask method.
The fourth thing you need to know is that pandas provides the sort_values and sort_index methods for sorting data based on the values or the labels of the columns and rows. The sort_values method allows you to sort the data by one or more columns, in ascending or descending order, and with different options for handling missing values. For example, if you have a dataframe called df with columns called ‘A’, ‘B’, and ‘C’, you can sort the data by column ‘A’ in ascending order by typing:
df.sort_values(by='A')
This will return a new dataframe with the rows sorted by the values in column ‘A’. You can also sort the data by multiple columns by passing a list of column names to the by parameter, and specify the order for each column by passing a list of booleans to the ascending parameter. For example, you can sort the data by columns ‘A’ and ‘B’, in ascending order for ‘A’ and descending order for ‘B’, by typing:
df.sort_values(by=['A', 'B'], ascending=[True, False])
This will return a new dataframe with the rows sorted by the values in columns ‘A’ and ‘B’, with ties in ‘A’ broken by ‘B’. You can also use the na_position parameter to specify whether to place the missing values at the beginning or the end of the sorted data, and the inplace parameter to modify the original dataframe instead of returning a new one. You can find more details and examples on the official documentation page for the sort_values method.
The sort_index method allows you to sort the data by the labels of the columns or rows, in ascending or descending order, and with different options for handling missing values. For example, if you have a dataframe called df with an index and columns that are strings, you can sort the data by the index in alphabetical order by typing:
df.sort_index()
This will return a new dataframe with the rows sorted by the index. You can also sort the data by the columns by passing axis=1 to the sort_index method, and specify the order by passing a boolean to the ascending parameter. For example, you can sort the data by the columns in reverse alphabetical order by typing:
df.sort_index(axis=1, ascending=False)
This will return a new dataframe with the columns sorted by the labels. You can also use the na_position parameter to specify whether to place the missing values at the beginning or the end of the sorted data, and the inplace parameter to modify the original dataframe instead of returning a new one. You can find more details and examples on the official documentation page for the sort_index method.
In this section, you learned how to filter and sort data in pandas using various methods and functions. You learned how to use boolean indexing, query, where, mask, sort_values, and sort_index to select and arrange the data that meets your requirements
4.3. Apply functions and calculations
The last thing you need to know in this tutorial is how to apply functions and calculations to the data in a dataframe. This allows you to perform various operations on the data, such as arithmetic, statistics, aggregation, transformation, and more. Pandas provides various methods and functions for applying functions and calculations, such as apply, applymap, agg, transform, and eval. In this section, we will explain how to use these methods and functions and what are the differences between them.
The first thing you need to know is that pandas uses the apply method as a general way of applying a function to the data in a dataframe. The apply method allows you to apply a function to each column or row of a dataframe, and returns a new dataframe or series with the results. For example, if you have a dataframe called df with columns called ‘A’, ‘B’, and ‘C’, you can use the apply method to calculate the sum of each column by typing:
df.apply(sum)
This will return a new series with the sum of each column. You can also use the apply method to calculate the sum of each row by passing axis=1 to the apply method, and specify a different function by passing any callable object to the apply method. For example, you can use the apply method to calculate the mean of each row by typing:
df.apply(np.mean, axis=1)
This will return a new series with the mean of each row, using the np.mean function from the NumPy library. You can also use the apply method to apply a custom function that you define yourself, such as a lambda function or a user-defined function. For example, you can use the apply method to apply a lambda function that adds 1 to each value in the dataframe by typing:
df.apply(lambda x: x + 1)
This will return a new dataframe with each value incremented by 1. You can also use the apply method to apply a user-defined function that takes multiple arguments, such as a function that calculates the difference between the maximum and minimum values in a column or row. For example, you can use the apply method to apply a user-defined function that calculates the range of each column by typing:
def range(x): return x.max() - x.min() df.apply(range)
This will return a new series with the range of each column. You can find more details and examples on the official documentation page for the apply method.
The second thing you need to know is that pandas provides the applymap method as a way of applying a function to each element of a dataframe. The applymap method allows you to apply a function that takes a single argument and returns a single value to each element of a dataframe, and returns a new dataframe with the results. For example, if you have a dataframe called df with columns called ‘A’, ‘B’, and ‘C’, you can use the applymap method to apply the np.sqrt function from the NumPy library to each element of the dataframe by typing:
df.applymap(np.sqrt)
This will return a new dataframe with the square root of each element. You can also use the applymap method to apply a custom function that you define yourself, such as a lambda function or a user-defined function. For example, you can use the applymap method to apply a lambda function that converts each element to a string by typing:
df.applymap(str)
This will return a new dataframe with each element converted to a string. You can also use the applymap method to apply a user-defined function that takes a single argument and returns a single value, such as a function that checks if a value is even or odd. For example, you can use the applymap method to apply a user-defined function that returns ‘even’ or ‘odd’ depending on the value by typing:
def even_or_odd(x): if x % 2 == 0: return 'even' else: return 'odd' df.applymap(even_or_odd)
This will return a new dataframe with each element labeled as ‘even’ or ‘odd’. You can find more details and examples on the official documentation page for the applymap method.
The third thing you need to know is that pandas provides the agg and transform methods as ways of applying aggregation and transformation functions to the data in a dataframe. The agg and transform methods are similar to the apply method, but they have some differences in how they handle the output and the input of the functions. The agg method allows you to apply one or more aggregation functions to each column or row of a dataframe, and returns a new dataframe or series with the results. An aggregation function is a function that takes a series or a dataframe and returns a single value, such as sum, mean, min, max, count, std, var, and more. For example, if you have a dataframe called df with columns called ‘A’, ‘B’, and ‘C’, you can use the agg method to calculate the sum and the mean of each column by typing:
df.agg(['sum', 'mean'])
This will return a new dataframe with the sum and the mean of each column. You can also use the agg method to apply different aggregation functions to different columns by passing a dictionary of column names and function names to the agg method. For example, you can use the agg method to calculate the sum of column ‘A’, the mean of column ‘B’, and the minimum of column ‘C’ by typing:
df.agg({'A': 'sum', 'B': 'mean', 'C': 'min'})
This will return a new series with the results for each column. You can also use the agg method to apply custom aggregation functions that you define yourself, such as a lambda function or a user-defined function. For example, you can use the agg method to apply a lambda function that calculates the range of each column by typing:
df.agg(lambda x: x.max() - x.min())
This will return a new series with the range of each column. You can find more details and examples on the official documentation page for the agg method.
The transform method allows you to apply one or more transformation functions to each column or row of a dataframe, and returns a new dataframe with the results. A transformation function is a function that takes a series or a dataframe and returns a series or a dataframe with the same shape and index, such as abs, log, sqrt, rank, zscore, and more. For example, if you have a dataframe called df with columns called ‘A’, ‘B’, and ‘C’, you can use the transform method to apply the np.log function from the NumPy library to each column by typing:
df.transform(np.log)
This will return a new dataframe with the natural logarithm of each element. You can also use the transform method to apply different transformation functions to different columns by passing a dictionary of column names and function names to the transform method. For example, you can use the transform method to apply the np.log function to column ‘A’, the np.sqrt function to column ‘B’, and the np.abs function to column ‘C’ by typing:
df.transform({'A': np.log, 'B': np.sqrt, 'C': np.abs})
This will return a new dataframe with the results for each column. You can also use the transform method to apply custom transformation functions that you define yourself, such as a lambda function or a user-defined function.
5. Summary and next steps
In this tutorial, you learned the basics of pandas and dataframes, which are essential tools for data analysis and manipulation in Python. You learned how to:
- Install pandas and import it as pd
- Create a dataframe from a list of dictionaries
- View and inspect the dataframe using various attributes and methods
- Select columns and rows from the dataframe using different methods and attributes, such as [], loc, iloc, at, iat, and more
- Filter and sort the data in the dataframe using different methods and functions, such as boolean indexing, query, where, mask, sort_values, sort_index, and more
- Apply functions and calculations to the data in the dataframe using different methods and functions, such as apply, applymap, agg, transform, eval, and more
By completing this tutorial, you have gained a solid foundation for working with pandas and dataframes, which will enable you to perform various data analysis tasks in Python. You can also use pandas to read and write data from different sources, such as CSV, Excel, JSON, SQL, and more. You can find more details and examples on the official documentation page for input/output in pandas.
As a next step, you can explore more features and functionalities of pandas and dataframes, such as grouping, merging, reshaping, pivoting, plotting, and more. You can find more tutorials and resources on the official documentation page for getting started with pandas.
We hope you enjoyed this tutorial and learned something new and useful. If you have any questions or feedback, please let us know in the comments below. Thank you for reading and happy coding!