Pandas DataFrame
#
Learning Objectives
Questions:
What is a Pandas
DataFrame
?How is a
DataFrame
used to manage and manipulate data?
Objectives:
Understand the structure and purpose of a
DataFrame
.Create a
DataFrame
from lists.Perform basic
DataFrame
manipulations, such as renaming columns and transposing data.Understand the structure and purpose of a Pandas
Series
.Append
Series
as columns to an existingDataFrame
and create new columns based on existing data.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Why pandas as pd?
Aliasing pandas as pd
is a widely adopted convention that simplifies the syntax for accessing its functionalities.
After this statement, you can use pd
to access all the functionalities provided by the pandas library.
Pandas data table representation#
A DataFrame
is a two-dimensional data structure in the Pandas library, designed to hold and manage data efficiently. It organizes data into labeled columns and rows, allowing you to store various types of data, such as numbers, text, and categorical values, all within the same structure.
Creating our first DataFrame
#
We start by creating three lists of equal length (i.e., containing the same amount of elements).
These lists will be used as columns for a DataFrame
, with each list representing a column and each element within the list representing a row in that column.
# Create three lists named 'name', 'age', and 'sex'.
name = ["Braund", "Allen", "Bonnel"]
age = [22, 35, 58]
sex = ["male", "male", "female"]
We use lists name
, age
, and sex
to fill in the columns.
Each list corresponds to a column in the DataFrame.
Name
, Age
, and Sex
are the titles of these columns.
# Create a DataFrame named 'df' based on three lists.
df = pd.DataFrame({'Name': name, 'Age': age, 'Sex': sex})
DataFrames
and dictionaries
#
Creating a DataFrame
in pandas is similar to creating a dictionary
. The key
s in the dictionary
become the column names, while the value
s, which are list
s or array
s, form the columns’ data.
See also
For more information on dictionary
s in Python, see GeeksforGeeks.
# Display the DataFrame 'df'.
df
In a spreadsheet software, the table representation of our data would look very similar
# Check the type of the 'df' object using the 'type()' function.
type(df)
Attributes#
We can use the shape
attribute to determine the dimensions of the DataFrame
.
It returns a tuple representing the number of rows and columns (rows, columns). This can be helpful especially if you have not created the DataFrame
yourself.
df.shape
And we can use the dtypes
attribute to view the data types of each column in the DataFrame
.
This command provides information about the data type of each column, such as integer, float, or object (string). This is useful knowledge to have when you start working more in-depth with your data.
df.dtypes
When asking for the shape
or dtypes
, no parentheses ()
are used. Both are an attribute of DataFrame
and Series
. (Series
will be explained later.)
Attributes of a DataFrame
or Series
do not need ()
.
Attributes represent a characteristic of a DataFrame
/Series
, whereas methods (which require parentheses ()
) do something with the DataFrame
/Series
.
Transposing a DataFrame
#
The transpose()
method swaps the DataFrame
’s rows and columns, creating ‘df_transposed’.
Transposing is useful for reshaping data, making it easier to compare rows or apply certain operations that are typically column-based.
# Transpose the DataFrame 'df' using the 'transpose()' method.
df_transposed = df.transpose()
# Display the DataFrame 'df_transposed'.
df_transposed
Renaming columns#
We can rename the columns of our DataFrame
after creation.
This is done by assigning a new list of column names to df.columns
.
The new column names are Names
, Age
, and Sex
, in that order.
# Rename the columns of the DataFrame 'df'.
df.columns = ['Names', 'Age', 'Sex']
The method below is useful for selectively renaming only one or more columns without changing the entire set of column names:
# Rename the 'Age' column to 'Ages' in the DataFrame 'df'.
df = df.rename(columns={'Age': 'Ages'})
# Our DataFrame now looks like this:
df
Series
#
Each column in a DataFrame
is a Series
. When we access a column in a DataFrame
, this actually returns a Series
object containing all the data in that column.
df['Ages']
# Check the type of the 'Ages' column in 'df' using the 'type()' function.
type(df['Ages'])
Creating our own Series
#
We can create and name a Series
in the following way.
The name
parameter assigns the name ‘Fare’ to the Series.
# Create a pandas Series named 'fare' with specified values.
fare = pd.Series(['7.2500', '71.2833', 'Unknown'], name='Fare')
This outputs the values along with their index positions and the name of the Series
:
# Display the 'fare' Series.
fare
fare.apply(type).value_counts()
Fare
<class 'str'> 3
Name: count, dtype: int64
# Check the data type of 'fare' using the 'type()' function.
type(fare)
Although our data looks fine at first glance, each cell is actually stored as a string. While we can read the numbers without issue, this will cause problems if we want to use them for calculations.
To solve this, Pandas provides a built-in function called to_numeric()
, which converts a Series
to either an integer or a float, depending on the values.
fare = pd.to_numeric(fare)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File pandas/_libs/lib.pyx:2407, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "Unknown"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[19], line 1
----> 1 fare = pd.to_numeric(fare)
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/core/tools/numeric.py:235, in to_numeric(arg, errors, downcast, dtype_backend)
233 coerce_numeric = errors not in ("ignore", "raise")
234 try:
--> 235 values, new_mask = lib.maybe_convert_numeric( # type: ignore[call-overload]
236 values,
237 set(),
238 coerce_numeric=coerce_numeric,
239 convert_to_masked_nullable=dtype_backend is not lib.no_default
240 or isinstance(values_dtype, StringDtype)
241 and values_dtype.na_value is libmissing.NA,
242 )
243 except (ValueError, TypeError):
244 if errors == "raise":
File pandas/_libs/lib.pyx:2449, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "Unknown" at position 2
However, as seen above, this produces a ValueError
due to the “Unknown” value at position 2.
To solve this, we need to use the errors
parameter in to_numeric()
.
Exercise#
Convert the cells of the ‘fare’ Series
to numeric values.
The “Unknown” value must be convertet to NaN
by applying the errors
parameter in to_numeric()
.
Hint: We can get help by using:
help(pd.to_numeric)
Solution
fare = pd.to_numeric(fare, errors='coerce')
This solution changes all cell values to numeric values.
Any values that cannot be converted — such as strings like “Unknown” — will be replaced with NaN
instead of raising an error.
Appending Series
#
We can add a Series
as a new column to a DataFrame
, extending it horizontally.
Here, the name of the ‘fare’ Series
(‘Fare’) becomes the column name in the updated DataFrame
.
# Convert the Series' cells to numeric values
fare = pd.to_numeric(fare, errors='coerce')
# Concatenate the 'fare' Series to the 'df' DataFrame along the columns (axis=1).
df = pd.concat([df, fare], axis=1)
The axis
parameter in pandas.concat()
determines whether you are combining data along rows or columns:
axis=0
(Default): Concatenation along Rows
axis=1
: Concatenation along Columns
(When concatenating along axis=1
, Pandas aligns on the index by default. If indices do not match, you may see NaN
values where data is missing.)
# Display the updated DataFrame 'df'.
df
Exercise#
Given the Series
:
extra_row = pd.Series(['Futrelle', 35, 'Female', 53.1000])
Append this Series
to ‘df’ but as a row (not a column!).
Solution
df = pd.concat([df, extra_row], axis=0)
This solution concatenates the ‘extra_row’ Series
to the ‘df’ DataFrame
along the rows (axis=0).
Creating a new column based on existing data#
We can also create a new column based on the data in an existing column.
Here, we create a new column ‘Age_in_3_years’ in the DataFrame
‘df’.
This column is calculated by adding 3 to each value in the ‘Ages’ column.
df['Age_in_3_years'] = df['Ages'] + 3
# Display the updated DataFrame 'df'.
df
Exercise#
Create a new column called ‘Fare_in_DKK’ based on the column ‘Fare’.
We assume the old fare prices to be in GBP and the exchange rate to be £1 = 8.7 DKK
Solution
df['Fare_in_DKK'] = df['Fare'] * 8.7
This solution creates a new column in the DataFrame
named ‘Fare_in_DKK’, which contains the fare prices converted from GBP to DKK using the given exchange rate.
Each fare value in GBP is multiplied by the exchange rate to obtain the corresponding fare value in DKK.
Key points#
Import the library, aka
import pandas as pd
.A table of data is stored as a pandas
DataFrame
.The
shape
anddtypes
attributes are convenient for a first check.Each column in a
DataFrame
is aSeries
.We can append
Series
as columns to an existingDataFrame
.