Subsets#
Learning Objectives
Questions:
How can we select and filter specific columns and rows in a
DataFrame
?
Objectives:
Select specific columns from a
DataFrame
using square bracket notation ([]
).Use the
loc
property for label-based indexing in aDataFrame
.Use the
iloc
property for position-based indexing in aDataFrame
.Filter rows based on the presence or absence of missing values.
Understand the differences and complementary uses of
.loc[]
and.iloc[]
.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Creating a DataFrame
from a CSV file#
The pd.read_csv()
function is used to read the data from the file ‘data/titanic.csv’.
# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.
titanic
Selecting specific columns#
Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age
or later on when you want to access and manipulate your data in more complex ways.
Access the ‘Age’ column from the DataFrame
‘titanic’ to return a Series
object containing all the data in the ‘Age’ column:
titanic['Age']
# Check the type of the 'Age' column in 'titanic' using the 'type()' function.
type(titanic['Age'])
Use the shape
attribute to determine the dimensions of the Series
.
It returns a tuple representing the number of rows and columns (rows, columns).
titanic['Age'].shape
Calling multiple Series
#
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets:
# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.
titanic[['Age', 'Sex']]
Note
The inner square brackets define a Python list
with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame
as seen in the previous example.
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets, and then apply the type()
function to the resulting DataFrame
subset:
# Use the 'type()' function to determine the data type of the DataFrame subset.
type(titanic[['Age', 'Sex']])
Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame
using double square brackets, and then apply the shape
attribute to the resulting DataFrame
subset:
# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.
titanic[['Age', 'Sex']].shape
Introducing .loc[]
and .iloc[]
#
Now that you’ve learned how to select specific columns from a DataFrame
, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[]
and .iloc[]
will give you greater control and clarity in your code.
These powerful tools allow you to:
Safely and explicitly select rows and columns by labels with
.loc[]
.Access rows and columns by their integer positions using
.iloc[]
.Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.
With these methods, you’ll be better equipped to handle more complex data tasks efficiently.
What exactly are loc and iloc?
loc
and iloc
in Pandas are technically properties that return a specialized indexer object. These indexer objects allow for label-based (loc
) and integer-based (iloc
) indexing. Although they behave similarly to methods in that you use them to perform actions (i.e., selecting or slicing data), they are properties because they return an indexing object rather than directly executing a function.
When you access loc
or iloc
on a DataFrame
, what you are getting is not a simple value but an object that has its own methods and behaviors, specifically designed to handle indexing operations. This design allows for the concise and powerful data selection syntax that Pandas is known for.
Avoid ambiguous operations
Some of the operations we’ll be covering can technically be performed without using .loc[]
or .iloc[]
. For instance, you might see data selections or modifications done with simple square brackets like df[df['column'] > value]
. While this can work for basic tasks, it can lead to issues in more complex scenarios, such as unexpected behavior, ambiguity, or performance inefficiencies.
Why is this a bad idea?
Ambiguity: It’s not always clear if you’re modifying a copy of your data or the original
DataFrame
, which can lead to unintended side effects.Potential errors: Using square brackets for complex selections might trigger warnings like
SettingWithCopyWarning
, indicating that your code might not be doing what you think it is.Lack of flexibility: As your data operations become more complex, relying on square brackets can limit your ability to write clear and maintainable code.
For these reasons, we recommend using .loc[]
and .iloc[]
as they provide a more reliable and explicit way to handle data selection and modification in Pandas.
If you’re interested in learning more about the potential pitfalls of using square brackets and the benefits of .loc[]
and .iloc[]
, you can read more in the Pandas documentation on indexing and selecting data.
Filtering rows based on conditional expressions#
With .loc[]
and .iloc[]
, you have more control and flexibility when filtering rows and selecting columns in a DataFrame
.
Let’s start by exploring how you can use these tools to filter rows based on conditions.
Filtering specific rows with .loc[]
#
To filter rows based on a condition, you can place the condition inside the .loc[]
selector.
Here’s how you can filter rows where the Age
column is greater than 35:
# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.
titanic.loc[titanic['Age'] > 35]
This command filters the rows where the condition titanic['Age'] > 35
is True
, and returns the entire DataFrame
for those rows.
If you only want to select specific columns (e.g., Name
and Pclass
) for those filtered rows, you can add the column names after the comma:
# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.
titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
When using loc/iloc
The part before the comma is the rows you want to select.
The part after the comma is the columns you want to select.
Using .iloc[]
for positional indexing#
The .iloc[]
method works similarly to .loc[]
, but it selects rows and columns based on their integer positions rather than labels.
For example, to select the first 5 rows and the first 2 columns, you can use:
# Select the first 5 rows and the first 2 columns using .iloc[].
titanic.iloc[:5, :2]
Inclusive vs. Exclusive Slicing#
When slicing with .loc[]
, the slicing is inclusive of the start and end labels. For example:
# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.
titanic.loc[0:3, ['Name', 'Pclass']]
This command returns rows 0, 1, 2, and 3, including both the start and end indices.
When slicing with .iloc[]
, the slicing is exclusive of the end position:
# Select rows 0 through 2 (exclusive of 3) and columns 0 and 1 (exclusive of 2) using .iloc[].
titanic.iloc[0:3, 0:2]
Exercises#
Use
loc
to find the passenger named “Dooley, Mr. Patrick”Use
iloc
to find the ~~alien~~ 8. passengerUse
iloc
to find the value in row 100, column 4Use
loc
andstr.contains()
to find all Reverends (“Rev.”)
Solution
titanic.loc[titanic[‘Name’] == “Dooley, Mr. Patrick”]
titanic.iloc[7]
titanic.iloc[99, 3]
titanic.loc[titanic[‘Name’].str.contains(“Rev.”, case=False)]
Using .loc[] and .iloc[] for complex conditions
You can use both .loc[]
and .iloc[]
to filter rows based on more complex conditions, combining multiple criteria with logical operators.
For instance, to filter rows where Pclass
is either 2 or 3 using .loc[]
:
# Filter rows where 'Pclass' is either 2 or 3 using .loc[].
titanic.loc[titanic['Pclass'].isin([2, 3])]
Similarly, if you want to perform position-based filtering using .iloc[]
, you would first need to identify the positions of the relevant rows and columns. However, .iloc[]
is typically less useful for condition-based filtering because it requires knowledge of the exact row and column positions rather than labels.
You can also combine multiple conditions using logical operators like &
(AND) and |
(OR):
# Filter rows where ('Pclass' is either 2 or 3) AND ('Age' is greater than 35),
# and then select the 'Name' and 'Pclass' columns.
titanic.loc[((titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)) & (titanic['Age'] > 35), ['Name', 'Pclass']]
Handling missing values with .loc[]
and .iloc[]
#
Both .loc[]
and .iloc[]
allow you to filter rows based on the presence or absence of missing values.
(However, .loc[]
is generally more convenient when working with labeled data, while .iloc[]
requires knowing the exact positions.)
For example, to filter rows where the Embarked
column is not null using .loc[]
:
# Filter rows where 'Embarked' is not null using .loc[].
titanic.loc[titanic['Embarked'].notna()]
You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:
# Display information about the DataFrame to compare row counts.
titanic.info()
Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.
As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:
titanic[titanic['Embarked'].isna()]
If we want to solve the same thing using loc
, we would write:
titanic.loc[titanic['Embarked'].isna()]
Selecting rows and columns#
The true power of .loc[]
becomes evident when you want to filter rows and simultaneously select specific columns. For example:
# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.
titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]
Combining loc
and iloc
#
We can use iloc
to select:
The first 100 rows, i.e., rows from index position 0 to 100 (exclusive).
Then further filter these rows using loc
based on the condition where:
‘Age’ is greater than 35.
titanic.iloc[:100].loc[titanic['Age'] > 35]
Exercise#
What is the number of matching records from:
titanic.iloc[:100].loc[titanic['Age'] > 35]
What is the number of matching records from:
titanic.loc[titanic['Age'] > 35].iloc[:100]
Reflect on and explain why the two statements return a different number of rows?
Solution
(We can use the len()
function to solve question 1 and 2.)
17 rows
100 rows
The two expressions differ in the order of filtering and slicing, which affects the final result:
titanic.iloc[:100].loc[titanic['Age'] > 35]
→ First selects the first 100 rows, then filters for rows whereAge > 35
within those 100.titanic.loc[titanic['Age'] > 35].iloc[:100]
→ First filters for all rows whereAge > 35
, then selects the first 100 rows from that filtered set.
Thus, the main difference lies in whether you slice first or filter first, which can lead to very different subsets of data.
Key points#
Access a single column or multiple columns from a
DataFrame
using square bracket notation ([]
).Use label-based indexing (
.loc[]
) and position-based indexing (.iloc[]
) to select specific rows and columns.Slicing with
.loc[]
is inclusive, while slicing with.iloc[]
is exclusive.Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.