Subsets

Subsets#

Learning Objectives

Questions:

How can we select and filter specific columns and rows in a DataFrame?

Objectives:

Select specific columns from a DataFrame using square bracket notation ([]).
Use the loc property for label-based indexing in a DataFrame.
Use the iloc property for position-based indexing in a DataFrame.
Filter rows based on the presence or absence of missing values.
Understand the differences and complementary uses of .loc[] and .iloc[].

Import pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Creating a `DataFrame` from a CSV file#

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')

# Display the DataFrame 'titanic'.

titanic

Show code cell output

Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Selecting specific columns#

Selecting specific columns from a DataFrame

Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age or later on when you want to access and manipulate your data in more complex ways.

Access the ‘Age’ column from the DataFrame ‘titanic’ to return a Series object containing all the data in the ‘Age’ column:

titanic['Age']

# Check the type of the 'Age' column in 'titanic' using the 'type()' function.

type(titanic['Age'])

Use the shape attribute to determine the dimensions of the Series.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic['Age'].shape

Calling multiple `Series`#

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets:

# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.

titanic[['Age', 'Sex']]

Show code cell output

Hide code cell output

	Age	Sex
0	22.0	male
1	38.0	female
2	26.0	female
3	35.0	female
4	35.0	male
...	...	...
886	27.0	male
887	19.0	female
888	NaN	female
889	26.0	male
890	32.0	male

891 rows × 2 columns

Note

The inner square brackets define a Python list with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame as seen in the previous example.

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the type() function to the resulting DataFrame subset:

# Use the 'type()' function to determine the data type of the DataFrame subset.

type(titanic[['Age', 'Sex']])

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the shape attribute to the resulting DataFrame subset:

# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.

titanic[['Age', 'Sex']].shape

Introducing `.loc[]` and `.iloc[]`#

Now that you’ve learned how to select specific columns from a DataFrame, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[] and .iloc[] will give you greater control and clarity in your code.

These powerful tools allow you to:

Safely and explicitly select rows and columns by labels with .loc[].
Access rows and columns by their integer positions using .iloc[].
Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.

With these methods, you’ll be better equipped to handle more complex data tasks efficiently.

What exactly are loc and iloc?

loc and iloc in pandas are technically properties that return a specialized indexer object. These indexer objects allow for label-based (loc) and integer-based (iloc) indexing. Although they behave similarly to methods in that you use them to perform actions (i.e., selecting or slicing data), they are properties because they return an indexing object rather than directly executing a function.

When you access loc or iloc on a DataFrame, what you are getting is not a simple value but an object that has its own methods and behaviors, specifically designed to handle indexing operations. This design allows for the concise and powerful data selection syntax that pandas is known for.

Avoid ambiguous operations

Some of the operations we’ll be covering can technically be performed without using .loc[] or .iloc[]. For instance, you might see data selections or modifications done with simple square brackets like df[df['column'] > value]. While this can work for basic tasks, it can lead to issues in more complex scenarios, such as unexpected behavior, ambiguity, or performance inefficiencies.

Why is this a bad idea?

Ambiguity: It’s not always clear if you’re modifying a copy of your data or the original DataFrame, which can lead to unintended side effects.
Potential errors: Using square brackets for complex selections might trigger warnings like SettingWithCopyWarning, indicating that your code might not be doing what you think it is.
Lack of flexibility: As your data operations become more complex, relying on square brackets can limit your ability to write clear and maintainable code.

For these reasons, we recommend using .loc[] and .iloc[] as they provide a more reliable and explicit way to handle data selection and modification in pandas.

If you’re interested in learning more about the potential pitfalls of using square brackets and the benefits of .loc[] and .iloc[], you can read more in the pandas documentation on indexing and selecting data.

Filtering rows based on conditional expressions#

Filtering specific rows from a DataFrame

With .loc[] and .iloc[], you have more control and flexibility when filtering rows and selecting columns in a DataFrame.

Let’s start by exploring how you can use these tools to filter rows based on conditions.

Filtering specific rows with `.loc[]`#

To filter rows based on a condition, you can place the condition inside the .loc[] selector.

Here’s how you can filter rows where the Age column is greater than 35:

# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.

titanic.loc[titanic['Age'] > 35]

Show code cell output

Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	NaN	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q

217 rows × 12 columns

This command filters the rows where the condition titanic['Age'] > 35 is True, and returns the entire DataFrame for those rows.

If you only want to select specific columns (e.g., Name and Pclass) for those filtered rows, you can add the column names after the comma:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Show code cell output

Hide code cell output

	Name	Pclass
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
6	McCarthy, Mr. Timothy J	1
11	Bonnell, Miss. Elizabeth	1
13	Andersson, Mr. Anders Johan	3
15	Hewlett, Mrs. (Mary D Kingcome)	2
...	...	...
865	Bystrom, Mrs. (Karolina)	2
871	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	1
873	Vander Cruyssen, Mr. Victor	3
879	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	1
885	Rice, Mrs. William (Margaret Norton)	3

217 rows × 2 columns

When using loc/iloc

The part before the comma is the rows you want to select.

The part after the comma is the columns you want to select.

Using `.iloc[]` for positional indexing#

The .iloc[] method works similarly to .loc[], but it selects rows and columns based on their integer positions rather than labels.

For example, to select the first 5 rows and the first 2 columns, you can use:

# Select the first 5 rows and the first 2 columns using .iloc[].

titanic.iloc[:5, :2]

Show code cell output

Hide code cell output

	PassengerId	Survived
0	1	0
1	2	1
2	3	1
3	4	1
4	5	0

Inclusive vs. Exclusive Slicing#

When slicing with .loc[], the slicing is inclusive of the start and end labels. For example:

# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.

titanic.loc[0:3, ['Name', 'Pclass']]

Show code cell output

Hide code cell output

	Name	Pclass
0	Braund, Mr. Owen Harris	3
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
2	Heikkinen, Miss. Laina	3
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1

This command returns rows 0, 1, 2, and 3, including both the start and end indices.

When slicing with .iloc[], the slicing is exclusive of the end position:

# Select rows 0 through 2 (exclusive of 3) and columns 0 and 1 (exclusive of 2) using .iloc[].

titanic.iloc[0:3, 0:2]

Show code cell output

Hide code cell output

	PassengerId	Survived
0	1	0
1	2	1
2	3	1

Exercises#

Use loc to find the passenger named “Dooley, Mr. Patrick”
Use iloc to find the ~~alien~~ 8. passenger
Use iloc to find the value in row 100, column 4
Use loc and str.contains() to find all Reverends (“Rev.”)

Solution

titanic.loc[titanic[‘Name’] == “Dooley, Mr. Patrick”]
titanic.iloc[7]
titanic.iloc[99, 3]
titanic.loc[titanic[‘Name’].str.contains(“Rev.”, case=False)]

Solution 4 selects all rows from titanic where the Name column contains “Rev.”.
The argument case=False makes the search case-insensitive (it ignores uppercase/lowercase differences).
You have not learned about case=False yet, and it is not strictly necessary considering the wording of the exercise.

Using .loc[] and .iloc[] for complex conditions

You can use both .loc[] and .iloc[] to filter rows based on more complex conditions, combining multiple criteria with logical operators.

For instance, to filter rows where Pclass is either 2 or 3 using .loc[]:

# Filter rows where 'Pclass' is either 2 or 3 using .loc[].
titanic.loc[titanic['Pclass'].isin([2, 3])]

Similarly, if you want to perform position-based filtering using .iloc[], you would first need to identify the positions of the relevant rows and columns. However, .iloc[] is typically less useful for condition-based filtering because it requires knowledge of the exact row and column positions rather than labels.

You can also combine multiple conditions using logical operators like & (AND) and | (OR):

# Filter rows where ('Pclass' is either 2 or 3) AND ('Age' is greater than 35),
# and then select the 'Name' and 'Pclass' columns.
titanic.loc[((titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)) & (titanic['Age'] > 35), ['Name', 'Pclass']]

Handling missing values with `.loc[]` and `.iloc[]`#

Both .loc[] and .iloc[] allow you to filter rows based on the presence or absence of missing values.
(However, .loc[] is generally more convenient when working with labeled data, while .iloc[] requires knowing the exact positions.)

For example, to filter rows where the Embarked column is not null using .loc[]:

# Filter rows where 'Embarked' is not null using .loc[].

titanic.loc[titanic['Embarked'].notna()]

Show code cell output

Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

889 rows × 12 columns

You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:

# Display information about the DataFrame to compare row counts.

titanic.info()

Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.

As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:

titanic[titanic['Embarked'].isna()]

If we want to solve the same thing using loc, we would write:

titanic.loc[titanic['Embarked'].isna()]

Show code cell output

Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Selecting rows and columns#

Selecting specific rows and columns from a DataFrame

The true power of .loc[] becomes evident when you want to filter rows and simultaneously select specific columns. For example:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Show code cell output

Hide code cell output

	Name	Pclass
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1
6	McCarthy, Mr. Timothy J	1
11	Bonnell, Miss. Elizabeth	1
13	Andersson, Mr. Anders Johan	3
15	Hewlett, Mrs. (Mary D Kingcome)	2
...	...	...
865	Bystrom, Mrs. (Karolina)	2
871	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	1
873	Vander Cruyssen, Mr. Victor	3
879	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	1
885	Rice, Mrs. William (Margaret Norton)	3

217 rows × 2 columns

Combining `loc` and `iloc`#

We can use iloc to select:

The first 100 rows, i.e., rows from index position 0 to 100 (exclusive).

Then further filter these rows using loc based on the condition where:

‘Age’ is greater than 35.

titanic.iloc[:100].loc[titanic['Age'] > 35]

Show code cell output

Hide code cell output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
30	31	0	1	Uruchurtu, Don. Manuel E	male	40.0	0	0	PC 17601	27.7208	NaN	C
33	34	0	2	Wheadon, Mr. Edward H	male	66.0	0	0	C.A. 24579	10.5000	NaN	S
35	36	0	1	Holverson, Mr. Alexander Oskar	male	42.0	1	0	113789	52.0000	NaN	S
40	41	0	3	Ahlin, Mrs. Johan (Johanna Persdotter Larsson)	female	40.0	1	0	7546	9.4750	NaN	S
52	53	1	1	Harper, Mrs. Henry Sleeper (Myna Haxtun)	female	49.0	1	0	PC 17572	76.7292	D33	C
54	55	0	1	Ostby, Mr. Engelhart Cornelius	male	65.0	0	1	113509	61.9792	B30	C
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0000	B28	NaN
62	63	0	1	Harris, Mr. Henry Birkhardt	male	45.0	1	0	36973	83.4750	C83	S
92	93	0	1	Chaffee, Mr. Herbert Fuller	male	46.0	1	0	W.E.P. 5734	61.1750	E31	S
94	95	0	3	Coxon, Mr. Daniel	male	59.0	0	0	364500	7.2500	NaN	S
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C

Exercise#

What is the number of matching records from:

titanic.iloc[:100].loc[titanic['Age'] > 35]

What is the number of matching records from:

titanic.loc[titanic['Age'] > 35].iloc[:100]

Reflect on and explain why the two statements return a different number of rows?

Solution

(We can use the len() function to solve question 1 and 2.)

17 rows
100 rows
The two expressions differ in the order of filtering and slicing, which affects the final result:

titanic.iloc[:100].loc[titanic['Age'] > 35]
→ First selects the first 100 rows, then filters for rows where Age > 35 within those 100.
titanic.loc[titanic['Age'] > 35].iloc[:100]
→ First filters for all rows where Age > 35, then selects the first 100 rows from that filtered set.

Thus, the main difference lies in whether you slice first or filter first, which can lead to very different subsets of data.

Key points#

Access a single column or multiple columns from a DataFrame using square bracket notation ([]).
Use label-based indexing (.loc[]) and position-based indexing (.iloc[]) to select specific rows and columns.
Slicing with .loc[] is inclusive, while slicing with .iloc[] is exclusive.
Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.

Subsets

Contents

Subsets#

Import pandas#

Creating a DataFrame from a CSV file#

Selecting specific columns#

Calling multiple Series#

Introducing .loc[] and .iloc[]#

Filtering rows based on conditional expressions#

Filtering specific rows with .loc[]#

Using .iloc[] for positional indexing#

Inclusive vs. Exclusive Slicing#

Exercises#

Handling missing values with .loc[] and .iloc[]#

Selecting rows and columns#

Combining loc and iloc#

Exercise#

Key points#

Creating a `DataFrame` from a CSV file#

Calling multiple `Series`#

Introducing `.loc[]` and `.iloc[]`#

Filtering specific rows with `.loc[]`#

Using `.iloc[]` for positional indexing#

Handling missing values with `.loc[]` and `.iloc[]`#

Combining `loc` and `iloc`#