Subsets#

Learning Objectives

Questions:

  • How can we select and filter specific columns and rows in a DataFrame?

Objectives:

  • Select specific columns from a DataFrame using square bracket notation ([]).

  • Use the loc property for label-based indexing in a DataFrame.

  • Use the iloc property for position-based indexing in a DataFrame.

  • Filter rows based on the presence or absence of missing values.

  • Understand the differences and complementary uses of .loc[] and .iloc[].

Import Pandas#

# This line imports the pandas library and aliases it as 'pd'.

import pandas as pd

Creating a DataFrame from a CSV file#

The pd.read_csv() function is used to read the data from the file ‘data/titanic.csv’.

# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.

titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.

titanic

Hide code cell output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns


Selecting specific columns#

Selecting specific columns from a DataFrame

Selecting specific columns is useful when you want to work with specific subsets of the dataset e.g. you might want to perform statistical calculations based on only the colum Age or later on when you want to access and manipulate your data in more complex ways.

Access the ‘Age’ column from the DataFrame ‘titanic’ to return a Series object containing all the data in the ‘Age’ column:

titanic['Age']

Hide code cell output

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64
# Check the type of the 'Age' column in 'titanic' using the 'type()' function.

type(titanic['Age'])

Hide code cell output

pandas.core.series.Series

Use the shape attribute to determine the dimensions of the Series.
It returns a tuple representing the number of rows and columns (rows, columns).

titanic['Age'].shape

Hide code cell output

(891,)

Calling multiple Series#

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets:

# Select the columns 'Age' and 'Sex' from the 'titanic' DataFrame.

titanic[['Age', 'Sex']]

Hide code cell output

Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
... ... ...
886 27.0 male
887 19.0 female
888 NaN female
889 26.0 male
890 32.0 male

891 rows × 2 columns

Note

The inner square brackets define a Python list with column names, whereas the outer square brackets are used to select the data from a pandas DataFrame as seen in the previous example.

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the type() function to the resulting DataFrame subset:

# Use the 'type()' function to determine the data type of the DataFrame subset.

type(titanic[['Age', 'Sex']])

Hide code cell output

pandas.core.frame.DataFrame

Select columns ‘Age’ and ‘Sex’ from the ‘titanic’ DataFrame using double square brackets, and then apply the shape attribute to the resulting DataFrame subset:

# Use the 'shape' attribute to determine the dimensions of the DataFrame subset.

titanic[['Age', 'Sex']].shape

Hide code cell output

(891, 2)

Introducing .loc[] and .iloc[]#

Now that you’ve learned how to select specific columns from a DataFrame, it’s time to explore more advanced and flexible ways to access and manipulate your data. As you start filtering rows based on conditions or need to select both rows and columns simultaneously, using .loc[] and .iloc[] will give you greater control and clarity in your code.

These powerful tools allow you to:

  • Safely and explicitly select rows and columns by labels with .loc[].

  • Access rows and columns by their integer positions using .iloc[].

  • Avoid potential issues that can arise with more complex operations, ensuring your data manipulations are clear, consistent, and reliable.

With these methods, you’ll be better equipped to handle more complex data tasks efficiently.


Filtering rows based on conditional expressions#

Filtering specific rows from a DataFrame

With .loc[] and .iloc[], you have more control and flexibility when filtering rows and selecting columns in a DataFrame.

Let’s start by exploring how you can use these tools to filter rows based on conditions.

Filtering specific rows with .loc[]#

To filter rows based on a condition, you can place the condition inside the .loc[] selector.

Here’s how you can filter rows where the Age column is greater than 35:

# Filter rows in the 'titanic' DataFrame where the 'Age' column is greater than 35.

titanic.loc[titanic['Age'] > 35]

Hide code cell output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q

217 rows × 12 columns

This command filters the rows where the condition titanic['Age'] > 35 is True, and returns the entire DataFrame for those rows.

If you only want to select specific columns (e.g., Name and Pclass) for those filtered rows, you can add the column names after the comma:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Hide code cell output

Name Pclass
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
6 McCarthy, Mr. Timothy J 1
11 Bonnell, Miss. Elizabeth 1
13 Andersson, Mr. Anders Johan 3
15 Hewlett, Mrs. (Mary D Kingcome) 2
... ... ...
865 Bystrom, Mrs. (Karolina) 2
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 1
873 Vander Cruyssen, Mr. Victor 3
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 1
885 Rice, Mrs. William (Margaret Norton) 3

217 rows × 2 columns

When using loc/iloc

The part before the comma is the rows you want to select.

The part after the comma is the columns you want to select.

Using .iloc[] for positional indexing#

The .iloc[] method works similarly to .loc[], but it selects rows and columns based on their integer positions rather than labels.

For example, to select the first 5 rows and the first 2 columns, you can use:

# Select the first 5 rows and the first 2 columns using .iloc[].

titanic.iloc[:5, :2]

Hide code cell output

PassengerId Survived
0 1 0
1 2 1
2 3 1
3 4 1
4 5 0

Inclusive vs. Exclusive Slicing#

When slicing with .loc[], the slicing is inclusive of the start and end labels. For example:

# Select rows 0 through 3 (inclusive) and columns 'Name' and 'Pclass'.

titanic.loc[0:3, ['Name', 'Pclass']]

Hide code cell output

Name Pclass
0 Braund, Mr. Owen Harris 3
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
2 Heikkinen, Miss. Laina 3
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1

This command returns rows 0, 1, 2, and 3, including both the start and end indices.

When slicing with .iloc[], the slicing is exclusive of the end position:

# Select rows 0 through 2 (exclusive of 3) and columns 0 and 1 (exclusive of 2) using .iloc[].

titanic.iloc[0:3, 0:2]

Hide code cell output

PassengerId Survived
0 1 0
1 2 1
2 3 1

Exercises#

  1. Use loc to find the passenger named “Dooley, Mr. Patrick”

  2. Use iloc to find the ~~alien~~ 8. passenger

  3. Use iloc to find the value in row 100, column 4

  4. Use loc and str.contains() to find all Reverends (“Rev.”)

Handling missing values with .loc[] and .iloc[]#

Both .loc[] and .iloc[] allow you to filter rows based on the presence or absence of missing values.
(However, .loc[] is generally more convenient when working with labeled data, while .iloc[] requires knowing the exact positions.)

For example, to filter rows where the Embarked column is not null using .loc[]:

# Filter rows where 'Embarked' is not null using .loc[].

titanic.loc[titanic['Embarked'].notna()]

Hide code cell output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

889 rows × 12 columns

You can then compare the number of extracted rows with the original DataFrame to identify how many rows have missing data:

# Display information about the DataFrame to compare row counts.

titanic.info()

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Comparing the two number of rows (889 and 891) tells us, that two rows are missing embarkation data.

As we saw in the exercise on the previous page on reading and writing tabular data, the two rows missing embarkation data can be found like this:

titanic[titanic['Embarked'].isna()]

If we want to solve the same thing using loc, we would write:

titanic.loc[titanic['Embarked'].isna()]

Hide code cell output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

Selecting rows and columns#

Selecting specific rows and columns from a DataFrame

The true power of .loc[] becomes evident when you want to filter rows and simultaneously select specific columns. For example:

# Filter rows where 'Age' is greater than 35 and select only the 'Name' and 'Pclass' columns.

titanic.loc[titanic['Age'] > 35, ['Name', 'Pclass']]

Hide code cell output

Name Pclass
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1
6 McCarthy, Mr. Timothy J 1
11 Bonnell, Miss. Elizabeth 1
13 Andersson, Mr. Anders Johan 3
15 Hewlett, Mrs. (Mary D Kingcome) 2
... ... ...
865 Bystrom, Mrs. (Karolina) 2
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 1
873 Vander Cruyssen, Mr. Victor 3
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 1
885 Rice, Mrs. William (Margaret Norton) 3

217 rows × 2 columns

Combining loc and iloc#

We can use iloc to select:

  • The first 100 rows, i.e., rows from index position 0 to 100 (exclusive).

Then further filter these rows using loc based on the condition where:

  • ‘Age’ is greater than 35.

titanic.iloc[:100].loc[titanic['Age'] > 35]

Hide code cell output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
35 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S
40 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.4750 NaN S
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN
62 63 0 1 Harris, Mr. Henry Birkhardt male 45.0 1 0 36973 83.4750 C83 S
92 93 0 1 Chaffee, Mr. Herbert Fuller male 46.0 1 0 W.E.P. 5734 61.1750 E31 S
94 95 0 3 Coxon, Mr. Daniel male 59.0 0 0 364500 7.2500 NaN S
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C

Exercise#

  1. What is the number of matching records from:

titanic.iloc[:100].loc[titanic['Age'] > 35]
  1. What is the number of matching records from:

titanic.loc[titanic['Age'] > 35].iloc[:100]
  1. Reflect on and explain why the two statements return a different number of rows?


Key points#

  • Access a single column or multiple columns from a DataFrame using square bracket notation ([]).

  • Use label-based indexing (.loc[]) and position-based indexing (.iloc[]) to select specific rows and columns.

  • Slicing with .loc[] is inclusive, while slicing with .iloc[] is exclusive.

  • Filter rows based on conditions and apply additional conditions or select specific columns within the same operation.