Read and Write Tabular Data#
Learning Objectives
Questions:
How can we read, inspect, and export tabular data using Pandas?
Objectives:
Load tabular data from a spreadsheet into a
DataFrame
.Inspect the structure and contents of your dataset.
Identify missing data within your dataset.
Apply filters to extract specific subsets of data based on conditions.
Save and export your dataset to various file formats.
Download the Titanic data#
For this page and the ones that follow, we will be working with the Titanic dataset. The full dataset can be found on Kaggle, but we will only be using the training data, which we have made available here.
To follow the exercises and use the code exactly as written below, you will need to download the data and place it in a folder called ‘data’ in the same directory as your Jupyter notebook. In other words, the folder where your notebook is located should contain a subfolder named ‘data’, which holds the file ‘titanic.csv’.
Import Pandas#
# This line imports the pandas library and aliases it as 'pd'.
import pandas as pd
Current working directory#
The %pwd
command (print working directory) is known as a “magic command” in Jupyter Notebook.
You can read more about magic commands here.
It is used to display the current working directory (cwd
) within the notebook environment.
This is the folder where the Jupyter Notebook is running.
Understanding the current working directory is important for file operations like reading and writing files.
%pwd
Creating a DataFrame
from a CSV file#
CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database.
See also
For more information on working with CSV files in Python, see GeeksforGeeks.
The pd.read_csv()
function is used to read the data from the file ‘data/titanic.csv’.
The file is located in the ‘data’ directory, relative to the current working directory.
I.e., in the folder where your notebook is located, you should have a folder named ‘data’ containing the file ‘titanic.csv’.
The resulting DataFrame
‘titanic’ contains the dataset, ready for analysis and manipulation.
# Load the Titanic dataset from a CSV file into a DataFrame named 'titanic'.
titanic = pd.read_csv('data/titanic.csv')
# Display the DataFrame 'titanic'.
titanic
Note, even though we only see the first and last five rows, we actually read the whole DataFrame
into the kernel’s memory.
Memory usage
The pressure on memory usage can be alleviated by using the head()
method described below.
However, this will only be an issue with very large datasets, so do not worry too much about it for now.
You can find out how much memory a DataFrame
uses by using the memory_usage()
method:
titanic.memory_usage(deep=True).sum()
Getting information about the DataFrame
#
Besides displaying the the first and last rows of your DataFrame
there are a few other attributes which can be useful for getting to know your data.
shape
#
The shape
attribute can be used to determine the dimensions of the DataFrame
‘titanic’.
It returns a tuple representing the number of rows and columns (rows, columns).
titanic.shape
head()
#
The head()
method displays the first five rows of the ‘titanic’ DataFrame
.
This provides a quick overview of the dataset, showing the column names and the first rows of data.
titanic.head()
tail()
#
Using the tail()
method and specifying 3 as argument will display the last three rows of the ‘titanic’ DataFrame
.
This allows us to inspect the end of the dataset, showing the last rows of data.
titanic.tail(3)
A note on head() and tail()
By default, Pandas limits the display of rows in a DataFrame
to a maximum of 60 rows.
Calling a number higher than 60 will display the first five rows followed by an ellipsis (...
), indicating that rows are being skipped, and then display the last five rows of the selection.
To display all rows without truncation, you can set the max_rows
option to None
using pd.set_option()
:
pd.set_option('display.max_rows', None)
Likewise, Pandas limits the display of columns in a DataFrame
to a maximum of 20 columns.
To display all columns without truncation, you can set the max_columns
option to None
using pd.set_option()
:
pd.set_option('display.max_columns', None)
dtypes
#
The dtypes
attribute can be used to view the data types of each column in the ‘titanic’ DataFrame
.
This command provides information about the data type of each column, such as integer, float, or object (string). The data type in each colum is useful to know especially if you are working with data that you did not create yourself. Moreover, this is important for performing statistics and calculations later on, as this can only be done on integers and floats.
titanic.dtypes
info()
#
The info()
method is used to display a concise summary of the ‘titanic’ DataFrame
.
This command provides essential information about the DataFrame
, including the number of non-null values in each column, the data type of each column, and memory usage.
titanic.info()
notna()
and isna()
#
The notna()
conditional function returns a True
for each row where the values are not a null value (i.e., not NaN
).
Use the notna()
method to create a boolean DataFrame
indicating whether each element in ‘titanic’ is not null.
This command returns a DataFrame
of the same shape as ‘titanic’, where True
indicates a non-null value, and False
indicates a null value.
titanic.notna()
Use the notna()
method followed by sum()
to count the non-null values in each column of the ‘titanic’ DataFrame
.
This command calculates the sum of True
values (non-null) along each column axis, providing a count of non-null values for each column.
titanic.notna().sum()
The isna()
conditional function returns a True
for each row where the values are a null value (i.e., NaN
).
Use the isna()
method to create a boolean DataFrame
indicating whether each element in ‘titanic’ is null.
This command returns a DataFrame
of the same shape as ‘titanic’, where True
indicates a null value, and False
indicates a non-null value.
titanic.isna()
Use the isna()
method followed by sum()
to count the null values in each column of the ‘titanic’ DataFrame
.
This command calculates the sum of True
values (null) along each column axis, providing a count of null values for each column.
titanic.isna().sum()
Filtering on column value#
We can filter a DataFrame
to select rows where a specific column contains a specific value.
This command returns a subset of the DataFrame
containing only the rows where the ‘PassengerId’ column has the value of 666:
titanic[titanic['PassengerId'] == 666]
Exercise#
Extract the two passengers whose point of departure (‘Embarked’) is unknown (NaN
)
Solution
titanic[titanic['Embarked'].isna()]
This solution uses boolean indexing to filter the DataFrame
‘titanic’, extracting only the rows where the ‘Embarked’ column contains missing values (NaN
).
This approach allows us to identify the passengers whose point of departure is unknown.
Reading and writing Excel files#
The to_excel()
method is used to save the DataFrame
to an Excel file.
The argument index=False
specifies that we do not want to include the row index in the Excel file.
# Write the DataFrame 'titanic' to an Excel file named 'titanic.xlsx'.
titanic.to_excel('titanic.xlsx', index=False)
We can read data from an Excel file into a DataFrame
using the read_excel()
function.
pd.read_excel('titanic.xlsx')
Note, that we are only reading the data, not storing it in a variable.
To do so, we could write:
new_df = pd.read_excel('titanic.xlsx')
Key points#
Reading data into Pandas from many different file formats or data sources is supported by
read_*
functions.The
head()
,tail()
, andinfo()
methods are convenient for a first check.The
notna()
andisna()
methods are useful for finding and isolating missing data.Exporting data out of pandas is provided by different
to_*
methods.