Python for Data Science: a Crash Course Processing Tabular Data With pandas

	Age	Job	Marital status	Housing	Loan default
1	30	unemployed	divorced	no	yes
2	35	management	married	yes	no
3	25	services	single	no	no
4	56	management	divorced	yes	no
5	22	student	single	no	yes
6	28	unemployed	single	no	yes
7	42	unemployed	married	no	yes
...	...	...	...	...	...

The Bank Marketing Data Set

Most examples in this section use the Bank Marketing Data Set
Variables

age: age in years (numeric)
job: the customer's job category (categorical)
marital:the customer's marital status (categorical)
education: the customer's education level (ordinal)
default: whether the customer has a loan in default (categorical)
housing: whether the customer has a housing loan (categorical)
loan: whether the customer has a presonal loan (categorical)
...
y: how the customer responded to a marketing campaign (target variable)

Reading and Writing Data Frames

CSV (comma-separated values) files are one of the most used formats for storing tabular data

Use the read_csv() reader function to load a data frame from a CSV file
Use the to_csv() writer method to save a data frame to disk as a CSV file

Loading a DataFrame from a CSV file

df = pd.read_csv(file_path, sep=separator, ...) # sep defaults to ","

Saving a DataFrame to a CSV file

df.to_csv(file_path, ...)

pandas also provides reader and writer functions for handling other popular formats (e.g., JSON, parquet, Excel, ...).

Head and Tail Methods

Use the head() and tail() methods to view a small sample of a Series or DataFrame

These methods are very useful to check that you imported a data frame correctly.

Shape, Columns, and Data Types

Use the shape attribute to get the shape of a DataFrame or a Series

DataFrame → tuple (row_count, column_count)
Series → singleton tuple (length, )

The column names of a DataFrame can be accessed using its columns attribute
Use the dtypes attribute to check the data types of a Series or a DataFrame's columns

pandas mostly relies on NumPy arrays and dtypes (bool, int, float, datetime64[ns], ...)
pandas also extends some NumPy types (CategoricalDtype, DatetimeTZDtype, ...)
Two ways to represent strings: object dtype (default) or StringDtype (recommended)

Shape, Columns, and Data Types

Technical Summary

A technical summary of a DataFrame can be accessed using the info() method

Technical Summary

The technical summary contains

The type of the DataFrame
The row index (RangeIndex in the example) and its number of entries
The total number of columns
For each column

The column's name
The count of non-null values
The column's data type

Column count per data type
Total memory usage

Statistical Summary of Numerical Columns

Use the describe() method to access a statistical summary (mean, standard deviation, min, max, ...) of numerical columns of a DataFrame

Value Counts of Qualitative Columns

Use the value_counts() method to count the number of occurrences of each value in a Series (or DataFrame)

Use normalize=True in the method call to get percentages

Selecting a Single Column

To select a single column from a DataFrame, specify its name within square brackets → df[col]

The retrieved object is a Series

Selecting Multiple Columns

To select multiple columns, provide a list of column names within square brackets → df[[col_1, col_2, ...]]

The retrieved object is a DataFrame

Dropping Columns

Instead of selecting columns, you can drop unwanted columns using the drop() method

Be sure to specify axis=1 (otherwise, will attempt to drop rows)
To modify the original data frame, use inplace=True in the method call

Why Select Columns?

Two main motivations for selecting or dropping columns

Restrict the data to meaningful variables that are useful for the intended data analysis
Retaining variables that are compatible with some technique you intend to use

e.g., some machine learning algorithms only make sense when applied to numerical variables

Filtering Rows

Rows can be removed using a boolean filter → df[bool_filter]

Filter contains True at position i → keep corresponding row
Filter contains False at position i → remove corresponding row

Most of the time, the filter involves conditions on the columns

e.g., keep married clients only
e.g., keep clients who are 30 or older
etc.

Conditions can be combined using logical operators

& → bit-wise logical and (binary)
| → bit-wise logical or (binary)
~ → bit-wise logical negation (unary)

Filtering Rows

Example: clients who are married or divorced, unemployed, and 40 or older

Each condition produces a pandas Series. The different conditions' Series are then combined into one Series used to filter the rows.

Why Filter Rows?

Filtering rows can be motivated by multiple reasons

Limiting the analysis to a specific subpopulation of interest
Handling outliers and missing values (drop problematic rows)
Performance considerations (subsampling a massive data set)

Never filter rows (or select columns) using for loops!

Sorting Data

Use the sort_values() method to sort a DataFrame or a Series

Data frames can be sorted on multiple columns by providing the list of column names
Sorting order (ascending or descending) can be controlled with the ascending argument
Use inplace=True in the method call to modify the original DataFrame or Series

Sorting Data

Example: sort the data frame by increasing order of age

Sorting Data

Example: sort the data frame by decreasing alphabetical order of marital status and education, and increasing order of age

While education is an ordinal variable, pandas sorts it alphabetically since it is encoded as a string!

Python for Data Science

A Crash Course

Processing Tabular Data With pandas