| Age | Job | Marital status | Housing | Loan default | |
|---|---|---|---|---|---|
| 1 | 30 | unemployed | divorced | no | yes |
| 2 | 35 | management | married | yes | no |
| 3 | 25 | services | single | no | no |
| 4 | 56 | management | divorced | yes | no |
| 5 | 22 | student | single | no | yes |
| 6 | 28 | unemployed | single | no | yes |
| 7 | 42 | unemployed | married | no | yes |
| ... | ... | ... | ... | ... | ... |
int) does not make it
quantitative!
DataFrame → 2-dimensional data structure storing data
of different types (strings, integers, floats, ...) in columns
Series → represents a column (series of values)Installing pandas with conda (recommended)
% conda install pandas
Installing pandas with pip
% pip install pandas
Importing pandas (in Python scripts or notebooks)
>>> import pandas as pd
age: age in years (numeric)job: the customer's job category (categorical)marital:the customer's marital status (categorical)education: the customer's education level (ordinal)default: whether the customer has a loan in default
(categorical)
housing: whether the customer has a housing loan
(categorical)
loan: whether the customer has a presonal loan
(categorical)
y: how the customer responded to a marketing campaign
(target variable)
read_csv() reader function to load a data frame
from a CSV file
to_csv() writer method to save a data frame to disk
as a CSV file
Loading a DataFrame from a CSV file
df = pd.read_csv(file_path, sep=separator, ...) # sep defaults to ","
Saving a DataFrame to a CSV file
df.to_csv(file_path, ...)
head() and tail() methods to view a
small sample of a Series or DataFrame
shape attribute to get the shape of a
DataFrame or a Series
DataFrame → tuple
(row_count, column_count)
Series → singleton tuple
(length, )
DataFrame can be accessed using its
columns attribute
dtypes attribute to check the
data types
of a Series or a DataFrame's columns
bool,
int, float, datetime64[ns], ...)
CategoricalDtype,
DatetimeTZDtype, ...)
object dtype (default) or
StringDtype (recommended)
DataFrame can be accessed using the
info() method
DataFrameRangeIndex in the example) and its number of
entries
describe() method to access a statistical summary
(mean, standard deviation, min, max, ...) of numerical columns of a
DataFrame
value_counts() method to count the number of
occurrences of each value in a Series (or
DataFrame)
normalize=True in the method call to get percentages
DataFrame, specify its name
within square brackets →
df[col]
Series
df[[col_1, col_2, ...]]
DataFrame
drop() method
axis=1
(otherwise, will attempt to drop rows)
inplace=True in the
method call
df[bool_filter]
True at position
i → keep corresponding row
False at position
i → remove corresponding row
& → bit-wise logical and (binary)| → bit-wise logical or (binary)~ → bit-wise logical negation (unary)
Series. The different conditions' Series are then
combined into one Series used to filter the rows.
for loops!
sort_values() method to sort a
DataFrame or a Series
ascending argument
inplace=True in the method call to modify the original
DataFrame or Series
education is
an ordinal variable, pandas sorts it alphabetically since it is encoded as a
string!
Two main ways for indexing data frames
.locdf.loc[row_lab_index, col_lab_index]
"age")["age", "job", "loan"])
"age":"balance").ilocdf.iloc[row_pos_index, col_pos_index]
4)[4, 2, 10])2:10:2):)
.iloc to get rows and columns by position
.loc to get rows and columns by label
DataFrame's Row IndexDataFrame's
row index can be changed
using the set_index() method
inplace=True in the method call to modify the original
DataFrame
DataFrame's index enables more
interesting label-based indexing of the rows
DataFrame's Row Indexmarital column as the
DataFrame's row index (instead of the default
RangeIndex)
DataFrame's Row Index
MultiIndex)DataFrame or Series can have a
multi-level (hierarchical) index
DataFrame's IndexDataFrame's index to the default one
by using the reset_index() method
drop=True in the method call to drop it instead)
inplace=True in the method call to modify the original
DataFrame directly
MultiIndex, you can select which levels to reset
(level parameter)
DataFrame's Index
"unemployed", "Unemployed",
"Unemployd", ...
2020-11-19, 2020/11/12,
2020-19-11, ...
duplicated() method to identify duplicated rows in a
DataFrame (or values in a Series)
drop_duplicates() to remove duplicates from a
DataFrame or Series
inplace=True to modify the original
DataFrame or Series
subset argument to limit the columns on which to
search for duplicates
keep argument to indicate what item of the
duplicates must be retained (first, last, drop all duplicates)

persons.csv
dropna() method to remove rows (or columns) with
missing values
axis: axis along which missing values will be removed
how: whether to remove a row or column if all values are
missing ("all") or if any value is missing
("any")
subset: labels on other axis to consider when looking for
missing values
inplace: if True, do the operation on the
original object
fillna() method to replace missing values in a
DataFrame
value: replacement valueaxis: axis along which to fill missing valuesinplace: if True, do the operation on the
original DataFrame
0, 1
convert_dtypes() method to let pandas attempt to
infer the most appropriate data types for a data frame's columns
astype() method to recast columns
(Series) to other types
convert_dtypes() method
CategoricalDtype
education respects the category order now
CategoricalDtype.
1 on a column indicates which category the original
variable had
0get_dummies() function to one-hot encode categorical
columns
education variable, join the dummy variables
to the data frame (more on joins later),
and drop the original column
DataFrame most of the time)
value_counts() works.
DataFrameGroupBy objects and possible
aggregations and transformations.
crosstab() function to compute cross tabulations
(i.e., co-occurrence counts) of two or more categorical
Series
normalize argument
margins=True in the function
call
merge() method to merge a DataFrame with
another DataFrame (or Series)
customer_id column in both customers and purchases)
id column in products and the
product_id column in purchases)
left_on and right_on arguments to
specify the column names in the left and right data frames respectively