Age | Job | Marital status | Housing | Loan default | |
---|---|---|---|---|---|
1 | 30 | unemployed | divorced | no | yes |
2 | 35 | management | married | yes | no |
3 | 25 | services | single | no | no |
4 | 56 | management | divorced | yes | no |
5 | 22 | student | single | no | yes |
6 | 28 | unemployed | single | no | yes |
7 | 42 | unemployed | married | no | yes |
... | ... | ... | ... | ... | ... |
int
) does not make it
quantitative!
DataFrame
→ 2-dimensional data structure storing data
of different types (strings, integers, floats, ...) in columns
Series
→ represents a column (series of values)Installing pandas with conda (recommended)
% conda install pandas
Installing pandas with pip
% pip install pandas
Importing pandas (in Python scripts or notebooks)
>>> import pandas as pd
age
: age in years (numeric)job
: the customer's job category (categorical)marital
:the customer's marital status (categorical)education
: the customer's education level (ordinal)default
: whether the customer has a loan in default
(categorical)
housing
: whether the customer has a housing loan
(categorical)
loan
: whether the customer has a presonal loan
(categorical)
y
: how the customer responded to a marketing campaign
(target variable)
read_csv()
reader function to load a data frame
from a CSV file
to_csv()
writer method to save a data frame to disk
as a CSV file
Loading a DataFrame
from a CSV file
df = pd.read_csv(file_path, sep=separator, ...) # sep defaults to ","
Saving a DataFrame
to a CSV file
df.to_csv(file_path, ...)
head()
and tail()
methods to view a
small sample of a Series
or DataFrame
shape
attribute to get the shape of a
DataFrame
or a Series
DataFrame
→ tuple
(row_count, column_count)
Series
→ singleton tuple
(length, )
DataFrame
can be accessed using its
columns
attribute
dtypes
attribute to check the
data types
of a Series
or a DataFrame
's columns
bool
,
int
, float
, datetime64[ns]
, ...)
CategoricalDtype
,
DatetimeTZDtype
, ...)
object
dtype (default) or
StringDtype
(recommended)
DataFrame
can be accessed using the
info()
method
DataFrame
RangeIndex
in the example) and its number of
entries
describe()
method to access a statistical summary
(mean, standard deviation, min, max, ...) of numerical columns of a
DataFrame
value_counts()
method to count the number of
occurrences of each value in a Series
(or
DataFrame
)
normalize=True
in the method call to get percentages
DataFrame
, specify its name
within square brackets →
df[col]
Series
df[[col_1, col_2, ...]]
DataFrame
drop()
method
axis=1
(otherwise, will attempt to drop rows)
inplace=True
in the
method call
df[bool_filter]
True
at position
i
→ keep corresponding row
False
at position
i
→ remove corresponding row
&
→ bit-wise logical and (binary)|
→ bit-wise logical or (binary)~
→ bit-wise logical negation (unary)Series
. The different conditions' Series
are then
combined into one Series
used to filter the rows.
for
loops!
sort_values()
method to sort a
DataFrame
or a Series
ascending
argument
inplace=True
in the method call to modify the original
DataFrame
or Series
education
is
an ordinal variable, pandas sorts it alphabetically since it is encoded as a
string!
Two main ways for indexing data frames
.loc
df.loc[row_lab_index, col_lab_index]
"age"
)["age", "job", "loan"]
)
"age":"balance"
).iloc
df.iloc[row_pos_index, col_pos_index]
4
)[4, 2, 10]
)2:10:2
):
)
.iloc
to get rows and columns by position.loc
to get rows and columns by labelDataFrame
's Row IndexDataFrame
's
row index can be changed
using the set_index()
method
inplace=True
in the method call to modify the original
DataFrame
DataFrame
's index enables more
interesting label-based indexing of the rows
DataFrame
's Row Indexmarital
column as the
DataFrame
's row index (instead of the default
RangeIndex
)
DataFrame
's Row IndexMultiIndex
)DataFrame
or Series
can have a
multi-level (hierarchical) index
DataFrame
's IndexDataFrame
's index to the default one
by using the reset_index()
method
drop=True
in the method call to drop it instead)
inplace=True
in the method call to modify the original
DataFrame
directly
MultiIndex
, you can select which levels to reset
(level
parameter)
DataFrame
's Index"unemployed"
, "Unemployed"
,
"Unemployd"
, ...
2020-11-19
, 2020/11/12
,
2020-19-11
, ...
duplicated()
method to identify duplicated rows in a
DataFrame
(or values in a Series
)
drop_duplicates()
to remove duplicates from a
DataFrame
or Series
inplace=True
to modify the original
DataFrame
or Series
subset
argument to limit the columns on which to
search for duplicates
keep
argument to indicate what item of the
duplicates must be retained (first, last, drop all duplicates)
persons.csv
dropna()
method to remove rows (or columns) with
missing values
axis
: axis along which missing values will be removed
how
: whether to remove a row or column if all values are
missing ("all"
) or if any value is missing
("any"
)
subset
: labels on other axis to consider when looking for
missing values
inplace
: if True
, do the operation on the
original object
fillna()
method to replace missing values in a
DataFrame
value
: replacement valueaxis
: axis along which to fill missing valuesinplace
: if True
, do the operation on the
original DataFrame
0
, 1
convert_dtypes()
method to let pandas attempt to
infer the most appropriate data types for a data frame's columns
astype()
method to recast columns
(Series
) to other types
convert_dtypes()
methodCategoricalDtype
education
respects the category order nowCategoricalDtype
.
1
on a column indicates which category the original
variable had
0
get_dummies()
function to one-hot encode categorical
columns
education
variable, join the dummy variables
to the data frame (more on joins later),
and drop the original column
DataFrame
most of the time)
value_counts()
works.
DataFrameGroupBy
objects and possible
aggregations and transformations.
crosstab()
function to compute cross tabulations
(i.e., co-occurrence counts) of two or more categorical
Series
normalize
argument
margins=True
in the function
call
merge()
method to merge a DataFrame
with
another DataFrame
(or Series
)
customer_id
column in both customers and purchases)
id
column in products and the
product_id
column in purchases)
left_on
and right_on
arguments to
specify the column names in the left and right data frames respectively