Evaluation
Overview
The objective of the assignment is to evaluate your ability to conduct basic
data exploration and analysis tasks on a simple dataset. As data scientists,
you are expected to be proficient in such tasks, as you will be performing
them on a regular basis as part of your data science and machine learning
projects.
The assignment is to be conducted in pairs. Each group must
choose one of the proposed datasets. You will be presenting your work and main
findings through a brief
oral presentation (10 min presentation + 5 min discussion and
questions).
Each dataset is accompanied with a few questions that you are expected to
answer. These questions are mainly meant to guide you in your analysis and
are, for some, left intentionally vague. Feel free to show your creativity by
conducting further analyses if you wish to.
Group Constitution Rules
You are free to choose your partner as long as you respect one simple rule:
students who were in M1
Quantitative Economics last year can not pair with students who were
not.
Deliverables
The following deliverables are expected from each group:
- A 5–10 page report detailing your main findings.
- The source code of your exploratory data analysis:
-
Hosted in a GitHub or GitLab repository (preferably), or as a zip archive.
- Both notebooks and Python scripts are accepted.
- An oral presentation of your work and main findings (date TBA):
- 10 min presentation.
- 5 min questions.
Tasks and Datasets
# |
Task |
# Rows |
# Features |
Availability* |
1 |
Customer Churn Analysis (Banking) |
10000 |
13 |
Claimed |
2 |
Heart Failure Analysis |
299 |
13 |
Claimed |
3 |
Customer Churn Analysis (Telecommunications) |
7043 |
21 |
Claimed |
4 |
Personal Loan Marketing Campaign Analysis |
5000 |
14 |
Claimed |
5 |
Heart Disease Analysis |
303 |
14 |
Claimed |
6 |
Student Performance Analysis |
349 |
21 |
Claimed |
7 |
Hotel Booking Cancellations Analysis |
5000 |
20 |
Claimed |
8 |
Wine Quality Analysis |
2500 |
13 |
Claimed |
9 |
Employee Churn Analysis (HR) |
1470 |
19 |
Claimed |
10 |
Metabolic Syndrome Analysis |
2009 |
14 |
Claimed |
11 |
Speed Dating Analysis |
800 |
26 |
Claimed |
* Last updated: 16/09/2024 at 11 AM.
Claim task
Task 1: Customer Churn Analysis (Banking)
Context
A european bank wants to identify the main factors contributing to customer
churn (i.e., customers leaving the bank and closing their accounts). By doing
so, the bank can target such customers with incentives, or use this knowledge
to propose new products that are better suited to their needs.
Features
The dataset contains the following features.
Feature |
Description |
CustomerId |
The customer's unique identifier. |
Surname |
The customer's last name. |
CreditScore |
The customer's credit score. |
Geography |
The customer geographic location (country). |
Gender
|
The customer's gender. |
Age |
The customer's age. |
Tenure |
The number of years the customer has been a client of the bank. |
Balance |
The customer's bank account's balance. |
NumOfProducts |
The number of products the customer contracted with the bank. |
HasCrCard |
Whether the customer has a credit card or not. |
IsActiveMember |
Whether the customer has been recently active (i.e., made transactions)
or not.
|
EstimatedSalary |
An estimate of the customer's annual salary. |
Exited (target) |
Whether the customer closed their account (1 ) or not
(0 ).
|
Questions
- What is the churn rate among the bank's customers?
-
How are the different variables (gender, age, geography, etc.) distributed
in the dataset?
- How do the different variables interact with each other?
-
What are the age, salary, balance, number of products, etc. distributions
for each gender group?
- How are different indicators distributed by country?
- etc.
-
How do the different variables affect churn? What are the causes that can
lead to increased (or reduced) customer churn?
-
(Optional) Build a simple machine learning classification model that
predicts churn based on a customer's features.
Download dataset
Task 2: Heart Failure Analysis
Context
A renowned hospital's cardiology department is conducting a study to pinpoint
factors that can foretell deadly heart failure. By doing so, the department
will be better prepared to identify patients at risk and provide them with
adequate care.
Features
The dataset contains the following features.
Feature |
Description |
age |
The patient's age (years). |
anaemia |
Whether the patient has anemia (decrease of red blood cells or
hemoglobin) or not.
|
creatinine_phosphokinase |
Level of the CPK enzyme in the blood (mcg/L). |
diabetes |
Whether the patient has diabetes or not. |
ejection_fraction |
Percentage of blood leaving the heart at each contraction (percentage).
|
high_blood_pressure |
Whether the patient has hypertension or not. |
platelets |
Platelets in the blood (kiloplatelets/mL). |
serum_creatinine |
Level of serum creatinine in the blood (mg/dL). |
serum_sodium |
Level of serum sodium in the blood (mEq/L). |
sex |
The patient's gender. |
smoking |
Whether the patient is a smoker or not. |
time |
Follow-up period (in days). |
DEATH_EVENT (target) |
If the patient died during the follow-up period. |
Questions
-
What is the mortality rate due to heart failure among the study's
participants?
-
How are the different variables (age, gender, presence of diabetes or high
blood pressure, etc.) distributed in the dataset?
- How do the variables relate to each other?
- Is the smoker distribution the same for both genders?
-
How many patients present each combination of underlying health conditions
(anaemia only, diabetes only, anaemia + diabetes, etc.)?
- etc.
-
What are the main factors that are related to a higher risk of heart
failure?
-
(Optional) Build a simple machine learning classification model that
predicts if a patient is at risk of dying from heart failure.
Download dataset
Task 3: Customer Churn Analysis (Telecommunications)
Context
A telecommunications company wants to identify reasons leading to customer
churn (i.e., customers canceling their phone or internet plans) in order to
improve its client retentionn strategies through more personnalized offers and
commercial gestures.
Features
The dataset contains the following features.
Feature |
Description |
customerID |
The customer's unique identifier. |
gender |
The customer's gender. |
SeniorCitizen |
Whether the customer is a senior citizen or not. |
Partner |
Whether the customer has a partner or not. |
Dependents |
Whether the customer has dependents or not. |
tenure |
Number of months the customer has stayed with the company. |
PhoneService |
Whether the customer has a phone service or not. |
MultipleLines |
Whether the customer has multiple phone lines or not. |
InternetService |
The type of the customer's Internet service or not (DSL; Fiber optic;
No).
|
OnlineSecurity |
Whether the customer has contracted the online security service or not
(Yes; No; No internet service).
|
OnlineBackup |
Whether the customer has contracted the online backup service or not
(Yes; No; No internet service).
|
DeviceProtection |
Whether the customer has contracted the device protection service or not
(Yes; No; No internet service).
|
TechSupport |
Whether the customer has tech support or not (Yes; No; No internet
service).
|
StreamingTV |
Whether the customer has subscribed to the TV streaming service or not
(Yes; No; No internet service).
|
StreamingMovies |
Whether the customer has subscribed to the movie streaming service or
not (Yes; No; No internet service).
|
Contract |
The customer's contract terms (Month-to-month; One year; Two years).
|
PaperlessBilling |
Whether the customer has opted for paperless billing or not. |
PaymentMethod |
The customer's payement method. |
MonthlyCharges |
The customer's monthly charges (in dollars). |
TotalCharges |
The total amount charged to the customer so far (in dollars). |
Churn (target) |
Whether the customer canceled their plan or not. |
Questions
- What is the churn rate among the company's customers?
-
How are the different variables (gender, tenure, contract type, etc.)
distributed in the dataset?
- How do the different variables interact with each other?
-
What are the distributions of contract type, tenure, seniority, charges,
etc. for each gender group?
- How are different indicators distributed by contract type?
- etc.
-
How do the different variables affect churn? What are the causes that can
lead to increased (or reduced) customer churn?
-
(Optional) Build a simple machine learning classification model that
predicts churn based on a customer's features.
Download dataset
Task 4: Personal Loan Marketing Campaign Analysis
Context
The retail marketing department of a bank ran a campaign in which it proposed
personal loans to its customers. The department wants to analyze the data in
order to discover insights that might help with tailoring better-targeted
campains that can lead to better conversion rates in the future.
Features
The dataset contains the following features.
Feature |
Description |
ID |
The customer's unique identifier. |
Age |
The customer's age (in years). |
Experience |
Number of year of professional experience. |
Income |
The customer's annual income (in $1OOO) |
ZIP Code |
The customer's home address zip code. |
Family |
The customer's family size. |
CCAvg |
Average spendings on credit card per month (in $1000). |
Education |
Education level:
1 : undergrad;
2 : Graduate;
3 : Advanced/Professional.
|
Mortgage |
Value of house mortgage (in $1000) if the customer has one. |
Personal Loan (target) |
Whether the customer contracted the personal loan offer in the last
campaign or not.
|
Securities Account |
Whether the customer has a securities account with the bank or not.
|
CD Account |
Whether the customer has a certificate of deposit account with the bank
or not.
|
Online |
Whether the customer uses the online facilities provided by the bank or
not.
|
CreditCard |
Whether the customer has a credit card issued by the bank or not. |
Questions
-
What is the conversion rate (percentage of clients that contracted a
personal loan) of the marketing campaign?
-
How are the different variables (age, income, education, etc.) distributed
in the dataset?
- How do the different variables interact with each other?
-
Are age, income, education, etc. distributed similarly for customers who
have security accounts and those who don't?
-
Are age, income, education, etc. distributed similarly for customers who
have CD accounts and those who don't?
-
How many customers have one account only? How many have multiple accounts
(i.e., both security and CD accounts)?
-
What are the most important factors that lead to customers responding
favorably to the marketing campaign?
-
(Optional) Build a simple machine learning classification model that
predicts, based on a customer's features, if they will respond to the
marketing campaign or not.
Download dataset
Task 5: Heart Disease Analysis
Context
A clinic's cardiology department wants to identify the most important
indicators that help diagnose heart disease. By doing so, the clinic will be
able to know which diagnostic tests need to be conducted in priority, thus
leading to more efficiency and decreased expenditures (and smaller medical
bills for the patients).
Features
The dataset contains the following features.
Feature |
Description |
age |
The patient's age (in years). |
sex |
The patient's gender (1 : male; 0 : female).
|
cp |
Chest pain type:
0 : asymptomatic
1 : typical angina
2 : atypical angina
3 : non-anginal pain
|
trestbps |
Resting blood pressure (in mm Hg). |
chol |
Serum cholestoral in mg/dl. |
fbs |
Whether the patient's fasting blood sugar > 120 mg/dl (1 :
yes;
0 : no).
|
restecg |
Resting ECG results:
0 : normal
-
1 : ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
-
2 : probable or definite left ventricular hypertrophy by
Estes' criteria
|
thalach |
Maximum heart rate achieved. |
exang |
Exercise induced angina (1 : yes; 0 : no).
|
oldpeak |
ST depression induced by exercise relative to rest. |
slope |
The slope of the peak exercise ST segment:
0 : upsloping
1 : flat
2 : downsloping
|
ca |
Number of major vessels (0–3 ) colored by flourosopy. |
thal |
3 : Normal; 6 : Fixed defect; 7 :
Reversable defect.
|
target (target) |
Whether the patient has heart disease (1 ) or not
(0 ).
|
Questions
- How many patients suffer from heart disease in the dataset?
-
How are the different variables (age, gender, chest pain type, etc.)
distributed?
- How do the different variables interact?
-
Do age, cholesterol level, blood pressure, and other indicators have
similar distributions for both genders?
- Are there correlations between pairs of variables?
- etc.
-
What are the features that are the most likely to indicate the presence of a
heart disease?
-
(Optional) Build a simple machine learning classification model that
predicts if a patient suffers from an underlying heart condition.
Download dataset
Task 6: Student Performance Analysis
Context
A secondary school wants to investigate social, demographic, and school
related causes linked to student failure in order to proactively identify
students at risk and provide them with adequate counseling and support.
Features
The dataset contains the following features.
Feature |
Description |
sex |
The student's gender. |
age |
The student's age (in years). |
famsize |
Family size (LE3 : less or equal to 3; GT3 :
greater than 3).
|
Pstatus |
Parents' cohabitation status (T : living together;
A : living apart).
|
Mjob |
Mother's job. |
Fjob |
Father's job. |
guardian |
The student's guardian. |
studytime |
Weekly study time:
1 : <2 hours;
2 : 2 to 5 hours;
3 : 5 to 10 hours;
4 : > 10 hours.
|
schoolsup |
Whether the student has extra educational support or not. |
famsup |
Whether the student has family educational support or not. |
paid |
Whether the student attends extra paid classes or not. |
activities |
Whether the student has extra-curricular activities or not. |
internet |
Whether the student has internet access at home or not. |
romantic |
Whether the student is in a romantic relationship or not. |
famrel |
Quality of family relationships, from 1 (very bad) to
5 (excellent).
|
goout |
Frequency of going out with friends, from 1 (very low) to
5 (very high).
|
Dalc |
Workday alcohol consumption, from 1 (very low) to
5 (very high).
|
Walc |
Weekend alcohol consumption, from 1 (very low) to
5 (very high).
|
health |
Current health condition, from 1 (very bad) to
5 (very good).
|
absences |
Number of school absences. |
success (target) |
Whether the student passed or failed. |
Questions
- What is the proportion of students that failed the class?
-
How are different variables (age, gender, internet access, family
composition, etc.) distributed in the dataset?
- How do variables interact with each other?
-
Are age, absences, family composition, etc. distributed similarly for both
genders?
-
What are the age, gender, internet access, alcohol consumption, etc.
distributions by parents cohabitation status?
- etc.
-
What are the principal factors that play a key role in student failure?
-
(Optional) Build a simple machine learning classification model that
predicts if a student will fail the class.
Download dataset
Task 7: Hotel Booking Cancellations Analysis
Context
A hotel chain wants to identify signs that can help foretell if guests are
going to cancel their room bookings. By doing so, they can better anticipate
the occurrence of such events and adjust their room prices and cancellation
policies accordingly.
Features
The dataset contains the following features.
Feature |
Description |
hotel |
Whether the booking is made at a resort or a city hotel. |
arrival_date_month |
The arrival month. |
arrival_date_week_number |
The arrival week number. |
arrival_date_day_of_month |
The arrival day of the month. |
stays_in_weekend_nights |
The number of weekend nights (Saturday or Sunday) the guest booked.
|
stays_in_week_nights |
The number of week nights (Monday to Friday) the buest booked. |
adults |
Number of adults. |
meal |
The type of meal booked (in standard hospitality meal package
categories):
Undefined /SC : no meal package;
BB : Bed & Breakfast;
HB : Half board (breakfast + one other meal);
FB : Full board (breakfast + lunch + dinner).
|
country |
The guest's country of origin (ISO 3155–3:2013 format). |
market_segment |
The market segment designation (TA = Travel Agents; TO = Tour
Operators).
|
distribution_channel |
The booking's distribution channel (TA = Travel Agents; TO = Tour
Operators).
|
reserved_room_type |
Code of the type of the reserved room. |
assigned_room_type |
Code of the type of the assigned room (can be different from reserved
room due to overbooking or customer requests).
|
booking_changes |
Number of changes made to the booking until check-in or cancellation.
|
deposit_type |
Type of deposit made by the guest to guarantee the booking
No Deposit : no deposit was made;
-
Non Refund : a deposit was made in the value of the
total stay cost;
-
Refundable : a deposit was made with a value under the
total cost of stay.
|
customer_type |
Type of booking:
-
Contract : the booking has a contract associated to it;
Group : the booking is associated to a group;
-
Transient : the booking is not associated to a contract,
a group or another transient booking;
-
Transient-party : the booking is transient and
associated to at least another transient booking.
|
adr |
Average daily rate. |
required_car_parking_spaces |
Number of car parking space required by the guest. |
total_of_special_requests |
Number of spacial requests (e.g., twin bed, high floor, etc.) made by
the guest.
|
canceled (target) |
Whether the booking was cancelled or not. |
Questions
- What is the proportion of cancelations in the dataset?
-
How are the different variables (hotel type, market segment, customer type,
etc.) distributed in the dataset?
- How do variables interact with each other?
-
Are stay in nights distributed similarly accross market segments? customer
types? countries? etc.
-
Are stay in nights, customer types, etc. distributed similarly for
different periods (months) of the year? per room type? etc.
- etc.
- What variables are associated the most with booking cancellations?
-
(Optional) Build a simple machine learning classification model that
predicts if a guest will cancel their booking or not.
Download dataset
Task 8: Wine Quality Analysis
Context
A wine aficionado is investigating what differentiates excellent wines from
poor or ordinary ones. Can you help him in his endeavor?
Features
The dataset contains the following features.
Feature |
Description |
fixed acidity |
Amount of fixed (non-volatile) acid in the wine. |
volatile acidity |
Amount of acetic (volatile) acid in the wine. |
citric acid |
Amount of citric acid in the wine. |
residual sugar |
Amount of sugar after fermentation. |
chlorides |
The amount of salt in the wine. |
free sulfur dioxide |
Amount of free SO2 in the wine. |
total sulfur dioxide |
Total amount (free and bound) SO2 in the wine. |
density |
The wine's density. |
pH |
The wine's acidity. |
sulphates |
The amount of sulphates in the wine. |
alcohol |
Percentage of alcohol in the wine. |
type |
Whether the wine is red or white. |
great wine (target) |
Whether the wine is great or not. |
Questions
- What is the proportion of great wines in the dataset?
-
How are the different variables (acidity, alcohol percentage, pH, etc.)
distributed in the dataset?
-
Are there correlations between pairs of variables (e.g., fixed and volatile
acidity)?
- Are the variables distributed similarly in white and red wines?
- How do the different variables affect wine quality?
- What makes a white wine great?
- What makes a red wine great?
-
(Optional) Build a simple machine learning classification model that
predicts if a wine is great or not based on its characteristics.
Download dataset
Task 9: Employee Churn Analysis (HR)
Context
A HR department recruited you as a Data Scientist to help identify the main reasons behind employee churn. The knowledge you extract from their historical data can be leveraged to identify and put in place actions that can help retain talent (e.g., clearer career paths, incentives, etc.).
Features
The dataset contains the following features.
Feature |
Description |
Age |
The employee's age. |
Gender |
The employees gender. |
MaritalStatus |
The employee's marital status. |
Education |
The employee's education level. |
EducationField |
The employee's field of education. |
JobRole |
The employee's job title. |
Department |
The department to which the employee is assigned. |
BusinessTravel |
Whether the employee's job involves frequent, occasional, or no business travel. |
MonthlyIncome |
The employee's monthly income. |
DistanceFromHome |
The firm's distance to the employee's home (in kilometers). |
JobSatisfaction |
The employee's job satisfaction. |
JobInvolvement |
The employee's job involvement. |
RelationshipSatisfaction |
The employee's satisfaction w.r.t. their personal relationship. |
PerformanceRating |
The employee's performance rating. |
YearsAtCompany |
The number of years the employee has been in the company. |
YearsInCurrentRole |
The number of years the employee has been in their current role. |
YearsSinceLastPromotion |
The number of years since the employee got promoted. |
YearsWithCurrManager |
The number of years the employee has been assigned the same manager. |
Churn (target) |
Whether the employee has left the company or not. |
Questions
- What is the churn rate among the company's employees?
- How are the different variables (gender, marital status, etc.) distributed?
- Are there correlations between pairs of variables (e.g., gender and monthly income, job statisfaction and distance from home, etc.)?
- What are the variables that influence employee churn the most?
- (Optional) Build a simple machine learning classification model that predicts if a given employee are going to resign in the following months or not.
Download dataset
Task 10: Metabolic Syndrome Analysis
Context
Metabolic syndrome is a combination of conditions that significantly raise the risk of a multitude of diseases (coronary heart disease, diabetes, etc.). The hospital you are working at wants to identify, based on historical clinical data, the most important factors linked to this syndrome.
Features
The dataset contains the following features.
Feature |
Description |
Age |
The patient's age. |
Sex |
The patient's gender. |
Marital |
The patient's marital status. |
Income |
The patient's monthly income. |
Race |
The patient's ethnicity. |
WaistCirc |
The patient's waist circumference. |
BMI |
The patient's BMI (Body Mass Index). |
Albuminuria |
The patient's Albuminuria stage. |
UrAlbCr |
The patient's Urine Albumin-Creatinine ratio. |
UricAcid |
The patient's uric acid level. |
BloodGlucose |
The patient's blood glucose level. |
HDL |
The patient's HDL (High-Density Lipoprotein) level. |
Triglycerides |
The patient's triglycerides level. |
MetabolicSyndrome (target) |
Whether the patient suffers from metabolic syndrome or not. |
Questions
- How many patients suffer from metabolic syndrome in the dataset?
- How are the different variables (age, gender, etc.) distributed?
- How do the different variables interact?
- Are the biological variables similarly distributed in the different gender groups?
- What correlations exist between biological variables?
- What factor is the most linked to metabolic syndrome?
- (Optional) Build a simple machine learning classification model that can help predict if a given patient is subject to metabolic syndrome or not
Download dataset
Task 11: Speed Dating Analysis
Context
A dating agency wants to analyze the data collected from a speed dating event it organized in order to identify key elements that lead to successful matches. This can help improve the agency's matchmaking strategy and suggest to their customers to meet people with whom they are more likely to get along.
Features
The dataset contains the following features.
Feature |
Description |
iid |
The person's unique identifier. |
iid_o |
The partner's unique identifier. |
gender |
The person's gender. |
gender_o |
The partners's gender. |
age |
The person's age. |
age_o |
The partner's age. |
race |
The person's ethnicity. |
race_o |
The partner's ethnicity. |
importance_race |
How important it is for the person (on a scale of 1 to 10) to date a person of the same ethnic background. |
importance_race_o |
How important it is for the partner (on a scale of 1 to 10) to date a person of the same ethnic background. |
importance_religion |
How important it is for the person (on a scale of 1 to 10) to date a person of the same religious background. |
importance_religion_o |
How important it is for the partner (on a scale of 1 to 10) to date a person of the same religious background. |
importance_attractive * |
The importance the person attaches to dating someone attractive. |
importance_sincere * |
The importance the person attaches to dating someone sincere. |
importance_intelligent * |
The importance the person attaches to dating someone intelligent. |
importance_fun * |
The importance the person attaches to dating someone funny. |
importance_ambitious * |
The importance the person attaches to dating someone ambitious. |
importance_shared_interests * |
The importance the person attaches to dating someone they have common interests with. |
importance_attractive_o * |
The importance the partner attaches to dating someone attractive. |
importance_sincere_o * |
The importance the partner attaches to dating someone sincere. |
importance_intelligent_o * |
The importance the partner attaches to dating someone intelligent. |
importance_fun_o * |
The importance the partner attaches to dating someone funny. |
importance_ambitious_o * |
The importance the partner attaches to dating someone ambitious. |
importance_shared_interests_o * |
The importance the partner attaches to dating someone they have common interests with. |
match (target) |
Whether the person and their partner matched (1 ) or not (0 ). |
* Participants had 100 points to distribute accross the six attributes (i.e., attractive + sincere + intelligent + fun + ambitious + shared_interests = 100
)
Questions
- What is the average number of matches per participant?
- For each gender group, what are the three most important attributes when choosing someone to date?
- Are do variables relate?
- Do people that value attractivenesss also value ambition?
- Do preferences change with age?
- Are the ethnicity and religion importances related?
- What are the variables that influence compatibility between two persons the most?
- (Optional) Build a simple machine learning classification model that can help predict if two persons are likely to match.
Download dataset