Evaluation

Overview

The objective of the assignment is to evaluate your ability to conduct basic data exploration and analysis tasks on a simple dataset. As data scientists, you are expected to be proficient in such tasks, as you will be performing them on a regular basis as part of your data science and machine learning projects.

The assignment is to be conducted in pairs. Each group must choose one of the proposed datasets. You will be presenting your work and main findings through a brief oral presentation (10 min presentation + 5 min discussion and questions).

Each dataset is accompanied with a few questions that you are expected to answer. These questions are mainly meant to guide you in your analysis and are, for some, left intentionally vague. Feel free to show your creativity by conducting further analyses if you wish to.

Group Constitution Rules

You are free to choose your partner as long as you respect one simple rule: students who were in M1 Quantitative Economics last year can not pair with students who were not.

Deliverables

The following deliverables are expected from each group:

Tasks and Datasets

# Task # Rows # Features Availability*
1 Customer Churn Analysis (Banking) 10000 13 Claimed
2 Heart Failure Analysis 299 13 Claimed
3 Customer Churn Analysis (Telecommunications) 7043 21 Claimed
4 Personal Loan Marketing Campaign Analysis 5000 14 Claimed
5 Heart Disease Analysis 303 14 Claimed
6 Student Performance Analysis 349 21 Claimed
7 Hotel Booking Cancellations Analysis 5000 20 Claimed
8 Wine Quality Analysis 2500 13 Claimed
9 Employee Churn Analysis (HR) 1470 19 Claimed
10 Metabolic Syndrome Analysis 2009 14 Claimed
11 Speed Dating Analysis 800 26 Claimed

* Last updated: 16/10/2023 at 7 AM.

Claim task

Task 1: Customer Churn Analysis (Banking)

Context

A european bank wants to identify the main factors contributing to customer churn (i.e., customers leaving the bank and closing their accounts). By doing so, the bank can target such customers with incentives, or use this knowledge to propose new products that are better suited to their needs.

Features

The dataset contains the following features.

Feature Description
CustomerId The customer's unique identifier.
Surname The customer's last name.
CreditScore The customer's credit score.
Geography The customer geographic location (country).
Gender The customer's gender.
Age The customer's age.
Tenure The number of years the customer has been a client of the bank.
Balance The customer's bank account's balance.
NumOfProducts The number of products the customer contracted with the bank.
HasCrCard Whether the customer has a credit card or not.
IsActiveMember Whether the customer has been recently active (i.e., made transactions) or not.
EstimatedSalary An estimate of the customer's annual salary.
Exited (target) Whether the customer closed their account (1) or not (0).

Questions

 Download dataset

Task 2: Heart Failure Analysis

Context

A renowned hospital's cardiology department is conducting a study to pinpoint factors that can foretell deadly heart failure. By doing so, the department will be better prepared to identify patients at risk and provide them with adequate care.

Features

The dataset contains the following features.

Feature Description
age The patient's age (years).
anaemia Whether the patient has anemia (decrease of red blood cells or hemoglobin) or not.
creatinine_phosphokinase Level of the CPK enzyme in the blood (mcg/L).
diabetes Whether the patient has diabetes or not.
ejection_fraction Percentage of blood leaving the heart at each contraction (percentage).
high_blood_pressure Whether the patient has hypertension or not.
platelets Platelets in the blood (kiloplatelets/mL).
serum_creatinine Level of serum creatinine in the blood (mg/dL).
serum_sodium Level of serum sodium in the blood (mEq/L).
sex The patient's gender.
smoking Whether the patient is a smoker or not.
time Follow-up period (in days).
DEATH_EVENT (target) If the patient died during the follow-up period.

Questions

 Download dataset

Task 3: Customer Churn Analysis (Telecommunications)

Context

A telecommunications company wants to identify reasons leading to customer churn (i.e., customers canceling their phone or internet plans) in order to improve its client retentionn strategies through more personnalized offers and commercial gestures.

Features

The dataset contains the following features.

Feature Description
customerID The customer's unique identifier.
gender The customer's gender.
SeniorCitizen Whether the customer is a senior citizen or not.
Partner Whether the customer has a partner or not.
Dependents Whether the customer has dependents or not.
tenure Number of months the customer has stayed with the company.
PhoneService Whether the customer has a phone service or not.
MultipleLines Whether the customer has multiple phone lines or not.
InternetService The type of the customer's Internet service or not (DSL; Fiber optic; No).
OnlineSecurity Whether the customer has contracted the online security service or not (Yes; No; No internet service).
OnlineBackup Whether the customer has contracted the online backup service or not (Yes; No; No internet service).
DeviceProtection Whether the customer has contracted the device protection service or not (Yes; No; No internet service).
TechSupport Whether the customer has tech support or not (Yes; No; No internet service).
StreamingTV Whether the customer has subscribed to the TV streaming service or not (Yes; No; No internet service).
StreamingMovies Whether the customer has subscribed to the movie streaming service or not (Yes; No; No internet service).
Contract The customer's contract terms (Month-to-month; One year; Two years).
PaperlessBilling Whether the customer has opted for paperless billing or not.
PaymentMethod The customer's payement method.
MonthlyCharges The customer's monthly charges (in dollars).
TotalCharges The total amount charged to the customer so far (in dollars).
Churn (target) Whether the customer canceled their plan or not.

Questions

 Download dataset

Task 4: Personal Loan Marketing Campaign Analysis

Context

The retail marketing department of a bank ran a campaign in which it proposed personal loans to its customers. The department wants to analyze the data in order to discover insights that might help with tailoring better-targeted campains that can lead to better conversion rates in the future.

Features

The dataset contains the following features.

Feature Description
ID The customer's unique identifier.
Age The customer's age (in years).
Experience Number of year of professional experience.
Income The customer's annual income (in $1OOO)
ZIP Code The customer's home address zip code.
Family The customer's family size.
CCAvg Average spendings on credit card per month (in $1000).
Education Education level:
  • 1: undergrad;
  • 2: Graduate;
  • 3: Advanced/Professional.
Mortgage Value of house mortgage (in $1000) if the customer has one.
Personal Loan (target) Whether the customer contracted the personal loan offer in the last campaign or not.
Securities Account Whether the customer has a securities account with the bank or not.
CD Account Whether the customer has a certificate of deposit account with the bank or not.
Online Whether the customer uses the online facilities provided by the bank or not.
CreditCard Whether the customer has a credit card issued by the bank or not.

Questions

 Download dataset

Task 5: Heart Disease Analysis

Context

A clinic's cardiology department wants to identify the most important indicators that help diagnose heart disease. By doing so, the clinic will be able to know which diagnostic tests need to be conducted in priority, thus leading to more efficiency and decreased expenditures (and smaller medical bills for the patients).

Features

The dataset contains the following features.

Feature Description
age The patient's age (in years).
sex The patient's gender (1: male; 0: female).
cp Chest pain type:
  • 0: asymptomatic
  • 1: typical angina
  • 2: atypical angina
  • 3: non-anginal pain
trestbps Resting blood pressure (in mm Hg).
chol Serum cholestoral in mg/dl.
fbs Whether the patient's fasting blood sugar > 120 mg/dl (1: yes; 0: no).
restecg Resting ECG results:
  • 0: normal
  • 1: ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  • 2: probable or definite left ventricular hypertrophy by Estes' criteria
thalach Maximum heart rate achieved.
exang Exercise induced angina (1: yes; 0: no).
oldpeak ST depression induced by exercise relative to rest.
slope The slope of the peak exercise ST segment:
  • 0: upsloping
  • 1: flat
  • 2: downsloping
ca Number of major vessels (0–3) colored by flourosopy.
thal 3: Normal; 6: Fixed defect; 7: Reversable defect.
target (target) Whether the patient has heart disease (1) or not (0).

Questions

 Download dataset

Task 6: Student Performance Analysis

Context

A secondary school wants to investigate social, demographic, and school related causes linked to student failure in order to proactively identify students at risk and provide them with adequate counseling and support.

Features

The dataset contains the following features.

Feature Description
sex The student's gender.
age The student's age (in years).
famsize Family size (LE3: less or equal to 3; GT3: greater than 3).
Pstatus Parents' cohabitation status (T: living together; A: living apart).
Mjob Mother's job.
Fjob Father's job.
guardian The student's guardian.
studytime Weekly study time:
  • 1: <2 hours;
  • 2: 2 to 5 hours;
  • 3: 5 to 10 hours;
  • 4: > 10 hours.
schoolsup Whether the student has extra educational support or not.
famsup Whether the student has family educational support or not.
paid Whether the student attends extra paid classes or not.
activities Whether the student has extra-curricular activities or not.
internet Whether the student has internet access at home or not.
romantic Whether the student is in a romantic relationship or not.
famrel Quality of family relationships, from 1 (very bad) to 5 (excellent).
goout Frequency of going out with friends, from 1 (very low) to 5 (very high).
Dalc Workday alcohol consumption, from 1 (very low) to 5 (very high).
Walc Weekend alcohol consumption, from 1 (very low) to 5 (very high).
health Current health condition, from 1 (very bad) to 5 (very good).
absences Number of school absences.
success (target) Whether the student passed or failed.

Questions

 Download dataset

Task 7: Hotel Booking Cancellations Analysis

Context

A hotel chain wants to identify signs that can help foretell if guests are going to cancel their room bookings. By doing so, they can better anticipate the occurrence of such events and adjust their room prices and cancellation policies accordingly.

Features

The dataset contains the following features.

Feature Description
hotel Whether the booking is made at a resort or a city hotel.
arrival_date_month The arrival month.
arrival_date_week_number The arrival week number.
arrival_date_day_of_month The arrival day of the month.
stays_in_weekend_nights The number of weekend nights (Saturday or Sunday) the guest booked.
stays_in_week_nights The number of week nights (Monday to Friday) the buest booked.
adults Number of adults.
meal The type of meal booked (in standard hospitality meal package categories):
  • Undefined/SC: no meal package;
  • BB: Bed & Breakfast;
  • HB: Half board (breakfast + one other meal);
  • FB: Full board (breakfast + lunch + dinner).
country The guest's country of origin (ISO 3155–3:2013 format).
market_segment The market segment designation (TA = Travel Agents; TO = Tour Operators).
distribution_channel The booking's distribution channel (TA = Travel Agents; TO = Tour Operators).
reserved_room_type Code of the type of the reserved room.
assigned_room_type Code of the type of the assigned room (can be different from reserved room due to overbooking or customer requests).
booking_changes Number of changes made to the booking until check-in or cancellation.
deposit_type Type of deposit made by the guest to guarantee the booking
  • No Deposit: no deposit was made;
  • Non Refund: a deposit was made in the value of the total stay cost;
  • Refundable: a deposit was made with a value under the total cost of stay.
customer_type Type of booking:
  • Contract: the booking has a contract associated to it;
  • Group: the booking is associated to a group;
  • Transient: the booking is not associated to a contract, a group or another transient booking;
  • Transient-party: the booking is transient and associated to at least another transient booking.
adr Average daily rate.
required_car_parking_spaces Number of car parking space required by the guest.
total_of_special_requests Number of spacial requests (e.g., twin bed, high floor, etc.) made by the guest.
canceled (target) Whether the booking was cancelled or not.

Questions

 Download dataset

Task 8: Wine Quality Analysis

Context

A wine aficionado is investigating what differentiates excellent wines from poor or ordinary ones. Can you help him in his endeavor?

Features

The dataset contains the following features.

Feature Description
fixed acidity Amount of fixed (non-volatile) acid in the wine.
volatile acidity Amount of acetic (volatile) acid in the wine.
citric acid Amount of citric acid in the wine.
residual sugar Amount of sugar after fermentation.
chlorides The amount of salt in the wine.
free sulfur dioxide Amount of free SO2 in the wine.
total sulfur dioxide Total amount (free and bound) SO2 in the wine.
density The wine's density.
pH The wine's acidity.
sulphates The amount of sulphates in the wine.
alcohol Percentage of alcohol in the wine.
type Whether the wine is red or white.
great wine (target) Whether the wine is great or not.

Questions

 Download dataset

Task 9: Employee Churn Analysis (HR)

Context

A HR department recruited you as a Data Scientist to help identify the main reasons behind employee churn. The knowledge you extract from their historical data can be leveraged to identify and put in place actions that can help retain talent (e.g., clearer career paths, incentives, etc.).

Features

The dataset contains the following features.

Feature Description
Age The employee's age.
Gender The employees gender.
MaritalStatus The employee's marital status.
Education The employee's education level.
EducationField The employee's field of education.
JobRole The employee's job title.
Department The department to which the employee is assigned.
BusinessTravel Whether the employee's job involves frequent, occasional, or no business travel.
MonthlyIncome The employee's monthly income.
DistanceFromHome The firm's distance to the employee's home (in kilometers).
JobSatisfaction The employee's job satisfaction.
JobInvolvement The employee's job involvement.
RelationshipSatisfaction The employee's satisfaction w.r.t. their personal relationship.
PerformanceRating The employee's performance rating.
YearsAtCompany The number of years the employee has been in the company.
YearsInCurrentRole The number of years the employee has been in their current role.
YearsSinceLastPromotion The number of years since the employee got promoted.
YearsWithCurrManager The number of years the employee has been assigned the same manager.
Churn (target) Whether the employee has left the company or not.

Questions

 Download dataset

Task 10: Metabolic Syndrome Analysis

Context

Metabolic syndrome is a combination of conditions that significantly raise the risk of a multitude of diseases (coronary heart disease, diabetes, etc.). The hospital you are working at wants to identify, based on historical clinical data, the most important factors linked to this syndrome.

Features

The dataset contains the following features.

Feature Description
Age The patient's age.
Sex The patient's gender.
Marital The patient's marital status.
Income The patient's monthly income.
Race The patient's ethnicity.
WaistCirc The patient's waist circumference.
BMI The patient's BMI (Body Mass Index).
Albuminuria The patient's Albuminuria stage.
UrAlbCr The patient's Urine Albumin-Creatinine ratio.
UricAcid The patient's uric acid level.
BloodGlucose The patient's blood glucose level.
HDL The patient's HDL (High-Density Lipoprotein) level.
Triglycerides The patient's triglycerides level.
MetabolicSyndrome (target) Whether the patient suffers from metabolic syndrome or not.

Questions

 Download dataset

Task 11: Speed Dating Analysis

Context

A dating agency wants to analyze the data collected from a speed dating event it organized in order to identify key elements that lead to successful matches. This can help improve the agency's matchmaking strategy and suggest to their customers to meet people with whom they are more likely to get along.

Features

The dataset contains the following features.

Feature Description
iid The person's unique identifier.
iid_o The partner's unique identifier.
gender The person's gender.
gender_o The partners's gender.
age The person's age.
age_o The partner's age.
race The person's ethnicity.
race_o The partner's ethnicity.
importance_race How important it is for the person (on a scale of 1 to 10) to date a person of the same ethnic background.
importance_race_o How important it is for the partner (on a scale of 1 to 10) to date a person of the same ethnic background.
importance_religion How important it is for the person (on a scale of 1 to 10) to date a person of the same religious background.
importance_religion_o How important it is for the partner (on a scale of 1 to 10) to date a person of the same religious background.
importance_attractive* The importance the person attaches to dating someone attractive.
importance_sincere* The importance the person attaches to dating someone sincere.
importance_intelligent* The importance the person attaches to dating someone intelligent.
importance_fun* The importance the person attaches to dating someone funny.
importance_ambitious* The importance the person attaches to dating someone ambitious.
importance_shared_interests* The importance the person attaches to dating someone they have common interests with.
importance_attractive_o* The importance the partner attaches to dating someone attractive.
importance_sincere_o* The importance the partner attaches to dating someone sincere.
importance_intelligent_o* The importance the partner attaches to dating someone intelligent.
importance_fun_o* The importance the partner attaches to dating someone funny.
importance_ambitious_o* The importance the partner attaches to dating someone ambitious.
importance_shared_interests_o* The importance the partner attaches to dating someone they have common interests with.
match (target) Whether the person and their partner matched (1) or not (0).

* Participants had 100 points to distribute accross the six attributes (i.e., attractive + sincere + intelligent + fun + ambitious + shared_interests = 100)

Questions

 Download dataset