Evaluation

Overview

The objective of the assignment is to evaluate your ability to conduct basic data exploration and analysis tasks on a simple dataset. As data scientists, you are expected to be proficient in such tasks, as you will be performing them on a regular basis as part of your data science and machine learning projects.

The assignment is to be conducted in pairs. Each group must choose one of the proposed datasets. You will be presenting your work and main findings through a brief oral presentation (10 min presentation + 5 min discussion and questions).

Each dataset is accompanied with a few questions that you are expected to answer. These questions are mainly meant to guide you in your analysis and are, for some, left intentionally vague. Feel free to show your creativity by conducting further analyses if you wish to.

Group Constitution Rules

You are free to choose your partner as long as you respect one simple rule: students who were in M1 Quantitative Economics last year can not pair with students who were not.

Deliverables

The following deliverables are expected from each group:

A 5–10 page report detailing your main findings.
The source code of your exploratory data analysis:

Hosted in a GitHub or GitLab repository (preferably), or as a zip archive.
Both notebooks and Python scripts are accepted.

An oral presentation of your work and main findings (date TBA):

10 min presentation.
5 min questions.

Tasks and Datasets

#	Task	# Rows	# Features	Availability*
1	Customer Churn Analysis (Banking)	10000	13	Claimed
2	Heart Failure Analysis	299	13	Claimed
3	Customer Churn Analysis (Telecommunications)	7043	21	Claimed
4	Personal Loan Marketing Campaign Analysis	5000	14	Claimed
5	Heart Disease Analysis	303	14	Claimed
6	Student Performance Analysis	349	21	Claimed
7	Hotel Booking Cancellations Analysis	5000	20	Claimed
8	Wine Quality Analysis	2500	13	Claimed
9	Employee Churn Analysis (HR)	1470	19	Claimed
10	Metabolic Syndrome Analysis	2009	14	Claimed
11	Speed Dating Analysis	800	26	Claimed

* Last updated: 16/09/2024 at 11 AM.

Claim task

Task 1: Customer Churn Analysis (Banking)

Context

A european bank wants to identify the main factors contributing to customer churn (i.e., customers leaving the bank and closing their accounts). By doing so, the bank can target such customers with incentives, or use this knowledge to propose new products that are better suited to their needs.

Features

The dataset contains the following features.

Feature	Description
`CustomerId`	The customer's unique identifier.
`Surname`	The customer's last name.
`CreditScore`	The customer's credit score.
`Geography`	The customer geographic location (country).
`Gender`	The customer's gender.
`Age`	The customer's age.
`Tenure`	The number of years the customer has been a client of the bank.
`Balance`	The customer's bank account's balance.
`NumOfProducts`	The number of products the customer contracted with the bank.
`HasCrCard`	Whether the customer has a credit card or not.
`IsActiveMember`	Whether the customer has been recently active (i.e., made transactions) or not.
`EstimatedSalary`	An estimate of the customer's annual salary.
`Exited` (target)	Whether the customer closed their account (`1`) or not (`0`).

Questions

What is the churn rate among the bank's customers?
How are the different variables (gender, age, geography, etc.) distributed in the dataset?
How do the different variables interact with each other?

What are the age, salary, balance, number of products, etc. distributions for each gender group?
How are different indicators distributed by country?
etc.

How do the different variables affect churn? What are the causes that can lead to increased (or reduced) customer churn?
(Optional) Build a simple machine learning classification model that predicts churn based on a customer's features.

Download dataset

Task 2: Heart Failure Analysis

Context

A renowned hospital's cardiology department is conducting a study to pinpoint factors that can foretell deadly heart failure. By doing so, the department will be better prepared to identify patients at risk and provide them with adequate care.

Features

The dataset contains the following features.

Feature	Description
`age`	The patient's age (years).
`anaemia`	Whether the patient has anemia (decrease of red blood cells or hemoglobin) or not.
`creatinine_phosphokinase`	Level of the CPK enzyme in the blood (mcg/L).
`diabetes`	Whether the patient has diabetes or not.
`ejection_fraction`	Percentage of blood leaving the heart at each contraction (percentage).
`high_blood_pressure`	Whether the patient has hypertension or not.
`platelets`	Platelets in the blood (kiloplatelets/mL).
`serum_creatinine`	Level of serum creatinine in the blood (mg/dL).
`serum_sodium`	Level of serum sodium in the blood (mEq/L).
`sex`	The patient's gender.
`smoking`	Whether the patient is a smoker or not.
`time`	Follow-up period (in days).
`DEATH_EVENT` (target)	If the patient died during the follow-up period.

Questions

What is the mortality rate due to heart failure among the study's participants?
How are the different variables (age, gender, presence of diabetes or high blood pressure, etc.) distributed in the dataset?
How do the variables relate to each other?

Is the smoker distribution the same for both genders?
How many patients present each combination of underlying health conditions (anaemia only, diabetes only, anaemia + diabetes, etc.)?
etc.

What are the main factors that are related to a higher risk of heart failure?
(Optional) Build a simple machine learning classification model that predicts if a patient is at risk of dying from heart failure.

Download dataset

Task 3: Customer Churn Analysis (Telecommunications)

Context

A telecommunications company wants to identify reasons leading to customer churn (i.e., customers canceling their phone or internet plans) in order to improve its client retentionn strategies through more personnalized offers and commercial gestures.

Features

The dataset contains the following features.

Feature	Description
`customerID`	The customer's unique identifier.
`gender`	The customer's gender.
`SeniorCitizen`	Whether the customer is a senior citizen or not.
`Partner`	Whether the customer has a partner or not.
`Dependents`	Whether the customer has dependents or not.
`tenure`	Number of months the customer has stayed with the company.
`PhoneService`	Whether the customer has a phone service or not.
`MultipleLines`	Whether the customer has multiple phone lines or not.
`InternetService`	The type of the customer's Internet service or not (DSL; Fiber optic; No).
`OnlineSecurity`	Whether the customer has contracted the online security service or not (Yes; No; No internet service).
`OnlineBackup`	Whether the customer has contracted the online backup service or not (Yes; No; No internet service).
`DeviceProtection`	Whether the customer has contracted the device protection service or not (Yes; No; No internet service).
`TechSupport`	Whether the customer has tech support or not (Yes; No; No internet service).
`StreamingTV`	Whether the customer has subscribed to the TV streaming service or not (Yes; No; No internet service).
`StreamingMovies`	Whether the customer has subscribed to the movie streaming service or not (Yes; No; No internet service).
`Contract`	The customer's contract terms (Month-to-month; One year; Two years).
`PaperlessBilling`	Whether the customer has opted for paperless billing or not.
`PaymentMethod`	The customer's payement method.
`MonthlyCharges`	The customer's monthly charges (in dollars).
`TotalCharges`	The total amount charged to the customer so far (in dollars).
`Churn` (target)	Whether the customer canceled their plan or not.

Questions

What is the churn rate among the company's customers?
How are the different variables (gender, tenure, contract type, etc.) distributed in the dataset?
How do the different variables interact with each other?

What are the distributions of contract type, tenure, seniority, charges, etc. for each gender group?
How are different indicators distributed by contract type?
etc.

How do the different variables affect churn? What are the causes that can lead to increased (or reduced) customer churn?
(Optional) Build a simple machine learning classification model that predicts churn based on a customer's features.

Download dataset

Task 4: Personal Loan Marketing Campaign Analysis

Context

The retail marketing department of a bank ran a campaign in which it proposed personal loans to its customers. The department wants to analyze the data in order to discover insights that might help with tailoring better-targeted campains that can lead to better conversion rates in the future.

Features

The dataset contains the following features.

Feature	Description
`ID`	The customer's unique identifier.
`Age`	The customer's age (in years).
`Experience`	Number of year of professional experience.
`Income`	The customer's annual income (in $1OOO)
`ZIP Code`	The customer's home address zip code.
`Family`	The customer's family size.
`CCAvg`	Average spendings on credit card per month (in $1000).
`Education`	Education level: `1`: undergrad; `2`: Graduate; `3`: Advanced/Professional.
`Mortgage`	Value of house mortgage (in $1000) if the customer has one.
`Personal Loan` (target)	Whether the customer contracted the personal loan offer in the last campaign or not.
`Securities Account`	Whether the customer has a securities account with the bank or not.
`CD Account`	Whether the customer has a certificate of deposit account with the bank or not.
`Online`	Whether the customer uses the online facilities provided by the bank or not.
`CreditCard`	Whether the customer has a credit card issued by the bank or not.

Questions

What is the conversion rate (percentage of clients that contracted a personal loan) of the marketing campaign?
How are the different variables (age, income, education, etc.) distributed in the dataset?
How do the different variables interact with each other?

Are age, income, education, etc. distributed similarly for customers who have security accounts and those who don't?
Are age, income, education, etc. distributed similarly for customers who have CD accounts and those who don't?
How many customers have one account only? How many have multiple accounts (i.e., both security and CD accounts)?

What are the most important factors that lead to customers responding favorably to the marketing campaign?
(Optional) Build a simple machine learning classification model that predicts, based on a customer's features, if they will respond to the marketing campaign or not.

Download dataset

Task 5: Heart Disease Analysis

Context

A clinic's cardiology department wants to identify the most important indicators that help diagnose heart disease. By doing so, the clinic will be able to know which diagnostic tests need to be conducted in priority, thus leading to more efficiency and decreased expenditures (and smaller medical bills for the patients).

Features

The dataset contains the following features.

Feature	Description
`age`	The patient's age (in years).
`sex`	The patient's gender (`1`: male; `0`: female).
`cp`	Chest pain type: `0`: asymptomatic `1`: typical angina `2`: atypical angina `3`: non-anginal pain
`trestbps`	Resting blood pressure (in mm Hg).
`chol`	Serum cholestoral in mg/dl.
`fbs`	Whether the patient's fasting blood sugar > 120 mg/dl (`1`: yes; `0`: no).
`restecg`	Resting ECG results: `0`: normal `1`: ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) `2`: probable or definite left ventricular hypertrophy by Estes' criteria
`thalach`	Maximum heart rate achieved.
`exang`	Exercise induced angina (`1`: yes; `0`: no).
`oldpeak`	ST depression induced by exercise relative to rest.
`slope`	The slope of the peak exercise ST segment: `0`: upsloping `1`: flat `2`: downsloping
`ca`	Number of major vessels (`0–3`) colored by flourosopy.
`thal`	`3`: Normal; `6`: Fixed defect; `7`: Reversable defect.
`target` (target)	Whether the patient has heart disease (`1`) or not (`0`).

Questions

How many patients suffer from heart disease in the dataset?
How are the different variables (age, gender, chest pain type, etc.) distributed?
How do the different variables interact?

Do age, cholesterol level, blood pressure, and other indicators have similar distributions for both genders?
Are there correlations between pairs of variables?
etc.

What are the features that are the most likely to indicate the presence of a heart disease?
(Optional) Build a simple machine learning classification model that predicts if a patient suffers from an underlying heart condition.

Download dataset

Task 6: Student Performance Analysis

Context

A secondary school wants to investigate social, demographic, and school related causes linked to student failure in order to proactively identify students at risk and provide them with adequate counseling and support.

Features

The dataset contains the following features.

Feature	Description
`sex`	The student's gender.
`age`	The student's age (in years).
`famsize`	Family size (`LE3`: less or equal to 3; `GT3`: greater than 3).
`Pstatus`	Parents' cohabitation status (`T`: living together; `A`: living apart).
`Mjob`	Mother's job.
`Fjob`	Father's job.
`guardian`	The student's guardian.
`studytime`	Weekly study time: `1`: <2 hours; `2`: 2 to 5 hours; `3`: 5 to 10 hours; `4`: > 10 hours.
`schoolsup`	Whether the student has extra educational support or not.
`famsup`	Whether the student has family educational support or not.
`paid`	Whether the student attends extra paid classes or not.
`activities`	Whether the student has extra-curricular activities or not.
`internet`	Whether the student has internet access at home or not.
`romantic`	Whether the student is in a romantic relationship or not.
`famrel`	Quality of family relationships, from `1` (very bad) to `5` (excellent).
`goout`	Frequency of going out with friends, from `1` (very low) to `5` (very high).
`Dalc`	Workday alcohol consumption, from `1` (very low) to `5` (very high).
`Walc`	Weekend alcohol consumption, from `1` (very low) to `5` (very high).
`health`	Current health condition, from `1` (very bad) to `5` (very good).
`absences`	Number of school absences.
`success` (target)	Whether the student passed or failed.

Questions

What is the proportion of students that failed the class?
How are different variables (age, gender, internet access, family composition, etc.) distributed in the dataset?
How do variables interact with each other?

Are age, absences, family composition, etc. distributed similarly for both genders?
What are the age, gender, internet access, alcohol consumption, etc. distributions by parents cohabitation status?
etc.

What are the principal factors that play a key role in student failure?
(Optional) Build a simple machine learning classification model that predicts if a student will fail the class.

Download dataset

Task 7: Hotel Booking Cancellations Analysis

Context

A hotel chain wants to identify signs that can help foretell if guests are going to cancel their room bookings. By doing so, they can better anticipate the occurrence of such events and adjust their room prices and cancellation policies accordingly.

Features

The dataset contains the following features.

Feature	Description
`hotel`	Whether the booking is made at a resort or a city hotel.
`arrival_date_month`	The arrival month.
`arrival_date_week_number`	The arrival week number.
`arrival_date_day_of_month`	The arrival day of the month.
`stays_in_weekend_nights`	The number of weekend nights (Saturday or Sunday) the guest booked.
`stays_in_week_nights`	The number of week nights (Monday to Friday) the buest booked.
`adults`	Number of adults.
`meal`	The type of meal booked (in standard hospitality meal package categories): `Undefined`/`SC`: no meal package; `BB`: Bed & Breakfast; `HB`: Half board (breakfast + one other meal); `FB`: Full board (breakfast + lunch + dinner).
`country`	The guest's country of origin (ISO 3155–3:2013 format).
`market_segment`	The market segment designation (TA = Travel Agents; TO = Tour Operators).
`distribution_channel`	The booking's distribution channel (TA = Travel Agents; TO = Tour Operators).
`reserved_room_type`	Code of the type of the reserved room.
`assigned_room_type`	Code of the type of the assigned room (can be different from reserved room due to overbooking or customer requests).
`booking_changes`	Number of changes made to the booking until check-in or cancellation.
`deposit_type`	Type of deposit made by the guest to guarantee the booking `No Deposit`: no deposit was made; `Non Refund`: a deposit was made in the value of the total stay cost; `Refundable`: a deposit was made with a value under the total cost of stay.
`customer_type`	Type of booking: `Contract`: the booking has a contract associated to it; `Group`: the booking is associated to a group; `Transient`: the booking is not associated to a contract, a group or another transient booking; `Transient-party`: the booking is transient and associated to at least another transient booking.
`adr`	Average daily rate.
`required_car_parking_spaces`	Number of car parking space required by the guest.
`total_of_special_requests`	Number of spacial requests (e.g., twin bed, high floor, etc.) made by the guest.
`canceled` (target)	Whether the booking was cancelled or not.

Questions

What is the proportion of cancelations in the dataset?
How are the different variables (hotel type, market segment, customer type, etc.) distributed in the dataset?
How do variables interact with each other?

Are stay in nights distributed similarly accross market segments? customer types? countries? etc.
Are stay in nights, customer types, etc. distributed similarly for different periods (months) of the year? per room type? etc.
etc.

What variables are associated the most with booking cancellations?
(Optional) Build a simple machine learning classification model that predicts if a guest will cancel their booking or not.

Download dataset

Task 8: Wine Quality Analysis

Context

A wine aficionado is investigating what differentiates excellent wines from poor or ordinary ones. Can you help him in his endeavor?

Features

The dataset contains the following features.

Feature	Description
`fixed acidity`	Amount of fixed (non-volatile) acid in the wine.
`volatile acidity`	Amount of acetic (volatile) acid in the wine.
`citric acid`	Amount of citric acid in the wine.
`residual sugar`	Amount of sugar after fermentation.
`chlorides`	The amount of salt in the wine.
`free sulfur dioxide`	Amount of free SO2 in the wine.
`total sulfur dioxide`	Total amount (free and bound) SO2 in the wine.
`density`	The wine's density.
`pH`	The wine's acidity.
`sulphates`	The amount of sulphates in the wine.
`alcohol`	Percentage of alcohol in the wine.
`type`	Whether the wine is red or white.
`great wine` (target)	Whether the wine is great or not.

Questions

What is the proportion of great wines in the dataset?
How are the different variables (acidity, alcohol percentage, pH, etc.) distributed in the dataset?
Are there correlations between pairs of variables (e.g., fixed and volatile acidity)?
Are the variables distributed similarly in white and red wines?
How do the different variables affect wine quality?

What makes a white wine great?
What makes a red wine great?

(Optional) Build a simple machine learning classification model that predicts if a wine is great or not based on its characteristics.

Download dataset

Task 9: Employee Churn Analysis (HR)

Context

A HR department recruited you as a Data Scientist to help identify the main reasons behind employee churn. The knowledge you extract from their historical data can be leveraged to identify and put in place actions that can help retain talent (e.g., clearer career paths, incentives, etc.).

Features

The dataset contains the following features.

Feature	Description
`Age`	The employee's age.
`Gender`	The employees gender.
`MaritalStatus`	The employee's marital status.
`Education`	The employee's education level.
`EducationField`	The employee's field of education.
`JobRole`	The employee's job title.
`Department`	The department to which the employee is assigned.
`BusinessTravel`	Whether the employee's job involves frequent, occasional, or no business travel.
`MonthlyIncome`	The employee's monthly income.
`DistanceFromHome`	The firm's distance to the employee's home (in kilometers).
`JobSatisfaction`	The employee's job satisfaction.
`JobInvolvement`	The employee's job involvement.
`RelationshipSatisfaction`	The employee's satisfaction w.r.t. their personal relationship.
`PerformanceRating`	The employee's performance rating.
`YearsAtCompany`	The number of years the employee has been in the company.
`YearsInCurrentRole`	The number of years the employee has been in their current role.
`YearsSinceLastPromotion`	The number of years since the employee got promoted.
`YearsWithCurrManager`	The number of years the employee has been assigned the same manager.
`Churn` (target)	Whether the employee has left the company or not.

Questions

What is the churn rate among the company's employees?
How are the different variables (gender, marital status, etc.) distributed?
Are there correlations between pairs of variables (e.g., gender and monthly income, job statisfaction and distance from home, etc.)?
What are the variables that influence employee churn the most?
(Optional) Build a simple machine learning classification model that predicts if a given employee are going to resign in the following months or not.

Download dataset

Task 10: Metabolic Syndrome Analysis

Context

Metabolic syndrome is a combination of conditions that significantly raise the risk of a multitude of diseases (coronary heart disease, diabetes, etc.). The hospital you are working at wants to identify, based on historical clinical data, the most important factors linked to this syndrome.

Features

The dataset contains the following features.

Feature	Description
`Age`	The patient's age.
`Sex`	The patient's gender.
`Marital`	The patient's marital status.
`Income`	The patient's monthly income.
`Race`	The patient's ethnicity.
`WaistCirc`	The patient's waist circumference.
`BMI`	The patient's BMI (Body Mass Index).
`Albuminuria`	The patient's Albuminuria stage.
`UrAlbCr`	The patient's Urine Albumin-Creatinine ratio.
`UricAcid`	The patient's uric acid level.
`BloodGlucose`	The patient's blood glucose level.
`HDL`	The patient's HDL (High-Density Lipoprotein) level.
`Triglycerides`	The patient's triglycerides level.
`MetabolicSyndrome` (target)	Whether the patient suffers from metabolic syndrome or not.

Questions

How many patients suffer from metabolic syndrome in the dataset?
How are the different variables (age, gender, etc.) distributed?
How do the different variables interact?

Are the biological variables similarly distributed in the different gender groups?
What correlations exist between biological variables?

What factor is the most linked to metabolic syndrome?
(Optional) Build a simple machine learning classification model that can help predict if a given patient is subject to metabolic syndrome or not

Download dataset

Task 11: Speed Dating Analysis

Context

A dating agency wants to analyze the data collected from a speed dating event it organized in order to identify key elements that lead to successful matches. This can help improve the agency's matchmaking strategy and suggest to their customers to meet people with whom they are more likely to get along.

Features

The dataset contains the following features.

Feature	Description
`iid`	The person's unique identifier.
`iid_o`	The partner's unique identifier.
`gender`	The person's gender.
`gender_o`	The partners's gender.
`age`	The person's age.
`age_o`	The partner's age.
`race`	The person's ethnicity.
`race_o`	The partner's ethnicity.
`importance_race`	How important it is for the person (on a scale of 1 to 10) to date a person of the same ethnic background.
`importance_race_o`	How important it is for the partner (on a scale of 1 to 10) to date a person of the same ethnic background.
`importance_religion`	How important it is for the person (on a scale of 1 to 10) to date a person of the same religious background.
`importance_religion_o`	How important it is for the partner (on a scale of 1 to 10) to date a person of the same religious background.
`importance_attractive`*	The importance the person attaches to dating someone attractive.
`importance_sincere`*	The importance the person attaches to dating someone sincere.
`importance_intelligent`*	The importance the person attaches to dating someone intelligent.
`importance_fun`*	The importance the person attaches to dating someone funny.
`importance_ambitious`*	The importance the person attaches to dating someone ambitious.
`importance_shared_interests`*	The importance the person attaches to dating someone they have common interests with.
`importance_attractive_o`*	The importance the partner attaches to dating someone attractive.
`importance_sincere_o`*	The importance the partner attaches to dating someone sincere.
`importance_intelligent_o`*	The importance the partner attaches to dating someone intelligent.
`importance_fun_o`*	The importance the partner attaches to dating someone funny.
`importance_ambitious_o`*	The importance the partner attaches to dating someone ambitious.
`importance_shared_interests_o`*	The importance the partner attaches to dating someone they have common interests with.
`match` (target)	Whether the person and their partner matched (`1`) or not (`0`).

* Participants had 100 points to distribute accross the six attributes (i.e., attractive + sincere + intelligent + fun + ambitious + shared_interests = 100)

Questions

What is the average number of matches per participant?
For each gender group, what are the three most important attributes when choosing someone to date?
Are do variables relate?

Do people that value attractivenesss also value ambition?
Do preferences change with age?
Are the ethnicity and religion importances related?

What are the variables that influence compatibility between two persons the most?
(Optional) Build a simple machine learning classification model that can help predict if two persons are likely to match.

Download dataset