View on GitHub

INSEAD Course:
Data Science (and Machine Learning) for Business

T. Evgeniou
Professor of Decision Sciences and Technology Management,
INSEAD

A. Ovchinnikov
Visiting Professor of Technology, Operations and Decision Sciences,
INSEAD
Dinstinguished Professor of Management Analytics,
Queen's University, Canada

S. Zoumpoulis
Professor of Decision Sciences,
INSEAD

Download this project as a .zip file Download this project as a tar.gz file

Course Description

The abundance of data revolutionizes many industries, and creates new, data-intensive business models. To take advantage of this trend, today's MBAs need to be more comfortable with "data science" - an emerging discipline that combines data analytics and business. The goal of this course is to build your capability in data science so that you can better add value through the effective management and use of data in your organizations.

The course will combine three key elements: analytics techniques, business applications, and basic coding/programming (in R, one of the leading open-source tools for analyzing data that you will be able to use in your jobs). The emphasis will be on applications to various business cases in finance, marketing, and operations, among other disciplines.

A pre-requisite for the course is the material covered in the INSEAD core course Uncertainty, Data & Judgment. This course is a follow-up to UDJ. No prior coding experience is required: for most classes you will receive a starter code that you will modify and run. Because of that, much of the course will be in a form of a "hands-on" workshop; you will be expected to bring your laptop to class (with all the necessary software tools installed) and actively participate in the learning process.

What you will take away from this course:

The course is built around specific business cases that we will solve in a step-by-step approach, while getting introduced to the topics above.

This is not a course to become "data scientists" or even to become "experts in analytics". The goal is to familiarize participants with what is available and possible for analytics. It is meant to be a starting point.

Technological (and Data Science Training) Evolution

The course has been evolving since it was first launched in 2013 - you can find more readings, example projects, and links in this earlier edition. Both technology/tools and the Data Science and Machine Learning/AI thinking have been evolving over the years. See for example this very important (also today) reproducible project framework, the course technology platform back in 2013, as well as these technical resourses at the time. A number of comments, issues, and project ideas have also been contributed by the course alumni - please continue to use and post issues/ideas even after graduating from the course and from INSEAD, you are "writing history"!

Course Tools

The course is using the R programming language and the main tool is Rstudio. All participants are required to install Rstudio before the first session.

We will also use Github - an open-source platrofm for coding collaboration (What is Github?) and as the course progresses we hope you will find it useful to register there and use it to share your work.

A key lesson of the course is that an important success factor for data analytics projects is to have a good balance between creative customization and codified, reproducible and reusable end-to-end analytics processes ("solutions" or templates). We will develop several such codified end-to-end processes/solutions/templates during the course, and you will reuse them in your future work.

Cases and Final Project

Three cases will be assigned and they will be due in groups. Each case will be on a business application (relating to finance, marketing, and operations, among others). The cases will focus on implementing analytics techniques or models taught in class, and deriving relevant managerial insights, through a step-by-step solution approach.

A central part of the course is the group final project. For the final project, every group is required to develop a data analytics solution to a business problem, and share the relevant data. The project should include three parts: A clear process for how to solve the business problem with steps codified using R code and an interactive toolkit; an application of the process using a specific dataset; and specification for others to use the process, e.g. with different data. The professors will be meeting with all groups to discuss final project ideas and progress; the TA would be available to help with specific implementation issues.

Tutorials

There will be a teaching assistant for the course, who will run tutorial sessions throughout the course to assist you with getting comfortable with R, understanding and using the R implementation of the machine learning techniques presented in class, as well as implementing the final projects.

Books

There is no required textbook, but these books are recommended as optional background readings:
Data Science for Business: Fundamental Principles of Data Mining and Data-Analytic Thinking (DSB book)
by F. Provost and T.Fawcett (2013)

An open-source (free) online textbook covering much of the material in the early part of the course is Forecasting Principles and Practice (FPP). It uses R for examples. Browse through the book and don't hesitate to use it as an extended help file.


Another open-source (free) online textbook covering some of the course material is An Introduction to Statistical Learning (ISL). It also uses R for examples. Browse through the book and don't hesitate to use it as an extended help file as well.

Grading

3 Group Assignments (cases): 45% (15% each)

Group Final Project: 40 %

Individual Class Participation: 15%.
Note: 3% participation points will be deducted for each (1.5 hr) session missed



Course Sessions


Sessions 1-2: Introduction to data science and machine learning. Storytelling with data. Linear regression recall from UDJ and transition from Excel to R.

The session will start with an overview of the course and analytics broadly speaking, and specifically the role of data science, machine learning and AI (with use cases). We will then lightly touch on the science and practice of data visualization (in Tableau and in R) and will ultimately transition to predictive modeling using linear regressions. For that we will recall the linear regression modeling from UDJ and ???take it to the next level??? with R. A Tableau demo will be done live, and the R code will be provided to run/follow in class.

Read before class:

Case: "Sarah Gets a Diamond" [Note: current INSEAD students can obtain the case via the INSEAD course portal.]

The case contains data about various attributes for 9000+ diamonds. For the first 6000 you also have the price. The goal is to predict the prices of the remaining diamonds (i.e., those with IDs 6001 and above).

Prepare before class:

Install Rstudio (also see the R project)

Install Tableau

Think about your groups (~6 people per group). Groups to be finalized in the break between Session 1 and 2

(Optional) Explore these applications developed using some of the tools we use in class

In class material:

Slides for Sessions 0102

R script for the case analyses: 0102 R code -- Sarah Gets a Diamond -- Linear Regression.R

Data in the CSV format: 0102 CSV data -- Sarah Gets a Diamond data.csv

Homework for next time

N/A

Optional Readings:

What is R? R Reference Card.

Chapters 1 and 2 from the DSB book

Chapters 4 and 5 from the FPP book

Other Resources:

Datacamp

A Coursera R programming course

An Udemy data science course

A Lynda data science page

Tableau Training

A Lynda Tableau page

Articles of potential interest: An executive???s guide to AI (McKinsey), Modern AI for Executives, What can Machine Learning Do? (Science), What Machine Learning Can (and Cannot) Do?.


Tutorial 1: Getting comfortable with R (and coding in general)

Coding is likely new to many of you, hence the goal of the first tutorial is to help you become comfortable with "writing" simple code by modifying the templates given in class. Most of the tutorial will therefore be focused on working through the following problem:

Context: you are selling stuffed toy animals for holiday season. Each animal first is sold in a test market, and the "test" data is recorded. Then the selected toys are sold in all markets, and the total "sales" data is recorded. In addition, the kind of the toy ("SKU") is recorded.

The goal: predict the sales of a "Bear" for which the test sales were 10 items. First, consider a single-variable regression, and then consider the multi-variable case.

The file T1_Sales_Data_part1 contains the data.

The final part of the tutorial will introduce you to a powerful package for data manipulation called dplyr.

The file T1_Sales_Data_part2 contains the data for this last section.

Post-tutorial files: The two scripts we saw in the tutorial: T1_regressions and T1_dplyr


Sessions 3-4: Time series modeling (with R)

We will continue exploring data science with R, by considering predicting a time series ??? a common type of data indexed in chronological order. A variety of analytical techniques will be discussed: moving average, exponential smoothing, noise-trend-seasonality decompositions, trigonometric decompositions, ARIMA models and ???dynamic regressions??? (ARIMA models with covariates; equivalently, regression models with ARIMA errors). As always, R code will be provided to be followed in class.

Read before class: [Note, current INSEAD students can obtain the readings and cases via the INSEAD course portal.]

Reading: Time Series

Case: Wells Fargo: Solar Energy for Los Angeles Branches

Separate CSV data file for time series analyses during class: 0304 CSV data -- electric rates.csv

Prepare before class:

Nothing to prepare

In class material:

Slides Sessions 3-4

Intuition-building time series examples: 0304 Time Series examples.xlsx

R script for decompositions, exponential smoothing and trigonometric transforms: 0304 R code -- Timeseries I - decompositions ets and tbats.R

R script for auto-regressive models (ARIMA) and dynamic regression: 0304 R code -- Timeseries II - ARIMAs and dynamic regressions.R

Homework for next time: Assignment 1:

Yahoo's Acquisition of Tumblr + valuation spreadsheet and data. [Note, current INSEAD students can obtain the case, the data, and the valuation spreadsheet via the INSEAD course portal.]

The ultimate question: If you were advising the Yahoo's board, would you recommend approving the acquisition of Tumblr for $1.1 billion? Please prepare a slide-deck explaining your recommendation, illustrating it with any necessary graphs/charts/tables, etc. To help you in doing so please consider the following sub-questions:

What was the average monthly growth rate in the number of people worldwide accessing Tumblr site since its inception (over the last 37 months)? How does it compare to the average monthly growth rate over the past 12 months? What are the valuations of Tumblr given these two growth rates?

Use the data provided to forecast the number of people worldwide accessing Tumblr's site for the next 115 months (June 2013 - Dec 2022).

Given your forecast, revise the valuation of Tumblr (case Exhibit 5, valuation spreadsheet).

Submit PPT/PDF, R code, valuation model on the INSEAD course portal

Optional Readings:

Chapters 6,7,8, 9.1 from the FPP book


Sessions 5-6: Introduction to classification.

In Sessions 1-4 we focused on predicting continuous quantities (e.g., price of an item). However, it is equally important and common to predict events; for example, will a customer purchase again, will a borrower default, etc. ??? a task known as "classification". In this session we will consider the main metrics for understanding classification, as well as will discuss two popular techniques: one rooted in both statistics (logistic regression), and another, rooted in machine learning (CART, classification and regression tree). As we will see, CART is not the most accurate method, but it is nevertheless hugely important as it underlies a large family of popular and powerful ensemble methods, such as random forest and gradient boosting machines, which we will also discuss if time permits.

Read before class: [Note, current INSEAD students can obtain the readings and cases via the INSEAD course portal.]

Reading:

Classification Methods: a process

Modeling Discrete Choice: Categorical Dependent Variables, Logistic Regression, and Maximum Likelihood Estimation

Case: "Retention Modeling at Scholastic Travel Company (A)." UVA-QA-0864 + data

Case: "Retention Modeling at Scholastic Travel Company (B)." UVA-QA-0865 + data. The (B) case will be discussed during Tutorial 2

Skim through Chapter 4 (focus more on Section 4.3), and Sections 7.1 and 8.2-8.5 of the DSB book

Skim through Chapters 4, 8 and 9 of the ISL book

Prepare before class:

Assignment 1 (Yahoo/Tumblr case) is due by 8.25am the day of Sessions #5-6. Submit a .ppt file, your R code, and the valuation model on the INSEAD course portal

In class material:

Slides Sessions 5-6

R script for data cleaning and logistic regression: 0506 R code -- STC (A) Logistic.R

R script for CART: 0506 R code -- STC (A) CART.R

R script for Random Forest: 0506 R code -- STC (AB) random forest.R

R script for XGBoost: 0506 R code -- STC (AB) xgboost.R

Homework for next time: Assignment 2:

The Business Context:

A major bank wants to better predict the likelihood of default for its customers, as well as identify the key drivers that determine this likelihood. They hope that this would inform the bank???s decisions on who to give a credit to and what credit limit to provide, as well as also help the bank have a better understanding of their current and potential customers, which would inform their future strategy, including their planning of offering targeted credit products to their customers.

The Data:

The bank collected data on 25 000 of their existing clients. Of those, 1 000 were randomly selected to participate in a pilot described below. Data about the remaining 24 000 is in the file "DSB A2 ??? credit data.xls". The dataset contains various information, including demographic factors, credit data, history of payment, and bill statements of credit card customers from April to September, as well as information on the outcome: did the customer default or not in October. Please refer to the PDF with the assignmnet on the INSEAD course portal for a snapshot of the data and the data dictionary; the datafile "DSB_S7-8_Credit data.xls" is on the INSEAD course portal.

The Pilot:

Your department wants to pilot a new product, a short-term credit line with the limit of 25,000, and for the purposes of this assignment assume that the line is for 1 month at 2% per month. More so, assume that the client who was issued credit and repaid it will more likely use your bank for similar short-term financing needs in the future, which has an additional lifetime value (CLV) of 1,000. However, if the client will default, then you will be able to recover only 20,000 out of 25,000 credit granted. The data about 1 000 clients that were randomly selected for this pilot is in the file "DSB A2 - new applications.xlsx". This file is on the INSEAD course portal.

Assignment Questions:

The ultimate question: which of the 1000 new applicants in the pilot should be issued credit? The more specific questions are in the assignment PDF on the INSEAD course portal, along with some assumptions and hints. That document also explains what and how you need to submit: it provides a Qualtrics survey link for submitting your credit decisions, and answering some further questions. Please also submit to the INSEAD course portal the PPT/PDF explaining your work, the R code and the CSV file with a single column with 1 000 1s and 0s reflecting whether credit should or shouldn's be granted to the respective new applicants.


Tutorial 2: Mid-course help with R. Classification. Feature engineering.

The goal of the second tutorial is threefold. First, to serve as a mid-course Q&A on any of the material thus far. Second, to strengthen your understanding of the classification material. For that, you will work (with the TA, of course) on the STC(B) case - an extension of the work we have done in Sessions 5-6 based on the additional data about the Net Promoter Scores (NPS). Third, to show you examples of "feature engineering": coming up with new variables combining the existing ones.

Case: Retention Modelling at Scholastic Travel Company (B) + Excel data with dictionary + CSV data + R code


Sessions 7-8: Wrap-up of Supervised Learning: advanced classification, overfitting and regularization, feature engineering, Deep Learning and discussion of Assignment 2 (credit defaults)

We wrap up the supervised learning module of the course by discussing Assignment 2. Teams will present their approaches to the entire class. As we will see, best results are obtained by combining "better models" with "better data," and we will thus spend some time on feature engineering, i.e., creative ways to define new variables in order to better capture the information contained in the data. We will also discuss the important challenge of overfitting, as well as how to address it using state-of-the-art regularization techniques. We will then discuss several additional techniques, such as tree ensembles (random forests, gradient boosting), support vector machines (SVM), regularizations (LASSO, Ridge) and, time-permitting, artificial neural networks (Deep Learning with TensorFlow).

NOTE: depending on which faculty teaches Sessions 7-8, it may be taught in two slightly different versions. Version 1 introduces "Notebooks" (*.RMD files) as an all-in-one-place solution for both doing and communicating analytics with reproducible, publication-quality output; RMDs are then subsequently used in Sessions 9-10. Version 2 instead presents additional methods, such as Deep Learning.

Version 1:

Read before class:

Classification Process and Credit Card Default Report. Think how you would answer the questions (some overlap with Assignment 2).

Prepare before class:

Assignment 2 is due the day of Sessions #7-8. See submission details on the INSEAD course portal.

In class material:

Slides Sessions 7-8

Version 2:

In class material:

Slides Sessions 7-8

R script for the additional classification methods: 0708 R code -- DSB A2.R

"Minimally Viable Product" existing customer CSV data (no cleaning, no feature-engineering): 0708 CSV data -- DSB A2 -- credit data.csv

"Minimally Viable Product" new customer CSV data (no cleaning, no feature-engineering): 0708 CSV data -- DSB A2 -- new applications.csv

Homework for next time

No Homework

Optional Readings:

Chapters 4, 5, 7, 8 of the DSB book.

Chapters 6, 8, 9 from the ISL book.


Tutorial 3: *.Rmd Notebooks. GitHub.

The goal of this tutorial is (i) to get the participants started with GitHub and the course's GitHub repository; and (ii) to offer participants exposure to *.Rmd files as a way to combine "doing" and "communicating" analytics. This should be particularly handy as starting in Sessions 7-8 we handle *.Rmd files.

Instructions to set up GitHub : "Getting Started with GitHub" instructions


Sessions 9-10: Unsupervised Learning: Clustering, Segmentation and Dimensionality Reduction

In sessions 1-8 we dealt with supervised learning tasks, where the data contained features (X variables) and target of interest (Y variable) and we trained a machine to use what we know (known Xs and past/known Ys) to learn about and predict what we do not know (Ys of new/future Xs). But in many analytical tasks we do not have the target: can we learn anything in such cases? The answer is yes and these kinds of tasks are referred to as "unsupervised learning." [To see why this is called "unsupervised" note that in the previous tasks, the existence of past/known Ys was guiding, or "supervising", our learning by providing the answers we needed/wanted to learn.] In this class we will consider two popular unsupervised learning techniques: dimensionality reduction (Principal Component Analysis) and clustering/segmentation (two widely used techniques, k-means and hierarchical clustering, will be discussed in detail). Time-permitting, we will also discuss Association Rules and Anomaly Detection.

NOTE: as with Session 7-8, depending on which faculty teaches Session 9-10, it may be taught in two slightly different versions. Version 1 emphasizes "Notebooks" (*.RMD files) and the way the capture both doing and communicating analytics with reproducible, publication-quality outputs. Version 2 instead presents the additional methods, such as Association Rules.

Version 1:

Read before class:

Dimensionality Reduction and Derived Attributes (pdf version available here)

Boats: Segmentation Case Boats A (Part I&II) (Note: official case is Insead case 09/2012-5849).

Clustering and Segmentation (pdf version available here)

Skim through Chapter 6, read Section 6.4 of the DSB book

Prepare before class:

Set yourself up for the market segmentation notebook by following the steps below. If you are having issues, reach out to the TA.

  1. Follow the "Getting Started with GitHub" instructions. When done, you now have cloned the course's Git repository on a local directory on your computer.
  2. In RStudio, change the working directory to "/MYDIRECTORY/INSEADAnalytics", where you have cloned the course's Git repository. How to do this? You can see the current working directory by running getwd() in the RStudio console and you can change the working directory by running setwd("/MYDIRECTORY/INSEADAnalytics") in the RStudio console, where MYDIRECTORY is the local directory where you chose to clone the INSEADAnalytics Git repository on your computer.
  3. You are now set up to "knit" the Market Segmentation Process. To "knit" means to generate a document that includes both content as well as the output of any embedded R code chunks within the document. Find the MarketSegmentationProcessInClass.Rmd file under your local INSEADAnalytics/CourseSessions/InClassProcess directory. You may find convenient to customize the name of this file in your local directory. Open it in RStudio, and knit it pressing the "Knit" button (near the top left of the RStudio Editor window). You can choose to generate a html report, which you can then open with a browser.
  4. When knitting, you may run into this issue. You can resolve it by running in your RStudio console the command rmarkdown::render("CourseSessions/InClassProcess/customized_filename_of_MarketSegmentationProcessInClass.Rmd"), which will create the customized_filename_of_MarketSegmentationProcessInClass.html file in your local directory CourseSessions/InClassProcess.

(Optional) Explore some Shiny Dashboard visualization tools

In class material:

We will work on the segmentation template document (we will be editing the corresponding raw file)

Slides Sessions 9-10

Some slides on the Boats case: (Part I) and (Part II).

(Optional) These interactive document tools may also be used during class (running on RStudio): the Interactive Factors Analysis Tool as well as the Interactive Cluster Analysis Tool

(Optional) Complete this Exercise and push it on your individual github

Version 2:

In class material:

Slides Sessions 9-10

R script for the initial clustering analyses: 0910 R code -- clustering.R

CSV data for the initial clustering analyses: 0910 CSV data -- for clustering.csv

R script for the additional clustering and PCA analyses: 0910 R code -- pokemon.R

For the additional analyses we use a real-life dataset from Kaggle. Please follow the link here and download the Pokemon data as well as explore the data dictionary

R script for association rules analyses: 0910 R code -- associatoin rules.R

Homework for next time

Version 1, Assignment 3. Case: Boats (A): A Segmentation Case. Complete the segmentation process and your answers to the questions of Parts 1 and 2 of the Market Segmentation Process. Submit a .Rmd file and a .html file on the INSEAD course portal the day of Sessions #11-12.

Version 2, Assignment 3. Case: Boats (A): A Segmentation Case. Perform the analyses and respond to the questions in the case [Q1: 30+5+5pts, Q2: 20+10+30pts]. Prepare a presentation with your results and submit it, together with your R code (or a .RMD file) to the INSEAD course portal by the Sessions #11-12.

Proposal for final project, due the day of Sessions #11-12. Submit on the INSEAD course portal a .Rmd file and a .html file with a short (1-2 pages) description of the business problem, your business solution process, and some preliminary tables with (descriptive) statistics of your data. Also submit the data file.

Optional Readings:

Chapter 6 of the DSB book.

Chapter 10 from the ISL book.


Sessions 11-12: AI in business; The data science process; Guest speaker

We discuss the data analytics process, from understanding the business and the data, to data preparation, to modeling, to evaluation and deployment. A guest speaker will discuss practical challenges and key insights from managing data analytics projects.

Read before class (Optional)

Skim through Chapters 1 and 2 of DSB. Focus on Sections 1.4, 1.7, 1.8, 2.1 and 2.4

Skim through the CRSP-DM documentation

Information Management Issues

How to Tell If You Should Trust Your Statistical Models

Does bigger data lead to better decisions?

Run Field Experiments to Make Sense of Your Big Data

Prepare before class:

Assignment 3 (Boats (A): A Segmentation Case) is due the day of Sessions #11-12. Submit a .Rmd file and a .html file on the INSEAD course portal.

The proposal for the final project is due the day of Sessions #11-12. Submit on the INSEAD course portal a .Rmd file and a .html file with a short (1-2 pages) description of the business problem, your business solution process, and some preliminary tables with (descriptive) statistics of your data. Also submit the data file.

In class material:

Guest speaker - Slides

Slides Sessions 11-12

Homework for next time

Final project, due before Sessions #13-14. Submit a .Rmd file, a .html file, a data file, and your presentation file on the INSEAD course portal

Prepare to present your project in class.


Sessions 13-14: Project presentations

Groups present their final projects.

Every group is required to develop a data analytics solution to a business problem. We expect you to come up with a relevant business problem (ideally from your past or future workplace); to develop a clear process for how to solve the business problem with steps codified using an R notebook; to show an application of the process using a specific dataset; and to draw relevant and actionable business insights.

You are expected to share the data you use. This may be waived if there are privacy or non-disclosure limitations.

Prepare before class:

Final project, due before Sessions #13-14. Submit a .Rmd file, a .html file, a data file, and your presentation file on the INSEAD course portal

Prepare to present your project in class.

In class material:

Slides Sessions 13-14


"Hall of Fame" -- some selected projects from prior course offerings

FBL 22D -- Public Health at Rotterdam

SGP 22D -- Power Price Prediction

FBL 21D -- RecyGen Insect Protein Demand Forecasting

FBL 21J -- Predicting Hospital Readmissions

SGP 21J -- Home Rental Platform

FBL 20D -- Stopping Suicides

FBL 20D -- Quality Dog Food

SGP 20J -- Not Just Random Forests

SGP 20J -- Predicting Wine Prices

SGP 20J -- Optimizing Outpatient Scheduling

SGP 19D -- Loumidis Coffeeshops

SGP 19D -- Spam or Ham

SGP 19D -- Spanish Utility

SGP 19J -- Elanic Fashion



Group Projects, January-February 2016 (INSEAD 2016J) (note: only info/data with non-confidential/NDA constraints):

Energy Consumption (github source here)

Twitter and Elections (github source here)

Travel Website Analytics (github source here)

Google Analytics Dashboard (tool screenshot) (github source here)

Sports Club Analytics (github source here)

AirbnBb (github source here)



Group Projects, May-June 2016 (INSEAD 2016D) (note: only info/data with non-confidential/NDA constraints):

Wine Analytics (github source here)

Firm Fundamentals Analysis (github source here)

Airline Customer Segmentation (github source here)

Speed Dating Intelligence (github source here)

Airbnb Pricing in Amsterdam (github source here)



Group Projects, Jan-Feb 2017 (INSEAD 2017J) (note: only info/data with non-confidential/NDA constraints):

HR Analytics Project 1 (github source here), Project 2 (github source here), Project 3 (github source here), Project 4 (github source here), Project 5 (github source here), Project 6 (github source here)

Wine Analytics (github source here)

Risk in NYC (github source here)

Healthcare (github source here)

Movie Sales (github source here)

Mobile Telco Segmentation (github source here)

Lending Club Defaults Project 1 (github source here) and Project 2 (github source here)

Speed Dating, Project 1, Parts 1 , 2 , 3 , 4 (github source here), and Project 2

Travel Website Analytics (github source here)

Mashable News (github source here)

Flight Delays (github source here)

Credit Card Default (github source here)

e-commerce Analytics (github source here)

Airline Fleet Segmentation (github source here)

Restaurant Ratings (github source here)


Group Projects, Jan-Feb 2018 (INSEAD 2018J) (note: only info/data with non-confidential/NDA constraints):

Airbnb Pricing

Predicting Automotive Equity Performance

Classification for Employee Attrition

Project Alcohol

Formula 1 Prediction

Rehabilitation App

IBM HR Analytics