Course Description
The abundance of data revolutionizes many industries, and creates new, data-intensive business models. To take advantage of this trend, today's MBAs need to be more comfortable with "data science" - an emerging discipline that combines data analytics and business. The goal of this course is to build your capability in data science so that you can better add value through the effective management and use of data in your organizations.
The course will combine three key elements: analytics techniques, business applications, and basic coding/programming (in R, one of the leading open-source tools for analyzing data that you will be able to use in your jobs). The emphasis will be on applications to various business cases in finance, marketing, and operations, among other disciplines.
A pre-requisite for the course is the material covered in the INSEAD core course Uncertainty, Data & Judgment. This course is a follow-up to UDJ. No prior coding experience is required: for most classes you will receive a starter code that you will modify and run. Because of that, much of the course will be in a form of a "hands-on" workshop; you will be expected to bring your laptop to class (with all the necessary software tools installed) and actively participate in the learning process.
What you will take away from this course:
- Understand key principles and processes for analyzing data and managing analytics projects;
- Learn to better identify new business opportunities for data analytics, and the specific strategies for extracting business value from data;
- Learn several advanced analytics/machine learning techniques: generalized linear models (logistic regression), CART, random forests, SVM, methods for segmentation and clustering, and neural networks;
- Get an introductory exposure to coding (in R) on which you will be able to build in your jobs;
- Get an introductory understanding of data science, "data scientists", and how to work with them.
The course is built around specific business cases that we will solve in a step-by-step approach, while getting introduced to the topics above.
This is not a course to become "data scientists" or even to become "experts in analytics". The goal is to familiarize participants with what is available and possible for analytics. It is meant to be a starting point.
Course Tools
The course is using the R programming language and the main tool is Rstudio. All participants are required to install Rstudio before the first session.
Since coding is likely new to many of you, if you experience any issues using the class tools, please post them "issues" on the course website's GitHub page, after exploring any related past issues there. While answers will be provided by the course TAs, participation points will be awareded to participants for responding to peers' issues. What is Github? - it is an open-source platform for coding collaboration, and as the course progresses we hope you will find it useful to register there and use it to share your work.
A key lesson of the course is that an important success factor for data analytics projects is to have a good balance between creative customization and codified, reproducible and reusable end-to-end analytics processes ("solutions"). We will develop codified end-to-end processes ("solutions") during the course. Examples of reproducible and reusable analytics solutions can also be found in the Microsoft Azure Machine Learning Studio platform. This space is growing and changing fast, with various cloud-based platforms being developed such as Google Cloud Machine Learning Engine and Amazon's Artificial Intelligence on AWS, among others.
Cases and Final Project
Three cases will be assigned and they will be due in groups. Each case will be on a business application (relating to finance, marketing, and operations, among others). The cases will focus on implementing analytics techniques or models taught in class, and deriving relevant managerial insights, through a step-by-step solution approach.
A central part of the course is the group final project. For the final project, every group is required to develop a data analytics solution to a business problem, and share the relevant data. The project should include three parts: A clear process for how to solve the business problem with steps codified using R code; an application of the process using a specific dataset; and specification for others to use the process, e.g. with different data. The professors will be meeting with all groups to discuss final project ideas and progress; the TA would be available to help with specific implementation issues.
Books
There is no required textbook, but these books are recommended as optional background readings:
Data Science for Business: Fundamental Principles of Data Mining and Data-Analytic Thinking (DSB book)
by F. Provost and T.Fawcett (2013)
An open-source (free) online textbook covering much of the material in the early part of the course is Forecasting Principles and Practice (FPP). It also uses R for examples. Browse through the book and don't hesitate to use it as an extended help file.
Grading
See the INSEAD Course Portal.Course Sessions
Sessions 1-2: Introduction to predictive modeling, linear regression review and transition from Excel to R.
The session will start with an overview of the course and will then transition to predictive modeling using linear regressions. For that we will recall the linear regression modeling from UDJ and "take it to the next level" with R. We will also lightly touch on visualizations (in R and in Tableau). A Tableau demo will be done live, and the R code will be provided to run/follow in class.Read before class:
Case: "Sarah Gets a Diamond"
The file "0102 Sarah Gets a Diamond data.xls" contains data about various attributes for 9000 diamonds. For the first 6000 you also have the price. The goal is to predict the prices of the remaining diamonds (i.e., those with IDs 6000 and above). You will build a (regression) model to do that in class and we will have a little competition.
Prepare before class:
Install Rstudio (also see the R project)
Install Tableau
Think about your groups (4-5 people per group). Groups to be finalized in the break between Session 1 and 2
(Optional) Explore these applications developed using some of the tools we use in class
In class material:
R code we used in Sessions 1-2
Datafile 0102 Sarah Gets a Diamond data.csv
Homework for next time
Read the Time Series note for Sessions #3-4 and browse through the FPP book.
Optional Readings:
What is R? R Reference Card.
Chapters 1 and 2 from the DSB book
Chapters 4 and 5 from the FPP book
Other Resources:
A Coursera R programming course
Tutorial 1: Getting comfortable with R (and coding in general)
Coding is likely new to many of you, hence the goal of the first tutorial is to help you become comfortable with "writing" simple code by modifying the templates given in class. In particular, you will work on the following problem:Context: you are selling stuffed toy animals for holiday season. Each animal first is sold in a test market, and the "test" data is recorded. Then the selected toys are sold in all markets, and the total "sales" data is recorded. In addition, the kind of the toy ("SKU") is recorded.
The goal: predict the sales of a "Bear" for which the test sales were 10 items. First, consider a single-variable regression, and then consider the multi-variable case.
The file T1 Sales Data contains the data.
In-class files: R-script contains the code produced during the tutorial.
Data wrangling cheat-sheet for dplyr.
Sessions 3-4: Time series modeling (with R)
We will continue exploring data science with R, by considering predicting a time series - a common type of data indexed in chronological order. A variety of analytical techniques will be discussed: moving average, exponential smoothing, noise-trend-seasonality decompositions, trigonometric decompositions, ARIMA models and "dynamic regressions" (ARIMA models with covariates; equivalently, regression models with ARIMA errors). As always, R code will be provided to be followed in class.Read before class:
Separate data file for time series analyses Electric Rates.csv
Prepare before class:
Nothing to prepare
In class material:
R code we used in Sessions 3-4, part 1
R code we used in Sessions 3-4, part 2
Homework for Sessions 7-8
Yahoo's Acquisition of Tumblr + valuation spreadsheet and data
The ultimate question: If you were advising the Yahoo's board, would you recommend approving the acquisition of Tumblr for $1.1 billion? Please prepare a slide-deck explaining your recommendation, illustrating it with any necessary graphs/charts/tables, etc. To help you in doing so please consider the following sub-questions:
What was the average monthly growth rate in the number of people worldwide accessing Tumblr site since its inception (over the last 37 months)? How does it compare to the average monthly growth rate over the past 12 months? What are the valuations of Tumblr given these two growth rates?
Use the data provided to forecast the number of people worldwide accessing Tumblr's site for the next 115 months (June 2013 - Dec 2022).
Given your forecast, revise the valuation of Tumblr (case Exhibit 5, valuation spreadsheet).
Submit PPT, R code, valuation model on the INSEAD course portal
Optional Readings:
Chapters 6,7,8, 9.1 from the FPP book
Sessions 5-6: Introduction to classification. Logistic regression and CART
In Sessions 1-4 we focused on predicting continuous quantities (e.g., price of an item). But it is equally important and common to predict events; for example, will a customer purchase again, will a borrower default, etc - a task referred to as "classification". In this session we will consider the main metrics for understanding classification, as well as will discuss two popular techniques. First is logistic regression - a generalization of the linear regression. Second is CART (classification and regression tree).Read before class:
Case: "Retention Modeling at Scholastic Travel Company (A)." UVA-QA-0864 + Excel data + CSV data
Case: "Retention Modeling at Scholastic Travel Company (B)." UVA-QA-0865 + Excel data + CSV data. The (B) case will be discussed during Tutorial 2
Modeling Discrete Choice: Categorical Dependent Variables, Logistic Regression, and Maximum Likelihood Estimation + Excel data and model + data + R code
Skim through Chapter 4 (focus more on Section 4.3), and Sections 7.1 and 8.2-8.5 of the DSB book
In class material:
Homework for Sessions 7-8
Assignment 2 (Credit Card Default, A2_new_customers.csv) is due by 8am the day of Sessions #7-8. Submit your report (if possible in .Rmd and .html format, otherwise in a ppt format) on the INSEAD course portal, and be ready to present your work in class.
Note: starting from Sessions #7-8 we will also learn a new way/"process" for doing and communicating analytics - an approach that we will refer to as "notebooks". These are codified "solutions" that feed off live data and present the analyses and their explanation in one self-contained document. Please refer to this example of a scalable reusable work: explore this example of how to use the tools to generate reusable (long) reports efficiently. How many lines do you need to edit in order to generate this long report? Here is the source code that generated this report. We will see another example in Session #9-10 and will discuss it in detail.
Tutorial 2: Mid-course help with R. Classification. *.Rmd files
The goal of the second tutorial is threefold. First, to serve as a mid-course Q&A on any of the material thus far. Second, to strengthen your understanding of the classification material. For that, you will work (with the TA, of course) on the STC(B) case - an extension of the work we have done in sessions 5-6 based on the additional data about the Net Promoter Scores (NPS). Third, the tutorial will offer additional exposure to *.rmd files as a way to combine "doing"" and "communicating" analytics. This should be particularly handy as your assignment 2 is in the form of an rmd file.Case: "Retention Modeling at Scholastic Travel Company (B)." UVA-QA-0865 + Excel data + CSV data.
R script that we generated during the tutorial can be found here
RMD example that we saw in the tutorial can be found here
Sessions 7-8: Advanced Classification Methods and Dimensionality Reduction
We wrap up the discussion on classification using logistic regression and CART with presentations of the Credit Card Default case. We then switch gears and discuss dimensionality reduction: how to generate a few meaningful derived attributes starting from a large number of raw attributes. We present Principal Component Analysis, a popular technique for dimensionality reduction, in detail. Throughout this session, we will be working on notebooks - a reusable, replicable, and easy-to-share way of doing and communicating analytics with publication-quality output in the form of a single self-contained report.Read before class:
Dimensionality Reduction and Derived Attributes (pdf version available here)
Boats: Segmentation Case Boats A (Part I&II) (Note: official case is Insead case 09/2012-5849).
Prepare before class:
Assignment 2 (Credit Card Default, A2_new_customers.csv) is due by 8am the day of Sessions #7-8. Submit your report (if possible in .Rmd and .html format, otherwise in a ppt format) on the INSEAD course portal, and be ready to present your work in class.
Before next class, please make sure you have downloaded all the course code from github and that you can open in Rstudio the .Rmd files in the directory CourseSessions/InClassProcess. You can download and unzip all code directly from the github of the course - see these screenshots (1) and (2). We will be working in class on this file.
In class material:
We will work on Part 1 of the segmentation template document (we will be editing this raw file)
Homework for next time
No Homework
Optional Readings:
Chapter 6.4 of the DSB book, skim through all of Chapter 6.
Submit before class:
Assignment 2 (Credit Card Default, A2_new_customers.csv) is due by 8am the day of Sessions #7-8. Submit your report (if possible in .Rmd and .html format, otherwise in a ppt format) on the INSEAD course portal, and be ready to present your work in class.
Sessions 9-10: Clustering for Segmentation
In data analytics we oftentimes want to organize large data into clusters with similar observations within each cluster. In order to identify segments in the data we use clustering techniques, which group the data into a few segments so that data within a segment are similar, while data across segments are different. We will propose and run in class a complete process for clustering and segmentation, and we will discuss in detail two widely used clustering techniques: k-means clustering and hierarchical clustering.Read before class:
Clustering and Segmentation (pdf version available here)
Skim through Chapter 6, read Section 6.4 of the DSB book
Boats: Segmentation Case Boats A (Part I&II) (Note: official case is Insead case 09/2012-5849).
Prepare before class:
Make sure you can "knit" the market segmentation process document and generate an html report (check out this issue if needed).
Explore some Shiny Dashboard visualization tools
In class material:
We will continue work on Part 2 of the segmentation template document (we will be editing this raw file)
(Optional) Some slides on the Boats case (Part I) .
(Optional) These interactive document tools may also be used during class (running on RStudio): the Interactive Factors Analysis Tool as well as the Interactive Cluster Analysis Tool
Homework for next time
Proposal for final project, due at 8am on the day of Sessions #11-12. Submit on the INSEAD course portal a short (1-2 pages) description of the business problem, your business solution process, and some preliminary description of the data you plan to use.
Assignment 3 (Boats (A): A Segmentation Case) is due by 8am the day of Sessions #11-12. You should work only on Parts 1 and 2 of the segmentation template document we use in class, and submit a .Rmd file and a .html file of your work on the INSEAD course portal.
Sessions 11-12: AI, Machine Learning, and Managing Data Science
In this session, we first discuss about Artificial Intelligence, Machine Learning and how they relate with Data Science. We will then focus on how to plan and manage data science projects and processes, and discuss principles such as reusability, replicability, scalability, and shareability.In class material:
Some slides to browse before class on AI, Machine Learning, Business, and Society (you can also download the file via slideshare).
Read before class (Optional)
Skim through Chapters 1 and 2 of DSB. Focus on Sections 1.4, 1.7, 1.8, 2.1 and 2.4
Skim through the CRSP-DM documentation
How to Tell If You Should Trust Your Statistical Models
Does bigger data lead to better decisions?
Run Field Experiments to Make Sense of Your Big Data
Submit before class:
Assignment 3 (Boats (A): A Segmentation Case) (Parts 1 and 2 of the segmentation template document we use in class) is due by 8am the day of Sessions #13-14. Submit a .Rmd file and a .html file on the INSEAD course portal.
Homework for next time
Final project, due at the begining of Sessions #13-14. Submit a presentation (ppt and code or .Rmd and .html files), and data on the INSEAD course portal
Prepare to present your project in class.
Sessions 13-14: Project presentations
Groups present their final projects.Submit before class:
Final project, due at the begining of Sessions #13-14. Submit a presentation (ppt and code or .Rmd and .html files), and data on the INSEAD course portal
Prepare to present your project in class.
Extra Readings:
An FDA Proposal for Regulating AI and Machine Learning Medical Devices.
Regulating Autonomous Systems.
Group Projects, January-February 2016 (INSEAD 2016J) (note: only info/data with non-confidential/NDA constraints):
Energy Consumption (github source here)
Twitter and Elections (github source here)
Travel Website Analytics (github source here)
Google Analytics Dashboard (tool screenshot) (github source here)
Sports Club Analytics (github source here)
AirbnBb (github source here)
Group Projects, May-June 2016 (INSEAD 2016D) (note: only info/data with non-confidential/NDA constraints):
Wine Analytics (github source here)
Firm Fundamentals Analysis (github source here)
Airline Customer Segmentation (github source here)
Speed Dating Intelligence (github source here)
Airbnb Pricing in Amsterdam (github source here)
Group Projects, Jan-Feb 2017 (INSEAD 2017J) (note: only info/data with non-confidential/NDA constraints):
HR Analytics Project 1 (github source here), Project 2 (github source here), Project 3 (github source here), Project 4 (github source here), Project 5 (github source here), Project 6 (github source here)
Wine Analytics (github source here)
Risk in NYC (github source here)
Healthcare (github source here)
Movie Sales (github source here)
Mobile Telco Segmentation (github source here)
Lending Club Defaults Project 1 (github source here) and Project 2 (github source here)
Speed Dating, Project 1, Parts 1 , 2 , 3 , 4 (github source here), and Project 2
Travel Website Analytics (github source here)
Mashable News (github source here)
Flight Delays (github source here)
Credit Card Default (github source here)
e-commerce Analytics (github source here)
Airline Fleet Segmentation (github source here)
Restaurant Ratings (github source here)