One of our 18J classmates is serious about starting her business: an app that helps patients that go through rehabilitation. She has been working hard over the last months to develop her concept and has already run a survey to understand people’s interests in the product better. We would like to help her by applying the segmentation tools we have learned in class. Our objective is to help her find key insights that will help her target the market correctly. In addition to that, the survey has only been run with INSEAD students so far, so we would like to give her advice on how to improve the survey before she rolls it out to a bigger population.
The business problem that we are trying to solve is simple: who are the people who will be willing to use this type of product? Do we need to differentiate the product based on the type of user?
To answer this problem, we will first start by identifying key user descriptors and select the variables that describe them the most, then we will segment the market based on a cluster analysis. Then we will classify these clusters of users by using the profiling data set and understand their key drivers of success by using the segmentation data. Based on these steps we will make recommendations to our classmate for developing the survey further and hopefully develop some insights to offer her for how to develop her platform.
The “high level” process template is split in 2 parts, corresponding to the course sessions 3-4, 5-6, and an optional last part:
Part 1: We use some of the survey questions to find key customer descriptors (“factors”) using dimensionality reduction techniques described in the Dimensionality Reduction reading of Sessions 3-4.
Part 2: We use the selected customer descriptors to segment the market using cluster analysis techniques described in the Cluster Analysis reading of Sessions 5-6.
We are clearly faced with a limited dataset. There are currently 130 observations and counting. Thus we would like to take this opportunity to explore the data and create a presentation that recommends a path forward to our friend using proper methodology used in our class. We hope that it will be a valuable experience to attempt to employ big data techniques during the research phase of a startup.
First we load the data to use (see the raw .Rmd file to change the data file as needed):
The following are the columns of our dataset:
[1] “timestamp”
[2] “email”
[3] “had_rehab_binary”
[4] “injury_type_categorical_categorical”
[5] “how_easy_1_to_5”
[6] “rehab_time_months”
[7] “self_discipline_1_to_5”
[8] “did_self_exercise_binary”
[9] “helpful_to_have_guidance_categorical_1_to_4” [10] “prefer_fewer_trips_binary”
[11] “follow_prescriptions_1_to_3”
[12] “psych_support_categorical_1_to_3”
[13] “trust_internet_advice_1_to_5”
[14] “difficulty_maintaining_health_1_to_3”
[15] “use_smartphone_binary”
[16] “difficulty_of_rehab_1_to_5”
[17] “trust_medical_advice_1_to_5”
The data, in each column is made of text, therefore we have transformed each string value into a ranking from 1 to 5 with 1 being the worse and 5 the best, likewise we transform each yes/no answer into a binary variable.
In addition to that, we will limit the analysis to respondents who said they have already gone through rehabilitation, as those who have not were directed to another set of questions
Now that the data has been cleaned and is ready for analysis, we will see if all the 9 questions of the survey are necessary to understand the target customers. For that we will run a principal component analysis.
Visualing the global statistics of the data helps us understand the general dynamics of the survey responses.
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
how_easy_1_to_5 | 1 | 2 | 3 | 2.97 | 4 | 5 | 1.20 |
rehab_time_months | 0 | 2 | 3 | 3.54 | 4 | 18 | 3.42 |
self_discipline_1_to_5 | 1 | 2 | 2 | 2.42 | 3 | 5 | 1.01 |
did_self_exercise_binary | 0 | 1 | 1 | 0.91 | 1 | 1 | 0.28 |
helpful_to_have_guidance_categorical_1_to_4 | 1 | 3 | 4 | 3.32 | 4 | 4 | 0.93 |
prefer_fewer_trips_binary | 0 | 0 | 1 | 0.59 | 1 | 1 | 0.49 |
follow_prescriptions_1_to_3 | 1 | 1 | 2 | 1.57 | 2 | 3 | 0.61 |
psych_support_categorical_1_to_3 | 0 | 0 | 1 | 0.65 | 1 | 1 | 0.48 |
trust_internet_advice_1_to_5 | 1 | 2 | 2 | 2.43 | 3 | 4 | 0.79 |
In particular, it is interesting to check if some questions are correlated, this might be an indication of overfitting of the model and that we will need to reduce the number of factors for the segmentation. It does not seem to be the case here (as the below table indicates). However, we will still run the principal components analysis to confirm this hypothesis.
how_easy_1_to_5 | rehab_time_months | self_discipline_1_to_5 | did_self_exercise_binary | helpful_to_have_guidance_categorical_1_to_4 | prefer_fewer_trips_binary | follow_prescriptions_1_to_3 | psych_support_categorical_1_to_3 | trust_internet_advice_1_to_5 | |
---|---|---|---|---|---|---|---|---|---|
how_easy_1_to_5 | 1.00 | 0.12 | 0.34 | -0.01 | -0.06 | 0.05 | 0.08 | -0.22 | -0.08 |
rehab_time_months | 0.12 | 1.00 | -0.09 | 0.16 | 0.02 | -0.10 | -0.16 | -0.33 | 0.06 |
self_discipline_1_to_5 | 0.34 | -0.09 | 1.00 | 0.18 | 0.11 | 0.02 | 0.21 | -0.03 | 0.10 |
did_self_exercise_binary | -0.01 | 0.16 | 0.18 | 1.00 | 0.38 | -0.05 | 0.03 | -0.12 | -0.03 |
helpful_to_have_guidance_categorical_1_to_4 | -0.06 | 0.02 | 0.11 | 0.38 | 1.00 | -0.03 | 0.01 | -0.08 | -0.01 |
prefer_fewer_trips_binary | 0.05 | -0.10 | 0.02 | -0.05 | -0.03 | 1.00 | -0.01 | -0.05 | 0.04 |
follow_prescriptions_1_to_3 | 0.08 | -0.16 | 0.21 | 0.03 | 0.01 | -0.01 | 1.00 | -0.07 | 0.25 |
psych_support_categorical_1_to_3 | -0.22 | -0.33 | -0.03 | -0.12 | -0.08 | -0.05 | -0.07 | 1.00 | 0.02 |
trust_internet_advice_1_to_5 | -0.08 | 0.06 | 0.10 | -0.03 | -0.01 | 0.04 | 0.25 | 0.02 | 1.00 |
In fact, when running the PCA we see (table below) that 4 components have eigenvalues superior to 1 and they capture more than 60% of the information of the survey.
Eigenvalue | Pct of explained variance | Cumulative pct of explained variance | |
---|---|---|---|
Component 1 | 1.68 | 18.67 | 18.67 |
Component 2 | 1.45 | 16.06 | 34.73 |
Component 3 | 1.31 | 14.57 | 49.31 |
Component 4 | 1.12 | 12.49 | 61.80 |
Component 5 | 1.00 | 11.09 | 72.89 |
Component 6 | 0.81 | 8.98 | 81.87 |
Component 7 | 0.62 | 6.93 | 88.80 |
Component 8 | 0.54 | 6.02 | 94.82 |
Component 9 | 0.47 | 5.18 | 100.00 |
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | |
---|---|---|---|---|---|
helpful_to_have_guidance_categorical_1_to_4 | 0.83 | ||||
did_self_exercise_binary | 0.81 | ||||
self_discipline_1_to_5 | 0.76 | ||||
rehab_time_months | 0.84 | ||||
follow_prescriptions_1_to_3 | 0.71 | ||||
prefer_fewer_trips_binary | 0.98 | ||||
trust_internet_advice_1_to_5 | 0.85 | ||||
psych_support_categorical_1_to_3 | -0.75 | ||||
how_easy_1_to_5 | 0.81 |
During part 1, the following questions have been identified as sufficient to define a segment: - On a scale from 1 to 5, how easy was the rehab process for you ? - Did you feel like you were missing psychological support during your rehabilitation? - On a scale from 1 to 5, how much do you trust medical advice available on the internet ? - Do you think it is helpful to have someone who pushes you ? The segmentation will thus be based on those questions.
For the segmentation, we will use a euclidian distance to test how distant respondents are from each other. We have plotted the distances of the data set below.
Looking at the hierarchical cluster, we identified that the population could be divided into 4 different segments (as shown on the dendogram, that we have not plotted here as it creates an error that stops all other graphs from being plotted)
The snake graph here bellow gives a good visualisation of the 4 segments
We have been abble to identify:
Segment 1: people with low self discipline that could benefit from motivational aspects of an app.
Segment 2: people with bad rehab experience and trend to beleive in medical advices found in the internet.
Segment 3: people who had no specific problems during their rehab.
Segment 4: people who had the worst rehab experience and have good self discipline.
Out of the four selected components, one seems to look at the importance of having advice and guidance during rehabilitation, the other characterises the need for psychological support during long rehabilitations, a third one relates the ease of the rehabilitation and the self discipline, and the last one combines how people trust internet advice with whether or not they follow their prescribed exercices. These 4 categories make sense.
The first phase of this project should be useful to the founding team as it identifies key potential clients or focuses of the business. We advise the founders to consider this as a productive initial step, but only the beginning of a robust information-gathering phase of the project. This next section will discuss our advice to the founding team and recommend some paths forward.
First of all, it must be noted that the clustering done was based on extremely limited data of only INSEAD students with only 160 observations at the time of publishing. Before considering rolling out an app to the general public, a similar but improved survey should be conducted on segments of interest in the general public. We have identified some major changes that could be made in the surveying methodology.
We need y-variables!
We assume the founding team would like to gauge whether the respondants would be potentially interested in using an app and additionally find out their willingness to pay. A major challenge for our group is that we are not able to answer this question based on the current survey, as none of the questions seem to be a good indicator of this directly or even related to potential usage of an app. Of all of the variables included, based on judgement, “prefer_fewer_trips_to_the_pt” seemed the most relevant for who might benefit from an app, but this was a complete guess and serves as a shaky proxy at best. In the future survey, we propose a final pair of questions to generate robust data that can serve as y-variables of regression analysis:
Would you download and use an app that has (x),(y),and (z) functionality? (YES/NO) How much would you be willing to pay for each virtual PT session? (Enter amount in $)
The team could then do robust analysis that would generate potential download rates, conversion rates, better segmentation, and even start with financials.
The founding team might benefit greatly from a scientific process of writing down what they expect for the means of the question results, the corellations, the potential segments, and especially the causal relationships that may emerge. Then, we can prove whether the business assumptions were true/false and thus have concrete and meaningful survey results. These results would have real business implications that are catered specificially to the foudning teams’s pereceptions and assumptions about their target market.
The current version of the survey had 17 questions, many of which felt repetitive to the survey-takers we spoke with. The survey was longer than expected, which may cause a bias in that only those very interested in the subject of PT would make it through to the end. INSEAD students are happy to fill out a long survey to help their classmates, but for the general public it should be greatly condensed. In our analysis above, we were able to narrow down the results to 4 major factors, for which 4-8 of the important questions would be sufficient. Specifically, we advise to replace the 17 questions with either the four questions we advised above or up to 6 of the following:
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | |
---|---|---|---|---|---|
helpful_to_have_guidance_categorical_1_to_4 | 0.83 | ||||
did_self_exercise_binary | 0.81 | ||||
self_discipline_1_to_5 | 0.76 | ||||
rehab_time_months | 0.84 | ||||
follow_prescriptions_1_to_3 | 0.71 | ||||
prefer_fewer_trips_binary | 0.98 | ||||
trust_internet_advice_1_to_5 | 0.85 | ||||
psych_support_categorical_1_to_3 | -0.75 | ||||
how_easy_1_to_5 | 0.81 |
The current 17 questions are a mix of individually-formulated multiple choice questions that do not share any format. Each of them is seperate, uses a differenet scale and has descriptions for each of the numbers on the scale, bringing into question the robustness/continuous nature of the scale. It is debatable whether to treat them as continuous or categorical. We propose instead a much simpler interface where the respondants can quickly answer the most relevant quetions in a logical and visually attractive format.
For the Agree/Disagree variables, have a table with all of the quetions as the rows and how much do you agree on a scale of 1-5 as the columns. The user just has to tick quickly the answer to a series of questions seen in sequence.
For the binary variables, use a similar table with all of the questions as the rows and Yes/No as the answer-columns.
For the categorical variables, use a pull-down menu with a list of potential injuries much like how users choose “profession” or “industry” on job applications. The open-ended answer to the “What injury did you have?” was extermely difficult to filter as some respondents gave full sentences/stories.
If the founding team considers those currently in rehab as potential customers, reach out to PT clinics and survey their clients. This could be done as a partnership with certain rehab facilities and incorporated as part of the many forms that patients fill out. Additionally, those that are in orthaeopedic doctors may be interested in referring their patients to useful resources. Finally, those with no rehab experience at all are important as well, as anyone can get injured at any time and require rehab. Good surveying will allow the data scientists to know the source of the respondents and get a complete picture of which populations are most relevant.
Google forms easily allows the organizer to send users along a different path based on their answers to the previous questions, which could be very useful in certain situations. Here, in the first survey run this was a problem. Those that didn’t have rehab did not answer most of the main questions we were interested in, and instead answered a side set of questions. We ended up ignoring their responses completely. Instead, it would be interesting to let all users respond to a series of statements, saying whether or not they agree, and then at the end whether or not they would be interested in the app. This way their answers are unbiased and it will be possible to identify certain potential relationships that were otherwise hidden - for example, if those that haven’t gone through rehab have a perception that rehab should be easy and can be done on an app. Those that are new to rehab might have a completely different willingness to pay that would be important to quantify.
To summarize, we have taken a very limited dataset and gained some potentially useful insights thanks to the segmenting techniques learned in class. The founding team should think closely about which customers it wants to learn more about… Is it those that are current patients of PT clinics? Or is it patients who know the long and grueling nature of rehab and need added motivation? Maybe, the best segment will end up being those that are completely new to rehab and unfamiliar with its nuances.
With answers to these questions, our team is now ready to work with the founders and develop a robust survey that can serve as a very important bump to this business in its very initial phase.