In this assignment we consider an important question for the movie industry: which key movie characteristics and segments can explain what makes a movie a top rated movie? We define top rated as the top 250 movies of IMDB.
IMDB is an online database of information related to movies, and contains information about 3.871 movies. We used dataset as published on Kaggle (https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset).
Name | Description | Metric (Yes/No) |
---|---|---|
movie_ID | Unique number | Y |
movie_title | Title | N |
duration | Duration in minutes | Y |
color | Film in color (1) or not (0) | Y |
title_year | Year of release | Y |
country | Country of production | N |
language | Original language of movie | N |
content_rating | MPAA rating | N |
budget_USD | Budget of movie | Y |
gross_USD | Cumulative amount grossed by movie | Y |
net_USD | Difference between gross and budget, i.e. profit | Y |
profitable | Profitable (1) or not (0) | Y |
director_name | Name director | N |
director_facebook? | Has Facebook (1), or not (0) | Y |
director_facebook_likes | Number of likes on Facebook | Y |
actor_1_name | Name actor | N |
actor_1_facebook? | Has Facebook (1), or not (0) | Y |
actor_1_facebook_likes | Number of likes on Facebook | Y |
actor_2_name | Name actor | N |
actor_2_facebook? | Has Facebook (1), or not (0) | Y |
actor_2_facebook_likes | Number of likes on Facebook | Y |
actor_3_name | Name actor | N |
actor_3_facebook? | Has Facebook (1), or not (0) | Y |
actor_3_facebook_likes | Number of likes on Facebook | Y |
cast_total_facebook_likes | Number of likes on Facebook | Y |
movie_facebook? | Has Facebook (1), or not (0) | Y |
movie_facebook_likes | Number of likes on Facebook | Y |
facenumber_in_poster | Number of faces in poster | Y |
num_voted_users | Number of IMDB users that voted | Y |
num_user_for_reviews | Number of users IMDB used for review | Y |
imdb_score | Average IMDB score | Y |
imdb_top_250 | In top 250 (1), or not (0) | Y |
The process we defined is split in 2 parts:
Part 1: Dimensionality reduction: We will use factor analysis to combine together groups of the original raw attributes into a smaller number of key, meaningfull descriptors, so-called factors.
Part 2: Classificiation: We will use classification analysis (in particular CART tree and logistic regression methods) to predict whether or not a movie will be in the top 250 IMDB movies.
To reach our goal, we took 6 steps to make the information analysis possible:
Step 1: Load tbe data: load IMDB data from 3.781 movies with 28 different variables including movie length, budget, gross revenues, director, main characters, etc.
Step 2: Select most relevant factors: confirm all the data is metric and which factors are to be used
Step 3: Scale the data
Step 4: Analyse data: check the data by doing basic visual exploration, descriptive statistics, and correlations of the several variables
Step 5: Interpret the components: choose and interpret the number of components
Step 6: Save factor scores
We load the data to use (see the raw .Rmd file to change the data file as needed):
# the name of the file with the data used.
datafile_name = "Dataset/IMDB-database.csv"
# the maximum number of observations to show in the report and slides.
# DEFAULT is 10.
max_data_report = 10
We selected from the 28 variables the 18 factors that seemed more relevant / made more business sense to explain the IMDB score results and made them numeric data
Name | Description |
---|---|
hastopactor1 | In top 20 (1), or not (0) |
hastopactor2 | In top 20 (1), or not (0) |
hastopactor3 | In top 20 (1), or not (0) |
hastopdirector | In top 20 (1), or not (0) |
movie_duration | Duration in minutes |
movie_budget | Budget of movie |
movie_profitable | Profitable (1) or not (0) |
net_USD | Difference between gross and budget, i.e. profit |
cast_facebook_likes | Number of likes on Facebook |
movieposter_faces | Number of faces in poster |
movie_year | Year of release |
actor1_fb_likes | Number of likes on Facebook |
actor2_fb_likes | Number of likes on Facebook |
actor3_fb_likes | Number of likes on Facebook |
director_fb_likes | Number of likes on Facebook |
movie_fb_likes | Number of likes on Facebook |
number_votes | Number of IMDB users that voted |
number_reviews | Number of users IMDB used for review |
We scaled the data because of the different magnitude of the numbers in the data under consideration, allowing to present results in relative terms and eliminate outliers
We analysed the data by checking the variables results in observations, its statistical representation and correlations
Considering the observations, we could confirm how the results made business sense, with the third observation having the highest score in the number of votes in the output table, and then in database being the one with higher absolute number of votes.
# ENTER original raw attributes to use. Use numbers, not column names, e.g.
# c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
factor_attributes_used = c(2:20)
# ENTER the selection criterions for the factors to use. Choices:
# 'eigenvalue', 'variance', 'manual'
factor_selectionciterion = "manual"
# ENTER the desired minumum variance explained (Only used in case 'variance'
# is the factor selection criterion used).
minimum_variance_explained = 65 # between 1 and 100
# ENTER the number of factors to use (Only used in case 'manual' is the
# factor selection criterion used).
manual_numb_factors_used = 7
# ENTER the rotation eventually used (e.g. 'none', 'varimax', 'quatimax',
# 'promax', 'oblimin', 'simplimax', and 'cluster' - see help(principal)).
# Default is 'varimax'
rotation_used = "varimax"
Start by some basic visual exploration of, say, a few data:
Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
---|---|---|---|---|---|---|---|---|---|---|
hastopactor1 | -0.40 | 2.50 | -0.40 | 2.50 | -0.40 | -0.40 | -0.40 | -0.40 | -0.40 | -0.40 |
hastopactor2 | -0.17 | -0.17 | 5.74 | -0.17 | -0.17 | -0.17 | -0.17 | -0.17 | -0.17 | -0.17 |
hastopactor3 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 | -0.07 |
hastopdirector | -0.29 | -0.29 | -0.29 | 3.42 | -0.29 | -0.29 | -0.29 | -0.29 | -0.29 | 3.42 |
net_USD | 9.50 | -0.08 | 3.44 | 1.21 | 1.24 | -0.42 | 0.98 | -0.16 | 0.34 | -2.02 |
cast_facebook_likes | -0.35 | 1.92 | 4.97 | 1.80 | 0.67 | 0.96 | 0.47 | -0.12 | 0.89 | -0.43 |
movieposter_faces | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 | -0.68 |
movie_year | 0.60 | 0.40 | 0.90 | 0.40 | 1.31 | 0.29 | 1.01 | 1.11 | 0.90 | 0.70 |
actor1_fb_likes | -0.43 | 2.07 | 1.24 | 1.04 | 0.47 | 0.66 | 0.47 | -0.18 | 0.47 | -0.44 |
actor2_fb_likes | -0.24 | 0.66 | 4.61 | 1.97 | 0.44 | 1.75 | 0.22 | -0.23 | 1.75 | -0.25 |
actor3_fb_likes | 0.04 | 0.12 | 11.72 | 1.70 | 0.65 | 0.07 | -0.01 | 0.00 | 0.10 | -0.02 |
movie_fb_likes | 1.10 | -0.43 | 7.19 | -0.43 | 8.72 | -0.43 | 5.05 | 2.59 | 2.17 | 0.36 |
number_votes | 5.14 | 2.41 | 6.84 | 1.83 | 1.76 | 0.89 | 2.92 | 1.64 | 2.28 | 0.70 |
movie_duration | 3.08 | 2.67 | 2.45 | 2.09 | 3.31 | 2.67 | 1.50 | 2.45 | 1.95 | 2.09 |
movie_budget | 4.59 | 6.05 | 4.89 | 5.08 | 4.89 | 3.94 | 4.31 | 4.89 | 4.43 | 3.74 |
movie_profitable | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | -1.06 | 0.94 | 0.94 | 0.94 | -1.06 |
director_fb_likes | -0.26 | -0.08 | 6.89 | -0.26 | -0.26 | -0.26 | -0.26 | -0.26 | -0.11 | -0.26 |
number_reviews | 6.66 | 2.22 | 5.80 | 3.84 | 6.57 | 4.98 | 5.39 | 1.15 | 2.18 | 0.52 |
The data we use here have the following descriptive statistics:
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
hastopactor1 | -0.40 | -0.40 | -0.40 | 0 | -0.40 | 2.50 | 1 |
hastopactor2 | -0.17 | -0.17 | -0.17 | 0 | -0.17 | 5.74 | 1 |
hastopactor3 | -0.07 | -0.07 | -0.07 | 0 | -0.07 | 13.63 | 1 |
hastopdirector | -0.29 | -0.29 | -0.29 | 0 | -0.29 | 3.42 | 1 |
net_USD | -5.84 | -0.45 | -0.23 | 0 | 0.22 | 9.50 | 1 |
cast_facebook_likes | -0.60 | -0.50 | -0.39 | 0 | 0.25 | 33.69 | 1 |
movieposter_faces | -0.68 | -0.68 | -0.19 | 0 | 0.30 | 20.28 | 1 |
movie_year | -7.74 | -0.42 | 0.19 | 0 | 0.70 | 1.31 | 1 |
actor1_fb_likes | -0.50 | -0.45 | -0.43 | 0 | 0.34 | 40.62 | 1 |
actor2_fb_likes | -0.44 | -0.36 | -0.29 | 0 | -0.23 | 29.69 | 1 |
actor3_fb_likes | -0.41 | -0.30 | -0.18 | 0 | -0.04 | 11.72 | 1 |
movie_fb_likes | -0.43 | -0.43 | -0.42 | 0 | 0.08 | 15.79 | 1 |
number_votes | -0.69 | -0.57 | -0.34 | 0 | 0.15 | 10.43 | 1 |
movie_duration | -3.31 | -0.64 | -0.18 | 0 | 0.45 | 9.75 | 1 |
movie_budget | -0.89 | -0.67 | -0.32 | 0 | 0.26 | 6.05 | 1 |
movie_profitable | -1.06 | -1.06 | 0.94 | 0 | 0.94 | 0.94 | 1 |
director_fb_likes | -0.26 | -0.26 | -0.24 | 0 | -0.19 | 7.21 | 1 |
number_reviews | -0.81 | -0.55 | -0.31 | 0 | 0.15 | 11.57 | 1 |
In the correlations where we see how some of the variables have strong correlations. The most obvious ones are number of movie Facebook likes with the main actor Facebook likes and the number of IMDB reviews with the number of IMDB votes.
hastopactor1 | hastopactor2 | hastopactor3 | hastopdirector | net_USD | cast_facebook_likes | movieposter_faces | movie_year | actor1_fb_likes | actor2_fb_likes | actor3_fb_likes | movie_fb_likes | number_votes | movie_duration | movie_budget | movie_profitable | director_fb_likes | number_reviews | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hastopactor1 | 1.00 | 0.14 | 0.02 | 0.14 | 0.06 | 0.26 | -0.01 | -0.01 | 0.24 | 0.20 | 0.11 | 0.06 | 0.19 | 0.15 | 0.18 | 0.03 | 0.15 | 0.13 |
hastopactor2 | 0.14 | 1.00 | 0.12 | 0.04 | 0.06 | 0.33 | 0.03 | 0.03 | 0.23 | 0.43 | 0.26 | 0.11 | 0.17 | 0.09 | 0.11 | 0.02 | 0.10 | 0.11 |
hastopactor3 | 0.02 | 0.12 | 1.00 | 0.01 | 0.11 | 0.17 | 0.04 | 0.04 | 0.06 | 0.22 | 0.46 | 0.13 | 0.19 | 0.05 | 0.12 | 0.04 | 0.08 | 0.13 |
hastopdirector | 0.14 | 0.04 | 0.01 | 1.00 | 0.07 | 0.08 | -0.03 | -0.09 | 0.07 | 0.05 | 0.04 | 0.03 | 0.13 | 0.17 | 0.12 | 0.04 | 0.42 | 0.13 |
net_USD | 0.06 | 0.06 | 0.11 | 0.07 | 1.00 | 0.11 | -0.02 | -0.12 | 0.06 | 0.13 | 0.18 | 0.23 | 0.50 | 0.11 | 0.04 | 0.57 | 0.11 | 0.38 |
cast_facebook_likes | 0.26 | 0.33 | 0.17 | 0.08 | 0.11 | 1.00 | 0.08 | 0.12 | 0.94 | 0.64 | 0.49 | 0.20 | 0.25 | 0.13 | 0.24 | 0.05 | 0.12 | 0.19 |
movieposter_faces | -0.01 | 0.03 | 0.04 | -0.03 | -0.02 | 0.08 | 1.00 | 0.07 | 0.06 | 0.07 | 0.10 | 0.01 | -0.04 | 0.02 | -0.03 | 0.00 | -0.05 | -0.08 |
movie_year | -0.01 | 0.03 | 0.04 | -0.09 | -0.12 | 0.12 | 0.07 | 1.00 | 0.09 | 0.12 | 0.11 | 0.30 | 0.02 | -0.13 | 0.23 | -0.13 | -0.05 | 0.02 |
actor1_fb_likes | 0.24 | 0.23 | 0.06 | 0.07 | 0.06 | 0.94 | 0.06 | 0.09 | 1.00 | 0.39 | 0.25 | 0.13 | 0.18 | 0.09 | 0.17 | 0.03 | 0.09 | 0.13 |
actor2_fb_likes | 0.20 | 0.43 | 0.22 | 0.05 | 0.13 | 0.64 | 0.07 | 0.12 | 0.39 | 1.00 | 0.55 | 0.23 | 0.25 | 0.14 | 0.25 | 0.06 | 0.12 | 0.19 |
actor3_fb_likes | 0.11 | 0.26 | 0.46 | 0.04 | 0.18 | 0.49 | 0.10 | 0.11 | 0.25 | 0.55 | 1.00 | 0.27 | 0.27 | 0.13 | 0.27 | 0.08 | 0.12 | 0.21 |
movie_fb_likes | 0.06 | 0.11 | 0.13 | 0.03 | 0.23 | 0.20 | 0.01 | 0.30 | 0.13 | 0.23 | 0.27 | 1.00 | 0.52 | 0.23 | 0.31 | 0.15 | 0.16 | 0.38 |
number_votes | 0.19 | 0.17 | 0.19 | 0.13 | 0.50 | 0.25 | -0.04 | 0.02 | 0.18 | 0.25 | 0.27 | 0.52 | 1.00 | 0.36 | 0.39 | 0.29 | 0.30 | 0.79 |
movie_duration | 0.15 | 0.09 | 0.05 | 0.17 | 0.11 | 0.13 | 0.02 | -0.13 | 0.09 | 0.14 | 0.13 | 0.23 | 0.36 | 1.00 | 0.29 | 0.02 | 0.19 | 0.37 |
movie_budget | 0.18 | 0.11 | 0.12 | 0.12 | 0.04 | 0.24 | -0.03 | 0.23 | 0.17 | 0.25 | 0.27 | 0.31 | 0.39 | 0.29 | 1.00 | -0.03 | 0.10 | 0.42 |
movie_profitable | 0.03 | 0.02 | 0.04 | 0.04 | 0.57 | 0.05 | 0.00 | -0.13 | 0.03 | 0.06 | 0.08 | 0.15 | 0.29 | 0.02 | -0.03 | 1.00 | 0.06 | 0.22 |
director_fb_likes | 0.15 | 0.10 | 0.08 | 0.42 | 0.11 | 0.12 | -0.05 | -0.05 | 0.09 | 0.12 | 0.12 | 0.16 | 0.30 | 0.19 | 0.10 | 0.06 | 1.00 | 0.22 |
number_reviews | 0.13 | 0.11 | 0.13 | 0.13 | 0.38 | 0.19 | -0.08 | 0.02 | 0.13 | 0.19 | 0.21 | 0.38 | 0.79 | 0.37 | 0.42 | 0.22 | 0.22 | 1.00 |
We visualised the variance explained as well as the eigenvalues:
Eigenvalue | Pct of explained variance | Cumulative pct of explained variance | |
---|---|---|---|
Component 1 | 4.22 | 23.45 | 23.45 |
Component 2 | 2.19 | 12.15 | 35.60 |
Component 3 | 1.49 | 8.28 | 43.88 |
Component 4 | 1.48 | 8.20 | 52.08 |
Component 5 | 1.19 | 6.61 | 58.69 |
Component 6 | 1.01 | 5.63 | 64.32 |
Component 7 | 0.99 | 5.48 | 69.80 |
Component 8 | 0.87 | 4.85 | 74.66 |
Component 9 | 0.83 | 4.62 | 79.28 |
Component 10 | 0.68 | 3.78 | 83.06 |
Component 11 | 0.63 | 3.49 | 86.55 |
Component 12 | 0.55 | 3.06 | 89.61 |
Component 13 | 0.50 | 2.80 | 92.41 |
Component 14 | 0.47 | 2.59 | 95.00 |
Component 15 | 0.38 | 2.12 | 97.12 |
Component 16 | 0.35 | 1.94 | 99.06 |
Component 17 | 0.17 | 0.93 | 99.99 |
Component 18 | 0.00 | 0.01 | 100.00 |
For the factors segmentation, we developed a scree plot and analysed the cumulative explained variance of the factors. In these factor analysis, we decided to use seven components that explain 70% of the total variance, assuring a high value that explain the results.
We interpret the factors and look at the factor aggregated
In this step, we looked at the factors aggregated in each component. As described above, we will use 7 components that together explain 70% of total variance. Our business sense told that the results were logic:
Component 1: positively relates movies Facebook likes with the main and secondary actor Facebook likes
Component 2: positively combines number of IMDB movie reviews with number of IMDB movie votes
Component 3: negatively relates movies budget with net profits
Component 4 and 5: positively combines actor and Director name with Facebook likes
Component 6: positively combines the year of the movie with the number of movie Facebook likes
Component 7: reflects the number of faces in the movies poster
To better visualize them, we will use what is called a “rotation”. There are many rotations methods. In this case we selected the varimax rotation. For our data, the 7 selected factors look as follows after this rotation:
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | |
---|---|---|---|---|---|---|---|
cast_facebook_likes | 0.93 | 0.07 | 0.07 | 0.23 | 0.02 | 0.13 | 0.06 |
actor1_fb_likes | 0.90 | 0.01 | 0.06 | -0.01 | 0.02 | 0.11 | 0.04 |
actor2_fb_likes | 0.60 | 0.13 | 0.04 | 0.51 | 0.00 | 0.08 | 0.03 |
hastopactor1 | 0.45 | 0.24 | -0.04 | -0.10 | 0.18 | -0.13 | -0.05 |
hastopactor2 | 0.40 | 0.08 | -0.01 | 0.39 | 0.01 | -0.10 | -0.04 |
actor3_fb_likes | 0.29 | 0.14 | 0.08 | 0.76 | 0.02 | 0.13 | 0.09 |
movie_budget | 0.16 | 0.65 | -0.16 | 0.13 | 0.02 | 0.30 | -0.07 |
number_votes | 0.12 | 0.71 | 0.47 | 0.15 | 0.15 | 0.15 | -0.07 |
movie_duration | 0.08 | 0.74 | -0.08 | 0.03 | 0.12 | -0.30 | 0.17 |
hastopdirector | 0.07 | 0.07 | 0.00 | -0.03 | 0.84 | -0.06 | 0.02 |
number_reviews | 0.07 | 0.75 | 0.35 | 0.08 | 0.09 | 0.10 | -0.13 |
movie_year | 0.06 | -0.02 | -0.16 | 0.02 | -0.06 | 0.87 | 0.04 |
movie_fb_likes | 0.06 | 0.46 | 0.25 | 0.14 | 0.06 | 0.57 | 0.04 |
director_fb_likes | 0.06 | 0.13 | 0.08 | 0.10 | 0.82 | 0.04 | -0.05 |
net_USD | 0.03 | 0.18 | 0.85 | 0.10 | 0.04 | -0.04 | -0.02 |
movieposter_faces | 0.03 | -0.03 | 0.00 | 0.05 | -0.03 | 0.05 | 0.97 |
movie_profitable | 0.02 | -0.02 | 0.85 | -0.01 | 0.01 | -0.05 | 0.03 |
hastopactor3 | -0.08 | 0.04 | 0.04 | 0.81 | 0.04 | 0.02 | -0.01 |
To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | |
---|---|---|---|---|---|---|---|
cast_facebook_likes | 0.93 | ||||||
actor1_fb_likes | 0.90 | ||||||
actor2_fb_likes | 0.60 | 0.51 | |||||
hastopactor1 | |||||||
hastopactor2 | |||||||
actor3_fb_likes | 0.76 | ||||||
movie_budget | 0.65 | ||||||
number_votes | 0.71 | ||||||
movie_duration | 0.74 | ||||||
hastopdirector | 0.84 | ||||||
number_reviews | 0.75 | ||||||
movie_year | 0.87 | ||||||
movie_fb_likes | 0.57 | ||||||
director_fb_likes | 0.82 | ||||||
net_USD | 0.85 | ||||||
movieposter_faces | 0.97 | ||||||
movie_profitable | 0.85 | ||||||
hastopactor3 | 0.81 |
We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:
Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
---|---|---|---|---|---|---|---|---|---|---|
DV (Factor) 1 | -1.12 | 2.12 | 2.38 | 1.77 | -0.53 | 0.63 | -0.34 | -0.57 | 0.43 | -0.62 |
DV (Factor) 2 | 5.85 | 4.21 | 5.03 | 3.63 | 6.24 | 4.21 | 4.80 | 3.59 | 3.44 | 2.47 |
DV (Factor) 3 | 4.83 | -0.79 | 2.43 | -0.06 | 1.01 | -1.45 | 1.13 | -0.47 | 0.11 | -2.54 |
DV (Factor) 4 | -0.65 | -1.11 | 6.48 | -0.07 | -0.59 | -0.08 | -0.72 | -0.43 | -0.11 | -0.24 |
DV (Factor) 5 | -1.43 | -0.89 | 2.46 | 1.22 | -1.16 | -1.17 | -1.01 | -0.81 | -0.88 | 1.64 |
DV (Factor) 6 | 0.41 | -0.10 | 3.32 | -0.15 | 3.98 | -0.25 | 2.91 | 1.72 | 1.54 | 0.57 |
DV (Factor) 7 | -0.91 | -0.77 | -0.92 | -0.71 | -0.37 | -0.83 | -0.85 | -0.37 | -0.61 | -0.29 |
In order to predict whether a move will be in the top 250 imdb movies or not based on the 7 factors we found above we use classification methods: in particular the two methods we use are the CART tree methodology and the logistic regression method.
To start off, we set the variables for CART, and of the profit matrices to then evaluate our predictions. As mentioned, the dependent variable is a vector of 0,1 depending on whether the movie is top 250 or not. the independent variables are the seven factors. For the profit matrices, given the low percentage of top 250 movies (250/4000) we assume that guessing the movie gives a +50 and wrongly predicting it will be a high grossing movie gives a -1.
We then determine the parameters that will be used for the CART analysis- in particular we set the probabiity threshold to be 65%. We originally used 70% but some of the clusters were exactly 70% so we reduced to capture those.
We also determine how much data we want to use to estimate the parameters, and how much data we want to use to determine how correct our model is (we use 80% of the data for estimation)
Finally we determine the level of complexity of the tree we would like to achieve by choosing the complexity parameter- having tried a few options 0.05 gave a tree that was relatively simple but gives reasonable decision nodes.
Probability_Threshold = 65 # between 1 and 99%
estimation_data_percent = 80
validation_data_percent = 10
random_sampling = 0
CART_cp = 0.005
min_segment = 100
The resulting classification tree is below:
As we can see from the graph, there are 8 different ending points- of these only 3 would lead the movie to be in the top 250. To have a feeling of whether the classification makes sense we look at two example decisions.
Example decision to classify as a 0: one decision for example looks at whether Factor 2 < 1.1. Factor 2 corresponded to the number of likes and review of the movies. As typically movies with high reviews on IMDB and a lot of likes tend to be popular a low score on this factor indicates which pretty high probability
Example decision to classify as a 1: one decision for example looks at whether Factor 2 > 1.1, Factor 3 > 0.88 and Factor 6>1.8.So essentially if the film has top actors and has a high budget then it is likely to be a top 250 which is intuitive.
We then run a logistic regression to have an alternative classification of the movies
formula_log = paste(colnames(estimation_data[, dependent_variable, drop = F]),
paste(Reduce(paste, sapply(head(independent_variables, -1), function(i) paste(colnames(estimation_data)[i],
"+", sep = ""))), colnames(estimation_data)[tail(independent_variables,
1)], sep = ""), sep = "~")
logreg_solution <- glm(formula_log, family = binomial(link = "logit"), data = estimation_data)
log_coefficients = round(summary(logreg_solution)$coefficients, 1)
iprint.df(log_coefficients)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -4.1 | 0.2 | -23.8 | 0.0 |
DV..Factor..1 | -0.3 | 0.1 | -2.1 | 0.0 |
DV..Factor..2 | 1.0 | 0.1 | 12.4 | 0.0 |
DV..Factor..3 | 0.8 | 0.1 | 8.8 | 0.0 |
DV..Factor..4 | 0.3 | 0.1 | 3.2 | 0.0 |
DV..Factor..5 | 0.3 | 0.1 | 5.0 | 0.0 |
DV..Factor..6 | -0.2 | 0.1 | -3.1 | 0.0 |
DV..Factor..7 | 0.0 | 0.2 | 0.1 | 0.9 |
From the logistic regression we can determine what the most significant factors are and whether some of the factors are not significant: in our case it seems that factor 7 is not statistically significant. This is in lines with expectations as factor 7 is the number of faces in the movie poster that we did not expect to have significant impact.
We then compare the two methods by checking how much “weight” these three methods put on the different factors: the below table shows the importance assigned to each factor in the classification (as a percentage of the importance of the largest factor)
As we can see the results of the two methodologies are in line and the the most predictive factors are factor 2, factor 3 and factor 5- respectively number of IMDB reviews, profit of movie and number of facebook likes. This is in line with expectations. Factor 7 is the least important with both methodologies.
CART 1 | Logistic Regr. | |
---|---|---|
DV..Factor..1 | -0.4075435 | -0.169354839 |
DV..Factor..2 | 1.0000000 | 1.000000000 |
DV..Factor..3 | 0.9024807 | 0.709677419 |
DV..Factor..4 | 0.1830532 | 0.258064516 |
DV..Factor..5 | 0.3845682 | 0.403225806 |
DV..Factor..6 | -0.4371961 | -0.250000000 |
DV..Factor..7 | 0.1533779 | 0.008064516 |
Finally, if we were to use the estimated classification models on the test data, we would get the following profit curves (using the profit parameters set earlier)
The profit curve using the small classification tree:
This result is not reasonable, this may be because the tree is doing a rough probability estimation.The profit curve using the logistic regression classifier:
The logistic profit curve seems to indicate that our estimation is working as we are able to capture pretty much all the potential profit covering only the lowerst percentile.
These are the maximum total profit achieved in the test data using the three classifiers (without any segment specific analysis so far).
Percentile | Profit | |
---|---|---|
Small Tree | 6.42 | 231 |
Logistic Regression | 12.83 | 411 |
There is ample room to extend this process to improve on it.
-We could have chosen a different amount of factors (i.e. six) given the last factor does not seem to be statistically significant in the regression
One could for example add more variables to target more specifically, such as countries performance (allowing to target specific countries, and not only have a global approach) or film types (allowing to target type of movies, and not only have a generic approach)
Another option would be to refine the prediction. The process currently predicts whether a movie will make it to the top 250 or not, this could be refined to target more specifically e.g., top 100, top 1000. Maybe top 250 is too little (i.e. only 5% of the movies are a 1). Alternatively we could have tried to predict the actual IMDB score instead of just whether it’s a hit.
Another question that a small alteration to this process could answer is finding the the key drivers and link between movie budget and IMDB success