The Business Question

In this assignment we consider an important question for the movie industry: which key movie characteristics and segments can explain what makes a movie a top rated movie? We define top rated as the top 250 movies of IMDB.

The Data

IMDB is an online database of information related to movies, and contains information about 3.871 movies. We used dataset as published on Kaggle (https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset).

Dataset variables

Name Description Metric (Yes/No)
movie_ID Unique number Y
movie_title Title N
duration Duration in minutes Y
color Film in color (1) or not (0) Y
title_year Year of release Y
country Country of production N
language Original language of movie N
content_rating MPAA rating N
budget_USD Budget of movie Y
gross_USD Cumulative amount grossed by movie Y
net_USD Difference between gross and budget, i.e. profit Y
profitable Profitable (1) or not (0) Y
director_name Name director N
director_facebook? Has Facebook (1), or not (0) Y
director_facebook_likes Number of likes on Facebook Y
actor_1_name Name actor N
actor_1_facebook? Has Facebook (1), or not (0) Y
actor_1_facebook_likes Number of likes on Facebook Y
actor_2_name Name actor N
actor_2_facebook? Has Facebook (1), or not (0) Y
actor_2_facebook_likes Number of likes on Facebook Y
actor_3_name Name actor N
actor_3_facebook? Has Facebook (1), or not (0) Y
actor_3_facebook_likes Number of likes on Facebook Y
cast_total_facebook_likes Number of likes on Facebook Y
movie_facebook? Has Facebook (1), or not (0) Y
movie_facebook_likes Number of likes on Facebook Y
facenumber_in_poster Number of faces in poster Y
num_voted_users Number of IMDB users that voted Y
num_user_for_reviews Number of users IMDB used for review Y
imdb_score Average IMDB score Y
imdb_top_250 In top 250 (1), or not (0) Y

The Process

The process we defined is split in 2 parts:

  1. Part 1: Dimensionality reduction: We will use factor analysis to combine together groups of the original raw attributes into a smaller number of key, meaningfull descriptors, so-called factors.

  2. Part 2: Classificiation: We will use classification analysis (in particular CART tree and logistic regression methods) to predict whether or not a movie will be in the top 250 IMDB movies.


Part 1: Dimensionality reduction

To reach our goal, we took 6 steps to make the information analysis possible:

  • Step 1: Load tbe data: load IMDB data from 3.781 movies with 28 different variables including movie length, budget, gross revenues, director, main characters, etc.

  • Step 2: Select most relevant factors: confirm all the data is metric and which factors are to be used

  • Step 3: Scale the data

  • Step 4: Analyse data: check the data by doing basic visual exploration, descriptive statistics, and correlations of the several variables

  • Step 5: Interpret the components: choose and interpret the number of components

  • Step 6: Save factor scores

Step 1: Load tbe data

We load the data to use (see the raw .Rmd file to change the data file as needed):

# the name of the file with the data used.
datafile_name = "Dataset/IMDB-database.csv"

# the maximum number of observations to show in the report and slides.
# DEFAULT is 10.
max_data_report = 10

Step 2: Select most relevant factors

We selected from the 28 variables the 18 factors that seemed more relevant / made more business sense to explain the IMDB score results and made them numeric data

Name Description
hastopactor1 In top 20 (1), or not (0)
hastopactor2 In top 20 (1), or not (0)
hastopactor3 In top 20 (1), or not (0)
hastopdirector In top 20 (1), or not (0)
movie_duration Duration in minutes
movie_budget Budget of movie
movie_profitable Profitable (1) or not (0)
net_USD Difference between gross and budget, i.e. profit
cast_facebook_likes Number of likes on Facebook
movieposter_faces Number of faces in poster
movie_year Year of release
actor1_fb_likes Number of likes on Facebook
actor2_fb_likes Number of likes on Facebook
actor3_fb_likes Number of likes on Facebook
director_fb_likes Number of likes on Facebook
movie_fb_likes Number of likes on Facebook
number_votes Number of IMDB users that voted
number_reviews Number of users IMDB used for review

Step 3: Scale the data

We scaled the data because of the different magnitude of the numbers in the data under consideration, allowing to present results in relative terms and eliminate outliers

Step 4: Analyse data

We analysed the data by checking the variables results in observations, its statistical representation and correlations

Considering the observations, we could confirm how the results made business sense, with the third observation having the highest score in the number of votes in the output table, and then in database being the one with higher absolute number of votes.

# ENTER original raw attributes to use.  Use numbers, not column names, e.g.
# c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
factor_attributes_used = c(2:20)
# ENTER the selection criterions for the factors to use.  Choices:
# 'eigenvalue', 'variance', 'manual'
factor_selectionciterion = "manual"

# ENTER the desired minumum variance explained (Only used in case 'variance'
# is the factor selection criterion used).
minimum_variance_explained = 65  # between 1 and 100

# ENTER the number of factors to use (Only used in case 'manual' is the
# factor selection criterion used).
manual_numb_factors_used = 7

# ENTER the rotation eventually used (e.g. 'none', 'varimax', 'quatimax',
# 'promax', 'oblimin', 'simplimax', and 'cluster' - see help(principal)).
# Default is 'varimax'
rotation_used = "varimax"

Start by some basic visual exploration of, say, a few data:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
hastopactor1 -0.40 2.50 -0.40 2.50 -0.40 -0.40 -0.40 -0.40 -0.40 -0.40
hastopactor2 -0.17 -0.17 5.74 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17 -0.17
hastopactor3 -0.07 -0.07 -0.07 -0.07 -0.07 -0.07 -0.07 -0.07 -0.07 -0.07
hastopdirector -0.29 -0.29 -0.29 3.42 -0.29 -0.29 -0.29 -0.29 -0.29 3.42
net_USD 9.50 -0.08 3.44 1.21 1.24 -0.42 0.98 -0.16 0.34 -2.02
cast_facebook_likes -0.35 1.92 4.97 1.80 0.67 0.96 0.47 -0.12 0.89 -0.43
movieposter_faces -0.68 -0.68 -0.68 -0.68 -0.68 -0.68 -0.68 -0.68 -0.68 -0.68
movie_year 0.60 0.40 0.90 0.40 1.31 0.29 1.01 1.11 0.90 0.70
actor1_fb_likes -0.43 2.07 1.24 1.04 0.47 0.66 0.47 -0.18 0.47 -0.44
actor2_fb_likes -0.24 0.66 4.61 1.97 0.44 1.75 0.22 -0.23 1.75 -0.25
actor3_fb_likes 0.04 0.12 11.72 1.70 0.65 0.07 -0.01 0.00 0.10 -0.02
movie_fb_likes 1.10 -0.43 7.19 -0.43 8.72 -0.43 5.05 2.59 2.17 0.36
number_votes 5.14 2.41 6.84 1.83 1.76 0.89 2.92 1.64 2.28 0.70
movie_duration 3.08 2.67 2.45 2.09 3.31 2.67 1.50 2.45 1.95 2.09
movie_budget 4.59 6.05 4.89 5.08 4.89 3.94 4.31 4.89 4.43 3.74
movie_profitable 0.94 0.94 0.94 0.94 0.94 -1.06 0.94 0.94 0.94 -1.06
director_fb_likes -0.26 -0.08 6.89 -0.26 -0.26 -0.26 -0.26 -0.26 -0.11 -0.26
number_reviews 6.66 2.22 5.80 3.84 6.57 4.98 5.39 1.15 2.18 0.52

The data we use here have the following descriptive statistics:

min 25 percent median mean 75 percent max std
hastopactor1 -0.40 -0.40 -0.40 0 -0.40 2.50 1
hastopactor2 -0.17 -0.17 -0.17 0 -0.17 5.74 1
hastopactor3 -0.07 -0.07 -0.07 0 -0.07 13.63 1
hastopdirector -0.29 -0.29 -0.29 0 -0.29 3.42 1
net_USD -5.84 -0.45 -0.23 0 0.22 9.50 1
cast_facebook_likes -0.60 -0.50 -0.39 0 0.25 33.69 1
movieposter_faces -0.68 -0.68 -0.19 0 0.30 20.28 1
movie_year -7.74 -0.42 0.19 0 0.70 1.31 1
actor1_fb_likes -0.50 -0.45 -0.43 0 0.34 40.62 1
actor2_fb_likes -0.44 -0.36 -0.29 0 -0.23 29.69 1
actor3_fb_likes -0.41 -0.30 -0.18 0 -0.04 11.72 1
movie_fb_likes -0.43 -0.43 -0.42 0 0.08 15.79 1
number_votes -0.69 -0.57 -0.34 0 0.15 10.43 1
movie_duration -3.31 -0.64 -0.18 0 0.45 9.75 1
movie_budget -0.89 -0.67 -0.32 0 0.26 6.05 1
movie_profitable -1.06 -1.06 0.94 0 0.94 0.94 1
director_fb_likes -0.26 -0.26 -0.24 0 -0.19 7.21 1
number_reviews -0.81 -0.55 -0.31 0 0.15 11.57 1

In the correlations where we see how some of the variables have strong correlations. The most obvious ones are number of movie Facebook likes with the main actor Facebook likes and the number of IMDB reviews with the number of IMDB votes.

hastopactor1 hastopactor2 hastopactor3 hastopdirector net_USD cast_facebook_likes movieposter_faces movie_year actor1_fb_likes actor2_fb_likes actor3_fb_likes movie_fb_likes number_votes movie_duration movie_budget movie_profitable director_fb_likes number_reviews
hastopactor1 1.00 0.14 0.02 0.14 0.06 0.26 -0.01 -0.01 0.24 0.20 0.11 0.06 0.19 0.15 0.18 0.03 0.15 0.13
hastopactor2 0.14 1.00 0.12 0.04 0.06 0.33 0.03 0.03 0.23 0.43 0.26 0.11 0.17 0.09 0.11 0.02 0.10 0.11
hastopactor3 0.02 0.12 1.00 0.01 0.11 0.17 0.04 0.04 0.06 0.22 0.46 0.13 0.19 0.05 0.12 0.04 0.08 0.13
hastopdirector 0.14 0.04 0.01 1.00 0.07 0.08 -0.03 -0.09 0.07 0.05 0.04 0.03 0.13 0.17 0.12 0.04 0.42 0.13
net_USD 0.06 0.06 0.11 0.07 1.00 0.11 -0.02 -0.12 0.06 0.13 0.18 0.23 0.50 0.11 0.04 0.57 0.11 0.38
cast_facebook_likes 0.26 0.33 0.17 0.08 0.11 1.00 0.08 0.12 0.94 0.64 0.49 0.20 0.25 0.13 0.24 0.05 0.12 0.19
movieposter_faces -0.01 0.03 0.04 -0.03 -0.02 0.08 1.00 0.07 0.06 0.07 0.10 0.01 -0.04 0.02 -0.03 0.00 -0.05 -0.08
movie_year -0.01 0.03 0.04 -0.09 -0.12 0.12 0.07 1.00 0.09 0.12 0.11 0.30 0.02 -0.13 0.23 -0.13 -0.05 0.02
actor1_fb_likes 0.24 0.23 0.06 0.07 0.06 0.94 0.06 0.09 1.00 0.39 0.25 0.13 0.18 0.09 0.17 0.03 0.09 0.13
actor2_fb_likes 0.20 0.43 0.22 0.05 0.13 0.64 0.07 0.12 0.39 1.00 0.55 0.23 0.25 0.14 0.25 0.06 0.12 0.19
actor3_fb_likes 0.11 0.26 0.46 0.04 0.18 0.49 0.10 0.11 0.25 0.55 1.00 0.27 0.27 0.13 0.27 0.08 0.12 0.21
movie_fb_likes 0.06 0.11 0.13 0.03 0.23 0.20 0.01 0.30 0.13 0.23 0.27 1.00 0.52 0.23 0.31 0.15 0.16 0.38
number_votes 0.19 0.17 0.19 0.13 0.50 0.25 -0.04 0.02 0.18 0.25 0.27 0.52 1.00 0.36 0.39 0.29 0.30 0.79
movie_duration 0.15 0.09 0.05 0.17 0.11 0.13 0.02 -0.13 0.09 0.14 0.13 0.23 0.36 1.00 0.29 0.02 0.19 0.37
movie_budget 0.18 0.11 0.12 0.12 0.04 0.24 -0.03 0.23 0.17 0.25 0.27 0.31 0.39 0.29 1.00 -0.03 0.10 0.42
movie_profitable 0.03 0.02 0.04 0.04 0.57 0.05 0.00 -0.13 0.03 0.06 0.08 0.15 0.29 0.02 -0.03 1.00 0.06 0.22
director_fb_likes 0.15 0.10 0.08 0.42 0.11 0.12 -0.05 -0.05 0.09 0.12 0.12 0.16 0.30 0.19 0.10 0.06 1.00 0.22
number_reviews 0.13 0.11 0.13 0.13 0.38 0.19 -0.08 0.02 0.13 0.19 0.21 0.38 0.79 0.37 0.42 0.22 0.22 1.00

We visualised the variance explained as well as the eigenvalues:

Eigenvalue Pct of explained variance Cumulative pct of explained variance
Component 1 4.22 23.45 23.45
Component 2 2.19 12.15 35.60
Component 3 1.49 8.28 43.88
Component 4 1.48 8.20 52.08
Component 5 1.19 6.61 58.69
Component 6 1.01 5.63 64.32
Component 7 0.99 5.48 69.80
Component 8 0.87 4.85 74.66
Component 9 0.83 4.62 79.28
Component 10 0.68 3.78 83.06
Component 11 0.63 3.49 86.55
Component 12 0.55 3.06 89.61
Component 13 0.50 2.80 92.41
Component 14 0.47 2.59 95.00
Component 15 0.38 2.12 97.12
Component 16 0.35 1.94 99.06
Component 17 0.17 0.93 99.99
Component 18 0.00 0.01 100.00

For the factors segmentation, we developed a scree plot and analysed the cumulative explained variance of the factors. In these factor analysis, we decided to use seven components that explain 70% of the total variance, assuring a high value that explain the results.

Step 5: Interpret the factors

We interpret the factors and look at the factor aggregated

In this step, we looked at the factors aggregated in each component. As described above, we will use 7 components that together explain 70% of total variance. Our business sense told that the results were logic:

  • Component 1: positively relates movies Facebook likes with the main and secondary actor Facebook likes

  • Component 2: positively combines number of IMDB movie reviews with number of IMDB movie votes

  • Component 3: negatively relates movies budget with net profits

  • Component 4 and 5: positively combines actor and Director name with Facebook likes

  • Component 6: positively combines the year of the movie with the number of movie Facebook likes

  • Component 7: reflects the number of faces in the movies poster

To better visualize them, we will use what is called a “rotation”. There are many rotations methods. In this case we selected the varimax rotation. For our data, the 7 selected factors look as follows after this rotation:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
cast_facebook_likes 0.93 0.07 0.07 0.23 0.02 0.13 0.06
actor1_fb_likes 0.90 0.01 0.06 -0.01 0.02 0.11 0.04
actor2_fb_likes 0.60 0.13 0.04 0.51 0.00 0.08 0.03
hastopactor1 0.45 0.24 -0.04 -0.10 0.18 -0.13 -0.05
hastopactor2 0.40 0.08 -0.01 0.39 0.01 -0.10 -0.04
actor3_fb_likes 0.29 0.14 0.08 0.76 0.02 0.13 0.09
movie_budget 0.16 0.65 -0.16 0.13 0.02 0.30 -0.07
number_votes 0.12 0.71 0.47 0.15 0.15 0.15 -0.07
movie_duration 0.08 0.74 -0.08 0.03 0.12 -0.30 0.17
hastopdirector 0.07 0.07 0.00 -0.03 0.84 -0.06 0.02
number_reviews 0.07 0.75 0.35 0.08 0.09 0.10 -0.13
movie_year 0.06 -0.02 -0.16 0.02 -0.06 0.87 0.04
movie_fb_likes 0.06 0.46 0.25 0.14 0.06 0.57 0.04
director_fb_likes 0.06 0.13 0.08 0.10 0.82 0.04 -0.05
net_USD 0.03 0.18 0.85 0.10 0.04 -0.04 -0.02
movieposter_faces 0.03 -0.03 0.00 0.05 -0.03 0.05 0.97
movie_profitable 0.02 -0.02 0.85 -0.01 0.01 -0.05 0.03
hastopactor3 -0.08 0.04 0.04 0.81 0.04 0.02 -0.01

To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
cast_facebook_likes 0.93
actor1_fb_likes 0.90
actor2_fb_likes 0.60 0.51
hastopactor1
hastopactor2
actor3_fb_likes 0.76
movie_budget 0.65
number_votes 0.71
movie_duration 0.74
hastopdirector 0.84
number_reviews 0.75
movie_year 0.87
movie_fb_likes 0.57
director_fb_likes 0.82
net_USD 0.85
movieposter_faces 0.97
movie_profitable 0.85
hastopactor3 0.81

Step 6: Save factor scores

We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
DV (Factor) 1 -1.12 2.12 2.38 1.77 -0.53 0.63 -0.34 -0.57 0.43 -0.62
DV (Factor) 2 5.85 4.21 5.03 3.63 6.24 4.21 4.80 3.59 3.44 2.47
DV (Factor) 3 4.83 -0.79 2.43 -0.06 1.01 -1.45 1.13 -0.47 0.11 -2.54
DV (Factor) 4 -0.65 -1.11 6.48 -0.07 -0.59 -0.08 -0.72 -0.43 -0.11 -0.24
DV (Factor) 5 -1.43 -0.89 2.46 1.22 -1.16 -1.17 -1.01 -0.81 -0.88 1.64
DV (Factor) 6 0.41 -0.10 3.32 -0.15 3.98 -0.25 2.91 1.72 1.54 0.57
DV (Factor) 7 -0.91 -0.77 -0.92 -0.71 -0.37 -0.83 -0.85 -0.37 -0.61 -0.29

Part 2: Classification

In order to predict whether a move will be in the top 250 imdb movies or not based on the 7 factors we found above we use classification methods: in particular the two methods we use are the CART tree methodology and the logistic regression method.

To start off, we set the variables for CART, and of the profit matrices to then evaluate our predictions. As mentioned, the dependent variable is a vector of 0,1 depending on whether the movie is top 250 or not. the independent variables are the seven factors. For the profit matrices, given the low percentage of top 250 movies (250/4000) we assume that guessing the movie gives a +50 and wrongly predicting it will be a high grossing movie gives a -1.

We then determine the parameters that will be used for the CART analysis- in particular we set the probabiity threshold to be 65%. We originally used 70% but some of the clusters were exactly 70% so we reduced to capture those.

We also determine how much data we want to use to estimate the parameters, and how much data we want to use to determine how correct our model is (we use 80% of the data for estimation)

Finally we determine the level of complexity of the tree we would like to achieve by choosing the complexity parameter- having tried a few options 0.05 gave a tree that was relatively simple but gives reasonable decision nodes.

Probability_Threshold = 65  # between 1 and 99%
estimation_data_percent = 80
validation_data_percent = 10
random_sampling = 0
CART_cp = 0.005
min_segment = 100

The resulting classification tree is below:

As we can see from the graph, there are 8 different ending points- of these only 3 would lead the movie to be in the top 250. To have a feeling of whether the classification makes sense we look at two example decisions.

  1. Example decision to classify as a 0: one decision for example looks at whether Factor 2 < 1.1. Factor 2 corresponded to the number of likes and review of the movies. As typically movies with high reviews on IMDB and a lot of likes tend to be popular a low score on this factor indicates which pretty high probability

  2. Example decision to classify as a 1: one decision for example looks at whether Factor 2 > 1.1, Factor 3 > 0.88 and Factor 6>1.8.So essentially if the film has top actors and has a high budget then it is likely to be a top 250 which is intuitive.

We then run a logistic regression to have an alternative classification of the movies

formula_log = paste(colnames(estimation_data[, dependent_variable, drop = F]), 
    paste(Reduce(paste, sapply(head(independent_variables, -1), function(i) paste(colnames(estimation_data)[i], 
        "+", sep = ""))), colnames(estimation_data)[tail(independent_variables, 
        1)], sep = ""), sep = "~")

logreg_solution <- glm(formula_log, family = binomial(link = "logit"), data = estimation_data)

log_coefficients = round(summary(logreg_solution)$coefficients, 1)
iprint.df(log_coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1 0.2 -23.8 0.0
DV..Factor..1 -0.3 0.1 -2.1 0.0
DV..Factor..2 1.0 0.1 12.4 0.0
DV..Factor..3 0.8 0.1 8.8 0.0
DV..Factor..4 0.3 0.1 3.2 0.0
DV..Factor..5 0.3 0.1 5.0 0.0
DV..Factor..6 -0.2 0.1 -3.1 0.0
DV..Factor..7 0.0 0.2 0.1 0.9

From the logistic regression we can determine what the most significant factors are and whether some of the factors are not significant: in our case it seems that factor 7 is not statistically significant. This is in lines with expectations as factor 7 is the number of faces in the movie poster that we did not expect to have significant impact.

We then compare the two methods by checking how much “weight” these three methods put on the different factors: the below table shows the importance assigned to each factor in the classification (as a percentage of the importance of the largest factor)

As we can see the results of the two methodologies are in line and the the most predictive factors are factor 2, factor 3 and factor 5- respectively number of IMDB reviews, profit of movie and number of facebook likes. This is in line with expectations. Factor 7 is the least important with both methodologies.

CART 1 Logistic Regr.
DV..Factor..1 -0.4075435 -0.169354839
DV..Factor..2 1.0000000 1.000000000
DV..Factor..3 0.9024807 0.709677419
DV..Factor..4 0.1830532 0.258064516
DV..Factor..5 0.3845682 0.403225806
DV..Factor..6 -0.4371961 -0.250000000
DV..Factor..7 0.1533779 0.008064516

Finally, if we were to use the estimated classification models on the test data, we would get the following profit curves (using the profit parameters set earlier)

The profit curve using the small classification tree:

This result is not reasonable, this may be because the tree is doing a rough probability estimation.

The profit curve using the logistic regression classifier:

The logistic profit curve seems to indicate that our estimation is working as we are able to capture pretty much all the potential profit covering only the lowerst percentile.

These are the maximum total profit achieved in the test data using the three classifiers (without any segment specific analysis so far).

Percentile Profit
Small Tree 6.42 231
Logistic Regression 12.83 411

Further Ideas

There is ample room to extend this process to improve on it.

-We could have chosen a different amount of factors (i.e. six) given the last factor does not seem to be statistically significant in the regression

  • One could for example add more variables to target more specifically, such as countries performance (allowing to target specific countries, and not only have a global approach) or film types (allowing to target type of movies, and not only have a generic approach)

  • Another option would be to refine the prediction. The process currently predicts whether a movie will make it to the top 250 or not, this could be refined to target more specifically e.g., top 100, top 1000. Maybe top 250 is too little (i.e. only 5% of the movies are a 1). Alternatively we could have tried to predict the actual IMDB score instead of just whether it’s a hit.

  • Another question that a small alteration to this process could answer is finding the the key drivers and link between movie budget and IMDB success