The Business Decision

This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks, i.e. how popular any given article is. The dataset is publicly available at University of California Irvine Machine Learning Repository

Mashable Inc. is a digital media website founded in 2005. It has been described as a “one stop shop” for social media. As of November 2015, it has over 6,000,000 Twitter followers and over 3,200,000 fans on Facebook.


The Data

First we load the data to use (see the raw .Rmd file to change the data file as needed):

Attribute Information in Dataset are as follows:

  1. url: URL of the article (non-predictive)
  2. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
  3. n_tokens_title: Number of words in the title
  4. n_tokens_content Number of words in the content
  5. n_unique_tokens: Rate of unique words in the content
  6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  7. num_hrefs: Number of links
  8. num_self_hrefs: Number of links to other articles published by Mashable
  9. num_imgs: Number of images
  10. num_videos: Number of videos
  11. average_token_length: Average length of the words in the content
  12. num_keywords: Number of keywords in the metadata
  13. self_reference_min_shares: Min. shares of referenced articles in Mashable
  14. self_reference_max_shares: Max. shares of referenced articles in Mashable
  15. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  16. global_subjectivity: Text subjectivity
  17. global_sentiment_polarity: Text sentiment polarity
  18. global_rate_positive_words: Rate of positive words in the content
  19. global_rate_negative_words: Rate of negative words in the content
  20. rate_positive_words: Rate of positive words among non-neutral tokens
  21. rate_negative_words: Rate of negative words among non-neutral tokens
  22. title_subjectivity: Title subjectivity
  23. title_sentiment_polarity: Title polarity
  24. abs_title_subjectivity: Absolute subjectivity level
  25. abs_title_sentiment_polarity: Absolute polarity level
  26. shares: Number of shares (target)

Stop Words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on.


Dimensionality Reduction

Steps 1-2: Check the Data

Here is a sample of the first 250 rows of the Dataset:

The data we use here have the following descriptive statistics:

min 25 percent median mean 75 percent max std
n_tokens_title 2.00 9.00 10.00 10.40 12.00 23.00 2.11
n_tokens_content 0.00 246.00 409.00 546.41 716.00 8474.00 471.23
n_unique_tokens 0.00 0.47 0.54 0.53 0.61 1.00 0.14
n_non_stop_unique_tokens 0.00 0.63 0.69 0.67 0.75 1.00 0.15
num_hrefs 0.00 4.00 8.00 10.88 14.00 304.00 11.34
num_self_hrefs 0.00 1.00 3.00 3.28 4.00 116.00 3.83
num_imgs 0.00 1.00 1.00 4.54 4.00 128.00 8.30
num_videos 0.00 0.00 0.00 1.25 1.00 91.00 4.11
average_token_length 0.00 4.48 4.66 4.55 4.85 8.04 0.85
num_keywords 1.00 6.00 7.00 7.22 9.00 10.00 1.91
self_reference_min_shares 0.00 641.00 1200.00 4004.19 2600.00 843300.00 19757.91
self_reference_max_shares 0.00 1100.00 2800.00 10336.39 7900.00 843300.00 41067.32
self_reference_avg_sharess 0.00 983.00 2200.00 6409.49 5200.00 843300.00 24234.80
global_subjectivity 0.00 0.40 0.45 0.44 0.51 1.00 0.12
global_sentiment_polarity -0.39 0.06 0.12 0.12 0.18 0.73 0.10
global_rate_positive_words 0.00 0.03 0.04 0.04 0.05 0.16 0.02
global_rate_negative_words 0.00 0.01 0.02 0.02 0.02 0.18 0.01
rate_positive_words 0.00 0.60 0.71 0.68 0.80 1.00 0.19
rate_negative_words 0.00 0.19 0.28 0.29 0.38 1.00 0.16
title_subjectivity 0.00 0.00 0.15 0.28 0.50 1.00 0.32
title_sentiment_polarity -1.00 0.00 0.00 0.07 0.15 1.00 0.27
abs_title_subjectivity 0.00 0.17 0.50 0.34 0.50 0.50 0.19
abs_title_sentiment_polarity 0.00 0.00 0.00 0.16 0.25 1.00 0.23

The data is Scaled and summary statistics are reprinted:

min 25 percent median mean 75 percent max std
n_tokens_title -3.97 -0.66 -0.19 0 0.76 5.96 1
n_tokens_content -1.16 -0.64 -0.29 0 0.36 16.82 1
n_unique_tokens -3.87 -0.43 0.06 0 0.57 3.42 1
n_non_stop_unique_tokens -4.37 -0.30 0.11 0 0.53 2.12 1
num_hrefs -0.96 -0.61 -0.25 0 0.27 25.85 1
num_self_hrefs -0.86 -0.60 -0.07 0 0.19 29.39 1
num_imgs -0.55 -0.43 -0.43 0 -0.06 14.87 1
num_videos -0.30 -0.30 -0.30 0 -0.06 21.83 1
average_token_length -5.38 -0.08 0.14 0 0.36 4.13 1
num_keywords -3.26 -0.64 -0.12 0 0.93 1.45 1
self_reference_min_shares -0.20 -0.17 -0.14 0 -0.07 42.48 1
self_reference_max_shares -0.25 -0.22 -0.18 0 -0.06 20.28 1
self_reference_avg_sharess -0.26 -0.22 -0.17 0 -0.05 34.53 1
global_subjectivity -3.80 -0.40 0.09 0 0.56 4.77 1
global_sentiment_polarity -5.29 -0.63 0.00 0 0.60 6.28 1
global_rate_positive_words -2.27 -0.64 -0.03 0 0.61 6.65 1
global_rate_negative_words -1.53 -0.65 -0.12 0 0.47 15.54 1
rate_positive_words -3.59 -0.43 0.15 0 0.62 1.67 1
rate_negative_words -1.84 -0.66 -0.05 0 0.62 4.56 1
title_subjectivity -0.87 -0.87 -0.41 0 0.67 2.21 1
title_sentiment_polarity -4.04 -0.27 -0.27 0 0.30 3.50 1
abs_title_subjectivity -1.81 -0.93 0.84 0 0.84 0.84 1
abs_title_sentiment_polarity -0.69 -0.69 -0.69 0 0.42 3.73 1

Step 3: Check Correlations

n_tokens_title n_tokens_content n_unique_tokens n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords self_reference_min_shares self_reference_max_shares self_reference_avg_sharess global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity
n_tokens_title 1.00 0.02 -0.05 -0.04 -0.05 -0.02 -0.01 0.05 -0.07 -0.01 0.00 0.00 0.00 -0.06 -0.07 -0.07 0.02 -0.07 0.03 0.08 0.00 -0.15 0.04
n_tokens_content 0.02 1.00 -0.40 -0.22 0.42 0.30 0.34 0.10 0.17 0.07 -0.03 0.03 -0.01 0.13 0.02 0.13 0.13 0.10 0.10 0.00 0.02 0.01 0.01
n_unique_tokens -0.05 -0.40 1.00 0.94 -0.11 -0.05 -0.24 0.02 0.66 -0.08 0.05 0.03 0.05 0.49 0.17 0.29 0.18 0.45 0.20 -0.01 -0.03 0.00 -0.02
n_non_stop_unique_tokens -0.04 -0.22 0.94 1.00 -0.11 -0.02 -0.29 0.01 0.71 -0.08 0.04 0.03 0.04 0.52 0.17 0.35 0.22 0.49 0.23 -0.03 -0.03 0.01 -0.04
num_hrefs -0.05 0.42 -0.11 -0.11 1.00 0.40 0.34 0.11 0.22 0.13 0.00 0.08 0.03 0.20 0.09 0.06 0.03 0.10 0.06 0.04 0.04 0.01 0.06
num_self_hrefs -0.02 0.30 -0.05 -0.02 0.40 1.00 0.23 0.08 0.13 0.10 -0.03 0.13 0.02 0.11 0.09 0.12 0.01 0.14 -0.01 -0.01 0.03 0.01 -0.01
num_imgs -0.01 0.34 -0.24 -0.29 0.34 0.23 1.00 -0.07 0.03 0.09 0.01 0.03 0.02 0.08 0.02 -0.04 0.03 -0.02 0.04 0.06 0.05 -0.01 0.06
num_videos 0.05 0.10 0.02 0.01 0.11 0.08 -0.07 1.00 0.00 -0.02 0.00 0.08 0.03 0.08 -0.03 0.07 0.18 -0.04 0.07 0.06 0.02 -0.02 0.05
average_token_length -0.07 0.17 0.66 0.71 0.22 0.13 0.03 0.00 1.00 -0.02 0.03 0.04 0.04 0.60 0.18 0.32 0.23 0.58 0.32 -0.04 -0.02 0.03 -0.04
num_keywords -0.01 0.07 -0.08 -0.08 0.13 0.10 0.09 -0.02 -0.02 1.00 -0.01 0.01 0.00 0.04 0.08 0.05 -0.04 0.03 -0.07 0.02 0.03 -0.01 0.02
self_reference_min_shares 0.00 -0.03 0.05 0.04 0.00 -0.03 0.01 0.00 0.03 -0.01 1.00 0.48 0.82 0.06 0.01 0.00 0.01 0.02 0.01 0.01 0.00 0.00 0.01
self_reference_max_shares 0.00 0.03 0.03 0.03 0.08 0.13 0.03 0.08 0.04 0.01 0.48 1.00 0.85 0.06 0.01 0.02 0.02 0.03 0.02 0.01 0.00 0.00 0.01
self_reference_avg_sharess 0.00 -0.01 0.05 0.04 0.03 0.02 0.02 0.03 0.04 0.00 0.82 0.85 1.00 0.07 0.01 0.01 0.02 0.03 0.02 0.01 0.00 0.00 0.01
global_subjectivity -0.06 0.13 0.49 0.52 0.20 0.11 0.08 0.08 0.60 0.04 0.06 0.06 0.07 1.00 0.34 0.47 0.25 0.49 0.13 0.11 0.03 0.00 0.09
global_sentiment_polarity -0.07 0.02 0.17 0.17 0.09 0.09 0.02 -0.03 0.18 0.08 0.01 0.01 0.01 0.34 1.00 0.57 -0.47 0.73 -0.65 0.02 0.24 -0.03 0.07
global_rate_positive_words -0.07 0.13 0.29 0.35 0.06 0.12 -0.04 0.07 0.32 0.05 0.00 0.02 0.01 0.47 0.57 1.00 0.11 0.63 -0.33 0.11 0.14 -0.14 0.10
global_rate_negative_words 0.02 0.13 0.18 0.22 0.03 0.01 0.03 0.18 0.23 -0.04 0.01 0.02 0.02 0.25 -0.47 0.11 1.00 -0.40 0.78 0.09 -0.14 -0.06 0.06
rate_positive_words -0.07 0.10 0.45 0.49 0.10 0.14 -0.02 -0.04 0.58 0.03 0.02 0.03 0.03 0.49 0.73 0.63 -0.40 1.00 -0.53 -0.02 0.14 -0.02 0.00
rate_negative_words 0.03 0.10 0.20 0.23 0.06 -0.01 0.04 0.07 0.32 -0.07 0.01 0.02 0.02 0.13 -0.65 -0.33 0.78 -0.53 1.00 0.00 -0.19 0.04 -0.03
title_subjectivity 0.08 0.00 -0.01 -0.03 0.04 -0.01 0.06 0.06 -0.04 0.02 0.01 0.01 0.01 0.11 0.02 0.11 0.09 -0.02 0.00 1.00 0.23 -0.49 0.71
title_sentiment_polarity 0.00 0.02 -0.03 -0.03 0.04 0.03 0.05 0.02 -0.02 0.03 0.00 0.00 0.00 0.03 0.24 0.14 -0.14 0.14 -0.19 0.23 1.00 -0.24 0.41
abs_title_subjectivity -0.15 0.01 0.00 0.01 0.01 0.01 -0.01 -0.02 0.03 -0.01 0.00 0.00 0.00 0.00 -0.03 -0.14 -0.06 -0.02 0.04 -0.49 -0.24 1.00 -0.40
abs_title_sentiment_polarity 0.04 0.01 -0.02 -0.04 0.06 -0.01 0.06 0.05 -0.04 0.02 0.01 0.01 0.01 0.09 0.07 0.10 0.06 0.00 -0.03 0.71 0.41 -0.40 1.00

Step 4: Choose number of factors

After running the Principal Component Analysis, we loook at the variance explained as well as the eigenvalues to choose the relevant number of factors:

Eigenvalue Pct of explained variance Cumulative pct of explained variance
Component 1 4.02 17.49 17.49
Component 2 2.93 12.76 30.25
Component 3 2.53 11.01 41.27
Component 4 2.37 10.29 51.55
Component 5 2.21 9.63 61.18
Component 6 1.09 4.76 65.94
Component 7 1.00 4.34 70.28
Component 8 0.96 4.19 74.47
Component 9 0.89 3.86 78.34
Component 10 0.78 3.40 81.73
Component 11 0.72 3.14 84.87
Component 12 0.65 2.81 87.68
Component 13 0.61 2.66 90.33
Component 14 0.51 2.21 92.54
Component 15 0.49 2.11 94.65
Component 16 0.38 1.66 96.31
Component 17 0.25 1.09 97.40
Component 18 0.23 0.99 98.39
Component 19 0.20 0.87 99.26
Component 20 0.08 0.35 99.61
Component 21 0.04 0.16 99.77
Component 22 0.03 0.14 99.91
Component 23 0.02 0.09 100.00

Based on the Principal Component Analysis, 6 factors out of the 23 are chosen.

Step 5: Interpret the factors

We check the correlation of each of these six factors with the rest of the attributes.

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
n_non_stop_unique_tokens 0.90 -0.06 0.02 -0.29 -0.05 -0.02
n_unique_tokens 0.86 -0.05 0.04 -0.36 -0.02 -0.07
average_token_length 0.86 -0.10 0.01 0.20 -0.07 -0.11
global_subjectivity 0.76 0.03 0.05 0.24 0.10 0.04
rate_positive_words 0.63 0.70 0.01 0.12 0.00 -0.05
global_rate_positive_words 0.55 0.46 -0.02 0.15 0.16 0.24
global_sentiment_polarity 0.32 0.83 0.00 0.11 0.09 -0.01
global_rate_negative_words 0.30 -0.78 0.00 0.10 0.07 0.24
rate_negative_words 0.22 -0.94 0.01 0.06 -0.06 0.01
num_hrefs 0.09 -0.02 0.03 0.75 0.01 0.00
num_self_hrefs 0.09 0.06 0.04 0.60 -0.06 0.10
num_videos 0.05 -0.05 0.04 0.10 0.00 0.82
self_reference_avg_sharess 0.03 0.00 0.99 0.01 0.00 0.01
self_reference_min_shares 0.02 -0.01 0.85 -0.04 0.01 -0.05
self_reference_max_shares 0.02 0.00 0.86 0.09 0.00 0.08
title_subjectivity 0.02 -0.07 0.01 0.02 0.85 0.06
abs_title_sentiment_polarity 0.01 -0.02 0.01 0.04 0.87 0.00
title_sentiment_polarity 0.00 0.25 0.00 0.05 0.54 -0.05
abs_title_subjectivity -0.01 0.01 0.00 0.04 -0.69 -0.12
n_tokens_content -0.03 -0.04 -0.04 0.77 -0.02 0.17
num_keywords -0.04 0.09 0.00 0.24 0.05 -0.15
n_tokens_title -0.10 -0.04 0.00 -0.07 0.11 0.40
num_imgs -0.12 -0.07 0.02 0.65 0.09 -0.29

To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
n_non_stop_unique_tokens 0.90
n_unique_tokens 0.86
average_token_length 0.86
global_subjectivity 0.76
rate_positive_words 0.63 0.70
global_rate_positive_words 0.55
global_sentiment_polarity 0.83
global_rate_negative_words -0.78
rate_negative_words -0.94
num_hrefs 0.75
num_self_hrefs 0.60
num_videos 0.82
self_reference_avg_sharess 0.99
self_reference_min_shares 0.85
self_reference_max_shares 0.86
title_subjectivity 0.85
abs_title_sentiment_polarity 0.87
title_sentiment_polarity 0.54
abs_title_subjectivity -0.69
n_tokens_content 0.77
num_keywords
n_tokens_title
num_imgs 0.65

Step 6: Save factor scores

We can now either replace all initial variables used in this part with one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
DV (Factor) 1 0.15 0.28 0.17 0.40 -0.14 -0.36 0.29 0.32 -0.04 0.12
DV (Factor) 2 0.67 -1.90 0.28 -0.18 0.10 0.40 1.10 -0.74 1.74 0.40
DV (Factor) 3 -0.12 -0.29 -0.18 -0.23 -0.21 -0.22 -0.10 -0.13 -0.26 0.56
DV (Factor) 4 -0.77 -0.75 -0.61 -0.70 -0.15 -0.01 2.71 0.67 -0.62 -0.16
DV (Factor) 5 -0.89 0.49 -0.84 -0.90 -0.09 -0.87 -0.95 -0.37 1.25 0.47
DV (Factor) 6 0.18 -0.22 -1.02 0.27 -0.30 -1.18 -0.32 -0.78 -0.04 0.80

Where,

DV (Factor) 1: Rate of unique non-stop words in the content

DV (Factor) 2: Rate of negative (or positive) words in the content

DV (Factor) 3: Avg. shares of referenced articles in Mashable

DV (Factor) 4: Number of words in the content

DV (Factor) 5: Absolute polarity level in title

DV (Factor) 6: Number of videos in the article


By focusing on these six factors, Mashable should be able to better predict whether an article will be shared on social media. Moreover, Mashable can potentially increase the number of shares for each article by setting the value of each of these attributes such that it maximizes the chance that a reader will share that article.


Cluster Analysis and Segmentation

The Data

There are a total of 39,565 URLs in the data. Here are the responses for the first 10 URLs based on the six factors we chose in the Dimensionality Reduction stage:

n_non_stop_unique_tokens rate_negative_words self_reference_avg_sharess n_tokens_content abs_title_sentiment_polarity num_videos
0.73 0.20 3100.00 288 0.00 0
0.78 0.57 0.00 414 0.20 0
0.79 0.25 727.00 134 0.00 0
0.77 0.32 951.00 281 0.00 0
0.66 0.26 1300.00 499 0.00 0
0.59 0.22 0.00 268 0.00 0
0.55 0.19 3151.16 925 0.00 0
0.60 0.40 2700.00 261 0.10 0
0.70 0.00 0.00 306 0.33 0
0.67 0.25 20900.00 909 0.10 1

Summary Statistics

min 25 percent median mean 75 percent max std
n_non_stop_unique_tokens 0 0.63 0.69 0.67 0.75 1 0.15
rate_negative_words 0 0.19 0.28 0.29 0.38 1 0.16
self_reference_avg_sharess 0 983.00 2200.00 6409.49 5200.00 843300 24234.80
n_tokens_content 0 246.00 409.00 546.41 716.00 8474 471.23
abs_title_sentiment_polarity 0 0.00 0.00 0.16 0.25 1 0.23
num_videos 0 0.00 0.00 1.25 1.00 91 4.11

Scaled Summary Statistics

min 25 percent median mean 75 percent max std
n_non_stop_unique_tokens -4.37 -0.30 0.11 0 0.53 2.12 1
rate_negative_words -1.84 -0.66 -0.05 0 0.62 4.56 1
self_reference_avg_sharess -0.26 -0.22 -0.17 0 -0.05 34.53 1
n_tokens_content -1.16 -0.64 -0.29 0 0.36 16.82 1
abs_title_sentiment_polarity -0.69 -0.69 -0.69 0 0.42 3.73 1
num_videos -0.30 -0.30 -0.30 0 -0.06 21.83 1

Using Kmeans Clustering

We use Kmeans clustering to look for 3 clusters using the Lloyd Kmeans method. Here is the cluster membership for the first 10 URLs:

Observation Number Cluster_Membership
1 3
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 1
10 3

Interpreting the Segments

We compare the average responses for each segment with the population average:

Population Segment 1 Segment 2 Segment 3
n_non_stop_unique_tokens 0.67 0.69 0.48 0.72
rate_negative_words 0.29 0.29 0.24 0.30
self_reference_avg_sharess 6409.49 7785.53 4183.65 6666.43
n_tokens_content 546.41 499.43 1096.93 395.80
abs_title_sentiment_polarity 0.16 0.54 0.11 0.06
num_videos 1.25 1.67 2.35 0.80

To get a better picture of the magnitude of the differences, the segments are scaled:

Segment 1 Segment 2 Segment 3
n_non_stop_unique_tokens 0.03 -0.28 0.08
rate_negative_words 0.00 -0.16 0.05
self_reference_avg_sharess 0.21 -0.35 0.04
n_tokens_content -0.09 1.01 -0.28
abs_title_sentiment_polarity 2.49 -0.33 -0.64
num_videos 0.34 0.88 -0.36

Although there seems to be some differences between the three segments, after running the Kmeans clustering test numerous times it becomes clear that the segments are not very distinct. In other words, the distance between each segment is not that large. As such, it does not make much sense to segment the URLs based on these factors.