The Business Decision

This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks, i.e. how popular any given article is. The dataset is publicly available at University of California Irvine Machine Learning Repository

Mashable Inc. is a digital media website founded in 2005. It has been described as a “one stop shop” for social media. As of November 2015, it has over 6,000,000 Twitter followers and over 3,200,000 fans on Facebook.

The Data

First we load the data to use (see the raw .Rmd file to change the data file as needed):

Attribute Information in Dataset are as follows:

url: URL of the article (non-predictive)
timedelta: Days between the article publication and the dataset acquisition (non-predictive)
n_tokens_title: Number of words in the title
n_tokens_content Number of words in the content
n_unique_tokens: Rate of unique words in the content
n_non_stop_unique_tokens: Rate of unique non-stop words in the content
num_hrefs: Number of links
num_self_hrefs: Number of links to other articles published by Mashable
num_imgs: Number of images
num_videos: Number of videos
average_token_length: Average length of the words in the content
num_keywords: Number of keywords in the metadata
self_reference_min_shares: Min. shares of referenced articles in Mashable
self_reference_max_shares: Max. shares of referenced articles in Mashable
self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
global_subjectivity: Text subjectivity
global_sentiment_polarity: Text sentiment polarity
global_rate_positive_words: Rate of positive words in the content
global_rate_negative_words: Rate of negative words in the content
rate_positive_words: Rate of positive words among non-neutral tokens
rate_negative_words: Rate of negative words among non-neutral tokens
title_subjectivity: Title subjectivity
title_sentiment_polarity: Title polarity
abs_title_subjectivity: Absolute subjectivity level
abs_title_sentiment_polarity: Absolute polarity level
shares: Number of shares (target)

Stop Words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on.

Dimensionality Reduction

Steps 1-2: Check the Data

Here is a sample of the first 250 rows of the Dataset:

The data we use here have the following descriptive statistics:

	min	25 percent	median	mean	75 percent	max	std
n_tokens_title	2.00	9.00	10.00	10.40	12.00	23.00	2.11
n_tokens_content	0.00	246.00	409.00	546.41	716.00	8474.00	471.23
n_unique_tokens	0.00	0.47	0.54	0.53	0.61	1.00	0.14
n_non_stop_unique_tokens	0.00	0.63	0.69	0.67	0.75	1.00	0.15
num_hrefs	0.00	4.00	8.00	10.88	14.00	304.00	11.34
num_self_hrefs	0.00	1.00	3.00	3.28	4.00	116.00	3.83
num_imgs	0.00	1.00	1.00	4.54	4.00	128.00	8.30
num_videos	0.00	0.00	0.00	1.25	1.00	91.00	4.11
average_token_length	0.00	4.48	4.66	4.55	4.85	8.04	0.85
num_keywords	1.00	6.00	7.00	7.22	9.00	10.00	1.91
self_reference_min_shares	0.00	641.00	1200.00	4004.19	2600.00	843300.00	19757.91
self_reference_max_shares	0.00	1100.00	2800.00	10336.39	7900.00	843300.00	41067.32
self_reference_avg_sharess	0.00	983.00	2200.00	6409.49	5200.00	843300.00	24234.80
global_subjectivity	0.00	0.40	0.45	0.44	0.51	1.00	0.12
global_sentiment_polarity	-0.39	0.06	0.12	0.12	0.18	0.73	0.10
global_rate_positive_words	0.00	0.03	0.04	0.04	0.05	0.16	0.02
global_rate_negative_words	0.00	0.01	0.02	0.02	0.02	0.18	0.01
rate_positive_words	0.00	0.60	0.71	0.68	0.80	1.00	0.19
rate_negative_words	0.00	0.19	0.28	0.29	0.38	1.00	0.16
title_subjectivity	0.00	0.00	0.15	0.28	0.50	1.00	0.32
title_sentiment_polarity	-1.00	0.00	0.00	0.07	0.15	1.00	0.27
abs_title_subjectivity	0.00	0.17	0.50	0.34	0.50	0.50	0.19
abs_title_sentiment_polarity	0.00	0.00	0.00	0.16	0.25	1.00	0.23

The data is Scaled and summary statistics are reprinted:

	min	25 percent	median	75 percent	max	std
n_tokens_title	-3.97	-0.66	-0.19	0.76	5.96	1
n_tokens_content	-1.16	-0.64	-0.29	0.36	16.82	1
n_unique_tokens	-3.87	-0.43	0.06	0.57	3.42	1
n_non_stop_unique_tokens	-4.37	-0.30	0.11	0.53	2.12	1
num_hrefs	-0.96	-0.61	-0.25	0.27	25.85	1
num_self_hrefs	-0.86	-0.60	-0.07	0.19	29.39	1
num_imgs	-0.55	-0.43	-0.43	-0.06	14.87	1
num_videos	-0.30	-0.30	-0.30	-0.06	21.83	1
average_token_length	-5.38	-0.08	0.14	0.36	4.13	1
num_keywords	-3.26	-0.64	-0.12	0.93	1.45	1
self_reference_min_shares	-0.20	-0.17	-0.14	-0.07	42.48	1
self_reference_max_shares	-0.25	-0.22	-0.18	-0.06	20.28	1
self_reference_avg_sharess	-0.26	-0.22	-0.17	-0.05	34.53	1
global_subjectivity	-3.80	-0.40	0.09	0.56	4.77	1
global_sentiment_polarity	-5.29	-0.63	0.00	0.60	6.28	1
global_rate_positive_words	-2.27	-0.64	-0.03	0.61	6.65	1
global_rate_negative_words	-1.53	-0.65	-0.12	0.47	15.54	1
rate_positive_words	-3.59	-0.43	0.15	0.62	1.67	1
rate_negative_words	-1.84	-0.66	-0.05	0.62	4.56	1
title_subjectivity	-0.87	-0.87	-0.41	0.67	2.21	1
title_sentiment_polarity	-4.04	-0.27	-0.27	0.30	3.50	1
abs_title_subjectivity	-1.81	-0.93	0.84	0.84	0.84	1
abs_title_sentiment_polarity	-0.69	-0.69	-0.69	0.42	3.73	1

Step 3: Check Correlations

	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	num_videos	average_token_length	num_keywords	self_reference_min_shares	self_reference_max_shares	self_reference_avg_sharess	global_subjectivity	global_sentiment_polarity	global_rate_positive_words	global_rate_negative_words	rate_positive_words	rate_negative_words	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity
n_tokens_title	1.00	0.02	-0.05	-0.04	-0.05	-0.02	-0.01	0.05	-0.07	-0.01	0.00	0.00	0.00	-0.06	-0.07	-0.07	0.02	-0.07	0.03	0.08	0.00	-0.15	0.04
n_tokens_content	0.02	1.00	-0.40	-0.22	0.42	0.30	0.34	0.10	0.17	0.07	-0.03	0.03	-0.01	0.13	0.02	0.13	0.13	0.10	0.10	0.00	0.02	0.01	0.01
n_unique_tokens	-0.05	-0.40	1.00	0.94	-0.11	-0.05	-0.24	0.02	0.66	-0.08	0.05	0.03	0.05	0.49	0.17	0.29	0.18	0.45	0.20	-0.01	-0.03	0.00	-0.02
n_non_stop_unique_tokens	-0.04	-0.22	0.94	1.00	-0.11	-0.02	-0.29	0.01	0.71	-0.08	0.04	0.03	0.04	0.52	0.17	0.35	0.22	0.49	0.23	-0.03	-0.03	0.01	-0.04
num_hrefs	-0.05	0.42	-0.11	-0.11	1.00	0.40	0.34	0.11	0.22	0.13	0.00	0.08	0.03	0.20	0.09	0.06	0.03	0.10	0.06	0.04	0.04	0.01	0.06
num_self_hrefs	-0.02	0.30	-0.05	-0.02	0.40	1.00	0.23	0.08	0.13	0.10	-0.03	0.13	0.02	0.11	0.09	0.12	0.01	0.14	-0.01	-0.01	0.03	0.01	-0.01
num_imgs	-0.01	0.34	-0.24	-0.29	0.34	0.23	1.00	-0.07	0.03	0.09	0.01	0.03	0.02	0.08	0.02	-0.04	0.03	-0.02	0.04	0.06	0.05	-0.01	0.06
num_videos	0.05	0.10	0.02	0.01	0.11	0.08	-0.07	1.00	0.00	-0.02	0.00	0.08	0.03	0.08	-0.03	0.07	0.18	-0.04	0.07	0.06	0.02	-0.02	0.05
average_token_length	-0.07	0.17	0.66	0.71	0.22	0.13	0.03	0.00	1.00	-0.02	0.03	0.04	0.04	0.60	0.18	0.32	0.23	0.58	0.32	-0.04	-0.02	0.03	-0.04
num_keywords	-0.01	0.07	-0.08	-0.08	0.13	0.10	0.09	-0.02	-0.02	1.00	-0.01	0.01	0.00	0.04	0.08	0.05	-0.04	0.03	-0.07	0.02	0.03	-0.01	0.02
self_reference_min_shares	0.00	-0.03	0.05	0.04	0.00	-0.03	0.01	0.00	0.03	-0.01	1.00	0.48	0.82	0.06	0.01	0.00	0.01	0.02	0.01	0.01	0.00	0.00	0.01
self_reference_max_shares	0.00	0.03	0.03	0.03	0.08	0.13	0.03	0.08	0.04	0.01	0.48	1.00	0.85	0.06	0.01	0.02	0.02	0.03	0.02	0.01	0.00	0.00	0.01
self_reference_avg_sharess	0.00	-0.01	0.05	0.04	0.03	0.02	0.02	0.03	0.04	0.00	0.82	0.85	1.00	0.07	0.01	0.01	0.02	0.03	0.02	0.01	0.00	0.00	0.01
global_subjectivity	-0.06	0.13	0.49	0.52	0.20	0.11	0.08	0.08	0.60	0.04	0.06	0.06	0.07	1.00	0.34	0.47	0.25	0.49	0.13	0.11	0.03	0.00	0.09
global_sentiment_polarity	-0.07	0.02	0.17	0.17	0.09	0.09	0.02	-0.03	0.18	0.08	0.01	0.01	0.01	0.34	1.00	0.57	-0.47	0.73	-0.65	0.02	0.24	-0.03	0.07
global_rate_positive_words	-0.07	0.13	0.29	0.35	0.06	0.12	-0.04	0.07	0.32	0.05	0.00	0.02	0.01	0.47	0.57	1.00	0.11	0.63	-0.33	0.11	0.14	-0.14	0.10
global_rate_negative_words	0.02	0.13	0.18	0.22	0.03	0.01	0.03	0.18	0.23	-0.04	0.01	0.02	0.02	0.25	-0.47	0.11	1.00	-0.40	0.78	0.09	-0.14	-0.06	0.06
rate_positive_words	-0.07	0.10	0.45	0.49	0.10	0.14	-0.02	-0.04	0.58	0.03	0.02	0.03	0.03	0.49	0.73	0.63	-0.40	1.00	-0.53	-0.02	0.14	-0.02	0.00
rate_negative_words	0.03	0.10	0.20	0.23	0.06	-0.01	0.04	0.07	0.32	-0.07	0.01	0.02	0.02	0.13	-0.65	-0.33	0.78	-0.53	1.00	0.00	-0.19	0.04	-0.03
title_subjectivity	0.08	0.00	-0.01	-0.03	0.04	-0.01	0.06	0.06	-0.04	0.02	0.01	0.01	0.01	0.11	0.02	0.11	0.09	-0.02	0.00	1.00	0.23	-0.49	0.71
title_sentiment_polarity	0.00	0.02	-0.03	-0.03	0.04	0.03	0.05	0.02	-0.02	0.03	0.00	0.00	0.00	0.03	0.24	0.14	-0.14	0.14	-0.19	0.23	1.00	-0.24	0.41
abs_title_subjectivity	-0.15	0.01	0.00	0.01	0.01	0.01	-0.01	-0.02	0.03	-0.01	0.00	0.00	0.00	0.00	-0.03	-0.14	-0.06	-0.02	0.04	-0.49	-0.24	1.00	-0.40
abs_title_sentiment_polarity	0.04	0.01	-0.02	-0.04	0.06	-0.01	0.06	0.05	-0.04	0.02	0.01	0.01	0.01	0.09	0.07	0.10	0.06	0.00	-0.03	0.71	0.41	-0.40	1.00

Step 4: Choose number of factors

After running the Principal Component Analysis, we loook at the variance explained as well as the eigenvalues to choose the relevant number of factors:

	Eigenvalue	Pct of explained variance	Cumulative pct of explained variance
Component 1	4.02	17.49	17.49
Component 2	2.93	12.76	30.25
Component 3	2.53	11.01	41.27
Component 4	2.37	10.29	51.55
Component 5	2.21	9.63	61.18
Component 6	1.09	4.76	65.94
Component 7	1.00	4.34	70.28
Component 8	0.96	4.19	74.47
Component 9	0.89	3.86	78.34
Component 10	0.78	3.40	81.73
Component 11	0.72	3.14	84.87
Component 12	0.65	2.81	87.68
Component 13	0.61	2.66	90.33
Component 14	0.51	2.21	92.54
Component 15	0.49	2.11	94.65
Component 16	0.38	1.66	96.31
Component 17	0.25	1.09	97.40
Component 18	0.23	0.99	98.39
Component 19	0.20	0.87	99.26
Component 20	0.08	0.35	99.61
Component 21	0.04	0.16	99.77
Component 22	0.03	0.14	99.91
Component 23	0.02	0.09	100.00

Based on the Principal Component Analysis, 6 factors out of the 23 are chosen.

Step 5: Interpret the factors

We check the correlation of each of these six factors with the rest of the attributes.

	Comp.1	Comp.2	Comp.3	Comp.4	Comp.5	Comp.6
n_non_stop_unique_tokens	0.90	-0.06	0.02	-0.29	-0.05	-0.02
n_unique_tokens	0.86	-0.05	0.04	-0.36	-0.02	-0.07
average_token_length	0.86	-0.10	0.01	0.20	-0.07	-0.11
global_subjectivity	0.76	0.03	0.05	0.24	0.10	0.04
rate_positive_words	0.63	0.70	0.01	0.12	0.00	-0.05
global_rate_positive_words	0.55	0.46	-0.02	0.15	0.16	0.24
global_sentiment_polarity	0.32	0.83	0.00	0.11	0.09	-0.01
global_rate_negative_words	0.30	-0.78	0.00	0.10	0.07	0.24
rate_negative_words	0.22	-0.94	0.01	0.06	-0.06	0.01
num_hrefs	0.09	-0.02	0.03	0.75	0.01	0.00
num_self_hrefs	0.09	0.06	0.04	0.60	-0.06	0.10
num_videos	0.05	-0.05	0.04	0.10	0.00	0.82
self_reference_avg_sharess	0.03	0.00	0.99	0.01	0.00	0.01
self_reference_min_shares	0.02	-0.01	0.85	-0.04	0.01	-0.05
self_reference_max_shares	0.02	0.00	0.86	0.09	0.00	0.08
title_subjectivity	0.02	-0.07	0.01	0.02	0.85	0.06
abs_title_sentiment_polarity	0.01	-0.02	0.01	0.04	0.87	0.00
title_sentiment_polarity	0.00	0.25	0.00	0.05	0.54	-0.05
abs_title_subjectivity	-0.01	0.01	0.00	0.04	-0.69	-0.12
n_tokens_content	-0.03	-0.04	-0.04	0.77	-0.02	0.17
num_keywords	-0.04	0.09	0.00	0.24	0.05	-0.15
n_tokens_title	-0.10	-0.04	0.00	-0.07	0.11	0.40
num_imgs	-0.12	-0.07	0.02	0.65	0.09	-0.29

To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

	Comp.1	Comp.2	Comp.3	Comp.4	Comp.5	Comp.6
n_non_stop_unique_tokens	0.90
n_unique_tokens	0.86
average_token_length	0.86
global_subjectivity	0.76
rate_positive_words	0.63	0.70
global_rate_positive_words	0.55
global_sentiment_polarity		0.83
global_rate_negative_words		-0.78
rate_negative_words		-0.94
num_hrefs				0.75
num_self_hrefs				0.60
num_videos						0.82
self_reference_avg_sharess			0.99
self_reference_min_shares			0.85
self_reference_max_shares			0.86
title_subjectivity					0.85
abs_title_sentiment_polarity					0.87
title_sentiment_polarity					0.54
abs_title_subjectivity					-0.69
n_tokens_content				0.77
num_keywords
n_tokens_title
num_imgs				0.65

Step 6: Save factor scores

We can now either replace all initial variables used in this part with one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
DV (Factor) 1	0.15	0.28	0.17	0.40	-0.14	-0.36	0.29	0.32	-0.04	0.12
DV (Factor) 2	0.67	-1.90	0.28	-0.18	0.10	0.40	1.10	-0.74	1.74	0.40
DV (Factor) 3	-0.12	-0.29	-0.18	-0.23	-0.21	-0.22	-0.10	-0.13	-0.26	0.56
DV (Factor) 4	-0.77	-0.75	-0.61	-0.70	-0.15	-0.01	2.71	0.67	-0.62	-0.16
DV (Factor) 5	-0.89	0.49	-0.84	-0.90	-0.09	-0.87	-0.95	-0.37	1.25	0.47
DV (Factor) 6	0.18	-0.22	-1.02	0.27	-0.30	-1.18	-0.32	-0.78	-0.04	0.80

Where,

DV (Factor) 1: Rate of unique non-stop words in the content

DV (Factor) 2: Rate of negative (or positive) words in the content

DV (Factor) 3: Avg. shares of referenced articles in Mashable

DV (Factor) 4: Number of words in the content

DV (Factor) 5: Absolute polarity level in title

DV (Factor) 6: Number of videos in the article

By focusing on these six factors, Mashable should be able to better predict whether an article will be shared on social media. Moreover, Mashable can potentially increase the number of shares for each article by setting the value of each of these attributes such that it maximizes the chance that a reader will share that article.

Cluster Analysis and Segmentation

The Data

There are a total of 39,565 URLs in the data. Here are the responses for the first 10 URLs based on the six factors we chose in the Dimensionality Reduction stage:

n_non_stop_unique_tokens	rate_negative_words	self_reference_avg_sharess	n_tokens_content	abs_title_sentiment_polarity	num_videos
0.73	0.20	3100.00	288	0.00	0
0.78	0.57	0.00	414	0.20	0
0.79	0.25	727.00	134	0.00	0
0.77	0.32	951.00	281	0.00	0
0.66	0.26	1300.00	499	0.00	0
0.59	0.22	0.00	268	0.00	0
0.55	0.19	3151.16	925	0.00	0
0.60	0.40	2700.00	261	0.10	0
0.70	0.00	0.00	306	0.33	0
0.67	0.25	20900.00	909	0.10	1

Summary Statistics

	25 percent	median	mean	75 percent	max	std
n_non_stop_unique_tokens	0.63	0.69	0.67	0.75	1	0.15
rate_negative_words	0.19	0.28	0.29	0.38	1	0.16
self_reference_avg_sharess	983.00	2200.00	6409.49	5200.00	843300	24234.80
n_tokens_content	246.00	409.00	546.41	716.00	8474	471.23
abs_title_sentiment_polarity	0.00	0.00	0.16	0.25	1	0.23
num_videos	0.00	0.00	1.25	1.00	91	4.11

Scaled Summary Statistics

	min	25 percent	median	75 percent	max	std
n_non_stop_unique_tokens	-4.37	-0.30	0.11	0.53	2.12	1
rate_negative_words	-1.84	-0.66	-0.05	0.62	4.56	1
self_reference_avg_sharess	-0.26	-0.22	-0.17	-0.05	34.53	1
n_tokens_content	-1.16	-0.64	-0.29	0.36	16.82	1
abs_title_sentiment_polarity	-0.69	-0.69	-0.69	0.42	3.73	1
num_videos	-0.30	-0.30	-0.30	-0.06	21.83	1

Using Kmeans Clustering

We use Kmeans clustering to look for 3 clusters using the Lloyd Kmeans method. Here is the cluster membership for the first 10 URLs:

Observation Number	Cluster_Membership
1	3
2	3
3	3
4	3
5	3
6	3
7	2
8	3
9	1
10	3

Interpreting the Segments

We compare the average responses for each segment with the population average:

	Population	Segment 1	Segment 2	Segment 3
n_non_stop_unique_tokens	0.67	0.69	0.48	0.72
rate_negative_words	0.29	0.29	0.24	0.30
self_reference_avg_sharess	6409.49	7785.53	4183.65	6666.43
n_tokens_content	546.41	499.43	1096.93	395.80
abs_title_sentiment_polarity	0.16	0.54	0.11	0.06
num_videos	1.25	1.67	2.35	0.80

To get a better picture of the magnitude of the differences, the segments are scaled:

	Segment 1	Segment 2	Segment 3
n_non_stop_unique_tokens	0.03	-0.28	0.08
rate_negative_words	0.00	-0.16	0.05
self_reference_avg_sharess	0.21	-0.35	0.04
n_tokens_content	-0.09	1.01	-0.28
abs_title_sentiment_polarity	2.49	-0.33	-0.64
num_videos	0.34	0.88	-0.36

Although there seems to be some differences between the three segments, after running the Kmeans clustering test numerous times it becomes clear that the segments are not very distinct. In other words, the distance between each segment is not that large. As such, it does not make much sense to segment the URLs based on these factors.

Mashable.com - A Process to Predict Online News Popularity

Abhinandan Saini

The Business Decision

The Data

Dimensionality Reduction

Steps 1-2: Check the Data

Step 3: Check Correlations

Step 4: Choose number of factors

Step 5: Interpret the factors

Step 6: Save factor scores

Cluster Analysis and Segmentation

The Data

Summary Statistics

Scaled Summary Statistics

Using Kmeans Clustering

Interpreting the Segments