This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks, i.e. how popular any given article is. The dataset is publicly available at University of California Irvine Machine Learning Repository
Mashable Inc. is a digital media website founded in 2005. It has been described as a “one stop shop” for social media. As of November 2015, it has over 6,000,000 Twitter followers and over 3,200,000 fans on Facebook.
First we load the data to use (see the raw .Rmd file to change the data file as needed):
Attribute Information in Dataset are as follows:
Stop Words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on.
Here is a sample of the first 250 rows of the Dataset:
The data we use here have the following descriptive statistics:
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
n_tokens_title | 2.00 | 9.00 | 10.00 | 10.40 | 12.00 | 23.00 | 2.11 |
n_tokens_content | 0.00 | 246.00 | 409.00 | 546.41 | 716.00 | 8474.00 | 471.23 |
n_unique_tokens | 0.00 | 0.47 | 0.54 | 0.53 | 0.61 | 1.00 | 0.14 |
n_non_stop_unique_tokens | 0.00 | 0.63 | 0.69 | 0.67 | 0.75 | 1.00 | 0.15 |
num_hrefs | 0.00 | 4.00 | 8.00 | 10.88 | 14.00 | 304.00 | 11.34 |
num_self_hrefs | 0.00 | 1.00 | 3.00 | 3.28 | 4.00 | 116.00 | 3.83 |
num_imgs | 0.00 | 1.00 | 1.00 | 4.54 | 4.00 | 128.00 | 8.30 |
num_videos | 0.00 | 0.00 | 0.00 | 1.25 | 1.00 | 91.00 | 4.11 |
average_token_length | 0.00 | 4.48 | 4.66 | 4.55 | 4.85 | 8.04 | 0.85 |
num_keywords | 1.00 | 6.00 | 7.00 | 7.22 | 9.00 | 10.00 | 1.91 |
self_reference_min_shares | 0.00 | 641.00 | 1200.00 | 4004.19 | 2600.00 | 843300.00 | 19757.91 |
self_reference_max_shares | 0.00 | 1100.00 | 2800.00 | 10336.39 | 7900.00 | 843300.00 | 41067.32 |
self_reference_avg_sharess | 0.00 | 983.00 | 2200.00 | 6409.49 | 5200.00 | 843300.00 | 24234.80 |
global_subjectivity | 0.00 | 0.40 | 0.45 | 0.44 | 0.51 | 1.00 | 0.12 |
global_sentiment_polarity | -0.39 | 0.06 | 0.12 | 0.12 | 0.18 | 0.73 | 0.10 |
global_rate_positive_words | 0.00 | 0.03 | 0.04 | 0.04 | 0.05 | 0.16 | 0.02 |
global_rate_negative_words | 0.00 | 0.01 | 0.02 | 0.02 | 0.02 | 0.18 | 0.01 |
rate_positive_words | 0.00 | 0.60 | 0.71 | 0.68 | 0.80 | 1.00 | 0.19 |
rate_negative_words | 0.00 | 0.19 | 0.28 | 0.29 | 0.38 | 1.00 | 0.16 |
title_subjectivity | 0.00 | 0.00 | 0.15 | 0.28 | 0.50 | 1.00 | 0.32 |
title_sentiment_polarity | -1.00 | 0.00 | 0.00 | 0.07 | 0.15 | 1.00 | 0.27 |
abs_title_subjectivity | 0.00 | 0.17 | 0.50 | 0.34 | 0.50 | 0.50 | 0.19 |
abs_title_sentiment_polarity | 0.00 | 0.00 | 0.00 | 0.16 | 0.25 | 1.00 | 0.23 |
The data is Scaled and summary statistics are reprinted:
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
n_tokens_title | -3.97 | -0.66 | -0.19 | 0 | 0.76 | 5.96 | 1 |
n_tokens_content | -1.16 | -0.64 | -0.29 | 0 | 0.36 | 16.82 | 1 |
n_unique_tokens | -3.87 | -0.43 | 0.06 | 0 | 0.57 | 3.42 | 1 |
n_non_stop_unique_tokens | -4.37 | -0.30 | 0.11 | 0 | 0.53 | 2.12 | 1 |
num_hrefs | -0.96 | -0.61 | -0.25 | 0 | 0.27 | 25.85 | 1 |
num_self_hrefs | -0.86 | -0.60 | -0.07 | 0 | 0.19 | 29.39 | 1 |
num_imgs | -0.55 | -0.43 | -0.43 | 0 | -0.06 | 14.87 | 1 |
num_videos | -0.30 | -0.30 | -0.30 | 0 | -0.06 | 21.83 | 1 |
average_token_length | -5.38 | -0.08 | 0.14 | 0 | 0.36 | 4.13 | 1 |
num_keywords | -3.26 | -0.64 | -0.12 | 0 | 0.93 | 1.45 | 1 |
self_reference_min_shares | -0.20 | -0.17 | -0.14 | 0 | -0.07 | 42.48 | 1 |
self_reference_max_shares | -0.25 | -0.22 | -0.18 | 0 | -0.06 | 20.28 | 1 |
self_reference_avg_sharess | -0.26 | -0.22 | -0.17 | 0 | -0.05 | 34.53 | 1 |
global_subjectivity | -3.80 | -0.40 | 0.09 | 0 | 0.56 | 4.77 | 1 |
global_sentiment_polarity | -5.29 | -0.63 | 0.00 | 0 | 0.60 | 6.28 | 1 |
global_rate_positive_words | -2.27 | -0.64 | -0.03 | 0 | 0.61 | 6.65 | 1 |
global_rate_negative_words | -1.53 | -0.65 | -0.12 | 0 | 0.47 | 15.54 | 1 |
rate_positive_words | -3.59 | -0.43 | 0.15 | 0 | 0.62 | 1.67 | 1 |
rate_negative_words | -1.84 | -0.66 | -0.05 | 0 | 0.62 | 4.56 | 1 |
title_subjectivity | -0.87 | -0.87 | -0.41 | 0 | 0.67 | 2.21 | 1 |
title_sentiment_polarity | -4.04 | -0.27 | -0.27 | 0 | 0.30 | 3.50 | 1 |
abs_title_subjectivity | -1.81 | -0.93 | 0.84 | 0 | 0.84 | 0.84 | 1 |
abs_title_sentiment_polarity | -0.69 | -0.69 | -0.69 | 0 | 0.42 | 3.73 | 1 |
n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | num_videos | average_token_length | num_keywords | self_reference_min_shares | self_reference_max_shares | self_reference_avg_sharess | global_subjectivity | global_sentiment_polarity | global_rate_positive_words | global_rate_negative_words | rate_positive_words | rate_negative_words | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n_tokens_title | 1.00 | 0.02 | -0.05 | -0.04 | -0.05 | -0.02 | -0.01 | 0.05 | -0.07 | -0.01 | 0.00 | 0.00 | 0.00 | -0.06 | -0.07 | -0.07 | 0.02 | -0.07 | 0.03 | 0.08 | 0.00 | -0.15 | 0.04 |
n_tokens_content | 0.02 | 1.00 | -0.40 | -0.22 | 0.42 | 0.30 | 0.34 | 0.10 | 0.17 | 0.07 | -0.03 | 0.03 | -0.01 | 0.13 | 0.02 | 0.13 | 0.13 | 0.10 | 0.10 | 0.00 | 0.02 | 0.01 | 0.01 |
n_unique_tokens | -0.05 | -0.40 | 1.00 | 0.94 | -0.11 | -0.05 | -0.24 | 0.02 | 0.66 | -0.08 | 0.05 | 0.03 | 0.05 | 0.49 | 0.17 | 0.29 | 0.18 | 0.45 | 0.20 | -0.01 | -0.03 | 0.00 | -0.02 |
n_non_stop_unique_tokens | -0.04 | -0.22 | 0.94 | 1.00 | -0.11 | -0.02 | -0.29 | 0.01 | 0.71 | -0.08 | 0.04 | 0.03 | 0.04 | 0.52 | 0.17 | 0.35 | 0.22 | 0.49 | 0.23 | -0.03 | -0.03 | 0.01 | -0.04 |
num_hrefs | -0.05 | 0.42 | -0.11 | -0.11 | 1.00 | 0.40 | 0.34 | 0.11 | 0.22 | 0.13 | 0.00 | 0.08 | 0.03 | 0.20 | 0.09 | 0.06 | 0.03 | 0.10 | 0.06 | 0.04 | 0.04 | 0.01 | 0.06 |
num_self_hrefs | -0.02 | 0.30 | -0.05 | -0.02 | 0.40 | 1.00 | 0.23 | 0.08 | 0.13 | 0.10 | -0.03 | 0.13 | 0.02 | 0.11 | 0.09 | 0.12 | 0.01 | 0.14 | -0.01 | -0.01 | 0.03 | 0.01 | -0.01 |
num_imgs | -0.01 | 0.34 | -0.24 | -0.29 | 0.34 | 0.23 | 1.00 | -0.07 | 0.03 | 0.09 | 0.01 | 0.03 | 0.02 | 0.08 | 0.02 | -0.04 | 0.03 | -0.02 | 0.04 | 0.06 | 0.05 | -0.01 | 0.06 |
num_videos | 0.05 | 0.10 | 0.02 | 0.01 | 0.11 | 0.08 | -0.07 | 1.00 | 0.00 | -0.02 | 0.00 | 0.08 | 0.03 | 0.08 | -0.03 | 0.07 | 0.18 | -0.04 | 0.07 | 0.06 | 0.02 | -0.02 | 0.05 |
average_token_length | -0.07 | 0.17 | 0.66 | 0.71 | 0.22 | 0.13 | 0.03 | 0.00 | 1.00 | -0.02 | 0.03 | 0.04 | 0.04 | 0.60 | 0.18 | 0.32 | 0.23 | 0.58 | 0.32 | -0.04 | -0.02 | 0.03 | -0.04 |
num_keywords | -0.01 | 0.07 | -0.08 | -0.08 | 0.13 | 0.10 | 0.09 | -0.02 | -0.02 | 1.00 | -0.01 | 0.01 | 0.00 | 0.04 | 0.08 | 0.05 | -0.04 | 0.03 | -0.07 | 0.02 | 0.03 | -0.01 | 0.02 |
self_reference_min_shares | 0.00 | -0.03 | 0.05 | 0.04 | 0.00 | -0.03 | 0.01 | 0.00 | 0.03 | -0.01 | 1.00 | 0.48 | 0.82 | 0.06 | 0.01 | 0.00 | 0.01 | 0.02 | 0.01 | 0.01 | 0.00 | 0.00 | 0.01 |
self_reference_max_shares | 0.00 | 0.03 | 0.03 | 0.03 | 0.08 | 0.13 | 0.03 | 0.08 | 0.04 | 0.01 | 0.48 | 1.00 | 0.85 | 0.06 | 0.01 | 0.02 | 0.02 | 0.03 | 0.02 | 0.01 | 0.00 | 0.00 | 0.01 |
self_reference_avg_sharess | 0.00 | -0.01 | 0.05 | 0.04 | 0.03 | 0.02 | 0.02 | 0.03 | 0.04 | 0.00 | 0.82 | 0.85 | 1.00 | 0.07 | 0.01 | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 | 0.00 | 0.00 | 0.01 |
global_subjectivity | -0.06 | 0.13 | 0.49 | 0.52 | 0.20 | 0.11 | 0.08 | 0.08 | 0.60 | 0.04 | 0.06 | 0.06 | 0.07 | 1.00 | 0.34 | 0.47 | 0.25 | 0.49 | 0.13 | 0.11 | 0.03 | 0.00 | 0.09 |
global_sentiment_polarity | -0.07 | 0.02 | 0.17 | 0.17 | 0.09 | 0.09 | 0.02 | -0.03 | 0.18 | 0.08 | 0.01 | 0.01 | 0.01 | 0.34 | 1.00 | 0.57 | -0.47 | 0.73 | -0.65 | 0.02 | 0.24 | -0.03 | 0.07 |
global_rate_positive_words | -0.07 | 0.13 | 0.29 | 0.35 | 0.06 | 0.12 | -0.04 | 0.07 | 0.32 | 0.05 | 0.00 | 0.02 | 0.01 | 0.47 | 0.57 | 1.00 | 0.11 | 0.63 | -0.33 | 0.11 | 0.14 | -0.14 | 0.10 |
global_rate_negative_words | 0.02 | 0.13 | 0.18 | 0.22 | 0.03 | 0.01 | 0.03 | 0.18 | 0.23 | -0.04 | 0.01 | 0.02 | 0.02 | 0.25 | -0.47 | 0.11 | 1.00 | -0.40 | 0.78 | 0.09 | -0.14 | -0.06 | 0.06 |
rate_positive_words | -0.07 | 0.10 | 0.45 | 0.49 | 0.10 | 0.14 | -0.02 | -0.04 | 0.58 | 0.03 | 0.02 | 0.03 | 0.03 | 0.49 | 0.73 | 0.63 | -0.40 | 1.00 | -0.53 | -0.02 | 0.14 | -0.02 | 0.00 |
rate_negative_words | 0.03 | 0.10 | 0.20 | 0.23 | 0.06 | -0.01 | 0.04 | 0.07 | 0.32 | -0.07 | 0.01 | 0.02 | 0.02 | 0.13 | -0.65 | -0.33 | 0.78 | -0.53 | 1.00 | 0.00 | -0.19 | 0.04 | -0.03 |
title_subjectivity | 0.08 | 0.00 | -0.01 | -0.03 | 0.04 | -0.01 | 0.06 | 0.06 | -0.04 | 0.02 | 0.01 | 0.01 | 0.01 | 0.11 | 0.02 | 0.11 | 0.09 | -0.02 | 0.00 | 1.00 | 0.23 | -0.49 | 0.71 |
title_sentiment_polarity | 0.00 | 0.02 | -0.03 | -0.03 | 0.04 | 0.03 | 0.05 | 0.02 | -0.02 | 0.03 | 0.00 | 0.00 | 0.00 | 0.03 | 0.24 | 0.14 | -0.14 | 0.14 | -0.19 | 0.23 | 1.00 | -0.24 | 0.41 |
abs_title_subjectivity | -0.15 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | -0.01 | -0.02 | 0.03 | -0.01 | 0.00 | 0.00 | 0.00 | 0.00 | -0.03 | -0.14 | -0.06 | -0.02 | 0.04 | -0.49 | -0.24 | 1.00 | -0.40 |
abs_title_sentiment_polarity | 0.04 | 0.01 | -0.02 | -0.04 | 0.06 | -0.01 | 0.06 | 0.05 | -0.04 | 0.02 | 0.01 | 0.01 | 0.01 | 0.09 | 0.07 | 0.10 | 0.06 | 0.00 | -0.03 | 0.71 | 0.41 | -0.40 | 1.00 |
After running the Principal Component Analysis, we loook at the variance explained as well as the eigenvalues to choose the relevant number of factors:
Eigenvalue | Pct of explained variance | Cumulative pct of explained variance | |
---|---|---|---|
Component 1 | 4.02 | 17.49 | 17.49 |
Component 2 | 2.93 | 12.76 | 30.25 |
Component 3 | 2.53 | 11.01 | 41.27 |
Component 4 | 2.37 | 10.29 | 51.55 |
Component 5 | 2.21 | 9.63 | 61.18 |
Component 6 | 1.09 | 4.76 | 65.94 |
Component 7 | 1.00 | 4.34 | 70.28 |
Component 8 | 0.96 | 4.19 | 74.47 |
Component 9 | 0.89 | 3.86 | 78.34 |
Component 10 | 0.78 | 3.40 | 81.73 |
Component 11 | 0.72 | 3.14 | 84.87 |
Component 12 | 0.65 | 2.81 | 87.68 |
Component 13 | 0.61 | 2.66 | 90.33 |
Component 14 | 0.51 | 2.21 | 92.54 |
Component 15 | 0.49 | 2.11 | 94.65 |
Component 16 | 0.38 | 1.66 | 96.31 |
Component 17 | 0.25 | 1.09 | 97.40 |
Component 18 | 0.23 | 0.99 | 98.39 |
Component 19 | 0.20 | 0.87 | 99.26 |
Component 20 | 0.08 | 0.35 | 99.61 |
Component 21 | 0.04 | 0.16 | 99.77 |
Component 22 | 0.03 | 0.14 | 99.91 |
Component 23 | 0.02 | 0.09 | 100.00 |
Based on the Principal Component Analysis, 6 factors out of the 23 are chosen.
We check the correlation of each of these six factors with the rest of the attributes.
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | |
---|---|---|---|---|---|---|
n_non_stop_unique_tokens | 0.90 | -0.06 | 0.02 | -0.29 | -0.05 | -0.02 |
n_unique_tokens | 0.86 | -0.05 | 0.04 | -0.36 | -0.02 | -0.07 |
average_token_length | 0.86 | -0.10 | 0.01 | 0.20 | -0.07 | -0.11 |
global_subjectivity | 0.76 | 0.03 | 0.05 | 0.24 | 0.10 | 0.04 |
rate_positive_words | 0.63 | 0.70 | 0.01 | 0.12 | 0.00 | -0.05 |
global_rate_positive_words | 0.55 | 0.46 | -0.02 | 0.15 | 0.16 | 0.24 |
global_sentiment_polarity | 0.32 | 0.83 | 0.00 | 0.11 | 0.09 | -0.01 |
global_rate_negative_words | 0.30 | -0.78 | 0.00 | 0.10 | 0.07 | 0.24 |
rate_negative_words | 0.22 | -0.94 | 0.01 | 0.06 | -0.06 | 0.01 |
num_hrefs | 0.09 | -0.02 | 0.03 | 0.75 | 0.01 | 0.00 |
num_self_hrefs | 0.09 | 0.06 | 0.04 | 0.60 | -0.06 | 0.10 |
num_videos | 0.05 | -0.05 | 0.04 | 0.10 | 0.00 | 0.82 |
self_reference_avg_sharess | 0.03 | 0.00 | 0.99 | 0.01 | 0.00 | 0.01 |
self_reference_min_shares | 0.02 | -0.01 | 0.85 | -0.04 | 0.01 | -0.05 |
self_reference_max_shares | 0.02 | 0.00 | 0.86 | 0.09 | 0.00 | 0.08 |
title_subjectivity | 0.02 | -0.07 | 0.01 | 0.02 | 0.85 | 0.06 |
abs_title_sentiment_polarity | 0.01 | -0.02 | 0.01 | 0.04 | 0.87 | 0.00 |
title_sentiment_polarity | 0.00 | 0.25 | 0.00 | 0.05 | 0.54 | -0.05 |
abs_title_subjectivity | -0.01 | 0.01 | 0.00 | 0.04 | -0.69 | -0.12 |
n_tokens_content | -0.03 | -0.04 | -0.04 | 0.77 | -0.02 | 0.17 |
num_keywords | -0.04 | 0.09 | 0.00 | 0.24 | 0.05 | -0.15 |
n_tokens_title | -0.10 | -0.04 | 0.00 | -0.07 | 0.11 | 0.40 |
num_imgs | -0.12 | -0.07 | 0.02 | 0.65 | 0.09 | -0.29 |
To better visualize and interpret the factors we often “suppress” loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:
Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | |
---|---|---|---|---|---|---|
n_non_stop_unique_tokens | 0.90 | |||||
n_unique_tokens | 0.86 | |||||
average_token_length | 0.86 | |||||
global_subjectivity | 0.76 | |||||
rate_positive_words | 0.63 | 0.70 | ||||
global_rate_positive_words | 0.55 | |||||
global_sentiment_polarity | 0.83 | |||||
global_rate_negative_words | -0.78 | |||||
rate_negative_words | -0.94 | |||||
num_hrefs | 0.75 | |||||
num_self_hrefs | 0.60 | |||||
num_videos | 0.82 | |||||
self_reference_avg_sharess | 0.99 | |||||
self_reference_min_shares | 0.85 | |||||
self_reference_max_shares | 0.86 | |||||
title_subjectivity | 0.85 | |||||
abs_title_sentiment_polarity | 0.87 | |||||
title_sentiment_polarity | 0.54 | |||||
abs_title_subjectivity | -0.69 | |||||
n_tokens_content | 0.77 | |||||
num_keywords | ||||||
n_tokens_title | ||||||
num_imgs | 0.65 |
We can now either replace all initial variables used in this part with one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | |
---|---|---|---|---|---|---|---|---|---|---|
DV (Factor) 1 | 0.15 | 0.28 | 0.17 | 0.40 | -0.14 | -0.36 | 0.29 | 0.32 | -0.04 | 0.12 |
DV (Factor) 2 | 0.67 | -1.90 | 0.28 | -0.18 | 0.10 | 0.40 | 1.10 | -0.74 | 1.74 | 0.40 |
DV (Factor) 3 | -0.12 | -0.29 | -0.18 | -0.23 | -0.21 | -0.22 | -0.10 | -0.13 | -0.26 | 0.56 |
DV (Factor) 4 | -0.77 | -0.75 | -0.61 | -0.70 | -0.15 | -0.01 | 2.71 | 0.67 | -0.62 | -0.16 |
DV (Factor) 5 | -0.89 | 0.49 | -0.84 | -0.90 | -0.09 | -0.87 | -0.95 | -0.37 | 1.25 | 0.47 |
DV (Factor) 6 | 0.18 | -0.22 | -1.02 | 0.27 | -0.30 | -1.18 | -0.32 | -0.78 | -0.04 | 0.80 |
Where,
DV (Factor) 1: Rate of unique non-stop words in the content
DV (Factor) 2: Rate of negative (or positive) words in the content
DV (Factor) 3: Avg. shares of referenced articles in Mashable
DV (Factor) 4: Number of words in the content
DV (Factor) 5: Absolute polarity level in title
DV (Factor) 6: Number of videos in the article
By focusing on these six factors, Mashable should be able to better predict whether an article will be shared on social media. Moreover, Mashable can potentially increase the number of shares for each article by setting the value of each of these attributes such that it maximizes the chance that a reader will share that article.
There are a total of 39,565 URLs in the data. Here are the responses for the first 10 URLs based on the six factors we chose in the Dimensionality Reduction stage:
n_non_stop_unique_tokens | rate_negative_words | self_reference_avg_sharess | n_tokens_content | abs_title_sentiment_polarity | num_videos |
---|---|---|---|---|---|
0.73 | 0.20 | 3100.00 | 288 | 0.00 | 0 |
0.78 | 0.57 | 0.00 | 414 | 0.20 | 0 |
0.79 | 0.25 | 727.00 | 134 | 0.00 | 0 |
0.77 | 0.32 | 951.00 | 281 | 0.00 | 0 |
0.66 | 0.26 | 1300.00 | 499 | 0.00 | 0 |
0.59 | 0.22 | 0.00 | 268 | 0.00 | 0 |
0.55 | 0.19 | 3151.16 | 925 | 0.00 | 0 |
0.60 | 0.40 | 2700.00 | 261 | 0.10 | 0 |
0.70 | 0.00 | 0.00 | 306 | 0.33 | 0 |
0.67 | 0.25 | 20900.00 | 909 | 0.10 | 1 |
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
n_non_stop_unique_tokens | 0 | 0.63 | 0.69 | 0.67 | 0.75 | 1 | 0.15 |
rate_negative_words | 0 | 0.19 | 0.28 | 0.29 | 0.38 | 1 | 0.16 |
self_reference_avg_sharess | 0 | 983.00 | 2200.00 | 6409.49 | 5200.00 | 843300 | 24234.80 |
n_tokens_content | 0 | 246.00 | 409.00 | 546.41 | 716.00 | 8474 | 471.23 |
abs_title_sentiment_polarity | 0 | 0.00 | 0.00 | 0.16 | 0.25 | 1 | 0.23 |
num_videos | 0 | 0.00 | 0.00 | 1.25 | 1.00 | 91 | 4.11 |
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
n_non_stop_unique_tokens | -4.37 | -0.30 | 0.11 | 0 | 0.53 | 2.12 | 1 |
rate_negative_words | -1.84 | -0.66 | -0.05 | 0 | 0.62 | 4.56 | 1 |
self_reference_avg_sharess | -0.26 | -0.22 | -0.17 | 0 | -0.05 | 34.53 | 1 |
n_tokens_content | -1.16 | -0.64 | -0.29 | 0 | 0.36 | 16.82 | 1 |
abs_title_sentiment_polarity | -0.69 | -0.69 | -0.69 | 0 | 0.42 | 3.73 | 1 |
num_videos | -0.30 | -0.30 | -0.30 | 0 | -0.06 | 21.83 | 1 |
We use Kmeans clustering to look for 3 clusters using the Lloyd Kmeans method. Here is the cluster membership for the first 10 URLs:
Observation Number | Cluster_Membership |
---|---|
1 | 3 |
2 | 3 |
3 | 3 |
4 | 3 |
5 | 3 |
6 | 3 |
7 | 2 |
8 | 3 |
9 | 1 |
10 | 3 |
We compare the average responses for each segment with the population average:
Population | Segment 1 | Segment 2 | Segment 3 | |
---|---|---|---|---|
n_non_stop_unique_tokens | 0.67 | 0.69 | 0.48 | 0.72 |
rate_negative_words | 0.29 | 0.29 | 0.24 | 0.30 |
self_reference_avg_sharess | 6409.49 | 7785.53 | 4183.65 | 6666.43 |
n_tokens_content | 546.41 | 499.43 | 1096.93 | 395.80 |
abs_title_sentiment_polarity | 0.16 | 0.54 | 0.11 | 0.06 |
num_videos | 1.25 | 1.67 | 2.35 | 0.80 |
To get a better picture of the magnitude of the differences, the segments are scaled:
Segment 1 | Segment 2 | Segment 3 | |
---|---|---|---|
n_non_stop_unique_tokens | 0.03 | -0.28 | 0.08 |
rate_negative_words | 0.00 | -0.16 | 0.05 |
self_reference_avg_sharess | 0.21 | -0.35 | 0.04 |
n_tokens_content | -0.09 | 1.01 | -0.28 |
abs_title_sentiment_polarity | 2.49 | -0.33 | -0.64 |
num_videos | 0.34 | 0.88 | -0.36 |
Although there seems to be some differences between the three segments, after running the Kmeans clustering test numerous times it becomes clear that the segments are not very distinct. In other words, the distance between each segment is not that large. As such, it does not make much sense to segment the URLs based on these factors.