Intro

UCI Facebook Metrics https://archive.ics.uci.edu/ml/datasets/Facebook+metrics

Abstract:

Facebook performance metrics of a renowned cosmetic's brand Facebook page.

Data Set Characteristics: Multivariate
Number of Instances: 500
Area: Business
Attribute Characteristics: Integer
Number of Attributes: 19
Date Donated: 2016-08-05
Associated Tasks: Regression

Data Set Information:

The data is related to posts' published during the year of 2014 on the Facebook's page of a renowned cosmetics brand. 
This dataset contains 500 of the 790 rows and part of the features analyzed by Moro et al. (2016). The remaining were omitted due to confidentiality issues.


Attribute Information:

It includes 7 features known prior to post publication and 12 features for evaluating post impact (see Tables 2 and 3 from Moro et al., 2016 - complete reference in the 'Citation Request')


Relevant Papers:

(Moro et al., 2016) Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.

http://www.math-evry.cnrs.fr/_media/members/aguilloux/enseignements/moro2016.pdf

Literature Review

Paper: Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach

Post Timeframe: 1st of January and the 31th of December of 2014

Post Source: Facebook's page of a worldwide renowned cosmetic brand were included

Nbr of Posts: 500 of 790

4 Feature Types:

  • Identification—features that allow identifying each individual post
  • Content—the textual content of the post
  • Categorization—features that characterize the post
  • Performance

7 Inputs (Message, link etc.) + 12 Outputs (Likes, Interactions etc.)

Purpose: Understand social media impact on brand building.

Concept: Source Metrics --> Data Used --> References --> Branding Effect

Experiment

  • Use 7 inputs to predict 12 outputs
  • Check for outliers (Shapiro-Wilk)
  • Calculate absolute difference, percent difference, mean absolute percent difference
  • Conclude: Some predictive ability, better for interaction vs. visualization
  • SVM Model: Type --> Month --> Type/Likes

Data Loading

Import libraries

In [1]:
import sys
import pandas as pd
import numpy as np
import scipy as sp

# scikit-learn==0.19.1
# scipy==0.19.1

Read CSV and initially inspect

In [2]:
# Delimiter is ";"
# Use .head() if you want to show n rows

facebook = pd.read_csv('data/Facebook_metrics/dataset_Facebook.csv',delimiter=";")
facebook.head()
Out[2]:
Page total likes Type Category Post Month Post Weekday Post Hour Paid Lifetime Post Total Reach Lifetime Post Total Impressions Lifetime Engaged Users Lifetime Post Consumers Lifetime Post Consumptions Lifetime Post Impressions by people who have liked your Page Lifetime Post reach by people who like your Page Lifetime People who have liked your Page and engaged with your post comment like share Total Interactions
0 139441 Photo 2 12 4 3 0.0 2752 5091 178 109 159 3078 1640 119 4 79.0 17.0 100
1 139441 Status 2 12 3 10 0.0 10460 19057 1457 1361 1674 11710 6112 1108 5 130.0 29.0 164
2 139441 Photo 3 12 3 3 0.0 2413 4373 177 113 154 2812 1503 132 0 66.0 14.0 80
3 139441 Photo 2 12 2 10 1.0 50128 87991 2211 790 1119 61027 32048 1386 58 1572.0 147.0 1777
4 139441 Photo 2 12 2 3 0.0 7244 13594 671 410 580 6228 3200 396 19 325.0 49.0 393

Clean Data

In [3]:
# Shorten column names
facebook.rename(columns=
                {'Lifetime Post Total Reach': 'LT Post Total Reach',
                 'Lifetime Post Total Impressions': 'LT Post Total Imp',
                 'Lifetime Engaged Users': 'LT Engd Users',
                 'Lifetime Post Consumers': 'LT Post Consumers',
                 'Lifetime Post Consumptions': 'LT Post Consump',
                 'Lifetime Post Impressions by people who have liked your Page': 'LT Post Imp + Liked Page',
                 'Lifetime Post reach by people who like your Page': 'LT Post Reach + Liked Page',
                 'Lifetime People who have liked your Page and engaged with your post': 'LT People + Engd Post',
                 'comment': 'Comment',
                 'like': 'Like',
                 'share': 'Share',
                 'Total Interactions': 'Total Int'
                }, inplace=True)
facebook.head()
Out[3]:
Page total likes Type Category Post Month Post Weekday Post Hour Paid LT Post Total Reach LT Post Total Imp LT Engd Users LT Post Consumers LT Post Consump LT Post Imp + Liked Page LT Post Reach + Liked Page LT People + Engd Post Comment Like Share Total Int
0 139441 Photo 2 12 4 3 0.0 2752 5091 178 109 159 3078 1640 119 4 79.0 17.0 100
1 139441 Status 2 12 3 10 0.0 10460 19057 1457 1361 1674 11710 6112 1108 5 130.0 29.0 164
2 139441 Photo 3 12 3 3 0.0 2413 4373 177 113 154 2812 1503 132 0 66.0 14.0 80
3 139441 Photo 2 12 2 10 1.0 50128 87991 2211 790 1119 61027 32048 1386 58 1572.0 147.0 1777
4 139441 Photo 2 12 2 3 0.0 7244 13594 671 410 580 6228 3200 396 19 325.0 49.0 393

Add IDs

In [4]:
import pandas as pd
import numpy as np

# Create Dataframe
id = pd.DataFrame({'id': []})

# Add index numbers
for i in range(0,500):
    id = id.append({'id': i}, ignore_index=True)

# Concat?
#data = pd.concat((air_visit, submission.drop('id', axis='columns')))
facebook = pd.concat([facebook, id], axis=1)

Pre-Processing

The paper suggests using Shapiro-Wilks to identify outliers

Check if Sample looks Gaussian

In [5]:
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro

# seed the random number generator
#seed(1)
# generate univariate observations
#data = 5 * randn(100) + 50

# normality test
stat, p = shapiro(facebook['LT People + Engd Post'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Statistics=0.674, p=0.000
Sample does not look Gaussian (reject H0)

Remove Outliers

Calculate mean and standard deviation for the each of the 12 outputs. If the data is more than 2 standard deviations from the mean, remove it.

If I was interested, I would be creative with creating additional features

One-hot encoding for later

In [6]:
facebook = pd.get_dummies(facebook, columns=['Type', 'Category'])

facebook.head()
Out[6]:
Page total likes Post Month Post Weekday Post Hour Paid LT Post Total Reach LT Post Total Imp LT Engd Users LT Post Consumers LT Post Consump ... Share Total Int id Type_Link Type_Photo Type_Status Type_Video Category_1 Category_2 Category_3
0 139441 12 4 3 0.0 2752 5091 178 109 159 ... 17.0 100 0.0 0 1 0 0 0 1 0
1 139441 12 3 10 0.0 10460 19057 1457 1361 1674 ... 29.0 164 1.0 0 0 1 0 0 1 0
2 139441 12 3 3 0.0 2413 4373 177 113 154 ... 14.0 80 2.0 0 1 0 0 0 0 1
3 139441 12 2 10 1.0 50128 87991 2211 790 1119 ... 147.0 1777 3.0 0 1 0 0 0 1 0
4 139441 12 2 3 0.0 7244 13594 671 410 580 ... 49.0 393 4.0 0 1 0 0 0 1 0

5 rows × 25 columns

Train/Test Split

I wasn't able to use Shapiro-Wilks to filter outliers, so we'll continue for now

Let's assign 20% of the data for testing. Sometimes you choose 70/25/5 to include a dev set for removing bias during model building. See here: https://stackoverflow.com/questions/37114273/how-to-randomly-split-a-dataset-into-training-set-test-set-and-dev-set-in-pyth

In [7]:
import pandas as pd
import numpy as np

# Create Dataframe
is_test = pd.DataFrame({'is_test': []})

# Add 100 True
for i in range(100):
    is_test = is_test.append({'is_test': 1}, ignore_index=True)

# Add 400 False
for i in range(400):
    is_test = is_test.append({'is_test': 0}, ignore_index=True)

# Randomize
# https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
is_test = is_test.sample(frac=1).reset_index(drop=True)

# Merge

# Concat?
facebook = pd.concat([facebook, is_test], axis=1)
In [8]:
# Output
facebook.head(10)
Out[8]:
Page total likes Post Month Post Weekday Post Hour Paid LT Post Total Reach LT Post Total Imp LT Engd Users LT Post Consumers LT Post Consump ... Total Int id Type_Link Type_Photo Type_Status Type_Video Category_1 Category_2 Category_3 is_test
0 139441 12 4 3 0.0 2752 5091 178 109 159 ... 100 0.0 0 1 0 0 0 1 0 0.0
1 139441 12 3 10 0.0 10460 19057 1457 1361 1674 ... 164 1.0 0 0 1 0 0 1 0 0.0
2 139441 12 3 3 0.0 2413 4373 177 113 154 ... 80 2.0 0 1 0 0 0 0 1 0.0
3 139441 12 2 10 1.0 50128 87991 2211 790 1119 ... 1777 3.0 0 1 0 0 0 1 0 0.0
4 139441 12 2 3 0.0 7244 13594 671 410 580 ... 393 4.0 0 1 0 0 0 1 0 1.0
5 139441 12 1 9 0.0 10472 20849 1191 1073 1389 ... 186 5.0 0 0 1 0 0 1 0 0.0
6 139441 12 1 3 1.0 11692 19479 481 265 364 ... 279 6.0 0 1 0 0 0 0 1 0.0
7 139441 12 7 9 1.0 13720 24137 537 232 305 ... 339 7.0 0 1 0 0 0 0 1 0.0
8 139441 12 7 3 0.0 11844 22538 1530 1407 1692 ... 192 8.0 0 0 1 0 0 1 0 0.0
9 139441 12 6 10 0.0 4694 8668 280 183 250 ... 142 9.0 0 1 0 0 0 0 1 0.0

10 rows × 26 columns

In [9]:
train = facebook[(facebook['is_test'] == 0)]
test = facebook[(facebook['is_test'] == 1)]

train.head()
Out[9]:
Page total likes Post Month Post Weekday Post Hour Paid LT Post Total Reach LT Post Total Imp LT Engd Users LT Post Consumers LT Post Consump ... Total Int id Type_Link Type_Photo Type_Status Type_Video Category_1 Category_2 Category_3 is_test
0 139441 12 4 3 0.0 2752 5091 178 109 159 ... 100 0.0 0 1 0 0 0 1 0 0.0
1 139441 12 3 10 0.0 10460 19057 1457 1361 1674 ... 164 1.0 0 0 1 0 0 1 0 0.0
2 139441 12 3 3 0.0 2413 4373 177 113 154 ... 80 2.0 0 1 0 0 0 0 1 0.0
3 139441 12 2 10 1.0 50128 87991 2211 790 1119 ... 1777 3.0 0 1 0 0 0 1 0 0.0
5 139441 12 1 9 0.0 10472 20849 1191 1073 1389 ... 186 5.0 0 0 1 0 0 1 0 0.0

5 rows × 26 columns

In [10]:
test.head()
Out[10]:
Page total likes Post Month Post Weekday Post Hour Paid LT Post Total Reach LT Post Total Imp LT Engd Users LT Post Consumers LT Post Consump ... Total Int id Type_Link Type_Photo Type_Status Type_Video Category_1 Category_2 Category_3 is_test
4 139441 12 2 3 0.0 7244 13594 671 410 580 ... 393 4.0 0 1 0 0 0 1 0 1.0
15 138414 12 3 10 0.0 10060 19680 1264 1209 1425 ... 108 15.0 0 0 1 0 0 1 0 1.0
19 138414 12 1 11 0.0 1591 2825 121 88 111 ... 42 19.0 0 1 0 0 0 0 1 1.0
20 138414 12 1 3 0.0 2848 5066 200 142 184 ... 81 20.0 0 1 0 0 0 1 0 1.0
44 138353 12 4 11 0.0 4284 8387 355 316 513 ... 58 44.0 0 1 0 0 1 0 0 1.0

5 rows × 26 columns

Time to drop off columns before modeling!

In [11]:
to_drop = ['LT Post Total Reach', 'LT Post Total Imp', 'LT Engd Users', 
           'LT Post Consumers', 'LT Post Consump', 'LT Post Imp + Liked Page',
           'LT Post Reach + Liked Page', 'LT People + Engd Post',
           'Comment', 'Like', 'Share']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')

X_train = train.drop('Total Int', axis='columns')
X_test = test.drop('Total Int', axis='columns')
y_train = train['Total Int']
y_test = test['Total Int']

Do some sanity checks first!

In [12]:
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0

Model Building (Output = Total Interactions

With the full dataset split into test and train, we're ready to build some models.

In [13]:
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection


np.random.seed(42)

model = lgbm.LGBMRegressor(
    objective='regression',
    max_depth=5,
    num_leaves=5 ** 2 - 1,
    learning_rate=0.007,
    n_estimators=30000,
    min_child_samples=80,
    subsample=0.8,
    colsample_bytree=1,
    reg_alpha=0,
    reg_lambda=0,
    random_state=np.random.randint(10e6)
)

n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)

val_scores = [0] * n_splits

sub = test['id'].to_frame()
sub['Total Int'] = 0

feature_importances = pd.DataFrame(index=X_train.columns)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='l2',
        early_stopping_rounds=200,
        feature_name=X_fit.columns.tolist(),
        verbose=False
    )
    
    val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
    sub['Total Int'] += model.predict(X_test, num_iteration=model.best_iteration_)
    feature_importances[i] = model.feature_importances_
    
    print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
    
sub['Total Int'] /= n_splits

val_mean = np.mean(val_scores)
val_std = np.std(val_scores)

print('Local RMSLE: {:.5f}{:.5f})'.format(val_mean, val_std))
Fold 1 RMSLE: 277.03884
Fold 2 RMSLE: 353.74402
Fold 3 RMSLE: 146.68958
Fold 4 RMSLE: 238.51065
Fold 5 RMSLE: 814.34417
Fold 6 RMSLE: 298.18591
Local RMSLE: 354.75219 (±214.96785)

Model Review

Now we can review the performance of our model. We can use feature importances to get a feeling of what worked well, and make changes to the model as needed.

In [14]:
feature_importances.sort_values(0, ascending=False)
Out[14]:
0 1 2 3 4 5
Category_1 76 208 64 1 81 233
Post Weekday 44 142 29 1 63 325
Paid 16 102 10 0 0 184
id 16 329 0 0 19 315
Page total likes 0 281 0 0 0 336
Post Month 0 0 0 0 0 0
Post Hour 0 274 25 0 0 581
Type_Link 0 0 0 0 0 0
Type_Photo 0 0 0 0 0 0
Type_Status 0 0 0 0 0 0
Type_Video 0 0 0 0 0 0
Category_2 0 0 0 0 0 0
Category_3 0 0 0 0 1 0
is_test 0 0 0 0 0 0

Repeat

We performed Light GBM to create a model for Total Interactions, but we can still do 3 things to study further:

  1. Fine-tune the model (above)
  2. Develop models using other algorithms (NNs, SVMs)
  3. Apply those models to predicting the other 11 outputs
  4. Ensemble multiple models together for improved performance

Conclusions

Insights & Observations

Some observations:

  1. Category_1 seems to make the biggest difference on the number of interactions
  2. Post weekday/hour/month in the top 6 factors demonstrate that timing is a major factor in engagement
  3. Page total likes indicates that a larger brand presence impacts engagement
  4. I noticed 'id' was included in the model and should be removed!

Comparison with Literature

How does this compare with literature?

From the paper discussed in the introduction, the expected observations were post type, month and number of page likes - in that order. Our results similarly show month (generally, time of post) and page likes, but not type! This will require further study.

Next Steps

Aside from continued model tuning, it is important to connect insights with domain knowledge to explore new features and improve the model.

This model was built to predict 1 of the 12 outputs (Total Interactions). This process can be repeated for the other 11 outputs to improve the model, and also explore how different inputs affect different types of interactions (likes, comments etc).

Additionally, the model can be used for prediction. The model can receive inputs from new instances (ie. future posts) to predict the engagement.

--

That's all for now! Thanks for reading. I hope you found this notebook useful. Feel free to shoot me an email at d@dudonwai.com if you have any questions.