Intro¶

UCI Facebook Metrics https://archive.ics.uci.edu/ml/datasets/Facebook+metrics

Abstract:

Facebook performance metrics of a renowned cosmetic's brand Facebook page.

Data Set Characteristics: Multivariate
Number of Instances: 500
Area: Business
Attribute Characteristics: Integer
Number of Attributes: 19
Date Donated: 2016-08-05
Associated Tasks: Regression

Data Set Information:

The data is related to posts' published during the year of 2014 on the Facebook's page of a renowned cosmetics brand. 
This dataset contains 500 of the 790 rows and part of the features analyzed by Moro et al. (2016). The remaining were omitted due to confidentiality issues.

Attribute Information:

It includes 7 features known prior to post publication and 12 features for evaluating post impact (see Tables 2 and 3 from Moro et al., 2016 - complete reference in the 'Citation Request')

Relevant Papers:

(Moro et al., 2016) Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.

http://www.math-evry.cnrs.fr/_media/members/aguilloux/enseignements/moro2016.pdf

Literature Review¶

Paper: Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach¶

Post Timeframe: 1st of January and the 31th of December of 2014

Post Source: Facebook's page of a worldwide renowned cosmetic brand were included

Nbr of Posts: 500 of 790

4 Feature Types:

Identification—features that allow identifying each individual post
Content—the textual content of the post
Categorization—features that characterize the post
Performance

7 Inputs (Message, link etc.) + 12 Outputs (Likes, Interactions etc.)

Purpose: Understand social media impact on brand building.

Concept: Source Metrics --> Data Used --> References --> Branding Effect

Experiment

Use 7 inputs to predict 12 outputs
Check for outliers (Shapiro-Wilk)
Calculate absolute difference, percent difference, mean absolute percent difference
Conclude: Some predictive ability, better for interaction vs. visualization
SVM Model: Type --> Month --> Type/Likes

Data Loading¶

Import libraries¶

import sys
import pandas as pd
import numpy as np
import scipy as sp

# scikit-learn==0.19.1
# scipy==0.19.1

Read CSV and initially inspect¶

# Delimiter is ";"
# Use .head() if you want to show n rows

facebook = pd.read_csv('data/Facebook_metrics/dataset_Facebook.csv',delimiter=";")
facebook.head()

Clean Data¶

# Shorten column names
facebook.rename(columns=
                {'Lifetime Post Total Reach': 'LT Post Total Reach',
                 'Lifetime Post Total Impressions': 'LT Post Total Imp',
                 'Lifetime Engaged Users': 'LT Engd Users',
                 'Lifetime Post Consumers': 'LT Post Consumers',
                 'Lifetime Post Consumptions': 'LT Post Consump',
                 'Lifetime Post Impressions by people who have liked your Page': 'LT Post Imp + Liked Page',
                 'Lifetime Post reach by people who like your Page': 'LT Post Reach + Liked Page',
                 'Lifetime People who have liked your Page and engaged with your post': 'LT People + Engd Post',
                 'comment': 'Comment',
                 'like': 'Like',
                 'share': 'Share',
                 'Total Interactions': 'Total Int'
                }, inplace=True)
facebook.head()

Add IDs¶

import pandas as pd
import numpy as np

# Create Dataframe
id = pd.DataFrame({'id': []})

# Add index numbers
for i in range(0,500):
    id = id.append({'id': i}, ignore_index=True)

# Concat?
#data = pd.concat((air_visit, submission.drop('id', axis='columns')))
facebook = pd.concat([facebook, id], axis=1)

Pre-Processing¶

The paper suggests using Shapiro-Wilks to identify outliers

Check if Sample looks Gaussian¶

from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro

# seed the random number generator
#seed(1)
# generate univariate observations
#data = 5 * randn(100) + 50

# normality test
stat, p = shapiro(facebook['LT People + Engd Post'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Statistics=0.674, p=0.000
Sample does not look Gaussian (reject H0)

Remove Outliers¶

Calculate mean and standard deviation for the each of the 12 outputs. If the data is more than 2 standard deviations from the mean, remove it.

If I was interested, I would be creative with creating additional features¶

One-hot encoding for later¶

facebook = pd.get_dummies(facebook, columns=['Type', 'Category'])

facebook.head()

Train/Test Split¶

I wasn't able to use Shapiro-Wilks to filter outliers, so we'll continue for now

Let's assign 20% of the data for testing. Sometimes you choose 70/25/5 to include a dev set for removing bias during model building. See here: https://stackoverflow.com/questions/37114273/how-to-randomly-split-a-dataset-into-training-set-test-set-and-dev-set-in-pyth

import pandas as pd
import numpy as np

# Create Dataframe
is_test = pd.DataFrame({'is_test': []})

# Add 100 True
for i in range(100):
    is_test = is_test.append({'is_test': 1}, ignore_index=True)

# Add 400 False
for i in range(400):
    is_test = is_test.append({'is_test': 0}, ignore_index=True)

# Randomize
# https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
is_test = is_test.sample(frac=1).reset_index(drop=True)

# Merge

# Concat?
facebook = pd.concat([facebook, is_test], axis=1)

# Output
facebook.head(10)

train = facebook[(facebook['is_test'] == 0)]
test = facebook[(facebook['is_test'] == 1)]

train.head()

test.head()

Time to drop off columns before modeling!

to_drop = ['LT Post Total Reach', 'LT Post Total Imp', 'LT Engd Users', 
           'LT Post Consumers', 'LT Post Consump', 'LT Post Imp + Liked Page',
           'LT Post Reach + Liked Page', 'LT People + Engd Post',
           'Comment', 'Like', 'Share']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')

X_train = train.drop('Total Int', axis='columns')
X_test = test.drop('Total Int', axis='columns')
y_train = train['Total Int']
y_test = test['Total Int']

Do some sanity checks first!

assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0

Model Building (Output = Total Interactions¶

With the full dataset split into test and train, we're ready to build some models.

import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection


np.random.seed(42)

model = lgbm.LGBMRegressor(
    objective='regression',
    max_depth=5,
    num_leaves=5 ** 2 - 1,
    learning_rate=0.007,
    n_estimators=30000,
    min_child_samples=80,
    subsample=0.8,
    colsample_bytree=1,
    reg_alpha=0,
    reg_lambda=0,
    random_state=np.random.randint(10e6)
)

n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)

val_scores = [0] * n_splits

sub = test['id'].to_frame()
sub['Total Int'] = 0

feature_importances = pd.DataFrame(index=X_train.columns)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='l2',
        early_stopping_rounds=200,
        feature_name=X_fit.columns.tolist(),
        verbose=False
    )
    
    val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
    sub['Total Int'] += model.predict(X_test, num_iteration=model.best_iteration_)
    feature_importances[i] = model.feature_importances_
    
    print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
    
sub['Total Int'] /= n_splits

val_mean = np.mean(val_scores)
val_std = np.std(val_scores)

print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))

Fold 1 RMSLE: 277.03884
Fold 2 RMSLE: 353.74402
Fold 3 RMSLE: 146.68958
Fold 4 RMSLE: 238.51065
Fold 5 RMSLE: 814.34417
Fold 6 RMSLE: 298.18591
Local RMSLE: 354.75219 (±214.96785)

Model Review¶

Now we can review the performance of our model. We can use feature importances to get a feeling of what worked well, and make changes to the model as needed.

feature_importances.sort_values(0, ascending=False)

Repeat¶

We performed Light GBM to create a model for Total Interactions, but we can still do 3 things to study further:

Fine-tune the model (above)
Develop models using other algorithms (NNs, SVMs)
Apply those models to predicting the other 11 outputs
Ensemble multiple models together for improved performance

Conclusions¶

Insights & Observations¶

Some observations:

Category_1 seems to make the biggest difference on the number of interactions
Post weekday/hour/month in the top 6 factors demonstrate that timing is a major factor in engagement
Page total likes indicates that a larger brand presence impacts engagement
I noticed 'id' was included in the model and should be removed!

Comparison with Literature¶

How does this compare with literature?

From the paper discussed in the introduction, the expected observations were post type, month and number of page likes - in that order. Our results similarly show month (generally, time of post) and page likes, but not type! This will require further study.

Next Steps¶

Aside from continued model tuning, it is important to connect insights with domain knowledge to explore new features and improve the model.

This model was built to predict 1 of the 12 outputs (Total Interactions). This process can be repeated for the other 11 outputs to improve the model, and also explore how different inputs affect different types of interactions (likes, comments etc).

Additionally, the model can be used for prediction. The model can receive inputs from new instances (ie. future posts) to predict the engagement.

--

That's all for now! Thanks for reading. I hope you found this notebook useful. Feel free to shoot me an email at d@dudonwai.com if you have any questions.

	Page total likes	Type	Category	Post Month	Post Weekday	Post Hour	Paid	Lifetime Post Total Reach	Lifetime Post Total Impressions	Lifetime Engaged Users	Lifetime Post Consumers	Lifetime Post Consumptions	Lifetime Post Impressions by people who have liked your Page	Lifetime Post reach by people who like your Page	Lifetime People who have liked your Page and engaged with your post	comment	like	share	Total Interactions
0	139441	Photo	2	12	4	3	0.0	2752	5091	178	109	159	3078	1640	119	4	79.0	17.0	100
1	139441	Status	2	12	3	10	0.0	10460	19057	1457	1361	1674	11710	6112	1108	5	130.0	29.0	164
2	139441	Photo	3	12	3	3	0.0	2413	4373	177	113	154	2812	1503	132	0	66.0	14.0	80
3	139441	Photo	2	12	2	10	1.0	50128	87991	2211	790	1119	61027	32048	1386	58	1572.0	147.0	1777
4	139441	Photo	2	12	2	3	0.0	7244	13594	671	410	580	6228	3200	396	19	325.0	49.0	393

	Page total likes	Type	Category	Post Month	Post Weekday	Post Hour	Paid	LT Post Total Reach	LT Post Total Imp	LT Engd Users	LT Post Consumers	LT Post Consump	LT Post Imp + Liked Page	LT Post Reach + Liked Page	LT People + Engd Post	Comment	Like	Share	Total Int
0	139441	Photo	2	12	4	3	0.0	2752	5091	178	109	159	3078	1640	119	4	79.0	17.0	100
1	139441	Status	2	12	3	10	0.0	10460	19057	1457	1361	1674	11710	6112	1108	5	130.0	29.0	164
2	139441	Photo	3	12	3	3	0.0	2413	4373	177	113	154	2812	1503	132	0	66.0	14.0	80
3	139441	Photo	2	12	2	10	1.0	50128	87991	2211	790	1119	61027	32048	1386	58	1572.0	147.0	1777
4	139441	Photo	2	12	2	3	0.0	7244	13594	671	410	580	6228	3200	396	19	325.0	49.0	393

	Page total likes	Post Month	Post Weekday	Post Hour	Paid	LT Post Total Reach	LT Post Total Imp	LT Engd Users	LT Post Consumers	LT Post Consump	...	Share	Total Int	id	Type_Photo	Type_Status	Category_2	Category_3
0	139441	12	4	3	0.0	2752	5091	178	109	159	...	17.0	100	0.0	1	0	1	0
1	139441	12	3	10	0.0	10460	19057	1457	1361	1674	...	29.0	164	1.0	0	1	1	0
2	139441	12	3	3	0.0	2413	4373	177	113	154	...	14.0	80	2.0	1	0	0	1
3	139441	12	2	10	1.0	50128	87991	2211	790	1119	...	147.0	1777	3.0	1	0	1	0
4	139441	12	2	3	0.0	7244	13594	671	410	580	...	49.0	393	4.0	1	0	1	0

	Page total likes	Post Month	Post Weekday	Post Hour	Paid	LT Post Total Reach	LT Post Total Imp	LT Engd Users	LT Post Consumers	LT Post Consump	...	Total Int	id	Type_Photo	Type_Status	Category_2	Category_3	is_test
0	139441	12	4	3	0.0	2752	5091	178	109	159	...	100	0.0	1	0	1	0	0.0
1	139441	12	3	10	0.0	10460	19057	1457	1361	1674	...	164	1.0	0	1	1	0	0.0
2	139441	12	3	3	0.0	2413	4373	177	113	154	...	80	2.0	1	0	0	1	0.0
3	139441	12	2	10	1.0	50128	87991	2211	790	1119	...	1777	3.0	1	0	1	0	0.0
4	139441	12	2	3	0.0	7244	13594	671	410	580	...	393	4.0	1	0	1	0	1.0
5	139441	12	1	9	0.0	10472	20849	1191	1073	1389	...	186	5.0	0	1	1	0	0.0
6	139441	12	1	3	1.0	11692	19479	481	265	364	...	279	6.0	1	0	0	1	0.0
7	139441	12	7	9	1.0	13720	24137	537	232	305	...	339	7.0	1	0	0	1	0.0
8	139441	12	7	3	0.0	11844	22538	1530	1407	1692	...	192	8.0	0	1	1	0	0.0
9	139441	12	6	10	0.0	4694	8668	280	183	250	...	142	9.0	1	0	0	1	0.0

	Page total likes	Post Month	Post Weekday	Post Hour	Paid	LT Post Total Reach	LT Post Total Imp	LT Engd Users	LT Post Consumers	LT Post Consump	...	Total Int	id	Type_Photo	Type_Status	Category_2	Category_3
0	139441	12	4	3	0.0	2752	5091	178	109	159	...	100	0.0	1	0	1	0
1	139441	12	3	10	0.0	10460	19057	1457	1361	1674	...	164	1.0	0	1	1	0
2	139441	12	3	3	0.0	2413	4373	177	113	154	...	80	2.0	1	0	0	1
3	139441	12	2	10	1.0	50128	87991	2211	790	1119	...	1777	3.0	1	0	1	0
5	139441	12	1	9	0.0	10472	20849	1191	1073	1389	...	186	5.0	0	1	1	0