UCI Facebook Metrics https://archive.ics.uci.edu/ml/datasets/Facebook+metrics
Abstract:
Facebook performance metrics of a renowned cosmetic's brand Facebook page.
Data Set Characteristics: Multivariate
Number of Instances: 500
Area: Business
Attribute Characteristics: Integer
Number of Attributes: 19
Date Donated: 2016-08-05
Associated Tasks: Regression
Data Set Information:
The data is related to posts' published during the year of 2014 on the Facebook's page of a renowned cosmetics brand.
This dataset contains 500 of the 790 rows and part of the features analyzed by Moro et al. (2016). The remaining were omitted due to confidentiality issues.
Attribute Information:
It includes 7 features known prior to post publication and 12 features for evaluating post impact (see Tables 2 and 3 from Moro et al., 2016 - complete reference in the 'Citation Request')
Relevant Papers:
(Moro et al., 2016) Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.
http://www.math-evry.cnrs.fr/_media/members/aguilloux/enseignements/moro2016.pdf
Post Timeframe: 1st of January and the 31th of December of 2014
Post Source: Facebook's page of a worldwide renowned cosmetic brand were included
Nbr of Posts: 500 of 790
4 Feature Types:
7 Inputs (Message, link etc.) + 12 Outputs (Likes, Interactions etc.)
Purpose: Understand social media impact on brand building.
Concept: Source Metrics --> Data Used --> References --> Branding Effect
Experiment
import sys
import pandas as pd
import numpy as np
import scipy as sp
# scikit-learn==0.19.1
# scipy==0.19.1
# Delimiter is ";"
# Use .head() if you want to show n rows
facebook = pd.read_csv('data/Facebook_metrics/dataset_Facebook.csv',delimiter=";")
facebook.head()
# Shorten column names
facebook.rename(columns=
{'Lifetime Post Total Reach': 'LT Post Total Reach',
'Lifetime Post Total Impressions': 'LT Post Total Imp',
'Lifetime Engaged Users': 'LT Engd Users',
'Lifetime Post Consumers': 'LT Post Consumers',
'Lifetime Post Consumptions': 'LT Post Consump',
'Lifetime Post Impressions by people who have liked your Page': 'LT Post Imp + Liked Page',
'Lifetime Post reach by people who like your Page': 'LT Post Reach + Liked Page',
'Lifetime People who have liked your Page and engaged with your post': 'LT People + Engd Post',
'comment': 'Comment',
'like': 'Like',
'share': 'Share',
'Total Interactions': 'Total Int'
}, inplace=True)
facebook.head()
import pandas as pd
import numpy as np
# Create Dataframe
id = pd.DataFrame({'id': []})
# Add index numbers
for i in range(0,500):
id = id.append({'id': i}, ignore_index=True)
# Concat?
#data = pd.concat((air_visit, submission.drop('id', axis='columns')))
facebook = pd.concat([facebook, id], axis=1)
The paper suggests using Shapiro-Wilks to identify outliers
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
# seed the random number generator
#seed(1)
# generate univariate observations
#data = 5 * randn(100) + 50
# normality test
stat, p = shapiro(facebook['LT People + Engd Post'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
Calculate mean and standard deviation for the each of the 12 outputs. If the data is more than 2 standard deviations from the mean, remove it.
facebook = pd.get_dummies(facebook, columns=['Type', 'Category'])
facebook.head()
I wasn't able to use Shapiro-Wilks to filter outliers, so we'll continue for now
Let's assign 20% of the data for testing. Sometimes you choose 70/25/5 to include a dev set for removing bias during model building. See here: https://stackoverflow.com/questions/37114273/how-to-randomly-split-a-dataset-into-training-set-test-set-and-dev-set-in-pyth
import pandas as pd
import numpy as np
# Create Dataframe
is_test = pd.DataFrame({'is_test': []})
# Add 100 True
for i in range(100):
is_test = is_test.append({'is_test': 1}, ignore_index=True)
# Add 400 False
for i in range(400):
is_test = is_test.append({'is_test': 0}, ignore_index=True)
# Randomize
# https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
is_test = is_test.sample(frac=1).reset_index(drop=True)
# Merge
# Concat?
facebook = pd.concat([facebook, is_test], axis=1)
# Output
facebook.head(10)
train = facebook[(facebook['is_test'] == 0)]
test = facebook[(facebook['is_test'] == 1)]
train.head()
test.head()
Time to drop off columns before modeling!
to_drop = ['LT Post Total Reach', 'LT Post Total Imp', 'LT Engd Users',
'LT Post Consumers', 'LT Post Consump', 'LT Post Imp + Liked Page',
'LT Post Reach + Liked Page', 'LT People + Engd Post',
'Comment', 'Like', 'Share']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')
X_train = train.drop('Total Int', axis='columns')
X_test = test.drop('Total Int', axis='columns')
y_train = train['Total Int']
y_test = test['Total Int']
Do some sanity checks first!
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
With the full dataset split into test and train, we're ready to build some models.
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection
np.random.seed(42)
model = lgbm.LGBMRegressor(
objective='regression',
max_depth=5,
num_leaves=5 ** 2 - 1,
learning_rate=0.007,
n_estimators=30000,
min_child_samples=80,
subsample=0.8,
colsample_bytree=1,
reg_alpha=0,
reg_lambda=0,
random_state=np.random.randint(10e6)
)
n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)
val_scores = [0] * n_splits
sub = test['id'].to_frame()
sub['Total Int'] = 0
feature_importances = pd.DataFrame(index=X_train.columns)
for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
X_fit = X_train.iloc[fit_idx]
y_fit = y_train.iloc[fit_idx]
X_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
model.fit(
X_fit,
y_fit,
eval_set=[(X_fit, y_fit), (X_val, y_val)],
eval_names=('fit', 'val'),
eval_metric='l2',
early_stopping_rounds=200,
feature_name=X_fit.columns.tolist(),
verbose=False
)
val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
sub['Total Int'] += model.predict(X_test, num_iteration=model.best_iteration_)
feature_importances[i] = model.feature_importances_
print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
sub['Total Int'] /= n_splits
val_mean = np.mean(val_scores)
val_std = np.std(val_scores)
print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))
Now we can review the performance of our model. We can use feature importances to get a feeling of what worked well, and make changes to the model as needed.
feature_importances.sort_values(0, ascending=False)
We performed Light GBM to create a model for Total Interactions, but we can still do 3 things to study further:
Some observations:
How does this compare with literature?
From the paper discussed in the introduction, the expected observations were post type, month and number of page likes - in that order. Our results similarly show month (generally, time of post) and page likes, but not type! This will require further study.
Aside from continued model tuning, it is important to connect insights with domain knowledge to explore new features and improve the model.
This model was built to predict 1 of the 12 outputs (Total Interactions). This process can be repeated for the other 11 outputs to improve the model, and also explore how different inputs affect different types of interactions (likes, comments etc).
Additionally, the model can be used for prediction. The model can receive inputs from new instances (ie. future posts) to predict the engagement.
--
That's all for now! Thanks for reading. I hope you found this notebook useful. Feel free to shoot me an email at d@dudonwai.com if you have any questions.