Modeling feature importance

This example notebook shows how you can use the open model hub to model the importance of features on achieving a conversion goal. Such features can be any interactions with your product’s content. The model to run this directly using the open model hub is currently still in development; see the draft PR.

The dataset used here is the same as in Objectiv Up. To run this on your own data, see how to get started in your notebook.

Get started

We first have to instantiate the model hub and an Objectiv DataFrame object.

In:
# set the timeframe of the analysis
start_date = '2022-06-01'
end_date = None
In:
# instantiate the model hub, set the default time aggregation to daily
# and get the application & path global contexts
from matplotlib import pyplot as plt
from modelhub import ModelHub, display_sql_as_markdown
modelhub = ModelHub(time_aggregation='%Y-%m-%d', global_contexts=['root_location'])
# get an Objectiv DataFrame within a defined timeframe
df = modelhub.get_objectiv_dataframe(db_url=DB_URL, start_date=start_date, end_date=end_date)

Define features & conversion

First we have to define the conversion goal that we are predicting, as well as the features that we want to use as predictors.

For this example, we define the conversion goal as reaching the modeling section in our documentation. We want to model the impact of users clicking/pressing in any of the main sections (root locations) in our website. This works for our example, as there are only a limited amount of root locations in the dataset, and we make an assumption that there is as causal relation between the number of clicks in these root locations and conversion. Make sure to think of this assumption when using this model on your own data.

In:
# define which events to use as conversion events
modelhub.add_conversion_event(location_stack=df.location_stack.json[{'id': 'modeling', '_type': 'RootLocationContext'}:], event_type='PressEvent', name='use_modeling')
# the features that we use for predicting
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

We estimate conversion by using the number of presses in each root location with a logistic regression model. The coefficients of this regression can be interpreted as the contribution to conversion (direction and magnitude).

Next, we instantiate the dataset and untrained model.

In:
# define which events to use as conversion events
X_temp, y_temp, model = modelhub.agg.feature_importance(data=df[df.event_type=='PressEvent'], name='use_modeling', feature_column='root')

This lets you adjust the dataset further, or use the model as-is:

  • y_temp is a BooleanSeries that indicates conversion per user.

  • X_temp is a DataFrame with the number of presses per user_id. For users that converted in the selected dataset, only usage from before reaching conversion is counted.

  • The model is the toolkit that can be used to assess the feature importance on our conversion goal.

In:
y_temp.head()
Out:
user_id
005aa19c-7e80-4960-928c-a0853355ee5f False
0115c0f1-1145-49bd-80a5-66f4548a7a39 False
01891784-6333-40f1-8be6-739f3adfdb97 False
021f2c2f-f441-4e11-875c-20dc27aaf57e False
02c42c27-1c0d-4e3e-b6c0-403a60e8eb83 False
Name: is_converted, dtype: bool
In:
X_temp.head()
Out:
                                      jobs home join-slack tracking blog about taxonomy privacy
user_id
005aa19c-7e80-4960-928c-a0853355ee5f 0 0 0 0 0 2 0 0
0115c0f1-1145-49bd-80a5-66f4548a7a39 0 1 0 0 0 0 0 0
01891784-6333-40f1-8be6-739f3adfdb97 0 9 0 0 0 0 2 0
021f2c2f-f441-4e11-875c-20dc27aaf57e 1 3 0 0 1 1 0 0
02c42c27-1c0d-4e3e-b6c0-403a60e8eb83 0 0 0 0 0 0 3 0

In our example, we will go into detailed assessment of the model’s accuracy, so we won’t jumpt to the model results, but instead first look at our data set and prepare a proper data set for the model.

In:
data_set_temp = X_temp.copy()
# we save the columns that are in our dataset, these will be used later.
columns = X_temp.data_columns
data_set_temp['is_converted'] = y_temp
data_set_temp['total_press'] = modelhub.map.sum_feature_rows(X_temp)

Review the dataset

For a logistic regression, several assumptions such as sample size, no influential outliers, and linear relation between the features and the logit of the goal should be fulfilled. We’ll first look at our data to get the best possible dataset for our model.

In:
data_set_temp.describe().head()
Out:
       jobs   home join-slack tracking blog   about  taxonomy privacy total_press
__stat
count 543.00 543.00 543.00 543.00 543.00 543.00 543.00 543.00 543.00
mean 0.25 2.59 0.00 0.42 0.31 0.37 0.65 0.00 4.59
std 0.87 2.94 0.04 2.12 0.80 0.86 2.51 0.06 5.14
min 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
max 12.00 27.00 1.00 30.00 9.00 7.00 33.00 1.00 45.00

This shows that we have 543 samples in our data. It also shows that the mean is quite low for most features, and the same is true for the standard deviation. This indicates that the feature usage is not distributed very well.

In:
data_set_temp.is_converted.value_counts().head()
Out:
is_converted
False 469
True 74
Name: value_counts, dtype: int64
In:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()
Out:
is_converted
False 0.86372
True 0.13628
Name: value_counts, dtype: float64

The dataset is not balanced in terms of users that did or did not reach conversion: 74 converted users (13.6%). While this is not necessarily a problem, it influences the metric we choose to look at for model performance. The model that we instantiated already accommodates for this.

We can also plot histograms of the features, so we can inspect the distributions more closely.

In:
figure, axis = plt.subplots(len(columns), 2, figsize=(15,30))

for idx, name in enumerate(columns):
data_set_temp[data_set_temp.is_converted==True][[name]].plot.hist(bins=20, title='Converted', ax=axis[idx][0])
data_set_temp[data_set_temp.is_converted==False][[name]].plot.hist(bins=20, title='Not converted', ax=axis[idx][1])
plt.tight_layout()

Histogram plots of features

We see that some features are not useful at all (‘join-slack’ and ‘privacy’), so we will remove them. Moreover we think that users that clicking only once in any of the root locations will not provide us with any explanatory behavior for the goal. Those users might, for instance, be users that wanted to go to our modeling section, and this was the quickest way to get there with the results Google provided them. In that case, the intent of the user (something of which we can never be 100% sure), was going to the modeling section. The features did not convince them.

By filtering like this, it is more likely that the used features on our website did, or did not convince a user to check out the modeling section of our docs. This is exactly what we are after. An additional advantage is that the distribution of feature usage will most likely get more favorable after removing 1-press-users.

In:
# remove useless features.
data_set_temp = data_set_temp.drop(columns=['privacy', 'join-slack'])
# we update the columns that are still in our dataset.
columns = [x for x in data_set_temp.data_columns if x in X_temp.data_columns]
# only use users with at least more than 1 press.
data_set_temp = data_set_temp[data_set_temp.total_press>1]

If we now rerun the code above to review the dataset we find that the dataset is more balanced (16.5% converted), although it is a bit small now (406 samples). The distributions as shown by describing the data set and the histograms look indeed better for our model now. We will use this dataset to create our X and y dataset that we will use in the model.

In:
data_set_temp.describe().head()
Out:
jobs   home    tracking blog about taxonomy total_press
__stat
count 406.00 406.00 406.00 406.00 406.00 406.00 406.00
mean 0.32 3.20 0.56 0.39 0.47 0.85 5.80
std 0.99 3.17 2.43 0.90 0.97 2.87 5.43
min 0.00 0.00 0.00 0.00 0.00 0.00 2.00
max 12.00 27.00 30.00 9.00 7.00 33.00 45.00
In:
data_set_temp.is_converted.value_counts().head()
Out:
is_converted
False 339
True 67
Name: value_counts, dtype: int64
In:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()
Out:
is_converted
False 0.834975
True 0.165025
Name: value_counts, dtype: float64
In:
figure, axis = plt.subplots(len(columns), 2, figsize=(15,30))

for idx, name in enumerate(columns):
data_set_temp[data_set_temp.is_converted==True][[name]].plot.hist(bins=20, title='Converted', ax=axis[idx][0])
data_set_temp[data_set_temp.is_converted==False][[name]].plot.hist(bins=20, title='Not converted', ax=axis[idx][1])
plt.tight_layout()

Histogram plots of filtered features

In:
X = data_set_temp[columns]
y = data_set_temp.is_converted

Train and evaluate the model

As mentioned above, the model is based on logistic regression. Logistic regression seems sensible as it is used for classification, but also has relatively easy to interpret coefficients for the features. The feature importance model uses the AUC to assess the performance. This is because we are more interested in the coefficients than the actual predicted labels, and also because this metric can handle imbalanced datasets.

The feature importance model by default trains a logistic regression model three times on the entire dataset split in threefolds. This way we can not only calculate the AUC on one test after training the model, but also see whether the coefficients for the model are relatively stable when trained on different data. After fitting the model, the results (the average coefficients of the three models) as well as the performance of the three models can be retrieved with model methods.

In:
# train the model
model.fit(X, y, seed=.4)
model.results()
Out:
         coefficients_mean coefficients_std
about -0.403494 0.105327
jobs -0.016932 0.101370
home 0.033575 0.017482
taxonomy 0.084558 0.042645
blog 0.095825 0.104930
tracking 0.170703 0.107747

The mean of the coefficients are returned together with the standard deviation. The lower the standard deviation, the more stable the coefficients in the various runs. Our results show that ‘about’ has most negative impact on conversion, while ‘tracking’, ‘blog’ and ‘taxonomy’ has the most positive impact.

In:
model.auc()
Out:
0.6935796984854313

The average AUC of our models is 0.69. This is better than a baseline model (0.5 AUC). However, it also means that it is not a perfect model and therefore the chosen features alone cannot predict conversion completely.

Amongst others, some things that might improve further models are a larger dataset, other explanatory variables (i.e. more detailed locations instead of only root locations), and more information on the users (i.e. user referrer as a proxy for user intent).

In:
model.results(full=True)
Out:
       jobs     home tracking     blog     about taxonomy
0 0.086162 0.013394 0.078106 0.212065 -0.458034 0.053494
0 -0.020472 0.043257 0.288966 0.067309 -0.282082 0.133179
0 -0.116485 0.044074 0.145039 0.008101 -0.470368 0.067001

Get the SQL for any analysis

The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.

That’s it! The model to run this directly using the open model hub is currently still in development; see the draft PR.

Join us on Slack if you have any questions or suggestions.

Next Steps

Try the notebooks in Objectiv Up

Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.