Feature engineering with Bach
This example shows how Bach can be used for feature engineering. We’ll go through describing the data, finding outliers, transforming data and grouping and aggregating data so that a useful feature set is created that can be used for machine learning. We have a separate example available that goes into the details of how a data set prepared in Bach can be used for machine learning with sklearn here.
At first we have to install the open model hub and instantiate the Objectiv DataFrame object. See Getting started with Objectiv for more info on this.
This object points to all data in the data set. Too large to run in pandas and therefore sklearn. For the data set that we need, we aggregate to user level, at which point it is small enough to fit in memory.
We start with showing the first couple of rows from the data set and describing the entire data set.
Creating a feature set
We’d like to create a feature set that describes the behaviour of users in a way. We start with extracting
the root location from the location stack. This indicates what parts of our website users have visited. Using
to_numpy() shows the results as a numpy array.
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
[‘jobs’, ‘docs’, ‘home’…] etc is returned, the sections of the objectiv.io website.
Check missing values
A quick check learns us that there are no missing values to worry about. Now we want a data set with interactions on our different sections, in particular, presses. This is an event type. We first want an overview of the different event types that exist and select the one we are interested in.
We are interested in ‘PressEvent’.
Creating the variables
The next code block shows that we select only press events and then group by ‘user_id’ and ‘root’ and count the session_hit_number. After that the results are unstacked, resulting in a table where each row represents a user (the index is ‘user_id’) and the columns are the different root locations and its values are the number of times a user clicked in that sections.
features = df[(df.event_type=='PressEvent')].groupby(['user_id','root']).session_hit_number.count()
features_unstacked = features.unstack()
Fill empty values
Now we do have empty values, so we fill them with 0, as empty means that the user did not click in the section.
features_unstacked = features.unstack(fill_value=0)
Describe the data set
We use describe again to get an impression of out created per-user data set.
Looking at the mean, some sections seem to be used a lot more than others. Also the max number of clicks seems quite different per root section. This information can be used to drop some of the variables from our data set or the use scaling or outlier detection. We will plot histograms for the
Visualize the data
from matplotlib import pyplot as plt
figure, axis = plt.subplots(2, 4,figsize=(15,10))
for idx, name in enumerate(features_unstacked.data_columns):
df_bins = features_unstacked[name].cut(bins=5)
df_bins.value_counts().to_pandas().plot(title = name, kind='bar', ax=axis.flat[idx])
The histograms show that indeed the higher values seem quite anomalous for most of the root locations. This could be a reason to drop some of these observations or resort to scaling methods. For now we continue with the data set as is.
Add time feature
Now we want to add some time feature to our data set. We add the average session length per user to the data
set. We can use the model hub for this.
fillna is used to fill missing values.
features_unstacked['session_duration'] = df.mh.agg.session_duration(groupby='user_id').fillna(datetime.timedelta(0))
Export to pandas for sklearn
Now that we have our data set, we can use it for machine learning, using for example sklearn. To do so
to_pandas() to get a pandas DataFrame that can be used in sklearn.
pdf = features_unstacked.to_pandas()