Bach and sklearn

This example notebook shows how you can use Objectiv to create a basic feature set and use sklearn to do Machine Learning directly on the raw data in your SQL database. We also have an example that goes deeper into :doc:` feature engineering <./feature-engineering>`.

This notebook is also available as a full Jupyter notebook to run on your own data (see how to get started in your notebook). The dataset used here is the same as in Objectiv Up.

Get started

We first have to instantiate the model hub and an Objectiv DataFrame object.

In:
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None
In:
from modelhub import ModelHub, display_sql_as_markdown
from sklearn import cluster
# instantiate the model hub and set the default time aggregation to daily
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get a Bach DataFrame with Objectiv data within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)

This object points to all data in the dataset, which is too large to run in pandas and therefore sklearn. For the dataset that we need, we will aggregate to user level, at which point it is small enough to fit in memory.

Create the dataset

We’ll create a dataset of all the root locations that a user clicked on, per user.

In:
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
# root_location series is later unstacked and its values might contain dashes
# which are not allowed in BigQuery column names, lets replace them
df['root_location'] = df['root_location'].str.replace('-', '_')
In:
features = df[(df.event_type=='PressEvent')].groupby('user_id').root_location.value_counts()
features.head()
Out:
user_id                               root_location
8d3f2b11-bbc6-46ef-ae71-00b7d6cccf02 modeling 195
home 150
995701a7-a950-4100-b0b8-4382b45544c9 modeling 128
925b7b60-dba0-4b9a-a5ee-8837924e7fc5 tracking 93
b4b5ce02-7215-4193-9417-2df2faae4b03 modeling 74
Name: value_counts, dtype: int64
In:
features_unstacked = features.unstack(fill_value=0)
# sample or not
kmeans_frame = features_unstacked
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
kmeans_frame = features_unstacked.get_sample(table_name='kmeans_test', sample_percentage=50, overwrite=True)

Now we have a basic feature set that is small enough to fit in memory. This can be used with sklearn, as we demonstrate in this example.

Export to pandas for sklearn

In:
# export to pandas now
pdf = kmeans_frame.to_pandas()
pdf
Out:
                                      about  blog  home  jobs  join_slack  modeling  privacy  taxonomy  tracking
user_id
005aa19c-7e80-4960-928c-a0853355ee5f 2 0 0 0 0 0 0 0 0
007f5fd7-7535-434e-aa3e-3d52f06d63ce 0 0 4 0 0 4 0 0 0
00a9146f-69a7-4d02-ad46-83b15f062d96 0 0 1 0 0 1 0 0 0
00d6517e-be05-4aa1-ac88-9c0b3a7902c5 0 0 2 0 0 0 0 0 0
00e0d3e2-5a9b-4249-a00d-3527eddf945b 0 0 2 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
fdf31cae-4906-476a-b206-e0a9644ced52 0 0 0 0 0 0 0 1 0
fe26b80f-cdc8-45f2-b0f7-f0f36b9c0ea2 0 0 0 0 0 1 0 0 0
fe8f4e17-cb38-4d22-abc5-3c0f7bfa60f3 0 0 1 0 0 0 0 2 0
fea402ee-b33f-4313-a7d1-c6ed285d8f9d 0 0 2 0 0 0 0 0 0
ff48d79a-195a-476a-b49d-0e212de43c96 0 0 7 1 0 0 0 0 0
<BLANKLINE>
[712 rows x 9 columns]

Do basic kmeans clustering

Now that we have a pandas DataFrame with our dataset, we can run basic kmeans clustering on it.

In:
# do basic kmeans
est = cluster.KMeans(n_clusters=3)
est.fit(pdf)
pdf['cluster'] = est.labels_

Now you can use the created clusters on your entire dataset again if you add it back to your DataFrame. This is simple, as Bach and pandas work together nicely. Your original Objectiv data now has a ‘cluster’ column.

In:
kmeans_frame['cluster'] = pdf['cluster']
kmeans_frame.sort_values('cluster').head()
Out:
                                      about  blog  home  jobs  join_slack  modeling  privacy  taxonomy  tracking  cluster
user_id
30ea7fc7-7dbe-4080-b980-55120503479d 12 10 25 2 0 3 0 7 4 0
32815da3-f3dd-448d-9ec4-bfed8534c1cc 0 0 0 0 0 0 0 13 0 0
07a41d20-a71b-4612-b46a-30c2508087bc 0 0 1 0 0 3 0 18 0 0
30501147-b09f-4637-a5d8-c205492508e4 0 4 7 0 0 2 0 6 4 0
32bde210-5808-457f-9fe7-cb93c3fc8300 2 2 8 0 0 1 1 6 4 0
In:
df_with_cluster = df.merge(kmeans_frame[['cluster']], on='user_id')
df_with_cluster.head()
Out:
                                             day                  moment                               user_id                                                                                location_stack      event_type                                                 stack_event_types  session_id  session_hit_number   root_location  cluster
event_id
a6629a25-54ec-4cc5-a84a-2440164a0779 2022-06-22 2022-06-22 20:23:24.607 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... VisibleEvent [AbstractEvent, NonInteractiveEvent, VisibleEvent] 4056 1 home 2
94fa97a3-5623-4b7c-b953-79c9971f4e53 2022-06-22 2022-06-22 20:23:26.001 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... HiddenEvent [AbstractEvent, HiddenEvent, NonInteractiveEvent] 4056 2 home 2
0041d6c0-af21-4a30-9374-09056b535cfc 2022-06-22 2022-06-22 20:24:00.747 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... PressEvent [AbstractEvent, InteractiveEvent, PressEvent] 4056 3 home 2
4a957bb2-1b2c-4130-a14f-4f2c11532633 2022-06-22 2022-06-22 20:24:04.599 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'about', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractL... PressEvent [AbstractEvent, InteractiveEvent, PressEvent] 4056 4 about 2
592c232e-5ce4-4ae7-a22a-91daf2d61264 2022-06-22 2022-06-22 20:24:05.339 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... MediaLoadEvent [AbstractEvent, MediaEvent, MediaLoadEvent, NonInteractiveEvent] 4056 5 home 2

You can use this column just like any other. For example, you can now use your created clusters to group models from the model hub:

In:
modelhub.aggregate.session_duration(df_with_cluster, groupby='cluster').head()
cluster
Out:
0.0   0 days 00:09:25.571389
1.0 0 days 00:11:24.219444
2.0 0 days 00:02:23.977933
NaN 0 days 00:04:39.558403
Name: session_duration, dtype: timedelta64[ns]

Get the SQL for any analysis

The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.

In:
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(features)
Out:
WITH "manual_materialize___3b151791e67a85fc8c95acac4a457b07" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"cookie_id" AS "user_id",
cast("value"->>'_type' AS text) AS "event_type",
cast(cast("value"->>'_types' AS text) AS JSONB) AS "stack_event_types",
cast(cast("value"->>'location_stack' AS text) AS JSONB) AS "location_stack",
cast(cast("value"->>'time' AS text) AS bigint) AS "time"
FROM "data"
),
"getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"location_stack" AS "location_stack",
"time" AS "time"
FROM "manual_materialize___3b151791e67a85fc8c95acac4a457b07"
WHERE ((("day" >= cast('2022-03-01' AS date))) AND (("day" <= cast('2022-06-30' AS date))))
),
"context_data___204983e5a9e66c47992ef746e0877e2e" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types"
FROM "getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24"
),
"session_starts___8a496387168b8a51ab24870a574873b6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
CASE WHEN (extract(epoch FROM (("moment") - (lag("moment", 1, cast(NULL AS timestamp WITHOUT TIME ZONE)) OVER (PARTITION BY "user_id" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))) <= cast(1800 AS bigint)) THEN cast(NULL AS boolean)
ELSE cast(TRUE AS boolean)
END AS "is_start_of_session"
FROM "context_data___204983e5a9e66c47992ef746e0877e2e"
),
"session_id_and_count___0e805b87769bab9933174fc44829028a" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
CASE WHEN "is_start_of_session" THEN row_number() OVER (PARTITION BY "is_start_of_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE cast(NULL AS bigint)
END AS "session_start_id",
count("is_start_of_session") OVER (ORDER BY "user_id" ASC NULLS LAST, "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "is_one_session"
FROM "session_starts___8a496387168b8a51ab24870a574873b6"
),
"objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
"session_start_id" AS "session_start_id",
"is_one_session" AS "is_one_session",
first_value("session_start_id") OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_id",
row_number() OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_hit_number"
FROM "session_id_and_count___0e805b87769bab9933174fc44829028a"
),
"getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"session_id" AS "session_id",
"session_hit_number" AS "session_hit_number",
REPLACE(cast(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"_type": "RootLocationContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("location_stack") WITH ORDINALITY)), '[]'::JSONB)->0->>'id' AS text), '-', '_') AS "root_location"
FROM "objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6"
WHERE ("event_type" = 'PressEvent')
) SELECT "user_id" AS "user_id",
"root_location" AS "root_location",
cast(sum(cast(1 AS bigint)) AS bigint) AS "value_counts"
FROM "getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce"
GROUP BY "user_id",
"root_location"
ORDER BY cast(sum(cast(1 AS bigint)) AS bigint) DESC NULLS LAST

That’s it! Join us on Slack if you have any questions or suggestions.

Next Steps

Try the notebooks in Objectiv Up

Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.

Use this notebook with your own data

You can use the example notebooks on any dataset that was collected with Objectiv’s tracker, so feel free to use them to bootstrap your own projects. They are available as Jupyter notebooks on our GitHub repository. See instructions to set up the Objectiv tracker.