Bach and sklearn
This example notebook shows how you can use Objectiv to create a basic feature set and use sklearn to do Machine Learning directly on the raw data in your SQL database. We also have an example that goes deeper into :doc:` feature engineering <./feature-engineering>`.
This notebook is also available as a full Jupyter notebook to run on your own data (see how to get started in your notebook). The dataset used here is the same as in Objectiv Up.
Get started
We first have to instantiate the model hub and an Objectiv DataFrame object.
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None
from modelhub import ModelHub, display_sql_as_markdown
from sklearn import cluster
# instantiate the model hub and set the default time aggregation to daily
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get a Bach DataFrame with Objectiv data within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)
This object points to all data in the dataset, which is too large to run in pandas and therefore sklearn. For the dataset that we need, we will aggregate to user level, at which point it is small enough to fit in memory.
Create the dataset
We’ll create a dataset of all the root locations that a user clicked on, per user.
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
# root_location series is later unstacked and its values might contain dashes
# which are not allowed in BigQuery column names, lets replace them
df['root_location'] = df['root_location'].str.replace('-', '_')
features = df[(df.event_type=='PressEvent')].groupby('user_id').root_location.value_counts()
features.head()
user_id root_location
8d3f2b11-bbc6-46ef-ae71-00b7d6cccf02 modeling 195
home 150
995701a7-a950-4100-b0b8-4382b45544c9 modeling 128
925b7b60-dba0-4b9a-a5ee-8837924e7fc5 tracking 93
b4b5ce02-7215-4193-9417-2df2faae4b03 modeling 74
Name: value_counts, dtype: int64
features_unstacked = features.unstack(fill_value=0)
# sample or not
kmeans_frame = features_unstacked
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
kmeans_frame = features_unstacked.get_sample(table_name='kmeans_test', sample_percentage=50, overwrite=True)
Now we have a basic feature set that is small enough to fit in memory. This can be used with sklearn, as we demonstrate in this example.
Export to pandas for sklearn
# export to pandas now
pdf = kmeans_frame.to_pandas()
pdf
about blog home jobs join_slack modeling privacy taxonomy tracking
user_id
005aa19c-7e80-4960-928c-a0853355ee5f 2 0 0 0 0 0 0 0 0
007f5fd7-7535-434e-aa3e-3d52f06d63ce 0 0 4 0 0 4 0 0 0
00a9146f-69a7-4d02-ad46-83b15f062d96 0 0 1 0 0 1 0 0 0
00d6517e-be05-4aa1-ac88-9c0b3a7902c5 0 0 2 0 0 0 0 0 0
00e0d3e2-5a9b-4249-a00d-3527eddf945b 0 0 2 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ...
fdf31cae-4906-476a-b206-e0a9644ced52 0 0 0 0 0 0 0 1 0
fe26b80f-cdc8-45f2-b0f7-f0f36b9c0ea2 0 0 0 0 0 1 0 0 0
fe8f4e17-cb38-4d22-abc5-3c0f7bfa60f3 0 0 1 0 0 0 0 2 0
fea402ee-b33f-4313-a7d1-c6ed285d8f9d 0 0 2 0 0 0 0 0 0
ff48d79a-195a-476a-b49d-0e212de43c96 0 0 7 1 0 0 0 0 0
<BLANKLINE>
[712 rows x 9 columns]
Do basic kmeans clustering
Now that we have a pandas DataFrame with our dataset, we can run basic kmeans clustering on it.
# do basic kmeans
est = cluster.KMeans(n_clusters=3)
est.fit(pdf)
pdf['cluster'] = est.labels_
Now you can use the created clusters on your entire dataset again if you add it back to your DataFrame. This is simple, as Bach and pandas work together nicely. Your original Objectiv data now has a ‘cluster’ column.
kmeans_frame['cluster'] = pdf['cluster']
kmeans_frame.sort_values('cluster').head()
about blog home jobs join_slack modeling privacy taxonomy tracking cluster
user_id
30ea7fc7-7dbe-4080-b980-55120503479d 12 10 25 2 0 3 0 7 4 0
32815da3-f3dd-448d-9ec4-bfed8534c1cc 0 0 0 0 0 0 0 13 0 0
07a41d20-a71b-4612-b46a-30c2508087bc 0 0 1 0 0 3 0 18 0 0
30501147-b09f-4637-a5d8-c205492508e4 0 4 7 0 0 2 0 6 4 0
32bde210-5808-457f-9fe7-cb93c3fc8300 2 2 8 0 0 1 1 6 4 0
df_with_cluster = df.merge(kmeans_frame[['cluster']], on='user_id')
df_with_cluster.head()
day moment user_id location_stack event_type stack_event_types session_id session_hit_number root_location cluster
event_id
a6629a25-54ec-4cc5-a84a-2440164a0779 2022-06-22 2022-06-22 20:23:24.607 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... VisibleEvent [AbstractEvent, NonInteractiveEvent, VisibleEvent] 4056 1 home 2
94fa97a3-5623-4b7c-b953-79c9971f4e53 2022-06-22 2022-06-22 20:23:26.001 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... HiddenEvent [AbstractEvent, HiddenEvent, NonInteractiveEvent] 4056 2 home 2
0041d6c0-af21-4a30-9374-09056b535cfc 2022-06-22 2022-06-22 20:24:00.747 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... PressEvent [AbstractEvent, InteractiveEvent, PressEvent] 4056 3 home 2
4a957bb2-1b2c-4130-a14f-4f2c11532633 2022-06-22 2022-06-22 20:24:04.599 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'about', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractL... PressEvent [AbstractEvent, InteractiveEvent, PressEvent] 4056 4 about 2
592c232e-5ce4-4ae7-a22a-91daf2d61264 2022-06-22 2022-06-22 20:24:05.339 0000bb2f-66e9-4e48-8e2f-7d0a82446ef4 [{'id': 'home', '_type': 'RootLocationContext', '_types': ['AbstractContext', 'AbstractLo... MediaLoadEvent [AbstractEvent, MediaEvent, MediaLoadEvent, NonInteractiveEvent] 4056 5 home 2
You can use this column just like any other. For example, you can now use your created clusters to group models from the model hub:
modelhub.aggregate.session_duration(df_with_cluster, groupby='cluster').head()
cluster
0.0 0 days 00:09:25.571389
1.0 0 days 00:11:24.219444
2.0 0 days 00:02:23.977933
NaN 0 days 00:04:39.558403
Name: session_duration, dtype: timedelta64[ns]
Get the SQL for any analysis
The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(features)
WITH "manual_materialize___3b151791e67a85fc8c95acac4a457b07" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"cookie_id" AS "user_id",
cast("value"->>'_type' AS text) AS "event_type",
cast(cast("value"->>'_types' AS text) AS JSONB) AS "stack_event_types",
cast(cast("value"->>'location_stack' AS text) AS JSONB) AS "location_stack",
cast(cast("value"->>'time' AS text) AS bigint) AS "time"
FROM "data"
),
"getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"location_stack" AS "location_stack",
"time" AS "time"
FROM "manual_materialize___3b151791e67a85fc8c95acac4a457b07"
WHERE ((("day" >= cast('2022-03-01' AS date))) AND (("day" <= cast('2022-06-30' AS date))))
),
"context_data___204983e5a9e66c47992ef746e0877e2e" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types"
FROM "getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24"
),
"session_starts___8a496387168b8a51ab24870a574873b6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
CASE WHEN (extract(epoch FROM (("moment") - (lag("moment", 1, cast(NULL AS timestamp WITHOUT TIME ZONE)) OVER (PARTITION BY "user_id" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))) <= cast(1800 AS bigint)) THEN cast(NULL AS boolean)
ELSE cast(TRUE AS boolean)
END AS "is_start_of_session"
FROM "context_data___204983e5a9e66c47992ef746e0877e2e"
),
"session_id_and_count___0e805b87769bab9933174fc44829028a" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
CASE WHEN "is_start_of_session" THEN row_number() OVER (PARTITION BY "is_start_of_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE cast(NULL AS bigint)
END AS "session_start_id",
count("is_start_of_session") OVER (ORDER BY "user_id" ASC NULLS LAST, "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "is_one_session"
FROM "session_starts___8a496387168b8a51ab24870a574873b6"
),
"objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
"session_start_id" AS "session_start_id",
"is_one_session" AS "is_one_session",
first_value("session_start_id") OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_id",
row_number() OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_hit_number"
FROM "session_id_and_count___0e805b87769bab9933174fc44829028a"
),
"getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"session_id" AS "session_id",
"session_hit_number" AS "session_hit_number",
REPLACE(cast(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"_type": "RootLocationContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("location_stack") WITH ORDINALITY)), '[]'::JSONB)->0->>'id' AS text), '-', '_') AS "root_location"
FROM "objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6"
WHERE ("event_type" = 'PressEvent')
) SELECT "user_id" AS "user_id",
"root_location" AS "root_location",
cast(sum(cast(1 AS bigint)) AS bigint) AS "value_counts"
FROM "getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce"
GROUP BY "user_id",
"root_location"
ORDER BY cast(sum(cast(1 AS bigint)) AS bigint) DESC NULLS LAST
That’s it! Join us on Slack if you have any questions or suggestions.
Next Steps
Try the notebooks in Objectiv Up
Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.
Use this notebook with your own data
You can use the example notebooks on any dataset that was collected with Objectiv’s tracker, so feel free to use them to bootstrap your own projects. They are available as Jupyter notebooks on our GitHub repository. See instructions to set up the Objectiv tracker.
Check out related example notebooks
- Feature engineering - see how modeling library Bach can be used for feature engineering.