Open model hub basics

This example notebook shows how you can use the pre-built models from the open model hub in conjunction with modeling library Bach to quickly build model stacks to answer common analytics questions.

It’s also available as a full Jupyter notebook to run on your own data (see how to get started in your notebook). The dataset used here is the same as in Objectiv Up.

Using the open model hub

The following types of functions/models are provided:

  1. Helper functions: Simplify manipulating and analyzing the data.

  2. Aggregation models: Enable running some of the more common data analyses and product analytics metrics.

  3. Machine learning models: ML models such as logistic regression.

  4. Funnels: To analyze Funnels, e.g. discover all the (top) user journeys that lead to conversion or drop-off.

Modeling behavior of users and groups is enabled through configurable Identity Resolution.

Helper functions always return a Series with the same shape and index as the DataFrame they are applied to. This ensures they can be added as a column to that DataFrame. Helper functions that return SeriesBoolean can be used to filter the data. The helper functions can be accessed with the map accessor from a model hub instance.

Aggregation models perform multiple Bach instructions that run some of the more common data analyses and product analytics metrics. They always return aggregated data in some form from the DataFrame they’re applied to. Aggregation models can be accessed with the aggregate accessor from a model hub instance.

Most of the model hub helper functions and aggregation models take data as their first argument: this is the DataFrame with the Objectiv data to apply the model to.

For an example of a machine learning model, see the logistic regression example notebook.

Below we’ll showcase a selection of the models from the open model hub.

Get started

We first have to instantiate the model hub and an Objectiv DataFrame object.

In:
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None
In:
from modelhub import ModelHub, display_sql_as_markdown
from sql_models.util import is_bigquery
# instantiate the model hub and set the default time aggregation to daily
# and set the global contexts that will be used in this example
modelhub = ModelHub(time_aggregation='%Y-%m-%d', global_contexts=['application'])
# get a Bach DataFrame with Objectiv data within a defined timeframe
df = modelhub.get_objectiv_dataframe(start_date=start_date, end_date=end_date)

Run a simple aggregation model

Calculating the number of unique users is one of the models. As it is an aggregation model, it’s called with model_hub.aggregate.unique_users() (or model_hub.agg.unique_users() for short). It uses the time_aggregation that is set when the model hub was instantiated.

With .head() we immediately query the data to show the results. .to_pandas() can be used to use all results as a pandas object in Python. These (and the following) results are sorted descending, so we show the latest data first.

Mapping and filtering

Here we’ll use map operations from the open model hub to label certain users or events, in order to filter on them or combine them with other models later on.

First, we’ll use map to label users as being a new user. As time_aggregation was set to ‘%Y-%m-%d’ it means all hits are labeled as new for the entire day in which the user had its first session.

In:
# label new users & all their events on the defined time_aggregation
df['is_new_user'] = modelhub.map.is_new_user(df)
df.is_new_user.head(10)
Out:
event_id
d4a0cb80-729c-4e17-9a42-6cb48672250f True
75afa7bc-5237-4033-a833-bf9e0e85a3c1 True
0ae59c2c-2a2e-480c-8212-23d7aed2ae3c True
e2d95395-e7c1-4ab1-bf32-616bb485ff02 True
75447a30-f379-4a8f-8568-77b9cb0b5039 True
8e86f8e3-be55-4c5c-8a23-ba8dea38c35d True
854ec5e1-5c27-4c3b-b01c-98671eaa8da7 True
adc6d6e1-9525-4e9c-9a02-1f79cd6428a6 True
17aad036-4cca-4c31-a115-9601510907fc True
e139877e-ef37-426e-bc89-15e11d87effb True
Name: is_new_user, dtype: bool

We can also label conversion events. To do this, we first have to define what a conversion is, by setting the type of event and the location in the product where the event was triggered (see more about the location stack here), using add_conversion_event.

In:
# define a conversion event, and label events whether they converted
modelhub.add_conversion_event(location_stack=df.location_stack.json[{'id': 'Quickstart Guide', '_type': 'LinkContext'}:], event_type='PressEvent', name='quickstart_presses')
df['conversion_events'] = modelhub.map.is_conversion_event(df, 'quickstart_presses')
df.conversion_events.head(10)
Out:
event_id
d4a0cb80-729c-4e17-9a42-6cb48672250f False
75afa7bc-5237-4033-a833-bf9e0e85a3c1 False
0ae59c2c-2a2e-480c-8212-23d7aed2ae3c False
e2d95395-e7c1-4ab1-bf32-616bb485ff02 False
75447a30-f379-4a8f-8568-77b9cb0b5039 False
8e86f8e3-be55-4c5c-8a23-ba8dea38c35d False
854ec5e1-5c27-4c3b-b01c-98671eaa8da7 False
adc6d6e1-9525-4e9c-9a02-1f79cd6428a6 False
17aad036-4cca-4c31-a115-9601510907fc False
e139877e-ef37-426e-bc89-15e11d87effb False
Name: conversion_events, dtype: bool

Combine mapping with filtering and aggregattion

As the map functions above return a SeriesBoolean, they can be combined with filter and aggregation models. We use the same aggregation model we showed earlier (unique_users), but now applying the df.conversion_events filter to just look at unique converted users per day.

In:
# filter unique users on whether they've converted
modelhub.aggregate.unique_users(df[df.conversion_events]).sort_index(ascending=False).head(10)
Out:
time_aggregation
2022-03-31 1
2022-03-28 1
2022-03-26 1
2022-03-24 1
2022-03-22 3
2022-03-21 4
2022-03-17 1
2022-03-15 1
2022-03-13 1
2022-03-09 2
Name: unique_users, dtype: int64

Other aggregation models can be used similarly. In the example below, the average session duration is calculated just for new users.

In:
# calculate average session duration, filtered on new users
duration_new_users = modelhub.aggregate.session_duration(df[df.is_new_user])
duration_new_users.sort_index(ascending=False).head(10)
Out:
time_aggregation
2022-03-31 0 days 00:04:14.323667
2022-03-30 0 days 00:00:23.893000
2022-03-29 0 days 00:00:02.906500
2022-03-28 0 days 00:03:01.878000
2022-03-27 0 days 00:00:42.750600
2022-03-26 0 days 00:01:34.580250
2022-03-25 0 days 00:02:10.365571
2022-03-24 0 days 00:00:42.824500
2022-03-23 0 days 00:00:54.777167
2022-03-22 0 days 00:01:02.964167
Name: session_duration, dtype: timedelta64[ns]

Using multiple model hub filters

The model hub’s map results can be combined and reused. Below we set the result of two helper functions as a column to the original DataFrame, and use them both to filter the data and apply an aggregation model. We calculate the number of users that were new in a month, and also the number of users that converted twice on a day.

In:
# calculate new users & users that converted twice on a day
df['is_new_user_month'] = modelhub.map.is_new_user(df, time_aggregation = '%Y-%m')
df['is_twice_converted'] = modelhub.map.conversions_in_time(df, name='quickstart_presses')==2
# use results in an aggregation model
modelhub.aggregate.unique_users(df[df.is_new_user_month & df.is_twice_converted]).sort_index(ascending=False).head()
Out:
time_aggregation
2022-03-03 1
Name: unique_users, dtype: int64

Combine model results

Results from aggregation models can be used together if they share the same index type - similar to pandas. Below the share of new users per day is calculated.

In:
# calculate the share of new users per day using results from two aggregation models
new_user_share = modelhub.agg.unique_users(df[df.is_new_user]) / modelhub.agg.unique_users(df)
new_user_share.sort_index(ascending=False).head(10)
Out:
time_aggregation
2022-03-31 0.461538
2022-03-30 0.800000
2022-03-29 0.428571
2022-03-28 0.454545
2022-03-27 1.000000
2022-03-26 0.857143
2022-03-25 0.888889
2022-03-24 0.833333
2022-03-23 0.722222
2022-03-22 0.611111
Name: unique_users, dtype: float64

Crunch data further with the Bach modeling library

All results from the model hub are in the form of a Bach DataFrame or Series. This makes the open model hub and Bach work seamlessly together.

In:
# We'll do a lot of operations on the data in the df DataFrame. To make this easier for the
# BigQuery, we tell Bach to materialize the current DataFrame as temporary table.
# This statement has no direct effect, but any invocation of head() on the DataFrame later
# on will consist of two queries: one to create a temporary table with the current state of the
# DataFrame, and one that queries that table and does subsequent operations.
if is_bigquery(df.engine):
df = df.materialize(materialization='temp_table')
In:
# now label the number of times a user is converted in a session at a certain moment
df['conversion_count'] = modelhub.map.conversions_in_time(df, name='quickstart_presses')

# use Bach to do any supported operation using pandas syntax.
# select users that converted
converted_users = df[df.conversion_events].user_id.unique()
# select PressEvents of users that converted
df_selection = df[(df.event_type == 'PressEvent') & (df.user_id.isin(converted_users))]
# calculate the number of PressEvents before conversion per session
presses_per_session = df_selection[df_selection.conversion_count == 0].groupby('session_id').session_hit_number.count()
presses_per_session_pdf = presses_per_session.to_pandas()

Now let’s see the results, at which point the underlying query is actually executed.

In:
presses_per_session_pdf.head()
Out:
session_id
6 2
10 2
11 1
16 1
23 2
Name: session_hit_number, dtype: int64

See the Bach API reference for all available operations.

Export results to a pandas DataFrame

Bach DataFrames and/or model hub results can always be exported to pandas, so you can use all its options as well as pandas-compatible ML packages. Since Bach DataFrame operations run on the full dataset in the SQL data store, it is recommended to export to pandas if the data is small enough, i.e. by aggregation or selection.

Below we plot the previously calculated presses per session before conversion, using pandas’ built-in plotting methods.

In:
presses_per_session_pdf.hist()
Out:
<AxesSubplot:...>

Presses per session

Get the SQL for any analysis

The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.

In:
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(new_user_share)
Out:
WITH "manual_materialize___3b151791e67a85fc8c95acac4a457b07" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"cookie_id" AS "user_id",
cast("value"->>'_type' AS text) AS "event_type",
cast(cast("value"->>'_types' AS text) AS JSONB) AS "stack_event_types",
cast(cast("value"->>'location_stack' AS text) AS JSONB) AS "location_stack",
cast(cast("value"->>'time' AS text) AS bigint) AS "time"
FROM "data"
),
"getitem_where_boolean___a0cde0de892cbce88c49587144341273" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"location_stack" AS "location_stack",
"time" AS "time"
FROM "manual_materialize___3b151791e67a85fc8c95acac4a457b07"
WHERE ((("day" >= cast('2022-03-01' AS date))) AND (("day" <= cast('2022-03-31' AS date))))
),
"context_data___f37be8c35ef5c693b5c6757a42caffc9" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types"
FROM "getitem_where_boolean___a0cde0de892cbce88c49587144341273"
),
"session_starts___62ad5aa356650475769808c8a2a58847" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
CASE WHEN (extract(epoch FROM (("moment") - (lag("moment", 1, cast(NULL AS timestamp WITHOUT TIME ZONE)) OVER (PARTITION BY "user_id" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))) <= cast(1800 AS bigint)) THEN cast(NULL AS boolean)
ELSE cast(TRUE AS boolean)
END AS "is_start_of_session"
FROM "context_data___f37be8c35ef5c693b5c6757a42caffc9"
),
"session_id_and_count___de47919480493af27779b89a88e6ad3a" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
CASE WHEN "is_start_of_session" THEN row_number() OVER (PARTITION BY "is_start_of_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE cast(NULL AS bigint)
END AS "session_start_id",
count("is_start_of_session") OVER (ORDER BY "user_id" ASC NULLS LAST, "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "is_one_session"
FROM "session_starts___62ad5aa356650475769808c8a2a58847"
),
"objectiv_sessionized_data___b40afa1117f20eda0fc8812d6c1a6db9" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
"session_start_id" AS "session_start_id",
"is_one_session" AS "is_one_session",
first_value("session_start_id") OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_id",
row_number() OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_hit_number"
FROM "session_id_and_count___de47919480493af27779b89a88e6ad3a"
),
"manual_materialize___f6b6403780d6e60c32d5b7d210c4268e" AS (
SELECT "event_id" AS "event_id",
(min("session_id") OVER (PARTITION BY to_char("moment", 'YYYY"-"MM"-"DD'), "user_id" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) = min("session_id") OVER (PARTITION BY "user_id" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) AS "is_new_user"
FROM "objectiv_sessionized_data___b40afa1117f20eda0fc8812d6c1a6db9"
),
"merge_sql___47d0bb4b68ff863b9aed7efa1a5e0c36" AS (
SELECT COALESCE("l"."event_id", "r"."event_id") AS "event_id",
"l"."day" AS "day",
"l"."moment" AS "moment",
"l"."user_id" AS "user_id",
"l"."location_stack" AS "location_stack",
"l"."event_type" AS "event_type",
"l"."stack_event_types" AS "stack_event_types",
"l"."session_id" AS "session_id",
"l"."session_hit_number" AS "session_hit_number",
"r"."is_new_user" AS "is_new_user"
FROM "objectiv_sessionized_data___b40afa1117f20eda0fc8812d6c1a6db9" AS l
LEFT
JOIN "manual_materialize___f6b6403780d6e60c32d5b7d210c4268e" AS r
ON ("l"."event_id" = "r"."event_id")
),
"manual_materialize___ce1874e5303e7e9c7bd912bf7de2ffb8" AS (
SELECT "event_id" AS "event_id",
(min("session_id") OVER (PARTITION BY to_char("moment", 'YYYY"-"MM'), "user_id" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) = min("session_id") OVER (PARTITION BY "user_id" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) AS "is_new_user"
FROM "merge_sql___47d0bb4b68ff863b9aed7efa1a5e0c36"
),
"merge_sql___e1fc7c81b69651e5740c6358e73ef7a0" AS (
SELECT COALESCE("l"."event_id", "r"."event_id") AS "event_id",
"l"."day" AS "day",
"l"."moment" AS "moment",
"l"."user_id" AS "user_id",
"l"."location_stack" AS "location_stack",
"l"."event_type" AS "event_type",
"l"."stack_event_types" AS "stack_event_types",
"l"."session_id" AS "session_id",
"l"."session_hit_number" AS "session_hit_number",
"l"."is_new_user" AS "is_new_user",
(((jsonb_array_length(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("l"."location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"id": "Quickstart Guide", "_type": "LinkContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("l"."location_stack") WITH ORDINALITY)), '[]'::JSONB)) > cast(0 AS bigint))) AND (("l"."event_type" = 'PressEvent'))) AS "conversion_events",
"r"."is_new_user" AS "is_new_user_month"
FROM "merge_sql___47d0bb4b68ff863b9aed7efa1a5e0c36" AS l
LEFT
JOIN "manual_materialize___ce1874e5303e7e9c7bd912bf7de2ffb8" AS r
ON ("l"."event_id" = "r"."event_id")
),
"merge_sql___e54ce05f7f6cbf6221631d8bd5c43765" AS (
SELECT COALESCE("l"."event_id", "r"."event_id") AS "event_id",
(jsonb_array_length(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("l"."location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"id": "Quickstart Guide", "_type": "LinkContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("l"."location_stack") WITH ORDINALITY)), '[]'::JSONB)) > cast(0 AS bigint)) AS "location_stack",
("r"."event_type" = 'PressEvent') AS "event_type"
FROM "merge_sql___47d0bb4b68ff863b9aed7efa1a5e0c36" AS l
FULL OUTER
JOIN "merge_sql___e1fc7c81b69651e5740c6358e73ef7a0" AS r
ON ("l"."event_id" = "r"."event_id")
),
"merge_sql___245b4e11fb326ae748bcf76a60e49e3d" AS (
SELECT COALESCE("l"."event_id", "r"."event_id") AS "event_id",
"l"."moment" AS "moment",
"l"."event_type" AS "event_type",
"l"."session_id" AS "session_id",
(("r"."location_stack") AND ("r"."event_type")) AS "__is_conversion_event"
FROM "merge_sql___e1fc7c81b69651e5740c6358e73ef7a0" AS l
LEFT
JOIN "merge_sql___e54ce05f7f6cbf6221631d8bd5c43765" AS r
ON ("l"."event_id" = "r"."event_id")
),
"conversions_in_time_calculat___7a719f2f24a8fd76d25332e81484a664" AS (
SELECT "event_id" AS "event_id",
"moment" AS "moment",
"event_type" AS "event_type",
"session_id" AS "session_id",
"__is_conversion_event" AS "__is_conversion_event",
CASE WHEN "__is_conversion_event" THEN cast(1 AS bigint)
ELSE cast(0 AS bigint)
END AS "__conversion_counter",
cast(sum(CASE WHEN "__is_conversion_event" THEN cast(1 AS bigint) ELSE cast(0 AS bigint) END) OVER (PARTITION BY "session_id" ORDER BY "session_id" ASC NULLS LAST, "moment" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS bigint) AS "__conversions_in_time"
FROM "merge_sql___245b4e11fb326ae748bcf76a60e49e3d"
),
"merge_sql___8e10601dfa89fb9223b2e299b4295b33" AS (
SELECT COALESCE("l"."event_id", "r"."event_id") AS "event_id",
"l"."day" AS "day",
"l"."moment" AS "moment",
"l"."user_id" AS "user_id",
"l"."location_stack" AS "location_stack",
"l"."event_type" AS "event_type",
"l"."stack_event_types" AS "stack_event_types",
"l"."session_id" AS "session_id",
"l"."session_hit_number" AS "session_hit_number",
"l"."is_new_user" AS "is_new_user",
"l"."conversion_events" AS "conversion_events",
"l"."is_new_user_month" AS "is_new_user_month",
("r"."__conversions_in_time" = cast(2 AS bigint)) AS "is_twice_converted"
FROM "merge_sql___e1fc7c81b69651e5740c6358e73ef7a0" AS l
LEFT
JOIN "conversions_in_time_calculat___7a719f2f24a8fd76d25332e81484a664" AS r
ON ("l"."event_id" = "r"."event_id")
),
"getitem_where_boolean___96d65cb4ff5195536c687e1c6ccdc1c1" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"session_id" AS "session_id",
"session_hit_number" AS "session_hit_number",
"is_new_user" AS "is_new_user",
"conversion_events" AS "conversion_events",
"is_new_user_month" AS "is_new_user_month",
"is_twice_converted" AS "is_twice_converted"
FROM "merge_sql___8e10601dfa89fb9223b2e299b4295b33"
WHERE "is_new_user"
),
"merge_left___9a2a470cf65a510087a4b2204d759d01" AS (
SELECT to_char("moment", 'YYYY"-"MM"-"DD') AS "time_aggregation",
cast(count(DISTINCT "user_id") AS double precision) AS "unique_users"
FROM "getitem_where_boolean___96d65cb4ff5195536c687e1c6ccdc1c1"
GROUP BY to_char("moment", 'YYYY"-"MM"-"DD')
),
"manual_materialize___eebdb7eaf67621c07f039d021e65d7a1" AS (
SELECT to_char("moment", 'YYYY"-"MM"-"DD') AS "time_aggregation",
count(DISTINCT "user_id") AS "unique_users"
FROM "merge_sql___8e10601dfa89fb9223b2e299b4295b33"
GROUP BY to_char("moment", 'YYYY"-"MM"-"DD')
),
"merge_sql___61121cd03ee65d4368107659b1b49463" AS (
SELECT COALESCE("l"."time_aggregation", "r"."time_aggregation") AS "time_aggregation",
"l"."unique_users" AS "unique_users",
"r"."unique_users" AS "unique_users__other"
FROM "merge_left___9a2a470cf65a510087a4b2204d759d01" AS l
FULL OUTER
JOIN "manual_materialize___eebdb7eaf67621c07f039d021e65d7a1" AS r
ON ("l"."time_aggregation" = "r"."time_aggregation")
) SELECT "time_aggregation" AS "time_aggregation",
("unique_users" / "unique_users__other") AS "unique_users"
FROM "merge_sql___61121cd03ee65d4368107659b1b49463"

That’s it! Join us on Slack if you have any questions or suggestions.

Next Steps

Try the notebooks in Objectiv Up

Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.

Use this notebook with your own data

You can use the example notebooks on any dataset that was collected with Objectiv’s tracker, so feel free to use them to bootstrap your own projects. They are available as Jupyter notebooks on our GitHub repository. See instructions to set up the Objectiv tracker.

  • Open taxonomy how-to - see what you can do with the Bach modeling library and a dataset that is validated against the open analytics taxonomy.

  • Explore your data - see how you can easily explore your data collected with Objectiv.