Logistic Regression
Data collected with Objectiv is strictly structured & designed for modeling, making it ideal for various machine learning models, which can be applied directly without cleaning, transformations, or complex tooling.
This example notebook shows how you can predict user behavior with the Logistic Regression model in the open model hub on a full dataset collected with Objectiv. Examples of predictions you can create:
- Will a user convert?
- Will a user start using a specific product feature or area?
- Will a user have a long active session duration?.
It’s also available as a full Jupyter notebook to run on your own data (see how to get started in your notebook). The dataset used here is the same as in Objectiv Up.
Get started
We first have to instantiate the model hub and an Objectiv DataFrame object.
# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None
# instantiate the model hub, set the default time aggregation to daily
# and get the application & path global contexts
from modelhub import ModelHub, display_sql_as_markdown
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get an Objectiv DataFrame within a defined timeframe
df = modelhub.get_objectiv_dataframe(db_url=DB_URL, start_date=start_date, end_date=end_date)
The location_stack
column, and the columns taken from the global contexts, contain most of the
event-specific data. These columns are JSON typed, and we can extract data from it using the keys of the JSON
objects with SeriesLocationStack methods, or the context
accessor for global context columns. See the open taxonomy example for how to use
the location_stack
and global contexts.
df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
Creating a feature set to predict user behavior
For simple demonstration purposes, we’ll predict if users on our own website will reach the modeling section of our docs, by looking at interactions they have with all the main sections on our site, as defined by the root location.
We’ll create a dataset that counts the number of clicks per user in each section. Note that this is a simple dataset used just for demonstration purposes of the logistic regression functionality, and not so much the results itself. For ins and outs on feature engineering see the feature engineering notebook.
# first replace dashes in the root_location Series, because is unstacked later on
# and dashes are not allowed in BigQuery column names
df['root_location'] = df['root_location'].str.replace('-', '_')
# look at the number of clicks per user in each section; only PressEvents, counting the root_locations
features = df[(df.event_type=='PressEvent')].groupby('user_id').root_location.value_counts()
# unstack the series, to create a DataFrame with the number of clicks per root location as columns
features_unstacked = features.unstack(fill_value=0)
Sample the data
To limit data processing and speed up fitting, let’s take a 10% sample of the full dataset to train the model on. After the model is fitted, it can easily be unsampled again to predict the labels for the entire dataset.
# take a 10% sample to train the model on
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
features_set_sample = features_unstacked.get_sample('test_lr_sample', sample_percentage=10, overwrite=True)
To predict whether a user clicked in the modeling section of our docs, we will look at the number of clicks in any of the other sections:
X
is a DataFrame that contains the explanatory variables.y
is a SeriesBoolean with the labels we want to predict.
# set the explanatory variables and labels to predict
y_column = 'modeling'
y = features_set_sample[y_column] > 0
X = features_set_sample.drop(columns=[y_column])
# see what `X` looks like
X.head()
about blog home jobs join_slack privacy taxonomy tracking
user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce 0 0 4 0 0 0 0 0
014c911b-53a9-44e2-9805-369e0b4d598f 0 0 0 0 0 0 1 0
02cf09ee-0ad8-418e-a8a8-82cf3ca4a9b7 0 0 2 0 0 0 0 0
066f3814-6e32-4267-9e87-e918b6708995 0 0 1 0 0 0 0 0
0717f786-572e-461f-b895-53fef9a6a757 0 0 3 0 0 0 0 0
# and see what `y` looks like
y.head()
user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce True
014c911b-53a9-44e2-9805-369e0b4d598f False
02cf09ee-0ad8-418e-a8a8-82cf3ca4a9b7 False
066f3814-6e32-4267-9e87-e918b6708995 False
0717f786-572e-461f-b895-53fef9a6a757 False
Name: modeling, dtype: boo
Instantiate & fit the logistic regression model
As the model is based on sklearn’s version of LogisticRegression, it can be instantiated with any parameters
that sklearn’s LogisticRegression supports.
In our example we instantiate it with fit_intercept=False
.
lr = modelhub.get_logistic_regression(fit_intercept=False)
The fit
operation then fits it to the passed data. This operation extracts the data from the database under
the hood.
lr.fit(X, y)
LogisticRegression(fit_intercept=False)
Set accuracy & prediction
All of the following operations are carried out directly on the database.
lr.score(X, y)
0.7941176470588235
The model provides the same attributes as sklearn’s Logistic Regression model, such as
coef_
.
# show the coefficients of the fitted model
lr.coef_
array([[-0.82043069, 0.61146569, -0.12662165, -0.23339689, 0. ,
-0.30166917, -0.13706966, 0.37093552]])
Now let’s create columns for the predicted values and the labels in the dataset. Labels are set to True
if
the probability is over 0.5.
features_set_sample['predicted_values'] = lr.predict_proba(X)
features_set_sample['predicted_labels'] = lr.predict(X)
# show the sampled data set, including predictions
features_set_sample.head(20)
about blog home jobs join_slack modeling privacy taxonomy tracking predicted_values predicted_labels
user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce 0 0 4 0 0 4 0 0 0 0.283592 False
012e2d45-32ed-4f33-84e5-babfd3b10372 0 0 2 0 0 0 0 0 0 0.386190 False
023ec102-bc2f-4f30-bd42-12dcdbd96e31 0 0 2 0 0 0 0 0 0 0.386190 False
02bc6ae7-153c-44e2-ba4a-301984960ac9 0 0 1 0 0 0 0 0 0 0.442338 False
02c42c27-1c0d-4e3e-b6c0-403a60e8eb83 0 0 0 0 0 0 0 3 0 0.372160 False
030f618e-00e8-4d83-b95e-0c2870978e08 0 0 3 0 0 0 0 0 0 0.332914 False
062dded9-cf06-48fc-9923-ebb5e526f154 0 0 2 0 0 0 0 0 0 0.386190 False
0a16f4d9-eb23-4b86-a886-9276ab1912a0 2 0 4 0 0 0 0 0 0 0.247541 False
0abfcb9e-bc78-49c6-8bb0-4187aa4d8ab2 0 0 7 0 0 5 0 0 0 0.164964 False
0bae858b-80c5-4c34-ae29-2aea00be8ac5 1 1 2 1 0 0 0 2 0 0.331506 False
0bc184f0-f879-44a6-978e-f203bd9e03c1 0 2 0 0 0 0 0 0 0 0.780044 True
0bd503b7-4a60-401f-a217-f9e29bd4686b 0 0 1 0 0 0 0 0 0 0.442338 False
0c235f61-53c4-4708-9f35-fa554da293ca 0 0 1 0 0 0 0 0 0 0.442338 False
0dc6bfb4-f054-4f48-b2c1-a07dd6e31cc7 0 0 1 0 0 0 0 0 0 0.442338 False
1190e603-b127-41d5-9e8d-72d2ed3402da 0 0 5 0 0 0 0 0 0 0.238960 False
13292379-1ad7-4dd5-9b90-bf65408fa24e 0 0 1 0 0 0 0 0 0 0.442338 False
13a975e8-3533-47b0-ae04-08a97a5e211c 0 2 0 0 0 0 0 0 0 0.780044 True
1b219545-08e0-43fa-b476-c6bf0d80c0c7 1 0 8 1 0 0 0 0 0 0.085039 False
1e644cdb-5e79-4536-ab17-6948b4c000bf 0 0 1 0 0 0 0 3 1 0.484079 False
1fc27f44-d229-4f21-9d14-bb0a21571f00 0 0 3 0 0 0 0 0 0 0.332914 False
Unsample and get the SQL
The sampled dataset we used above can easily be unsampled.
features_set_full = features_set_sample.get_unsampled()
The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.
# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(features_set_full)
WITH "manual_materialize___3b151791e67a85fc8c95acac4a457b07" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"cookie_id" AS "user_id",
cast("value"->>'_type' AS text) AS "event_type",
cast(cast("value"->>'_types' AS text) AS JSONB) AS "stack_event_types",
cast(cast("value"->>'location_stack' AS text) AS JSONB) AS "location_stack",
cast(cast("value"->>'time' AS text) AS bigint) AS "time"
FROM "data"
),
"getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"location_stack" AS "location_stack",
"time" AS "time"
FROM "manual_materialize___3b151791e67a85fc8c95acac4a457b07"
WHERE ((("day" >= cast('2022-03-01' AS date))) AND (("day" <= cast('2022-06-30' AS date))))
),
"context_data___204983e5a9e66c47992ef746e0877e2e" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types"
FROM "getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24"
),
"session_starts___8a496387168b8a51ab24870a574873b6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
CASE WHEN (extract(epoch FROM (("moment") - (lag("moment", 1, cast(NULL AS timestamp WITHOUT TIME ZONE)) OVER (PARTITION BY "user_id" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))) <= cast(1800 AS bigint)) THEN cast(NULL AS boolean)
ELSE cast(TRUE AS boolean)
END AS "is_start_of_session"
FROM "context_data___204983e5a9e66c47992ef746e0877e2e"
),
"session_id_and_count___0e805b87769bab9933174fc44829028a" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
CASE WHEN "is_start_of_session" THEN row_number() OVER (PARTITION BY "is_start_of_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE cast(NULL AS bigint)
END AS "session_start_id",
count("is_start_of_session") OVER (ORDER BY "user_id" ASC NULLS LAST, "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "is_one_session"
FROM "session_starts___8a496387168b8a51ab24870a574873b6"
),
"objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"is_start_of_session" AS "is_start_of_session",
"session_start_id" AS "session_start_id",
"is_one_session" AS "is_one_session",
first_value("session_start_id") OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_id",
row_number() OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_hit_number"
FROM "session_id_and_count___0e805b87769bab9933174fc44829028a"
),
"getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce" AS (
SELECT "event_id" AS "event_id",
"day" AS "day",
"moment" AS "moment",
"user_id" AS "user_id",
"location_stack" AS "location_stack",
"event_type" AS "event_type",
"stack_event_types" AS "stack_event_types",
"session_id" AS "session_id",
"session_hit_number" AS "session_hit_number",
REPLACE(cast(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"_type": "RootLocationContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("location_stack") WITH ORDINALITY)), '[]'::JSONB)->0->>'id' AS text), '-', '_') AS "root_location"
FROM "objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6"
WHERE ("event_type" = 'PressEvent')
),
"reset_index___33f5dbafcc0b5e9f64c2ad04e1787eb3" AS (
SELECT "user_id" AS "user_id",
"root_location" AS "root_location",
cast(sum(cast(1 AS bigint)) AS bigint) AS "value_counts"
FROM "getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce"
GROUP BY "user_id",
"root_location"
ORDER BY cast(sum(cast(1 AS bigint)) AS bigint) DESC NULLS LAST
),
"unstack___8dfef8242fc13d14e3c1068b38828af3" AS (
SELECT "user_id" AS "user_id",
max("value_counts") AS "value_counts",
max("root_location") AS "root_location",
max(CASE WHEN ("root_location" = 'about') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "about__value_counts",
max(CASE WHEN ("root_location" = 'blog') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "blog__value_counts",
max(CASE WHEN ("root_location" = 'home') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "home__value_counts",
max(CASE WHEN ("root_location" = 'jobs') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "jobs__value_counts",
max(CASE WHEN ("root_location" = 'join_slack') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "join_slack__value_counts",
max(CASE WHEN ("root_location" = 'modeling') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "modeling__value_counts",
max(CASE WHEN ("root_location" = 'privacy') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "privacy__value_counts",
max(CASE WHEN ("root_location" = 'taxonomy') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "taxonomy__value_counts",
max(CASE WHEN ("root_location" = 'tracking') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "tracking__value_counts"
FROM "reset_index___33f5dbafcc0b5e9f64c2ad04e1787eb3"
GROUP BY "user_id"
),
"get_sample___d78797a51432ec8008301d111eec6f1d" AS (
SELECT "user_id" AS "user_id",
(COALESCE("about__value_counts", cast(0 AS bigint))) AS "about",
(COALESCE("blog__value_counts", cast(0 AS bigint))) AS "blog",
(COALESCE("home__value_counts", cast(0 AS bigint))) AS "home",
(COALESCE("jobs__value_counts", cast(0 AS bigint))) AS "jobs",
(COALESCE("join_slack__value_counts", cast(0 AS bigint))) AS "join_slack",
(COALESCE("modeling__value_counts", cast(0 AS bigint))) AS "modeling",
(COALESCE("privacy__value_counts", cast(0 AS bigint))) AS "privacy",
(COALESCE("taxonomy__value_counts", cast(0 AS bigint))) AS "taxonomy",
(COALESCE("tracking__value_counts", cast(0 AS bigint))) AS "tracking"
FROM "unstack___8dfef8242fc13d14e3c1068b38828af3"
) SELECT "user_id" AS "user_id",
"about" AS "about",
"blog" AS "blog",
"home" AS "home",
"jobs" AS "jobs",
"join_slack" AS "join_slack",
"modeling" AS "modeling",
"privacy" AS "privacy",
"taxonomy" AS "taxonomy",
"tracking" AS "tracking",
(exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) / (exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) + cast('1.0' AS double precision))) AS "predicted_values",
((exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) / (exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) + cast('1.0' AS double precision))) > cast('0.5' AS double precision)) AS "predicted_labels"
FROM "get_sample___d78797a51432ec8008301d111eec6f1d"
That’s it! Stay tuned for more metrics to assess model fit, as well as simplifying splitting the data into training and testing datasets.
Join us on Slack if you have any questions or suggestions.
Next Steps
Try the notebooks in Objectiv Up
Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.
Use this notebook with your own data
You can use the example notebooks on any dataset that was collected with Objectiv’s tracker, so feel free to use them to bootstrap your own projects. They are available as Jupyter notebooks on our GitHub repository. See instructions to set up the Objectiv tracker.
Check out related example notebooks
Feature importance notebook - model the importance of features on achieving a conversion goal.
User Intent notebook - run basic User Intent analysis with Objectiv.