Logistic Regression

Data collected with Objectiv is strictly structured & designed for modeling, making it ideal for various machine learning models, which can be applied directly without cleaning, transformations, or complex tooling.

This example notebook shows how you can predict user behavior with the Logistic Regression model in the open model hub on a full dataset collected with Objectiv. Examples of predictions you can create:

Will a user convert?
Will a user start using a specific product feature or area?
Will a user have a long active session duration?.

It’s also available as a full Jupyter notebook to run on your own data (see how to get started in your notebook). The dataset used here is the same as in Objectiv Up.

Get started

We first have to instantiate the model hub and an Objectiv DataFrame object.

In:

# set the timeframe of the analysis
start_date = '2022-03-01'
end_date = None

In:

# instantiate the model hub, set the default time aggregation to daily
# and get the application & path global contexts
from modelhub import ModelHub, display_sql_as_markdown
modelhub = ModelHub(time_aggregation='%Y-%m-%d')
# get an Objectiv DataFrame within a defined timeframe
df = modelhub.get_objectiv_dataframe(db_url=DB_URL, start_date=start_date, end_date=end_date)

The location_stack column, and the columns taken from the global contexts, contain most of the event-specific data. These columns are JSON typed, and we can extract data from it using the keys of the JSON objects with SeriesLocationStack methods, or the context accessor for global context columns. See the open taxonomy example for how to use the location_stack and global contexts.

In:

df['root_location'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

Reference

Creating a feature set to predict user behavior

For simple demonstration purposes, we’ll predict if users on our own website will reach the modeling section of our docs, by looking at interactions they have with all the main sections on our site, as defined by the root location.

We’ll create a dataset that counts the number of clicks per user in each section. Note that this is a simple dataset used just for demonstration purposes of the logistic regression functionality, and not so much the results itself. For ins and outs on feature engineering see the feature engineering notebook.

In:

# first replace dashes in the root_location Series, because is unstacked later on
# and dashes are not allowed in BigQuery column names
df['root_location'] = df['root_location'].str.replace('-', '_')

In:

# look at the number of clicks per user in each section; only PressEvents, counting the root_locations
features = df[(df.event_type=='PressEvent')].groupby('user_id').root_location.value_counts()

In:

# unstack the series, to create a DataFrame with the number of clicks per root location as columns
features_unstacked = features.unstack(fill_value=0)

Reference

Sample the data

To limit data processing and speed up fitting, let’s take a 10% sample of the full dataset to train the model on. After the model is fitted, it can easily be unsampled again to predict the labels for the entire dataset.

In:

# take a 10% sample to train the model on
# for BigQuery the table name should be 'YOUR_PROJECT.YOUR_WRITABLE_DATASET.YOUR_TABLE_NAME'
features_set_sample = features_unstacked.get_sample('test_lr_sample', sample_percentage=10, overwrite=True)

To predict whether a user clicked in the modeling section of our docs, we will look at the number of clicks in any of the other sections:

X is a DataFrame that contains the explanatory variables.
y is a SeriesBoolean with the labels we want to predict.

In:

# set the explanatory variables and labels to predict
y_column = 'modeling'
y = features_set_sample[y_column] > 0
X = features_set_sample.drop(columns=[y_column])

In:

# see what `X` looks like
X.head()

Out:

                                      about  blog  home  jobs  join_slack  privacy  taxonomy  tracking
user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce      0     0     4     0           0        0         0         0
014c911b-53a9-44e2-9805-369e0b4d598f      0     0     0     0           0        0         1         0
02cf09ee-0ad8-418e-a8a8-82cf3ca4a9b7      0     0     2     0           0        0         0         0
066f3814-6e32-4267-9e87-e918b6708995      0     0     1     0           0        0         0         0
0717f786-572e-461f-b895-53fef9a6a757      0     0     3     0           0        0         0         0

In:

# and see what `y` looks like
y.head()

Out:

user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce     True
014c911b-53a9-44e2-9805-369e0b4d598f    False
02cf09ee-0ad8-418e-a8a8-82cf3ca4a9b7    False
066f3814-6e32-4267-9e87-e918b6708995    False
0717f786-572e-461f-b895-53fef9a6a757    False
Name: modeling, dtype: boo

Reference

Instantiate & fit the logistic regression model

As the model is based on sklearn’s version of LogisticRegression, it can be instantiated with any parameters that sklearn’s LogisticRegression supports. In our example we instantiate it with fit_intercept=False.

In:

lr = modelhub.get_logistic_regression(fit_intercept=False)

The fit operation then fits it to the passed data. This operation extracts the data from the database under the hood.

In:

lr.fit(X, y)

Out:

LogisticRegression(fit_intercept=False)

Set accuracy & prediction

All of the following operations are carried out directly on the database.

In:

lr.score(X, y)

Out:

0.7941176470588235

The model provides the same attributes as sklearn’s Logistic Regression model, such as coef_.

In:

# show the coefficients of the fitted model
lr.coef_

Out:

array([[-0.82043069,  0.61146569, -0.12662165, -0.23339689,  0.        ,
        -0.30166917, -0.13706966,  0.37093552]])

Now let’s create columns for the predicted values and the labels in the dataset. Labels are set to True if the probability is over 0.5.

In:

features_set_sample['predicted_values'] = lr.predict_proba(X)
features_set_sample['predicted_labels'] = lr.predict(X)

In:

# show the sampled data set, including predictions
features_set_sample.head(20)

Out:

                                      about  blog  home  jobs  join_slack  modeling  privacy  taxonomy  tracking  predicted_values  predicted_labels
user_id
007f5fd7-7535-434e-aa3e-3d52f06d63ce      0     0     4     0           0         4        0         0         0          0.283592             False
012e2d45-32ed-4f33-84e5-babfd3b10372      0     0     2     0           0         0        0         0         0          0.386190             False
023ec102-bc2f-4f30-bd42-12dcdbd96e31      0     0     2     0           0         0        0         0         0          0.386190             False
02bc6ae7-153c-44e2-ba4a-301984960ac9      0     0     1     0           0         0        0         0         0          0.442338             False
02c42c27-1c0d-4e3e-b6c0-403a60e8eb83      0     0     0     0           0         0        0         3         0          0.372160             False
030f618e-00e8-4d83-b95e-0c2870978e08      0     0     3     0           0         0        0         0         0          0.332914             False
062dded9-cf06-48fc-9923-ebb5e526f154      0     0     2     0           0         0        0         0         0          0.386190             False
0a16f4d9-eb23-4b86-a886-9276ab1912a0      2     0     4     0           0         0        0         0         0          0.247541             False
0abfcb9e-bc78-49c6-8bb0-4187aa4d8ab2      0     0     7     0           0         5        0         0         0          0.164964             False
0bae858b-80c5-4c34-ae29-2aea00be8ac5      1     1     2     1           0         0        0         2         0          0.331506             False
0bc184f0-f879-44a6-978e-f203bd9e03c1      0     2     0     0           0         0        0         0         0          0.780044              True
0bd503b7-4a60-401f-a217-f9e29bd4686b      0     0     1     0           0         0        0         0         0          0.442338             False
0c235f61-53c4-4708-9f35-fa554da293ca      0     0     1     0           0         0        0         0         0          0.442338             False
0dc6bfb4-f054-4f48-b2c1-a07dd6e31cc7      0     0     1     0           0         0        0         0         0          0.442338             False
1190e603-b127-41d5-9e8d-72d2ed3402da      0     0     5     0           0         0        0         0         0          0.238960             False
13292379-1ad7-4dd5-9b90-bf65408fa24e      0     0     1     0           0         0        0         0         0          0.442338             False
13a975e8-3533-47b0-ae04-08a97a5e211c      0     2     0     0           0         0        0         0         0          0.780044              True
1b219545-08e0-43fa-b476-c6bf0d80c0c7      1     0     8     1           0         0        0         0         0          0.085039             False
1e644cdb-5e79-4536-ab17-6948b4c000bf      0     0     1     0           0         0        0         3         1          0.484079             False
1fc27f44-d229-4f21-9d14-bb0a21571f00      0     0     3     0           0         0        0         0         0          0.332914             False

Reference

Unsample and get the SQL

The sampled dataset we used above can easily be unsampled.

In:

features_set_full = features_set_sample.get_unsampled()

The SQL for any analysis can be exported with one command, so you can use models in production directly to simplify data debugging & delivery to BI tools like Metabase, dbt, etc. See how you can quickly create BI dashboards with this.

In:

# show the underlying SQL for this dataframe - works for any dataframe/model in Objectiv
display_sql_as_markdown(features_set_full)

Out:

WITH "manual_materialize___3b151791e67a85fc8c95acac4a457b07" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "cookie_id" AS "user_id",
               cast("value"->>'_type' AS text) AS "event_type",
               cast(cast("value"->>'_types' AS text) AS JSONB) AS "stack_event_types",
               cast(cast("value"->>'location_stack' AS text) AS JSONB) AS "location_stack",
               cast(cast("value"->>'time' AS text) AS bigint) AS "time"
          FROM "data"
       ),
       "getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types",
               "location_stack" AS "location_stack",
               "time" AS "time"
          FROM "manual_materialize___3b151791e67a85fc8c95acac4a457b07"
         WHERE ((("day" >= cast('2022-03-01' AS date))) AND (("day" <= cast('2022-06-30' AS date))))
       ),
       "context_data___204983e5a9e66c47992ef746e0877e2e" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "location_stack" AS "location_stack",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types"
          FROM "getitem_where_boolean___c2409cbd16046c741733e1b3fc173f24"
       ),
       "session_starts___8a496387168b8a51ab24870a574873b6" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "location_stack" AS "location_stack",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types",
               CASE WHEN (extract(epoch FROM (("moment") - (lag("moment", 1, cast(NULL AS timestamp WITHOUT TIME ZONE)) OVER (PARTITION BY "user_id" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))) <= cast(1800 AS bigint)) THEN cast(NULL AS boolean)
                    ELSE cast(TRUE AS boolean)
                     END AS "is_start_of_session"
          FROM "context_data___204983e5a9e66c47992ef746e0877e2e"
       ),
       "session_id_and_count___0e805b87769bab9933174fc44829028a" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "location_stack" AS "location_stack",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types",
               "is_start_of_session" AS "is_start_of_session",
               CASE WHEN "is_start_of_session" THEN row_number() OVER (PARTITION BY "is_start_of_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                    ELSE cast(NULL AS bigint)
                     END AS "session_start_id",
               count("is_start_of_session") OVER (ORDER BY "user_id" ASC NULLS LAST, "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "is_one_session"
          FROM "session_starts___8a496387168b8a51ab24870a574873b6"
       ),
       "objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "location_stack" AS "location_stack",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types",
               "is_start_of_session" AS "is_start_of_session",
               "session_start_id" AS "session_start_id",
               "is_one_session" AS "is_one_session",
               first_value("session_start_id") OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_id",
               row_number() OVER (PARTITION BY "is_one_session" ORDER BY "moment" ASC NULLS LAST, "event_id" ASC NULLS LAST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "session_hit_number"
          FROM "session_id_and_count___0e805b87769bab9933174fc44829028a"
       ),
       "getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce" AS (
        SELECT "event_id" AS "event_id",
               "day" AS "day",
               "moment" AS "moment",
               "user_id" AS "user_id",
               "location_stack" AS "location_stack",
               "event_type" AS "event_type",
               "stack_event_types" AS "stack_event_types",
               "session_id" AS "session_id",
               "session_hit_number" AS "session_hit_number",
               REPLACE(cast(coalesce((SELECT jsonb_agg(x.value) FROM jsonb_array_elements("location_stack") WITH ORDINALITY x WHERE ORDINALITY - 1 >= (SELECT min(CASE WHEN ('{"_type": "RootLocationContext"}'::JSONB) <@ value THEN ORDINALITY END) -1 FROM jsonb_array_elements("location_stack") WITH ORDINALITY)), '[]'::JSONB)->0->>'id' AS text), '-', '_') AS "root_location"
          FROM "objectiv_sessionized_data___15eaed9578d6a47c065bd9f82a5bf1a6"
         WHERE ("event_type" = 'PressEvent')
       ),
       "reset_index___33f5dbafcc0b5e9f64c2ad04e1787eb3" AS (
        SELECT "user_id" AS "user_id",
               "root_location" AS "root_location",
               cast(sum(cast(1 AS bigint)) AS bigint) AS "value_counts"
          FROM "getitem_where_boolean___f69b89ddfeead29a894fccd7586179ce"
         GROUP BY "user_id",
                  "root_location"
         ORDER BY cast(sum(cast(1 AS bigint)) AS bigint) DESC NULLS LAST
       ),
       "unstack___8dfef8242fc13d14e3c1068b38828af3" AS (
        SELECT "user_id" AS "user_id",
               max("value_counts") AS "value_counts",
               max("root_location") AS "root_location",
               max(CASE WHEN ("root_location" = 'about') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "about__value_counts",
               max(CASE WHEN ("root_location" = 'blog') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "blog__value_counts",
               max(CASE WHEN ("root_location" = 'home') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "home__value_counts",
               max(CASE WHEN ("root_location" = 'jobs') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "jobs__value_counts",
               max(CASE WHEN ("root_location" = 'join_slack') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "join_slack__value_counts",
               max(CASE WHEN ("root_location" = 'modeling') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "modeling__value_counts",
               max(CASE WHEN ("root_location" = 'privacy') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "privacy__value_counts",
               max(CASE WHEN ("root_location" = 'taxonomy') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "taxonomy__value_counts",
               max(CASE WHEN ("root_location" = 'tracking') THEN "value_counts" ELSE cast(NULL AS bigint) END) AS "tracking__value_counts"
          FROM "reset_index___33f5dbafcc0b5e9f64c2ad04e1787eb3"
         GROUP BY "user_id"
       ),
       "get_sample___d78797a51432ec8008301d111eec6f1d" AS (
        SELECT "user_id" AS "user_id",
               (COALESCE("about__value_counts", cast(0 AS bigint))) AS "about",
               (COALESCE("blog__value_counts", cast(0 AS bigint))) AS "blog",
               (COALESCE("home__value_counts", cast(0 AS bigint))) AS "home",
               (COALESCE("jobs__value_counts", cast(0 AS bigint))) AS "jobs",
               (COALESCE("join_slack__value_counts", cast(0 AS bigint))) AS "join_slack",
               (COALESCE("modeling__value_counts", cast(0 AS bigint))) AS "modeling",
               (COALESCE("privacy__value_counts", cast(0 AS bigint))) AS "privacy",
               (COALESCE("taxonomy__value_counts", cast(0 AS bigint))) AS "taxonomy",
               (COALESCE("tracking__value_counts", cast(0 AS bigint))) AS "tracking"
          FROM "unstack___8dfef8242fc13d14e3c1068b38828af3"
       ) SELECT "user_id" AS "user_id",
       "about" AS "about",
       "blog" AS "blog",
       "home" AS "home",
       "jobs" AS "jobs",
       "join_slack" AS "join_slack",
       "modeling" AS "modeling",
       "privacy" AS "privacy",
       "taxonomy" AS "taxonomy",
       "tracking" AS "tracking",
       (exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) / (exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) + cast('1.0' AS double precision))) AS "predicted_values",
       ((exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) / (exp(((((((((cast('0.0' AS double precision) + ("about" * cast('-1.0064649743992844' AS double precision))) + ("blog" * cast('-0.05683975149017227' AS double precision))) + ("home" * cast('-0.20963182234527333' AS double precision))) + ("jobs" * cast('1.510545488936254' AS double precision))) + ("join_slack" * cast('0.0' AS double precision))) + ("privacy" * cast('-0.008778031990958772' AS double precision))) + ("taxonomy" * cast('0.018646988267775482' AS double precision))) + ("tracking" * cast('0.5282013453013719' AS double precision)))) + cast('1.0' AS double precision))) > cast('0.5' AS double precision)) AS "predicted_labels"
  FROM "get_sample___d78797a51432ec8008301d111eec6f1d"

That’s it! Stay tuned for more metrics to assess model fit, as well as simplifying splitting the data into training and testing datasets.

Join us on Slack if you have any questions or suggestions.

Next Steps

Try the notebooks in Objectiv Up

Spin up a full-fledged product analytics pipeline with Objectiv Up in under 5 minutes, and play with the included example notebooks yourself.

Use this notebook with your own data

You can use the example notebooks on any dataset that was collected with Objectiv’s tracker, so feel free to use them to bootstrap your own projects. They are available as Jupyter notebooks on our GitHub repository. See instructions to set up the Objectiv tracker.

Feature importance notebook - model the importance of features on achieving a conversion goal.
User Intent notebook - run basic User Intent analysis with Objectiv.

Logistic Regression

Get started​

Creating a feature set to predict user behavior​

Sample the data​

Instantiate & fit the logistic regression model​

Set accuracy & prediction​

Unsample and get the SQL​

Next Steps​

Try the notebooks in Objectiv Up​

Use this notebook with your own data​

Check out related example notebooks​

Get started

Creating a feature set to predict user behavior

Sample the data

Instantiate & fit the logistic regression model

Set accuracy & prediction

Unsample and get the SQL

Next Steps

Try the notebooks in Objectiv Up

Use this notebook with your own data

Check out related example notebooks