Snowplow pipeline
The Objectiv Collector supports using the Snowplow pipeline as a sink for Objectiv events, hooking directly into Snowplow's enrichment step. Currently, there is data store support for:
- Google BigQuery, via Google PubSub; and
- Amazon S3, via AWS SQS/Kinesis.
How to set up Objectiv with Snowplow
In this setup, we assume you already have a fully functional Snowplow pipeline running, including enrichment, loader and iglu repository. If you haven't, please see the Snowplow quickstart for Open Source.
Enabling Objectiv involves two steps, as explained next:
- Adding the Objectiv Taxonomy schema to the iglu repository;
- Configuring the Objectiv Collector output to push events into the appropriate message queue.
1. Add the Objectiv schema to the iglu repo
This step is required so the Snowplow pipeline (enrichment) can validate the incoming custom contexts.
Preparation
- copy the Objectiv iglu schemas (see here)
- get the address / URL of your iglu repository;
- get the uuid of the repo.
Pushing the schema
java -jar igluctl static push --public <path to iglu schemas> <url to repo> <uuid>
## example:
java -jar igluctl static push --public ./iglu https://iglu.example.com myuuid-abcd-abcd-abcd-abcdef12345
2. Configure output to push events to the data store
The Collector can be configured to push events into a Snowplow message queue, using environment variables.
- To send output to GCP/BigQuery, please refer to BigQuery instructions.
- To send output to AWS SQS/Kinesis, please refer to Amazon S3 instructions.
Background
The Snowplow pipeline roughly consists of the following components:
Collector
: http(s) endpoint that receives events;Enrichment
: process that validates incoming events, potentially enriches them (adds metadata);Loader
: final step, where the validated and enriched events are loaded into persistent storage. Depending on your choice of platform, this could be BigQuery on GCP, Redshift on AWS, etc.;iglu
: central repository used by other components to pull schema for validation on events, contexts, etc.
The Snowplow pipeline uses message queues and Thrift messages to communicate between the components.
Objectiv uses its own Collector (which also handles validation) that bypasses the Snowplow collector, and
pushes events directly into the message queue that is read by the enrichment
.
Snowplow allows for so-called structured custom contexts to be added to events. This is exactly what Objectiv
uses. As with all contexts, they must pass validation in the enrichment
step, which is why a schema for the
Objectiv custom context must be added to iglu, so Snowplow knows how to validate the context. Furthermore, it
infers the database schema to be able to persist the context. How this is handled depends on the loader
chosen, e.g. Postgres uses a more relational schema than BigQuery.
Objectiv to Snowplow events mapping
In a standard Snowplow setup, all data is stored in a table called events
. Objectiv data is stored in the
table by mapping the Objectiv event properties on the respective Snowplow properties. Objectiv's contexts are stored in
custom contexts.
Events
Event and some context properties are mapped onto the Snowplow events
table directly. See table below for details:
Objectiv property | SP Tracker property | Snowplow property |
---|---|---|
event.event_id | eid | event_id |
event.time | ttm | true_stamp |
event._type | se_ca | se_category |
ApplicationContext.id | aid | app_id |
CookieIdContext.id | networkUserId | network_userid |
HttpContext.referrer | refr | page_referrer |
HttpContext.remote_address | ip | user_ipaddress |
PathContext.id | url | page_url |
Global contexts
For every global context, a specific custom context is created, with its own schema in Iglu. Naming scheme is
io.objectiv.context/SomeContext
NOTE: the _type
and _types
properties have been removed.
Location stack
As order is significant in the location stack, a slightly different approach is taken in storing it. The location stack
is stored as a nested structure in a custom context (io.objectiv/location_stack
)