The Objectiv Collector can be configured to work with BigQuery through Snowplow on Google Cloud Platform. The Snowplow GCP setup uses GCP PubSub topics (message queue) to connect various stages in the pipeline.
How to set up Objectiv on GCP using Snowplow
We assume below that you've already read how to set up Objectiv with Snowplow.
The setup works as follows:
- Events arrive at the Objectiv Collector, and are validated;
- Good events are published on the
rawtopic on PubSub (which is read by the Enrich process); and
- Bad events (invalid) are published on the
badtopic on PubSub.
Starting the Collector
The configuration for the collector is controlled through environment variables that configure which outputs are used. Settings specific to the PubSub sink are:
idof the project on GCP where the PubSub topic is located;
idof the PubSub topic to publish events to;
idof the PubSub topic to publish bad/invalid events to;
GOOGLE_APPLICATION_CREDENTIALS: The path to a
jsoncontaining a service account on GCP that allows publishing to the PubSub topic.
If these are set, the Snowplow sink is automatically enabled.
To run this setup in Docker, make sure that the aforementioned environment variables are properly set and available in the container. Also take care that the path to the credentials is actually available in the container.
docker-compose, the following yaml snippet will do the trick:
The important parts here are:
- Using a volume to make the service account available inside the container;
- Assigning the path of the volume-mapped file correctly to the environment variable;
- Setting the 2 GCP/PubSub variables, to make sure the collector knows where to push the events to.
Running the collector locally, in a dev setup is pretty similar:
# setup environment
pip install -r requirements.in
# start flask app
Test the setup
The Collector will display a message if the Snowplow config is loaded:
Enabled Snowplow: GCP pipeline.
This indicates that the Collector will try to push events. If this fails, logging should hint what's happening. If there are no errors in the collector logs, the events should be successfully pushed into the raw topic, to be picked up by Snowplow's enrichment.
To check if messages have been successfully received by the PubSub topic, please refer to the monitoring of
that specific topic in the GCP console. The
Publish message request count monitoring topic should show more
than 0 requests/sec.
Connect to BigQuery in your notebook
See how to get started in your notebook to connect to the BigQuery database and start modeling.
BigQuery table structure
In a standard Snowplow BigQuery setup, all data is stored in a table called
events. Objectiv data is stored in the
table by mapping the Objectiv event properties on the respective Snowplow properties. Objectiv's contexts are stored in
Event and some context properties are mapped onto the
events table directly.
Every global context has its own custom context, and thus its own column in the database. Columns in the database will
contexts_io_objectiv_context_some_global_context_1_0_0 containing a property for each of the properties in the original
_types properties have been removed.
As order is significant in the location stack, a slightly different approach is taken in storing it. The location stack
is stored as a nested structure in