Google BigQuery
The Objectiv Collector can be configured to work with BigQuery through Snowplow on Google Cloud Platform. The Snowplow GCP setup uses GCP PubSub topics (message queue) to connect various stages in the pipeline.
How to set up Objectiv on GCP using Snowplow
We assume below that you've already read how to set up Objectiv with Snowplow.
The setup works as follows:
- Events arrive at the Objectiv Collector, and are validated;
- Good events are published on the
raw
topic on PubSub (which is read by the Enrich process); and - Bad events (invalid) are published on the
bad
topic on PubSub.
Starting the Collector
The configuration for the collector is controlled through environment variables that configure which outputs are used. Settings specific to the PubSub sink are:
SP_GCP_PROJECT
: Theid
of the project on GCP where the PubSub topic is located;SP_GCP_PUBSUB_TOPIC_RAW
: Theid
of the PubSub topic to publish events to;SP_GCP_PUBSUB_TOPIC_BAD
: Theid
of the PubSub topic to publish bad/invalid events to;GOOGLE_APPLICATION_CREDENTIALS
: The path to ajson
containing a service account on GCP that allows publishing to the PubSub topic.
If these are set, the Snowplow sink is automatically enabled.
Using docker-compose
To run this setup in Docker, make sure that the aforementioned environment variables are properly set and available in the container. Also take care that the path to the credentials is actually available in the container.
When using docker-compose
, the following yaml snippet will do the trick:
objectiv_collector:
container_name: objectiv_collector
image: objectiv/backend
working_dir: /services
ports:
- "127.0.0.1:5000:5000"
volumes:
- /path/to/YOUR_SERVICE_ACCOUNT.json:/sa.json
environment:
SP_GCP_PROJECT: some-gcp-project
SP_GCP_PUBSUB_TOPIC_RAW: sp-raw
SP_GCP_PUBSUB_TOPIC_BAD: sp-bad
GOOGLE_APPLICATION_CREDENTIALS: /sa.json
OUTPUT_ENABLE_PG: false
The important parts here are:
- Using a volume to make the service account available inside the container;
- Assigning the path of the volume-mapped file correctly to the environment variable;
- Setting the 2 GCP/PubSub variables, to make sure the collector knows where to push the events to.
Running locally
Running the collector locally, in a dev setup is pretty similar:
# setup environment
virtualenv objectiv-venv
source objectiv-venv/bin/activate
pip install -r requirements.in
# start flask app
cd objectiv_backend
export PYTHONPATH=.:$PYTHONPATH
GOOGLE_APPLICATION_CREDENTIALS=/path/to/YOUR_SERVICE_ACCOUNT.json \
SP_GCP_PROJECT=some-gcp-project \
SP_GCP_PUBSUB_TOPIC_RAW=sp-raw \
SP_GCP_PUBSUB_TOPIC_BAD=sp-bad \
flask run
Test the setup
The Collector will display a message if the Snowplow config is loaded: Enabled Snowplow: GCP pipeline
.
This indicates that the Collector will try to push events. If this fails, logging should hint what's happening. If there are no errors in the collector logs, the events should be successfully pushed into the raw topic, to be picked up by Snowplow's enrichment.
To check if messages have been successfully received by the PubSub topic, please refer to the monitoring of
that specific topic in the GCP console. The Publish message request count
monitoring topic should show more
than 0 requests/sec.
Connect to BigQuery in your notebook
See how to get started in your notebook to connect to the BigQuery database and start modeling.
BigQuery table structure
In a standard Snowplow BigQuery setup, all data is stored in a table called events
. Objectiv data is stored in the
table by mapping the Objectiv event properties on the respective Snowplow properties. Objectiv's contexts are stored in
custom contexts.
Events
Event and some context properties are mapped onto the events
table directly.
Global contexts
Every global context has its own custom context, and thus its own column in the database. Columns in the database will
be named:
contexts_io_objectiv_context_some_global_context_1_0_0
containing a property for each of the properties in the original
Objectiv context.
NOTE: the _type
and _types
properties have been removed.
Location stack
As order is significant in the location stack, a slightly different approach is taken in storing it. The location stack
is stored as a nested structure in context_io_objectiv_location_stack_1_0_0
.