Supported Data Stores

Our aim is to support all popular data stores with Bach. Currently Bach supports AWS Athena, Google BigQuery, and PostgreSQL. Our roadmap includes Databricks, Redshift, ClickHouse, Snowflake, etcetera.

For information on how to setup Objectiv with each of the data stores, see the Collector sections for PostgreSQL, AWS Athena, or Google BigQuery.

Bach works the same, regardless of the underlying database. Generally, anything that works on a supported database will work on the other databases too; no code changes needed. There are some notable exceptions to this. Below we list the major platform-specific exceptions, and other things to keep in mind when using Bach with the different supported databases.

AWS Athena

AWS Athena is a serverless cloud-hosted SQL engine that make it possible to query huge datasets that are stored on AWS’s S3 service.

When running Bach on Athena, one should be aware of the following:

Overwriting tables doesn’t delete the old data. Operations that overwrite an existing table (e.g. DataFrame.database_create_table() with if_exists=’replace’) don’t actually remove the data of the original table. The original table definition is removed from Athena, but the data itself resides on AWS S3. You will have to clean up data on S3 yourself.

Temporary tables are not supported. As an alternative consider using DataFrame.database_create_table().

Some Bach data types are represented or stored as different types. The data type support in Athena is limited compared to some other databases. Bach uses some workarounds for this, that are mostly transparent to the user, but can become clear when writing resulting data to a table and inspecting that table.

  • The Bach types time and timedelta do not use specialized time types, but Athena’s double type to store the number of (partial) seconds.
  • The bach type json is converted to string when writing it to a table. Athena does not support storing json data directly.

Bach Series names with special characters get encoded. Athena limits the allowed characters in column names. Therefore, Bach Series names that contain ‘special’ characters are encoded in SQL, and might be harder to relate to the original series name. See the tip on using simple Series names.

Google BigQuery

Google BigQuery is a serverless cloud-hosted database that can handle huge datasets. When running Bach on BigQuery, one should be aware of the following:

Bach Series names with special characters get encoded. Big Query limits the allowed characters in column names. Therefore, Bach Series names that contain ‘special’ characters are encoded in SQL, and might be harder to relate to the original series name. See the tip on using simple Series names.

BigQuery cannot execute very complex queries Sometimes complex operations on Bach cannot be executed on BigQuery and you might get an error:

Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.

One way to avoid this problem is by materializing intermediate results, before the query becomes too complex.

df = df.materialize(materialization='temp_table')

Also see the tip on using temporary tables.

Special BigQuery Series types. The Series dtype list and dict are currently only supported on BigQuery, and their functionality is limited. Some functions might not work as expected when using them on such series, e.g. sorting on such columns will give an error.

PostgreSQL

PostgreSQL is a very feature-rich database, and as a result Bach is fully supported with very few oddities. Generally Postgres has little trouble with complex queries that are generated by Bach. But it might struggle handling big data sets, which might be less of a problem for some of the other databases.

Special Postgres Series type. The Series dtype json_postgres is only supported on Postgres. When possible, one should use the json dtype, which is supported on all databases.