Usage
This page explains how to use Bach:
See the Examples section for some basic Bach examples: creating a DataFrame, simple data operations, aggregate operations, and getting the resulting data filtered and sorted.
See the Best Practices section for tips on how to best use Bach.
Examplesâ
In the examples below, weâll assume that the database has a table called âexampleâ, with a few specific columns. The SQL to create that table can be found below in Appendix: Example Data.
To get a taste of what you can do with Objectiv Bach, thereâs a demo that enables you to run the full Objectiv pipeline on your local machine. It includes our website as a demo app, a Jupyter Notebook environment with working models and a Metabase environment to output data to.
Create a DataFrame from a database tableâ
from bach import from_table
import sqlalchemy
# Setup database connection
engine = sqlalchemy.create_engine(DB_URL)
# Create Bach DataFrame representing all the data in the 'example' table, with the 'city_id' as index
df = from_table(engine, 'example', index=['city_id'])
The above fragment queries the database to get the table structure of the âexampleâ table. But it does not query any of the data in âexampleâ, and this thus works equally well for a tiny table as for a huge table.
It is also possible to create a DataFrame from an arbitrary sql query (using
from_model
) or from an existing pandas DataFrame (using
from_pandas
).
Basic operationsâ
# Adding a new column
df['column_name'] = 1337
# Setting a new column to the value of an existing column
df['another column'] = df['city']
# Add a column 'century' with the result of some arithmetic
df['century'] = df['founding'] // 100 + 1
# Add a column 'concat', with the result of concatenating strings
df['concat'] = df['city'] + ' is located in ' + df['municipality']
# remove the city column
df = df.drop(columns=['city'])
# rename the 'another column' column to 'city'
df = df.rename(columns={'another column': 'city'})
# Convert the Bach DataFrame to a pandas DataFrame.
# When executing in a notebook this will print the dataframe.
df.to_pandas()
The above operations add/remove/rename some columns of the DataFrame. However no actual query is executed
on the Database, until df.to_pandas()
is called. The DataFrame operations
merely change the symbolic representation of the data in the DataFrame and its Series.
The call to df.to_pandas()
here is merely for demonstration purposes, in
situations with bigger data sets it should be avoided until the data is needed as it will query the database
and transfer all data.
Aggregate operationsâ
# Group on century, select the 'inhabitants' column, and calculate the maximum value within the group
df_max = df.groupby('century')[['inhabitants']].max()
df_max = df_max.reset_index()
# df_max has two columns: 'century' and 'inhabitants_max'
# merge df_max back into df, the merge will be done on the 'century' column as that is in both df and df_max
df = df.merge(df_max)
# Alternative method: use a window function
century_window = df.window('century')
df['inhabitants_max_2'] = df['inhabitants'].max(century_window)
# Convert the Bach DataFrame to a pandas DataFrame.
# When executing in a notebook this will print the dataframe.
df.to_pandas()
The above example demonstrates how we can calculate aggregate functions (in this case
max()
) on a group of row within a window that contains rows. Additionally it shows
how to merge two DataFrames. Again only the optional debug statement
df.to_pandas()
runs a query, the other operations merely update the internal
state of the DataFrame and its Series.
Filtering, sorting, and outputâ
# Only keep the rows for which inhabitants == inhabitants_max,
# i.e. the cities that are the biggest of all cities founded in the same century
df = df[df.inhabitants == df.inhabitants_max]
# Sort by century
df = df.sort_values('century')
# Only keep selected columns
df = df[['skating_order', 'municipality', 'inhabitants', 'founding', 'city']]
# Query database.
print(df.to_pandas())
# Expected output:
# skating_order municipality inhabitants founding city
# city_id
# 5 5 Súdwest-Fryslân 960 1061 Starum
# 1 1 Leeuwarden 93485 1285 Ljouwert
# 10 10 Waadhoeke 12760 1374 Frjentsjer
# 2 2 Súdwest-Fryslân 33520 1456 Snits
# Show the SQL query used to generate the above output:
print(df.view_sql())
The above example demonstrates filtering out rows and sorting a DataFrame. Without the
sort_values()
the order of the returned rows is non-deterministic.
view_sql()
can be used to show the compiled SQL query that
encompasses all operations done so far.
Filtering by Index Labelsâ
In the following examples we call the to_pandas()
method multiple times,
but we do it only for visualization purposes. Please use to_pandas()
only when necessary, as this will execute the frameâs current query.
Here we construct a simple dataframe for illustrating the label selection functionality:
In [1]: import pandas
In [2]: data = {
...: 'skating_order': [1, 2, 3, 4, 5],
...: 'city': ['Ljouwert', 'Snits', 'Drylts', 'Sleat', 'Starum'],
...: 'municipality': ['Leeuwarden', 'Súdwest-Fryslân', 'Súdwest-Fryslân', 'De Friese Meren', 'Súdwest-Fryslân'],
...: 'inhabitants': [1285, 1456, 1268, 1426, 1061],
...: }
...:
In [3]: pdf = pandas.DataFrame(data)
In [4]: df = DataFrame.from_pandas(engine=engine, df=pdf)
In [5]: df = df.set_index('city')
In [6]: df.to_pandas()
Out[6]:
skating_order municipality inhabitants
city
Ljouwert 1 Leeuwarden 1285
Snits 2 Súdwest-Fryslân 1456
Drylts 3 Súdwest-Fryslân 1268
Sleat 4 De Friese Meren 1426
Starum 5 Súdwest-Fryslân 1061
If you want to select a specific row from the frame, you can simply pass the label to the
loc()
property:
In [7]: df.loc['Drylts'].to_pandas()
Out[7]:
__stacked_index
skating_order 3
municipality Súdwest-Fryslân
inhabitants 1268
Name: __stacked, dtype: object
You can observe that the previous result returned a Bach Series, this will always be the return-type for Single-Label Selection, where all selected columns are stacked into one single index series.
If you require just a subset of the columns, you can pass them as a second parameter:
In [8]: df.loc['Drylts', ['inhabitants', 'skating_order']].to_pandas()
Out[8]:
__stacked_index
inhabitants 1268
skating_order 3
Name: __stacked, dtype: int64
This will still return a series, the main difference is that only inhabitants
and skating_order
are
considered in the stacked index.
If a column included in the selection does not exist in the frame, this will raise an error.
In case you want to select/filter rows based on multiple labels, you can pass a list of labels:
In [9]: df.loc[['Drylts', 'Ljouwert'], 'municipality'].to_pandas()
Out[9]:
municipality
city
Ljouwert Leeuwarden
Drylts Súdwest-Fryslân
The code from above will return a Bach DataFrame instead, this will always be the return-type for List-Label Selection.
Slicing-Selection by labels is also possible, for both rows and columns. Note the next example:
Before slicing a frameâs rows, you must sort it first. As slicing is a non-deterministic operation.
In [10]: df.sort_index().loc['Sleat':, 'municipality':].to_pandas()
Out[10]:
municipality inhabitants
city
Sleat De Friese Meren 1426
Snits Súdwest-Fryslân 1456
Starum Súdwest-Fryslân 1061
In case you need to select a value based on a condition, a series boolean can also be passed to the
loc()
property.
In [11]: df.loc[df['inhabitants'] > 1300].to_pandas()
Out[11]:
skating_order municipality inhabitants
city
Snits 2 Súdwest-Fryslân 1456
Sleat 4 De Friese Meren 1426
In previous examples, we selected rows by labels that actually exist in the frame. In case a label doesnât exist in the frame, this will not raise any error since Bach has no notion of which values exist in the frame.
In [12]: df.loc['x'].to_pandas()
Out[12]: Series([], Name: __stacked, dtype: object)
Setting Values to DataFrame Subsetâ
In previous section we played around a bit with the loc()
property, by just
filtering the frame using labels. As in pandas, you are also able to set values to a specific group of rows
and update the main frame. This works for all types of selections.
In [13]: df.loc['Drylts'] = 'x'
In [14]: df.to_pandas()
Out[14]:
skating_order municipality inhabitants
city
Ljouwert 1 Leeuwarden 1285
Snits 2 Súdwest-Fryslân 1456
Drylts x x x
Sleat 4 De Friese Meren 1426
Starum 5 Súdwest-Fryslân 1061
You can see that the previous code block, sets all series to âxâ where the index is equal to Drylts
.
If the value being set has a different dtype
than the series to be modified, the seriesâs dtype
will be changed to string. This does not apply if both have a numerical dtype
.
In [15]: df = df.sort_index()
In [16]: df.loc['Sleat':, 'municipality'] = 'Fryslân'
In [17]: df.to_pandas()
Out[17]:
skating_order municipality inhabitants
city
Drylts x x x
Ljouwert 1 Leeuwarden 1285
Sleat 4 Fryslân 1426
Snits 2 Fryslân 1456
Starum 5 Fryslân 1061
As we mentioned, setting a value is possible for any type of selection. Notice that sorting is also needed for this case.
df.sort_index().loc['Sleat':, 'municipality'] = 'Fryslân'
will have no effect on df
, since
sort_index()
returns a new DataFrame.
Best Practicesâ
Use simple Series names for cleaner SQLâ
All Series in a Bach DataFrame map directly to database columns with the same name. However, some databases limit the characters that can be used in column names. To accommodate this, Bach will transparently map Series names with âspecialâ characters to different column names.
This does mean that the columns names in the generated SQL query can be different from the names in a
DataFrame. If thatâs undesired, then stick to Series names only containing the characters a-z
, 0-9
,
and _
, that start with a-z
, and a maximum length of 63 characters.
Use a data sample to limit the data queriedâ
Querying big datasets can be very costly in both time and money.
To get a smaller sample of the current DataFrame, use DataFrame.get_sample()
:
table_name = 'example-project.writable_dataset.table_name'
df = df.get_sample(table_name, sample_percentage=10)
This creates a permanent table with 10% of the current data of the DataFrame. The DataFrame df
will use
this new table in all later operations. One can use DataFrame.get_unsampled()
to switch the
source table for the DataFrame back to the original table, without undoing any of the operations that have
been done since the sample was created.
Use temporary tables to limit query complexityâ
Very complex queries can lead to problems with databases. Sometimes the performance degrades and sometimes databases might not even be able to execute a query at all.
One solution is to materialize the state of you DataFrame into a temporary table, in between complex operations.
df = df.materialize(materialization='temp_table')
Calling DataFrame.materialize()
does not cause a direct call to the database, but rather changes the SQL
that will be generated later on. That SQL will be split in parts: first the query to create a temporary table
with the current state of the DataFrame, and then the SQL for the following operations. In some cases this
can help the database a lot.
One way of checking SQL complexity is to print the resulting query:
display_sql_as_markdown(df)
Appendix: Example Dataâ
Data used by some of the Usage Examples
CREATE TABLE example (
city_id bigint,
skating_order bigint,
city text,
municipality text,
inhabitants bigint,
founding bigint
);
insert into example(city_id, skating_order, city, municipality, inhabitants, founding) values
(1, 1, 'Ljouwert', 'Leeuwarden', 93485, 1285),
(2, 2, 'Snits', 'Súdwest-Fryslân', 33520, 1456),
(3, 3, 'Drylts', 'Súdwest-Fryslân', 3055, 1268),
(4, 4, 'Sleat', 'De Friese Meren', 700, 1426),
(5, 5, 'Starum', 'Súdwest-Fryslân', 960, 1061),
(6, 6, 'Hylpen', 'Súdwest-Fryslân', 870, 1225),
(7, 7, 'Warkum', 'Súdwest-Fryslân', 4440, 1399),
(8, 8, 'Boalsert', 'Súdwest-Fryslân', 10120, 1455),
(9, 9, 'Harns', 'Harlingen', 14740, 1234),
(10, 10, 'Frjentsjer', 'Waadhoeke', 12760, 1374),
(11, 11, 'Dokkum', 'Noardeast-Fryslân', 12675, 1298);