Examples

Here we’ll give some very basic examples of the usage of Bach: creating a DataFrame, basic operations, aggregate operations, and getting the resulting data filtered and sorted. For this example there is no separate notebook available, but the operations demonstrated here are used in the other example notebooks.

In the examples we’ll assume that the database has a table called ‘example’, with a few specific columns. The SQL to create that table can be found below in Appendix: Example Data.

To get a taste of what you can do with Objectiv Bach, There is a demo available that enables you to run the full Objectiv pipeline on your local machine. It includes our website as a demo app, a Jupyter Notebook environment with working models and a Metabase environment to output data to.

Create a DataFrame from a database table

from bach import from_table
import sqlalchemy
# Setup database connection
engine = sqlalchemy.create_engine(DB_URL)
# Create Bach DataFrame representing all the data in the 'example' table, with the 'city_id' as index
df = from_table(engine, 'example', index=['city_id'])

The above fragment queries the database to get the table structure of the ‘example’ table. But it does not query any of the data in ‘example’, and this thus works equally well for a tiny table as for a huge table.

It is also possible to create a DataFrame from an arbitrary sql query (using from_model) or from an existing pandas DataFrame (using from_pandas).

Basic operations

# Adding a new column
df['column_name'] = 1337
# Setting a new column to the value of an existing column
df['another column'] = df['city']
# Add a column 'century' with the result of some arithmetic
df['century'] = df['founding'] // 100 + 1
# Add a column 'concat', with the result of concatenating strings
df['concat'] = df['city'] + ' is located in ' + df['municipality']
# remove the city column
df = df.drop(columns=['city'])
# rename the 'another column' column to 'city'
df = df.rename(columns={'another column': 'city'})

# Convert the Bach DataFrame to a pandas DataFrame.
# When executing in a notebook this will print the dataframe.
df.to_pandas()

The above operations add/remove/rename some columns of the DataFrame. However no actual query is executed on the Database, until df.to_pandas() is called. The DataFrame operations merely change the symbolic representation of the data in the DataFrame and its Series.

The call to df.to_pandas() here is merely for demonstration purposes, in situations with bigger data sets it should be avoided until the data is needed as it will query the database and transfer all data.

Aggregate operations

# Group on century, select the 'inhabitants' column, and calculate the maximum value within the group
df_max = df.groupby('century')[['inhabitants']].max()
df_max = df_max.reset_index()
# df_max has two columns: 'century' and 'inhabitants_max'
# merge df_max back into df, the merge will be done on the 'century' column as that is in both df and df_max
df = df.merge(df_max)

# Alternative method: use a window function
century_window = df.window('century')
df['inhabitants_max_2'] = df['inhabitants'].max(century_window)

# Convert the Bach DataFrame to a pandas DataFrame.
# When executing in a notebook this will print the dataframe.
df.to_pandas()

The above example demonstrates how we can calculate aggregate functions (in this case max()) on a group of row within a window that contains rows. Additionally it shows how to merge two DataFrames. Again only the optional debug statement df.to_pandas() runs a query, the other operations merely update the internal state of the DataFrame and its Series.

Filtering, sorting, and output

# Only keep the rows for which inhabitants == inhabitants_max,
# i.e. the cities that are the biggest of all cities founded in the same century
df = df[df.inhabitants == df.inhabitants_max]
# Sort by century
df = df.sort_values('century')
# Only keep selected columns
df = df[['skating_order', 'municipality', 'inhabitants', 'founding', 'city']]

# Query database.
print(df.to_pandas())
# Expected output:
# skating_order municipality inhabitants founding city
# city_id
# 5 5 Súdwest-Fryslân 960 1061 Starum
# 1 1 Leeuwarden 93485 1285 Ljouwert
# 10 10 Waadhoeke 12760 1374 Frjentsjer
# 2 2 Súdwest-Fryslân 33520 1456 Snits

# Show the SQL query used to generate the above output:
print(df.view_sql())

The above example demonstrates filtering out rows and sorting a DataFrame. Without the sort_values() the order of the returned rows is non-deterministic. view_sql() can be used to show the compiled SQL query that encompasses all operations done so far.

Filtering by Index Labels

important

In the following examples we call multiple times to_pandas method, we do it only for visualization purposes. Please use only df.to_pandas when necessary, as this will execute the frame’s current query.

Here we construct a simple dataframe for illustrating the label selection functionality:

In [1]: import pandas

In [2]: data = {
...: 'skating_order': [1, 2, 3, 4, 5],
...: 'city': ['Ljouwert', 'Snits', 'Drylts', 'Sleat', 'Starum'],
...: 'municipality': ['Leeuwarden', 'Súdwest-Fryslân', 'Súdwest-Fryslân', 'De Friese Meren', 'Súdwest-Fryslân'],
...: 'inhabitants': [1285, 1456, 1268, 1426, 1061],
...: }
...:

In [3]: pdf = pandas.DataFrame(data)

In [4]: df = DataFrame.from_pandas(engine=engine, df=pdf, convert_objects=True)

In [5]: df = df.set_index('city')

In [6]: df.to_pandas()
Out[6]:
skating_order municipality inhabitants
city
Ljouwert 1 Leeuwarden 1285
Snits 2 Súdwest-Fryslân 1456
Drylts 3 Súdwest-Fryslân 1268
Sleat 4 De Friese Meren 1426
Starum 5 Súdwest-Fryslân 1061

If you want to select a specific row from the frame, you can simply pass the label to the loc property:

In [7]: df.loc['Drylts'].to_pandas()
Out[7]:
__stacked_index
skating_order 3
municipality Súdwest-Fryslân
inhabitants 1268
Name: __stacked, dtype: object

You can observe that the previous result returned a Bach Series, this will always be the return-type for Single-Label Selection, where all selected columns are stacked into one single index series.

If you require just a subset of the columns, you can pass them as a second parameter:

In [8]: df.loc['Drylts', ['inhabitants', 'skating_order']].to_pandas()
Out[8]:
__stacked_index
inhabitants 1268
skating_order 3
Name: __stacked, dtype: int64

This will still return a series, the main difference is that only inhabitants and skating_order are considered in the stacked index.

note

If a column included in the selection does not exist in the frame, this will raise an error.

In case you want to select/filter rows based on multiple labels, you can pass a list of labels:

In [9]: df.loc[['Drylts', 'Ljouwert'], 'municipality'].to_pandas()
Out[9]:
municipality
city
Ljouwert Leeuwarden
Drylts Súdwest-Fryslân

The code from above will return a Bach DataFrame instead, this will always be the return-type for List-Label Selection.

Slicing-Selection by labels is also possible, for both rows and columns. Note the next example:

warning

Before slicing a frame’s rows, you must sort it first. As slicing is a non-deterministic operation.

In [10]: df.sort_index().loc['Sleat':, 'municipality':].to_pandas()
Out[10]:
municipality inhabitants
city
Sleat De Friese Meren 1426
Snits Súdwest-Fryslân 1456
Starum Súdwest-Fryslân 1061

In case you need to select a value based on a condition, a series boolean can also be passed to loc property.

In [11]: df.loc[df['inhabitants'] > 1300].to_pandas()
Out[11]:
skating_order municipality inhabitants
city
Snits 2 Súdwest-Fryslân 1456
Sleat 4 De Friese Meren 1426

In previous examples, we selected rows by labels that actually exist in the frame. In case a label doesn’t exist in the frame, this will not raise any error since Bach has no notion of which values exist in the frame.

In [12]: df.loc['x'].to_pandas()
Out[12]: Series([], Name: __stacked, dtype: object)

Setting Values to DataFrame Subset

In previous section we played around a bit with loc property, by just filtering the frame using labels. As in pandas, you are also able to set values to a specific group of rows and update the main frame. This works for all types of selections.

In [13]: df.loc['Drylts'] = 'x'

In [14]: df.to_pandas()
Out[14]:
skating_order municipality inhabitants
city
Ljouwert 1 Leeuwarden 1285
Snits 2 Súdwest-Fryslân 1456
Drylts x x x
Sleat 4 De Friese Meren 1426
Starum 5 Súdwest-Fryslân 1061

You can see that the previous code block, sets all series to ‘x’ where the index is equal to Drylts.

note

If the value being set has a different dtype than the series to be modified, the series’s dtype will be changed to string. This does not apply if both have a numerical dtype.

In [15]: df = df.sort_index()

In [16]: df.loc['Sleat':, 'municipality'] = 'Fryslân'

In [17]: df.to_pandas()
Out[17]:
skating_order municipality inhabitants
city
Drylts x x x
Ljouwert 1 Leeuwarden 1285
Sleat 4 Fryslân 1426
Snits 2 Fryslân 1456
Starum 5 Fryslân 1061

As we mentioned, setting a value is possible for any type of selection. Notice that sorting is also needed for this case.

important

df.sort_index().loc['Sleat':, 'municipality'] = 'Fryslân' will have no effect on df, since df.sort_index() returns a new dataframe.

Appendix: Example Data

CREATE TABLE example (
city_id bigint,
skating_order bigint,
city text,
municipality text,
inhabitants bigint,
founding bigint
);
insert into example(city_id, skating_order, city, municipality, inhabitants, founding) values
(1, 1, 'Ljouwert', 'Leeuwarden', 93485, 1285),
(2, 2, 'Snits', 'Súdwest-Fryslân', 33520, 1456),
(3, 3, 'Drylts', 'Súdwest-Fryslân', 3055, 1268),
(4, 4, 'Sleat', 'De Friese Meren', 700, 1426),
(5, 5, 'Starum', 'Súdwest-Fryslân', 960, 1061),
(6, 6, 'Hylpen', 'Súdwest-Fryslân', 870, 1225),
(7, 7, 'Warkum', 'Súdwest-Fryslân', 4440, 1399),
(8, 8, 'Boalsert', 'Súdwest-Fryslân', 10120, 1455),
(9, 9, 'Harns', 'Harlingen', 14740, 1234),
(10, 10, 'Frjentsjer', 'Waadhoeke', 12760, 1374),
(11, 11, 'Dokkum', 'Noardeast-Fryslân', 12675, 1298);