DataFrame

class bach.DataFrame

(engine, base_node, index, series, group_by, order_by, savepoints, variables=None)

A mutable DataFrame representing tabular data in a database and enabling operations on that data.

A Bach DataFrame object can be used to process large amounts of data on a database, while using an api that is based on the pandas api. This allows the database to group and aggregate data, sample data and do other operations that are not suitable for in memory processing. At any time it is possible to write your Bach DataFrame to a pandas DataFrame.

Usage

It should generally not be required to construct DataFrame instances manually. A DataFrame can be constructed using the any of the bach classmethods like from_table(), from_model(), or from_pandas(). The returned DataFrame can be thought of as a dict-like container for Bach Series objects.

Getting & Setting columns

Getting data works similar to pandas DataFrame. Single columns can be retrieved with df['column_name'] as well as df.column_name. This will return a single Bach Series. Multiple columns can be retrieved by passing a list of column names like: df[['column_name','other_column_name']]. This returns a Bach DataFrame.

A selection of rows can be selected with python slicing. I.e. df[2:5] returns row 2 to 5. Only positive integers are currently accepted in slices.

SeriesBoolean can also be used to filter DataFrames, and these Series are easily created using comparison operations like equals (==), less-than (<), not(~) on two series, or series with values: boolean_series = a == b. Boolean indexing can be done like df[df.column == 5]. Only rows are returned for which the condition is true.

Label-based selection is also supported by using the .loc attribute. Each label is interpreted as a value contained by the index column. Unlike Pandas, if the label is not found, no exception will be raised. I.e. df.loc['a'] returns rows where the index series is equal to a. Slicing can also be performed (the dataframe must be sorted). I.e. df['a':'d', 'col1:col3']. This will return all rows/columns included in the slicing, where the start and stop are inclusive. For more information about label-based selection, please take a look to loc.

Moving Series around

Values, Series or DataFrames can be set to another DataFrame. Setting Series or DataFrames to another DataFrame is possible if they share the same base node or index dtype. DataFrames and Series share the same base node if they originate from the same data source. In most cases this means that the series that is to be set to the DataFrame is a result of operations on the DataFrame that is started with. If a Series or DataFrame do not share the same base node, the new column is or columns are set using a merge on the index. This works for one level indexes where the dtype of the series is the same as the DataFrame’s index dtype.

Examples

df['a'] = df.column_name + 5
df['b'] = ''

Database access

The data of this DataFrame is always held in the database and operations on the data are performed by the database, not in local memory. Data will only be transferred to local memory when an explicit call is made to one of the functions that transfers data:

head()
to_pandas()
to_numpy()
get_sample()
The property accessors Series.value (Series only), values

Other functions will not transfer data, nor will they trigger any operations to run on the database. Operations on the DataFrame are combined and translated to a single SQL query, which is executed only when one of the above mentioned data-transfer functions is called.

The API of this DataFrame is partially compatible with Pandas DataFrames. For more on Pandas DataFrames see https://pandas.pydata.org/docs/reference/frame.html

Reference by function

Creation


`DataFrame.from_table`(engine, table_name, index)	Instantiate a new DataFrame based on the content of an existing table in the database.
`DataFrame.from_model`(engine, model, index[, ...])	Instantiate a new DataFrame based on the result of the query defined in `model`.
`DataFrame.from_pandas`(engine, df[, ...])	Instantiate a new DataFrame based on the content of a Pandas DataFrame.
`DataFrame.copy`()	Return a copy of this DataFrame.

Value accessors


`DataFrame.head`([n])	Similar to `to_pandas()` but only returns the first `n` rows.
`DataFrame.to_pandas`([limit])	Run a SQL query representing the current state of this DataFrame against the database and return the resulting data as a Pandas DataFrame.
`DataFrame.loc`	The `.loc` accessor offers different methods for label-based selection.

Attributes and underlying data

Axes


`DataFrame.index`	Get the index dictionary `{name: Series}`
`DataFrame.data`	Get the data dictionary `{name: Series}`
`DataFrame.all_series`	Get all index and data Series in a dictionary `{name: Series}`
`DataFrame.index_columns`	Get all the index columns' names in a List
`DataFrame.data_columns`	Get all the data Series' names in a List
`DataFrame.columns`	Get all the data Series' names in a List
`DataFrame.group_by`	Get this DataFrame's grouping, if any.
`DataFrame.order_by`	Get the current sort order, if any.

Types


`DataFrame.dtypes`	Get the data Series' dtypes in a dictionary `{name: dtype}`
`DataFrame.index_dtypes`	Get the index Series' dtypes in a dictionary `{name: dtype}`
`DataFrame.astype`(dtype)	Cast all or some of the data columns to a certain dtype.

Sql Model


`DataFrame.materialize`([node_name, inplace, ...])	Create a copy of this DataFrame with as base_node the current DataFrame's state.
`DataFrame.get_sample`(table_name[, filter, ...])	Returns a DataFrame whose data is a sample of the current DataFrame object.
`DataFrame.get_unsampled`()	Return a copy of the current sampled DataFrame, that undoes calling `get_sample()` earlier.
`DataFrame.view_sql`([limit])	Translate the current state of this DataFrame into a SQL query.

Variables


`DataFrame.create_variable`(name, value, *[, ...])	Create a Series object that can be used as a variable, within the returned DataFrame.
`DataFrame.set_variable`(name, value, *[, dtype])	Return a copy of this DataFrame with the variable value updated.
`DataFrame.get_all_variable_usage`()	Get all variables that influence the values of this DataFrame.

Reshaping, indexing, sorting & merging


`DataFrame.sort_index`([level, ascending])	Sort dataframe by index levels.
`DataFrame.sort_values`(by[, ascending])	Create a new DataFrame with the specified sorting order.
`DataFrame.rename`([mapper, index, columns, ...])	Rename columns.
`DataFrame.drop`([labels, index, columns, ...])	Drop columns from the DataFrame
`DataFrame.drop_duplicates`([subset, keep, ...])	Return a dataframe with duplicated rows removed based on all series labels or a subset of labels.
`DataFrame.dropna`(*[, axis, how, thresh, subset])	Removes rows with missing values (NaN, None and SQL NULL).
`DataFrame.reset_index`([level, drop])	Drops the current index.
`DataFrame.set_index`(keys[, drop, append])	Set this dataframe's index to the the index given in keys
`DataFrame.merge`(right[, how, on, left_on, ...])	Join the right Dataframe or Series on self.
`DataFrame.append`(other[, ignore_index, sort])	Append rows of other dataframes to the the caller dataframe.
`DataFrame.fillna`(*[, value, method, axis, ...])	Fill any NULL value using a method or with a given value.
`DataFrame.ffill`([sort_by, ascending, window])	Fill missing values by propagating the last non-nullable value in each series.
`DataFrame.bfill`([sort_by, ascending, window])	Fill missing values by using the next non-nullable value in each series.
`DataFrame.stack`([dropna])	Stacks all data_columns into a single index series.
`DataFrame.unstack`([level, fill_value, ...])	Pivot a level of the index labels.
`DataFrame.scale`([with_mean, with_std])	Standardizes all numeric series based on mean and population standard deviation.
`DataFrame.minmax_scale`([feature_range])	Scales all numeric series based on a given range.
`DataFrame.get_dummies`([prefix, prefix_sep, ...])	Convert each unique category/value from a string series into a dummy/indicator variable.

Aggregation & windowing


`DataFrame.agg`(func[, axis, numeric_only])	Aggregate using one or more operations over the specified axis.
`DataFrame.aggregate`(func[, axis, numeric_only])	Alias for `agg()`
`DataFrame.groupby`([by])	Group by any of the series currently in this DataDrame, both from index as well as data.
`DataFrame.rollup`(by)	Group by and roll up over the column(s) `by`, replacing any current grouping.
`DataFrame.cube`(by)	Group by and cube over the column(s) `by`.
`DataFrame.window`(**frame_args)	Create a window on the current dataframe grouping and its sorting.
`DataFrame.rolling`(window[, min_periods, ...])	A rolling window of size 'window', by default right aligned.
`DataFrame.expanding`([min_periods, center])	Create an expanding window starting with the first row in the group, with at least `min_period` observations.

Computations & descriptive stats

All types


`DataFrame.describe`([percentiles, include, ...])	Returns descriptive statistics.
`DataFrame.count`([axis, level, numeric_only])	Count all non-NULL values in each column.
`DataFrame.min`([axis, skipna, level, ...])	Returns the minimum of all values in each column.
`DataFrame.max`([axis, skipna, level, ...])	Returns the maximum of all values in each column.
`DataFrame.median`([axis, skipna, level, ...])	Returns the median of all values in each column.
`DataFrame.mode`([axis, skipna, level, ...])	Returns the mode of all values in each column.
`DataFrame.nunique`([axis, skipna])	Returns the number of unique values in each column.
`DataFrame.value_counts`([subset, normalize, ...])	Returns a series containing counts of each unique row in the DataFrame

Numeric


`DataFrame.mean`([axis, skipna, level, ...])	Returns the mean of all values in each column.
`DataFrame.quantile`([q, axis])	Returns the quantile per numeric/timedelta column.
`DataFrame.sem`([axis, skipna, level, ddof, ...])	Returns the unbiased standard error of the mean of each column.
`DataFrame.sum`([axis, skipna, level, ...])	Returns the sum of all values in each column.
`DataFrame.std`([axis, skipna, level, ddof, ...])	Returns the sample standard deviation of each column.
`DataFrame.var`([axis, skipna, level, ddof, ...])	Returns the unbiased variance of each column.

Visualization


`DataFrame.plot`	The `.plot` accessor offers different methods for data visualization.

DataFrame

class bach.DataFrame​

Reference by function​

Creation​

Value accessors​

Attributes and underlying data​

Axes​

Types​

Sql Model​

Variables​

Reshaping, indexing, sorting & merging​

Aggregation & windowing​

Computations & descriptive stats​

All types​

Numeric​

Visualization​