DataFrame

class bach.DataFrame​

(engine, base_node, index, series, group_by, order_by, savepoints, variables=None)

​[source]

A mutable DataFrame representing tabular data in a database and enabling operations on that data.

A Bach DataFrame object can be used to process large amounts of data on a database, while using an api that is based on the pandas api. This allows the database to group and aggregate data, sample data and do other operations that are not suitable for in memory processing. At any time it is possible to write your Bach DataFrame to a pandas DataFrame.

Usage

It should generally not be required to construct DataFrame instances manually. A DataFrame can be constructed using the any of the bach classmethods like from_table(), from_model(), or from_pandas(). The returned DataFrame can be thought of as a dict-like container for Bach Series objects.

Getting & Setting columns

Getting data works similar to pandas DataFrame. Single columns can be retrieved with df['column_name'] as well as df.column_name. This will return a single Bach Series. Multiple columns can be retrieved by passing a list of column names like: df[['column_name','other_column_name']]. This returns a Bach DataFrame.

A selection of rows can be selected with python slicing. I.e. df[2:5] returns row 2 to 5. Only positive integers are currently accepted in slices.

SeriesBoolean can also be used to filter DataFrames, and these Series are easily created using comparison operations like equals (==), less-than (<), not(~) on two series, or series with values: boolean_series = a == b. Boolean indexing can be done like df[df.column == 5]. Only rows are returned for which the condition is true.

Label-based selection is also supported by using the .loc attribute. Each label is interpreted as a value contained by the index column. Unlike Pandas, if the label is not found, no exception will be raised. I.e. df.loc['a'] returns rows where the index series is equal to a. Slicing can also be performed (the dataframe must be sorted). I.e. df['a':'d', 'col1:col3']. This will return all rows/columns included in the slicing, where the start and stop are inclusive. For more information about label-based selection, please take a look to loc.

Moving Series around

Values, Series or DataFrames can be set to another DataFrame. Setting Series or DataFrames to another DataFrame is possible if they share the same base node or index dtype. DataFrames and Series share the same base node if they originate from the same data source. In most cases this means that the series that is to be set to the DataFrame is a result of operations on the DataFrame that is started with. If a Series or DataFrame do not share the same base node, the new column is or columns are set using a merge on the index. This works for one level indexes where the dtype of the series is the same as the DataFrame’s index dtype.

Examples

df['a'] = df.column_name + 5
df['b'] = ''

Database access

The data of this DataFrame is always held in the database and operations on the data are performed by the database, not in local memory. Data will only be transferred to local memory when an explicit call is made to one of the functions that transfers data:

Other functions will not transfer data, nor will they trigger any operations to run on the database. Operations on the DataFrame are combined and translated to a single SQL query, which is executed only when one of the above mentioned data-transfer functions is called.

The API of this DataFrame is partially compatible with Pandas DataFrames. For more on Pandas DataFrames see https://pandas.pydata.org/docs/reference/frame.html

Reference by function​

Creation​

DataFrame.from_table(engine, table_name, index)Instantiate a new DataFrame based on the content of an existing table in the database.
DataFrame.from_model(engine, model, index[, ...])Instantiate a new DataFrame based on the result of the query defined in model.
DataFrame.from_pandas(engine, df[, ...])Instantiate a new DataFrame based on the content of a Pandas DataFrame.
DataFrame.copy()Return a copy of this DataFrame.

Value accessors​

DataFrame.head([n])Similar to to_pandas() but only returns the first n rows.
DataFrame.to_pandas([limit])Run a SQL query representing the current state of this DataFrame against the database and return the resulting data as a Pandas DataFrame.
DataFrame.locThe .loc accessor offers different methods for label-based selection.

Attributes and underlying data​

Axes​

DataFrame.indexGet the index dictionary {name: Series}
DataFrame.dataGet the data dictionary {name: Series}
DataFrame.all_seriesGet all index and data Series in a dictionary {name: Series}
DataFrame.index_columnsGet all the index columns' names in a List
DataFrame.data_columnsGet all the data Series' names in a List
DataFrame.columnsGet all the data Series' names in a List
DataFrame.group_byGet this DataFrame's grouping, if any.
DataFrame.order_byGet the current sort order, if any.

Types​

DataFrame.dtypesGet the data Series' dtypes in a dictionary {name: dtype}
DataFrame.index_dtypesGet the index Series' dtypes in a dictionary {name: dtype}
DataFrame.astype(dtype)Cast all or some of the data columns to a certain dtype.

Sql Model​

DataFrame.materialize([node_name, inplace, ...])Create a copy of this DataFrame with as base_node the current DataFrame's state.
DataFrame.get_sample(table_name[, filter, ...])Returns a DataFrame whose data is a sample of the current DataFrame object.
DataFrame.get_unsampled()Return a copy of the current sampled DataFrame, that undoes calling get_sample() earlier.
DataFrame.view_sql([limit])Translate the current state of this DataFrame into a SQL query.

Variables​

DataFrame.create_variable(name, value, *[, ...])Create a Series object that can be used as a variable, within the returned DataFrame.
DataFrame.set_variable(name, value, *[, dtype])Return a copy of this DataFrame with the variable value updated.
DataFrame.get_all_variable_usage()Get all variables that influence the values of this DataFrame.

Reshaping, indexing, sorting & merging​

DataFrame.sort_index([level, ascending])Sort dataframe by index levels.
DataFrame.sort_values(by[, ascending])Create a new DataFrame with the specified sorting order.
DataFrame.rename([mapper, index, columns, ...])Rename columns.
DataFrame.drop([labels, index, columns, ...])Drop columns from the DataFrame
DataFrame.drop_duplicates([subset, keep, ...])Return a dataframe with duplicated rows removed based on all series labels or a subset of labels.
DataFrame.dropna(*[, axis, how, thresh, subset])Removes rows with missing values (NaN, None and SQL NULL).
DataFrame.reset_index([level, drop])Drops the current index.
DataFrame.set_index(keys[, drop, append])Set this dataframe's index to the the index given in keys
DataFrame.merge(right[, how, on, left_on, ...])Join the right Dataframe or Series on self.
DataFrame.append(other[, ignore_index, sort])Append rows of other dataframes to the the caller dataframe.
DataFrame.fillna(*[, value, method, axis, ...])Fill any NULL value using a method or with a given value.
DataFrame.ffill([sort_by, ascending, window])Fill missing values by propagating the last non-nullable value in each series.
DataFrame.bfill([sort_by, ascending, window])Fill missing values by using the next non-nullable value in each series.
DataFrame.stack([dropna])Stacks all data_columns into a single index series.
DataFrame.unstack([level, fill_value, ...])Pivot a level of the index labels.
DataFrame.scale([with_mean, with_std])Standardizes all numeric series based on mean and population standard deviation.
DataFrame.minmax_scale([feature_range])Scales all numeric series based on a given range.
DataFrame.get_dummies([prefix, prefix_sep, ...])Convert each unique category/value from a string series into a dummy/indicator variable.

Aggregation & windowing​

DataFrame.agg(func[, axis, numeric_only])Aggregate using one or more operations over the specified axis.
DataFrame.aggregate(func[, axis, numeric_only])Alias for agg()
DataFrame.groupby([by])Group by any of the series currently in this DataDrame, both from index as well as data.
DataFrame.rollup(by)Group by and roll up over the column(s) by, replacing any current grouping.
DataFrame.cube(by)Group by and cube over the column(s) by.
DataFrame.window(**frame_args)Create a window on the current dataframe grouping and its sorting.
DataFrame.rolling(window[, min_periods, ...])A rolling window of size 'window', by default right aligned.
DataFrame.expanding([min_periods, center])Create an expanding window starting with the first row in the group, with at least min_period observations.

Computations & descriptive stats​

All types​

DataFrame.describe([percentiles, include, ...])Returns descriptive statistics.
DataFrame.count([axis, level, numeric_only])Count all non-NULL values in each column.
DataFrame.min([axis, skipna, level, ...])Returns the minimum of all values in each column.
DataFrame.max([axis, skipna, level, ...])Returns the maximum of all values in each column.
DataFrame.median([axis, skipna, level, ...])Returns the median of all values in each column.
DataFrame.mode([axis, skipna, level, ...])Returns the mode of all values in each column.
DataFrame.nunique([axis, skipna])Returns the number of unique values in each column.
DataFrame.value_counts([subset, normalize, ...])Returns a series containing counts of each unique row in the DataFrame

Numeric​

DataFrame.mean([axis, skipna, level, ...])Returns the mean of all values in each column.
DataFrame.quantile([q, axis])Returns the quantile per numeric/timedelta column.
DataFrame.sem([axis, skipna, level, ddof, ...])Returns the unbiased standard error of the mean of each column.
DataFrame.sum([axis, skipna, level, ...])Returns the sum of all values in each column.
DataFrame.std([axis, skipna, level, ddof, ...])Returns the sample standard deviation of each column.
DataFrame.var([axis, skipna, level, ddof, ...])Returns the unbiased variance of each column.

Visualization​

DataFrame.plotThe .plot accessor offers different methods for data visualization.