bach.DataFrame

class bach.DataFrame

(engine, base_node, index, series, group_by, order_by, savepoints, variables=None)

[source]

A mutable DataFrame representing tabular data in a database and enabling operations on that data.

A Bach DataFrame object can be used to process large amounts of data on a database, while using an api that is based on the pandas api. This allows the database to group and aggregate data, sample data and do other operations that are not suitable for in memory processing. At any time it is possible to write your Bach DataFrame to a pandas DataFrame.

Usage

It should generally not be required to construct DataFrame instances manually. A DataFrame can be constructed using the any of the bach classmethods like from_table(), from_model(), or from_pandas(). The returned DataFrame can be thought of as a dict-like container for Bach Series objects.

Getting & Setting columns

Getting data works similar to pandas DataFrame. Single columns can be retrieved with df['column_name'] as well as df.column_name. This will return a single Bach Series. Multiple columns can be retrieved by passing a list of column names like: df[['column_name','other_column_name']]. This returns a Bach DataFrame.

A selection of rows can be selected with python slicing. I.e. df[2:5] returns row 2 to 5. Only positive integers are currently accepted in slices.

SeriesBoolean can also be used to filter DataFrames, and these Series are easily created using comparison operations like equals (==), less-than (<), not(~) on two series, or series with values: boolean_series = a == b. Boolean indexing can be done like df[df.column == 5]. Only rows are returned for which the condition is true.

Label-based selection is also supported by using the .loc attribute. Each label is interpreted as a value contained by the index column. Unlike Pandas, if the label is not found, no exception will be raised. I.e. df.loc['a'] returns rows where the index series is equal to a. Slicing can also be performed (the dataframe must be sorted). I.e. df['a':'d', 'col1:col3']. This will return all rows/columns included in the slicing, where the start and stop are inclusive. For more information about label-based selection, please take a look to loc.

Moving Series around

Values, Series or DataFrames can be set to another DataFrame. Setting Series or DataFrames to another DataFrame is possible if they share the same base node or index dtype. DataFrames and Series share the same base node if they originate from the same data source. In most cases this means that the series that is to be set to the DataFrame is a result of operations on the DataFrame that is started with. If a Series or DataFrame do not share the same base node, the new column is or columns are set using a merge on the index. This works for one level indexes where the dtype of the series is the same as the DataFrame’s index dtype.

Examples

df['a'] = df.column_name + 5
df['b'] = ''

Database access

The data of this DataFrame is always held in the database and operations on the data are performed by the database, not in local memory. Data will only be transferred to local memory when an explicit call is made to one of the functions that transfers data:

Other functions will not transfer data, nor will they trigger any operations to run on the database. Operations on the DataFrame are combined and translated to a single SQL query, which is executed only when one of the above mentioned data-transfer functions is called.

The API of this DataFrame is partially compatible with Pandas DataFrames. For more on Pandas DataFrames see https://pandas.pydata.org/docs/reference/frame.html

Methods

agg(func[, axis, numeric_only])Aggregate using one or more operations over the specified axis.
aggregate(func[, axis, numeric_only])Alias for agg()
append(other[, ignore_index, sort])Append rows of other dataframes to the the caller dataframe.
astype(dtype)Cast all or some of the data columns to a certain dtype.
bfill([sort_by, ascending])Fill missing values by using the next non-nullable value in each series.
copy()Return a copy of this DataFrame.
count([axis, level, numeric_only])Count all non-NULL values in each column.
create_variable(name, value, *[, dtype])Create a Series object that can be used as a variable, within the returned DataFrame.
cube(by)Group by and cube over the column(s) by.
describe([percentiles, include, exclude, ...])Returns descriptive statistics.
drop([labels, index, columns, level, errors])Drop columns from the DataFrame
drop_duplicates([subset, keep, ...])Return a dataframe with duplicated rows removed based on all series labels or a subset of labels.
dropna(*[, axis, how, thresh, subset])Removes rows with missing values (NaN, None and SQL NULL).
expanding([min_periods, center])Create an expanding window starting with the first row in the group, with at least min_period observations.
ffill([sort_by, ascending])Fill missing values by propagating the last non-nullable value in each series.
fillna(*[, value, method, axis, sort_by, ...])Fill any NULL value using a method or with a given value.
from_model(engine, model, index[, all_dtypes])Instantiate a new DataFrame based on the result of the query defined in model.
from_pandas(engine, df, convert_objects[, ...])Instantiate a new DataFrame based on the content of a Pandas DataFrame.
from_table(engine, table_name, index[, ...])Instantiate a new DataFrame based on the content of an existing table in the database.
get_all_variable_usage()Get all variables that influence the values of this DataFrame.
get_dummies([prefix, prefix_sep, dummy_na, ...])Convert each unique category/value from a string series into a dummy/indicator variable.
get_sample(table_name[, filter, ...])Returns a DataFrame whose data is a sample of the current DataFrame object.
get_unsampled()Return a copy of the current sampled DataFrame, that undoes calling get_sample() earlier.
groupby([by])Group by any of the series currently in this DataDrame, both from index as well as data.
head([n])Similar to to_pandas() but only returns the first n rows.
materialize([node_name, inplace, limit, ...])Create a copy of this DataFrame with as base_node the current DataFrame's state.
max([axis, skipna, level, numeric_only])Returns the maximum of all values in each column.
mean([axis, skipna, level, numeric_only])Returns the mean of all values in each column.
median([axis, skipna, level, numeric_only])Returns the median of all values in each column.
merge(right[, how, on, left_on, right_on, ...])Join the right Dataframe or Series on self.
min([axis, skipna, level, numeric_only])Returns the minimum of all values in each column.
minmax_scale([feature_range])Scales all numeric series based on a given range.
mode([axis, skipna, level, numeric_only])Returns the mode of all values in each column.
nunique([axis, skipna])Returns the number of unique values in each column.
quantile([q, axis])Returns the quantile per numeric/timedelta column.
rename([mapper, index, columns, axis, ...])Rename columns.
reset_index([level, drop])Drops the current index.
rolling(window[, min_periods, center, closed])A rolling window of size 'window', by default right aligned.
rollup(by)Group by and roll up over the column(s) by, replacing any current grouping.
round([decimals])Returns a DataFrame with rounded numerical values
scale([with_mean, with_std])Standardizes all numeric series based on mean and population standard deviation.
sem([axis, skipna, level, ddof, numeric_only])Returns the unbiased standard error of the mean of each column.
set_index(keys[, drop, append])Set this dataframe's index to the the index given in keys
set_savepoint(name[, materialization])Set the current state as a savepoint in self.savepoints.
set_variable(name, value, *[, dtype])Return a copy of this DataFrame with the variable value updated.
sort_index([level, ascending])Sort dataframe by index levels.
sort_values(by[, ascending])Create a new DataFrame with the specified sorting order.
stack([dropna])Stacks all data_columns into a single index series.
std([axis, skipna, level, ddof, numeric_only])Returns the sample standard deviation of each column.
std_pop([axis, skipna, level, ddof, ...])Returns the population standard deviation of each column.
sum([axis, skipna, level, numeric_only, ...])Returns the sum of all values in each column.
to_numpy()Return a Numpy representation of the DataFrame akin pandas.Dataframe.to_numpy
to_pandas([limit])Run a SQL query representing the current state of this DataFrame against the database and return the resulting data as a Pandas DataFrame.
unstack([level, fill_value, aggregation])Pivot a level of the index labels.
value_counts([subset, normalize, sort, ...])Returns a series containing counts of each unique row in the DataFrame
var([axis, skipna, level, ddof, numeric_only])Returns the unbiased variance of each column.
view_sql([limit])Translate the current state of this DataFrame into a SQL query.
window(**frame_args)Create a window on the current dataframe grouping and its sorting.

Attributes

all_seriesGet all index and data Series in a dictionary {name: Series}
dataGet the data dictionary {name: Series}
data_columnsGet all the data Series' names in a List
dtypesGet the data Series' dtypes in a dictionary {name: dtype}
group_byGet this DataFrame's grouping, if any.
indexGet the index dictionary {name: Series}
index_columnsGet all the index columns' names in a List
index_dtypesGet the index Series' dtypes in a dictionary {name: dtype}
is_materializedReturn true if this DataFrame is in a materialized state, i.e. all information about the DataFrame's values is encoded in self.base_node.
locThe .loc accessor offers different methods for label-based selection.
order_byGet the current sort order, if any.
plotThe .plot accessor offers different methods for data visualization.
savepoints
variablesGet all variables for which values are set, which will be used when querying the database.