bach.DataFrame
class bach.DataFrame
(engine, base_node, index, series, group_by, order_by, savepoints, variables=None)A mutable DataFrame representing tabular data in a database and enabling operations on that data.
A Bach DataFrame object can be used to process large amounts of data on a database, while using an api that is based on the pandas api. This allows the database to group and aggregate data, sample data and do other operations that are not suitable for in memory processing. At any time it is possible to write your Bach DataFrame to a pandas DataFrame.
Usage
It should generally not be required to construct DataFrame instances manually. A DataFrame can be
constructed using the any of the bach classmethods like from_table()
, from_model()
, or
from_pandas()
. The returned DataFrame can be thought of as a dict-like container for Bach
Series objects.
Getting & Setting columns
Getting data works similar to pandas DataFrame. Single columns can be retrieved with df['column_name']
as well as df.column_name
. This will return a single Bach Series. Multiple columns can be retrieved by
passing a list of column names like: df[['column_name','other_column_name']]
. This returns a Bach
DataFrame.
A selection of rows can be selected with python slicing. I.e. df[2:5]
returns row 2 to 5. Only
positive integers are currently accepted in slices.
SeriesBoolean can also be used to filter DataFrames, and these Series are easily created using comparison
operations like equals (==
), less-than (<
), not(~
) on two series, or series with values:
boolean_series = a == b
. Boolean indexing can be done like df[df.column == 5]
. Only rows are
returned for which the condition is true.
Label-based selection is also supported by using the .loc
attribute. Each label is interpreted
as a value contained by the index column. Unlike Pandas, if the label is not found,
no exception will be raised. I.e. df.loc['a']
returns rows where the index series is equal to a
.
Slicing can also be performed (the dataframe must be sorted). I.e. df['a':'d', 'col1:col3']
. This will
return all rows/columns included in the slicing, where the start and stop are inclusive.
For more information about label-based selection, please take a look to loc
.
Moving Series around
Values, Series or DataFrames can be set to another DataFrame. Setting Series or DataFrames to another DataFrame is possible if they share the same base node or index dtype. DataFrames and Series share the same base node if they originate from the same data source. In most cases this means that the series that is to be set to the DataFrame is a result of operations on the DataFrame that is started with. If a Series or DataFrame do not share the same base node, the new column is or columns are set using a merge on the index. This works for one level indexes where the dtype of the series is the same as the DataFrame’s index dtype.
Examples
df['a'] = df.column_name + 5
df['b'] = ''
Database access
The data of this DataFrame is always held in the database and operations on the data are performed by the database, not in local memory. Data will only be transferred to local memory when an explicit call is made to one of the functions that transfers data:
head()
to_pandas()
to_numpy()
get_sample()
- The property accessors
Series.value
(Series only),values
Other functions will not transfer data, nor will they trigger any operations to run on the database. Operations on the DataFrame are combined and translated to a single SQL query, which is executed only when one of the above mentioned data-transfer functions is called.
The API of this DataFrame is partially compatible with Pandas DataFrames. For more on Pandas DataFrames see https://pandas.pydata.org/docs/reference/frame.html
Methods
agg (func[, axis, numeric_only]) | Aggregate using one or more operations over the specified axis. |
aggregate (func[, axis, numeric_only]) | Alias for agg() |
append (other[, ignore_index, sort]) | Append rows of other dataframes to the the caller dataframe. |
astype (dtype) | Cast all or some of the data columns to a certain dtype. |
bfill ([sort_by, ascending, window]) | Fill missing values by using the next non-nullable value in each series. |
copy () | Return a copy of this DataFrame. |
count ([axis, level, numeric_only]) | Count all non-NULL values in each column. |
create_variable (name, value, *[, dtype]) | Create a Series object that can be used as a variable, within the returned DataFrame. |
cube (by) | Group by and cube over the column(s) by . |
database_create_table (table_name, *[, if_exists]) | Write the current state of the DataFrame to a database table. |
describe ([percentiles, include, exclude, ...]) | Returns descriptive statistics. |
drop ([labels, index, columns, level, errors]) | Drop columns from the DataFrame |
drop_duplicates ([subset, keep, ...]) | Return a dataframe with duplicated rows removed based on all series labels or a subset of labels. |
dropna (*[, axis, how, thresh, subset]) | Removes rows with missing values (NaN, None and SQL NULL). |
expanding ([min_periods, center]) | Create an expanding window starting with the first row in the group, with at least min_period observations. |
ffill ([sort_by, ascending, window]) | Fill missing values by propagating the last non-nullable value in each series. |
fillna (*[, value, method, axis, sort_by, ...]) | Fill any NULL value using a method or with a given value. |
from_model (engine, model, index[, ...]) | Instantiate a new DataFrame based on the result of the query defined in model . |
from_pandas (engine, df[, convert_objects, ...]) | Instantiate a new DataFrame based on the content of a Pandas DataFrame. |
from_table (engine, table_name, index[, ...]) | Instantiate a new DataFrame based on the content of an existing table in the database. |
get_all_variable_usage () | Get all variables that influence the values of this DataFrame. |
get_dummies ([prefix, prefix_sep, dummy_na, ...]) | Convert each unique category/value from a string series into a dummy/indicator variable. |
get_sample (table_name[, filter, ...]) | Returns a DataFrame whose data is a sample of the current DataFrame object. |
get_unsampled () | Return a copy of the current sampled DataFrame, that undoes calling get_sample() earlier. |
groupby ([by]) | Group by any of the series currently in this DataDrame, both from index as well as data. |
head ([n]) | Similar to to_pandas() but only returns the first n rows. |
materialize ([node_name, inplace, limit, ...]) | Create a copy of this DataFrame with as base_node the current DataFrame's state. |
max ([axis, skipna, level, numeric_only]) | Returns the maximum of all values in each column. |
mean ([axis, skipna, level, numeric_only]) | Returns the mean of all values in each column. |
median ([axis, skipna, level, numeric_only]) | Returns the median of all values in each column. |
merge (right[, how, on, left_on, right_on, ...]) | Join the right Dataframe or Series on self. |
min ([axis, skipna, level, numeric_only]) | Returns the minimum of all values in each column. |
minmax_scale ([feature_range]) | Scales all numeric series based on a given range. |
mode ([axis, skipna, level, numeric_only]) | Returns the mode of all values in each column. |
nunique ([axis, skipna]) | Returns the number of unique values in each column. |
quantile ([q, axis]) | Returns the quantile per numeric/timedelta column. |
rename ([mapper, index, columns, axis, ...]) | Rename columns. |
reset_index ([level, drop]) | Drops the current index. |
rolling (window[, min_periods, center, closed]) | A rolling window of size 'window', by default right aligned. |
rollup (by) | Group by and roll up over the column(s) by , replacing any current grouping. |
round ([decimals]) | Returns a DataFrame with rounded numerical values |
scale ([with_mean, with_std]) | Standardizes all numeric series based on mean and population standard deviation. |
sem ([axis, skipna, level, ddof, numeric_only]) | Returns the unbiased standard error of the mean of each column. |
set_index (keys[, drop, append]) | Set this dataframe's index to the the index given in keys |
set_savepoint (name[, materialization]) | Set the current state as a savepoint in self.savepoints . |
set_variable (name, value, *[, dtype]) | Return a copy of this DataFrame with the variable value updated. |
sort_index ([level, ascending]) | Sort dataframe by index levels. |
sort_values (by[, ascending]) | Create a new DataFrame with the specified sorting order. |
stack ([dropna]) | Stacks all data_columns into a single index series. |
std ([axis, skipna, level, ddof, numeric_only]) | Returns the sample standard deviation of each column. |
std_pop ([axis, skipna, level, ddof, ...]) | Returns the population standard deviation of each column. |
sum ([axis, skipna, level, numeric_only, ...]) | Returns the sum of all values in each column. |
to_numpy () | Return a Numpy representation of the DataFrame akin pandas.Dataframe.to_numpy |
to_pandas ([limit]) | Run a SQL query representing the current state of this DataFrame against the database and return the resulting data as a Pandas DataFrame. |
unstack ([level, fill_value, aggregation]) | Pivot a level of the index labels. |
value_counts ([subset, normalize, sort, ...]) | Returns a series containing counts of each unique row in the DataFrame |
var ([axis, skipna, level, ddof, numeric_only]) | Returns the unbiased variance of each column. |
view_sql ([limit]) | Translate the current state of this DataFrame into a SQL query. |
window (**frame_args) | Create a window on the current dataframe grouping and its sorting. |
Attributes
all_series | Get all index and data Series in a dictionary {name: Series} |
columns | Get all the data Series' names in a List |
data | Get the data dictionary {name: Series} |
data_columns | Get all the data Series' names in a List |
dtypes | Get the data Series' dtypes in a dictionary {name: dtype} |
group_by | Get this DataFrame's grouping, if any. |
index | Get the index dictionary {name: Series} |
index_columns | Get all the index columns' names in a List |
index_dtypes | Get the index Series' dtypes in a dictionary {name: dtype} |
is_materialized | Return true if this DataFrame is in a materialized state, i.e. all information about the DataFrame's values is encoded in self.base_node. |
loc | The .loc accessor offers different methods for label-based selection. |
order_by | Get the current sort order, if any. |
plot | The .plot accessor offers different methods for data visualization. |
savepoints | |
variables | Get all variables for which values are set, which will be used when querying the database. |