DataFrame
class bach.DataFrameβ
(engine, base_node, index, series, group_by, order_by, savepoints, variables=None)β[source]
A mutable DataFrame representing tabular data in a database and enabling operations on that data.
A Bach DataFrame object can be used to process large amounts of data on a database, while using an api that is based on the pandas api. This allows the database to group and aggregate data, sample data and do other operations that are not suitable for in memory processing. At any time it is possible to write your Bach DataFrame to a pandas DataFrame.
Usage
It should generally not be required to construct DataFrame instances manually. A DataFrame can be
constructed using the any of the bach classmethods like from_table()
, from_model()
, or
from_pandas()
. The returned DataFrame can be thought of as a dict-like container for Bach
Series objects.
Getting & Setting columns
Getting data works similar to pandas DataFrame. Single columns can be retrieved with df['column_name']
as well as df.column_name
. This will return a single Bach Series. Multiple columns can be retrieved by
passing a list of column names like: df[['column_name','other_column_name']]
. This returns a Bach
DataFrame.
A selection of rows can be selected with python slicing. I.e. df[2:5]
returns row 2 to 5. Only
positive integers are currently accepted in slices.
SeriesBoolean can also be used to filter DataFrames, and these Series are easily created using comparison
operations like equals (==
), less-than (<
), not(~
) on two series, or series with values:
boolean_series = a == b
. Boolean indexing can be done like df[df.column == 5]
. Only rows are
returned for which the condition is true.
Label-based selection is also supported by using the .loc
attribute. Each label is interpreted
as a value contained by the index column. Unlike Pandas, if the label is not found,
no exception will be raised. I.e. df.loc['a']
returns rows where the index series is equal to a
.
Slicing can also be performed (the dataframe must be sorted). I.e. df['a':'d', 'col1:col3']
. This will
return all rows/columns included in the slicing, where the start and stop are inclusive.
For more information about label-based selection, please take a look to loc
.
Moving Series around
Values, Series or DataFrames can be set to another DataFrame. Setting Series or DataFrames to another DataFrame is possible if they share the same base node or index dtype. DataFrames and Series share the same base node if they originate from the same data source. In most cases this means that the series that is to be set to the DataFrame is a result of operations on the DataFrame that is started with. If a Series or DataFrame do not share the same base node, the new column is or columns are set using a merge on the index. This works for one level indexes where the dtype of the series is the same as the DataFrameβs index dtype.
Examples
df['a'] = df.column_name + 5
df['b'] = ''
Database access
The data of this DataFrame is always held in the database and operations on the data are performed by the database, not in local memory. Data will only be transferred to local memory when an explicit call is made to one of the functions that transfers data:
head()
to_pandas()
to_numpy()
get_sample()
- The property accessors
Series.value
(Series only),values
Other functions will not transfer data, nor will they trigger any operations to run on the database. Operations on the DataFrame are combined and translated to a single SQL query, which is executed only when one of the above mentioned data-transfer functions is called.
The API of this DataFrame is partially compatible with Pandas DataFrames. For more on Pandas DataFrames see https://pandas.pydata.org/docs/reference/frame.html
Reference by functionβ
Creationβ
DataFrame.from_table (engine, table_name, index) | Instantiate a new DataFrame based on the content of an existing table in the database. |
DataFrame.from_model (engine, model, index[, ...]) | Instantiate a new DataFrame based on the result of the query defined in model . |
DataFrame.from_pandas (engine, df[, ...]) | Instantiate a new DataFrame based on the content of a Pandas DataFrame. |
DataFrame.copy () | Return a copy of this DataFrame. |
Value accessorsβ
DataFrame.head ([n]) | Similar to to_pandas() but only returns the first n rows. |
DataFrame.to_pandas ([limit]) | Run a SQL query representing the current state of this DataFrame against the database and return the resulting data as a Pandas DataFrame. |
DataFrame.loc | The .loc accessor offers different methods for label-based selection. |
Attributes and underlying dataβ
Axesβ
DataFrame.index | Get the index dictionary {name: Series} |
DataFrame.data | Get the data dictionary {name: Series} |
DataFrame.all_series | Get all index and data Series in a dictionary {name: Series} |
DataFrame.index_columns | Get all the index columns' names in a List |
DataFrame.data_columns | Get all the data Series' names in a List |
DataFrame.columns | Get all the data Series' names in a List |
DataFrame.group_by | Get this DataFrame's grouping, if any. |
DataFrame.order_by | Get the current sort order, if any. |
Typesβ
DataFrame.dtypes | Get the data Series' dtypes in a dictionary {name: dtype} |
DataFrame.index_dtypes | Get the index Series' dtypes in a dictionary {name: dtype} |
DataFrame.astype (dtype) | Cast all or some of the data columns to a certain dtype. |
Sql Modelβ
DataFrame.materialize ([node_name, inplace, ...]) | Create a copy of this DataFrame with as base_node the current DataFrame's state. |
DataFrame.get_sample (table_name[, filter, ...]) | Returns a DataFrame whose data is a sample of the current DataFrame object. |
DataFrame.get_unsampled () | Return a copy of the current sampled DataFrame, that undoes calling get_sample() earlier. |
DataFrame.view_sql ([limit]) | Translate the current state of this DataFrame into a SQL query. |
Variablesβ
DataFrame.create_variable (name, value, *[, ...]) | Create a Series object that can be used as a variable, within the returned DataFrame. |
DataFrame.set_variable (name, value, *[, dtype]) | Return a copy of this DataFrame with the variable value updated. |
DataFrame.get_all_variable_usage () | Get all variables that influence the values of this DataFrame. |
Reshaping, indexing, sorting & mergingβ
DataFrame.sort_index ([level, ascending]) | Sort dataframe by index levels. |
DataFrame.sort_values (by[, ascending]) | Create a new DataFrame with the specified sorting order. |
DataFrame.rename ([mapper, index, columns, ...]) | Rename columns. |
DataFrame.drop ([labels, index, columns, ...]) | Drop columns from the DataFrame |
DataFrame.drop_duplicates ([subset, keep, ...]) | Return a dataframe with duplicated rows removed based on all series labels or a subset of labels. |
DataFrame.dropna (*[, axis, how, thresh, subset]) | Removes rows with missing values (NaN, None and SQL NULL). |
DataFrame.reset_index ([level, drop]) | Drops the current index. |
DataFrame.set_index (keys[, drop, append]) | Set this dataframe's index to the the index given in keys |
DataFrame.merge (right[, how, on, left_on, ...]) | Join the right Dataframe or Series on self. |
DataFrame.append (other[, ignore_index, sort]) | Append rows of other dataframes to the the caller dataframe. |
DataFrame.fillna (*[, value, method, axis, ...]) | Fill any NULL value using a method or with a given value. |
DataFrame.ffill ([sort_by, ascending, window]) | Fill missing values by propagating the last non-nullable value in each series. |
DataFrame.bfill ([sort_by, ascending, window]) | Fill missing values by using the next non-nullable value in each series. |
DataFrame.stack ([dropna]) | Stacks all data_columns into a single index series. |
DataFrame.unstack ([level, fill_value, ...]) | Pivot a level of the index labels. |
DataFrame.scale ([with_mean, with_std]) | Standardizes all numeric series based on mean and population standard deviation. |
DataFrame.minmax_scale ([feature_range]) | Scales all numeric series based on a given range. |
DataFrame.get_dummies ([prefix, prefix_sep, ...]) | Convert each unique category/value from a string series into a dummy/indicator variable. |
Aggregation & windowingβ
DataFrame.agg (func[, axis, numeric_only]) | Aggregate using one or more operations over the specified axis. |
DataFrame.aggregate (func[, axis, numeric_only]) | Alias for agg() |
DataFrame.groupby ([by]) | Group by any of the series currently in this DataDrame, both from index as well as data. |
DataFrame.rollup (by) | Group by and roll up over the column(s) by , replacing any current grouping. |
DataFrame.cube (by) | Group by and cube over the column(s) by . |
DataFrame.window (**frame_args) | Create a window on the current dataframe grouping and its sorting. |
DataFrame.rolling (window[, min_periods, ...]) | A rolling window of size 'window', by default right aligned. |
DataFrame.expanding ([min_periods, center]) | Create an expanding window starting with the first row in the group, with at least min_period observations. |
Computations & descriptive statsβ
All typesβ
DataFrame.describe ([percentiles, include, ...]) | Returns descriptive statistics. |
DataFrame.count ([axis, level, numeric_only]) | Count all non-NULL values in each column. |
DataFrame.min ([axis, skipna, level, ...]) | Returns the minimum of all values in each column. |
DataFrame.max ([axis, skipna, level, ...]) | Returns the maximum of all values in each column. |
DataFrame.median ([axis, skipna, level, ...]) | Returns the median of all values in each column. |
DataFrame.mode ([axis, skipna, level, ...]) | Returns the mode of all values in each column. |
DataFrame.nunique ([axis, skipna]) | Returns the number of unique values in each column. |
DataFrame.value_counts ([subset, normalize, ...]) | Returns a series containing counts of each unique row in the DataFrame |
Numericβ
DataFrame.mean ([axis, skipna, level, ...]) | Returns the mean of all values in each column. |
DataFrame.quantile ([q, axis]) | Returns the quantile per numeric/timedelta column. |
DataFrame.sem ([axis, skipna, level, ddof, ...]) | Returns the unbiased standard error of the mean of each column. |
DataFrame.sum ([axis, skipna, level, ...]) | Returns the sum of all values in each column. |
DataFrame.std ([axis, skipna, level, ddof, ...]) | Returns the sample standard deviation of each column. |
DataFrame.var ([axis, skipna, level, ddof, ...]) | Returns the unbiased variance of each column. |
Visualizationβ
DataFrame.plot | The .plot accessor offers different methods for data visualization. |