bach.DataFrame.get_sample

get_sample​

(table_name, filter=None, sample_percentage=None, *, overwrite=False, seed=None)

​[source]

Returns a DataFrame whose data is a sample of the current DataFrame object.

For the sample Dataframe to be created, all data is queried once and a persistent table is created to store the sample data used for the sampled DataFrame.

Use get_unsampled() to switch back to the unsampled data later on. This returns a new DataFrame with all operations that have been done on the sample, applied to that DataFrame.

Will materialize the DataFrame if it is not in a materialized state.

If seed is set (Postgres only), this will create a temporary table from which the sample will be queried using the tablesample bernoulli sql construction.

Parameters​

  • table_name (str) – the name of the underlying sql table that is created to store the sampled data. Can include project_id and dataset on BigQuery, e.g. β€˜project_id.dataset.table_name’
  • filter (SeriesBoolean) – a filter to apply to the dataframe before creating the sample. If a filter is applied, sample_percentage is ignored and thus the bernoulli sample creation is skipped.
  • sample_percentage (int) – the approximate size of the sample as a proportion of all rows. Between 0-100.
  • overwrite (bool) – if True, the sample data is written to table_name, even if that table already exists.
  • seed (int) – optional seed number used to generate the sample. Only supported for Postgres.

Raises​

Exception – If overwrite=False and the table already exists. The exact exception depends on the underlying database.

Returns​

a sampled DataFrame of the current DataFrame.

Return type​

DataFrame

danger

With overwrite=True, if a table already exist with the given name, then that table will be dropped and all data lost!

note

This function queries the database.

note

This function writes to the database.

note

All data in the DataFrame to be sampled is queried to create the sample.