get_sample(table_name, filter=None, sample_percentage=None, *, overwrite=False, seed=None)
Returns a DataFrame whose data is a sample of the current DataFrame object.
For the sample Dataframe to be created, all data is queried once and a persistent table is created to store the sample data used for the sampled DataFrame.
get_unsampled() to switch back to the unsampled data later on. This returns a new
DataFrame with all operations that have been done on the sample, applied to that DataFrame.
Will materialize the DataFrame if it is not in a materialized state.
seed is set (Postgres only), this will create a temporary table from which the sample will be
queried using the
tablesample bernoulli sql construction.
table_name(str) – the name of the underlying sql table that is created to store the sampled data. Can include project_id and dataset on BigQuery, e.g. ‘project_id.dataset.table_name’
filter(SeriesBoolean) – a filter to apply to the dataframe before creating the sample. If a filter is applied, sample_percentage is ignored and thus the bernoulli sample creation is skipped.
sample_percentage(int) – the approximate size of the sample as a proportion of all rows. Between 0-100.
overwrite(bool) – if True, the sample data is written to table_name, even if that table already exists.
seed(int) – optional seed number used to generate the sample. Only supported for Postgres.
Exception – If overwrite=False and the table already exists. The exact exception depends on
the underlying database.
a sampled DataFrame of the current DataFrame.
overwrite=True, if a table already exist with the given name, then that table will
be dropped and all data lost!
This function queries the database.
This function writes to the database.
All data in the DataFrame to be sampled is queried to create the sample.