modelhub.ModelHub.get_objectiv_dataframe

get_objectiv_dataframe

(*, db_url=None, table_name=None, start_date=None, end_date=None, bq_credentials_path=None, bq_credentials=None, with_sessionized_data=True, session_gap_seconds=1800, identity_resolution=None, anonymize_unidentified_users=True)

[source]

Sets data from sql table into an bach.DataFrame object.

The created DataFrame points to where the data is stored in the sql database, makes several transformations and sets the right data types for all columns. As such, the models from the model hub can be applied to a DataFrame created with this method.

For all databases, except BigQuery, the credentials can be specified as part of db_url. For BigQuery the credentials can be set with either bq_credentials (primary) or bq_credentials_path. Additionally, for all databases it’s possible to specify credentials as part of the environment, either as variables, files, or some other method. For more information on specifying the credentials as part of the environment, check the documentation of the specific database vendor: Athena , BigQuery , or Postgres.

Parameters

  • db_url (Optional[str]) – the url that indicate database dialect and connection arguments. If not given, env DSN is used to create one. If that’s not there, the default of ‘postgresql://objectiv:@localhost:5432/objectiv’ will be used.
  • table_name (Optional[str]) – the name of the sql table where the data is stored. Will default to ‘events’ for bigquery and ‘data’ for other engines.
  • start_date (Optional[str]) – first date for which data is loaded to the DataFrame. If None, data is loaded from the first date in the sql table. Format as ‘YYYY-MM-DD’.
  • end_date (Optional[str]) – last date for which data is loaded to the DataFrame. If None, data is loaded up to and including the last date in the sql table. Format as ‘YYYY-MM-DD’.
  • bq_credentials_path (Optional[str]) – optional path to file with BigQuery credentials.
  • bq_credentials (Optional[str]) – optional BigQuery credentials, content from credentials file.
  • with_sessionized_data (bool) – Indicates if DataFrame must include session_id and session_hit_number calculated series.
  • session_gap_seconds (int) – Amount of seconds to be use for identifying if events were triggered or not during the same session.
  • identity_resolution (Optional[str]) – Identity id to be used for identifying users based on IdentityContext. If no value is provided, then the user_id series will contain the value from the cookie_id column (a UUID).
  • anonymize_unidentified_users (bool) – Indicates if unidentified users are required to be anonymized by setting user_id value to NULL. Otherwise, original UUID value from the cookie will remain.

Returns

bach.DataFrame with Objectiv data.

note

If with_sessionized_data is True, Objectiv data will include session_id (int64) and session_hit_number (int64) series.