bach.DataFrame.drop_duplicates

drop_duplicates​

(subset=None, keep='first', ignore_index=False, sort_by=None, ascending=True)

​[source]

Return a dataframe with duplicated rows removed based on all series labels or a subset of labels.

Parameters​

  • subset (Optional[Union[str, Sequence[str]]]) – series label or sequence of labels. Duplications to be dropped are based on the combination of the subset of series. If not provided, all series labels will be used by default.
  • keep (Union[str, bool]) – Supported values: β€œfirst”, β€œlast” and False. Determines which duplicates to keep:
    • first: drop all occurrences except the first one
    • last: drop all occurrences except the last one
    • False: drops all duplicates

If no value is provided, first occurrences will be kept by default.

  • ignore_index (bool) – if true, drops indexes of the result
  • sort_by (Optional[Union[str, Sequence[str]]]) – series label or sequence of labels used to sort values. Sorting of values is needed since result might be non-deterministic when keep == β€œfirst” or keep == β€œlast”. If not provided:
  1. If dataframe has already an order_by, first and last values will be performed based on it
  2. Else all series not considered in duplication will be used instead.
  • ascending (Union[bool, List[bool]]) – Whether to sort ascending (True) or descending (False). If this is a list, then the by must also be a list and len(ascending) == len(by).

Returns​

a new dataframe with dropped duplicates.

Return type​

bach.dataframe.DataFrame