cudf.DataFrame.drop_duplicates#

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)#

Return DataFrame with duplicate rows removed, optionally only considering certain subset of columns.

Parameters
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplacebool, default False

Whether to drop duplicates in place or to return a copy.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns
DataFrame or None

DataFrame with duplicates removed or None if inplace=True.

Examples

>>> import cudf
>>> df = cudf.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns. Note that order of the rows being returned is not guaranteed to be sorted.

>>> df.drop_duplicates()
     brand style  rating
2  Indomie   cup     3.5
4  Indomie  pack     5.0
3  Indomie  pack    15.0
0  Yum Yum   cup     4.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
     brand style  rating
2  Indomie   cup     3.5
0  Yum Yum   cup     4.0

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
     brand style  rating
2  Indomie   cup     3.5
4  Indomie  pack     5.0
1  Yum Yum   cup     4.0