cudf.DataFrame.apply#

DataFrame.apply(func, axis=1, raw=False, result_type=None, args=(), **kwargs)#

Apply a function along an axis of the DataFrame. apply relies on Numba to JIT compile func. Thus the allowed operations within func are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.

Support for use of string data within UDFs is provided through the strings_udf RAPIDS library. Supported operations on strings include the subset of functions and string methods that expect an input string but do not return a string. Refer to caveats in the UDF guide referenced above.

Parameters
funcfunction

Function to apply to each row.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Axis along which the function is applied: * 0 or ‘index’: apply function to each column.

Note: axis=0 is not yet supported.

  • 1 or ‘columns’: apply function to each row.

raw: bool, default False

Not yet supported

result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None

Not yet supported

args: tuple

Positional arguments to pass to func in addition to the dataframe.

Examples

Simple function of a single variable which could be NA:

>>> def f(row):
...     if row['a'] is cudf.NA:
...             return 0
...     else:
...             return row['a'] + 1
...
>>> df = cudf.DataFrame({'a': [1, cudf.NA, 3]})
>>> df.apply(f, axis=1)
0    2
1    0
2    4
dtype: int64

Function of multiple variables will operate in a null aware manner:

>>> def f(row):
...     return row['a'] - row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, cudf.NA, 3, cudf.NA],
...     'b': [5, 6, cudf.NA, cudf.NA]
... })
>>> df.apply(f)
0      -4
1    <NA>
2    <NA>
3    <NA>
dtype: int64

Functions may conditionally return NA as in pandas:

>>> def f(row):
...     if row['a'] + row['b'] > 3:
...             return cudf.NA
...     else:
...             return row['a'] + row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [2, 1, 1]
... })
>>> df.apply(f, axis=1)
0       3
1       3
2    <NA>
dtype: int64

Mixed types are allowed, but will return the common type, rather than object as in pandas:

>>> def f(row):
...     return row['a'] + row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [0.5, cudf.NA, 3.14]
... })
>>> df.apply(f, axis=1)
0     1.5
1    <NA>
2    6.14
dtype: float64

Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data:

>>> def f(row):
...     if row['a'] > 3:
...             return row['a']
...     else:
...             return 1.5
...
>>> df = cudf.DataFrame({
...     'a': [1, 3, 5]
... })
>>> df.apply(f, axis=1)
0    1.5
1    1.5
2    5.0
dtype: float64

Ops against N columns are supported generally:

>>> def f(row):
...     v, w, x, y, z = (
...         row['a'], row['b'], row['c'], row['d'], row['e']
...     )
...     return x + (y - (z / w)) % v
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [4, 5, 6],
...     'c': [cudf.NA, 4, 4],
...     'd': [8, 7, 8],
...     'e': [7, 1, 6]
... })
>>> df.apply(f, axis=1)
0    <NA>
1     4.8
2     5.0
dtype: float64

UDFs manipulating string data are allowed, as long as they neither modify strings in place nor create new strings. For example, the following UDF is allowed:

>>> def f(row):
...     st = row['str_col']
...     scale = row['scale']
...     if len(st) == 0:
...             return -1
...     elif st.startswith('a'):
...             return 1 - scale
...     elif 'example' in st:
...             return 1 + scale
...     else:
...             return 42
...
>>> df = cudf.DataFrame({
...     'str_col': ['', 'abc', 'some_example'],
...     'scale': [1, 2, 3]
... })
>>> df.apply(f, axis=1)  
0   -1
1   -1
2    4
dtype: int64

However, the following UDF is not allowed since it includes an operation that requires the creation of a new string: a call to the upper method. Methods that are not supported in this manner will raise an AttributeError.

>>> def f(row):
...     st = row['str_col'].upper()
...     return 'ABC' in st
>>> df.apply(f, axis=1)  

For a complete list of supported functions and methods that may be used to manipulate string data, see the the UDF guide, <https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>