libcudf
23.12.00
|
Files | |
file | stemmer.hpp |
Enumerations | |
enum class | nvtext::letter_type { nvtext::CONSONANT , nvtext::VOWEL } |
Used for specifying letter type to check. More... | |
Functions | |
std::unique_ptr< cudf::column > | nvtext::is_letter (cudf::strings_column_view const &strings, letter_type ltype, cudf::size_type character_index, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns boolean column indicating if character_index of the input strings is a consonant or vowel. More... | |
std::unique_ptr< cudf::column > | nvtext::is_letter (cudf::strings_column_view const &strings, letter_type ltype, cudf::column_view const &indices, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns boolean column indicating if character at indices[i] of strings[i] is a consonant or vowel. More... | |
std::unique_ptr< cudf::column > | nvtext::porter_stemmer_measure (cudf::strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns the Porter Stemmer measurements of a strings column. More... | |
|
strong |
Used for specifying letter type to check.
Enumerator | |
---|---|
CONSONANT | Letter is a consonant. |
VOWEL | Letter is not a consonant. |
Definition at line 32 of file stemmer.hpp.
std::unique_ptr<cudf::column> nvtext::is_letter | ( | cudf::strings_column_view const & | strings, |
letter_type | ltype, | ||
cudf::column_view const & | indices, | ||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns boolean column indicating if character at indices[i]
of strings[i]
is a consonant or vowel.
Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A negative index value will check the character starting from the end of each string. That is, for character_index < 0
the letter checked for string strings[i]
is at position strings[i].length + indices[i]
.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
cudf::logic_error | if indices.size() != strings.size() |
cudf::logic_error | if indices contain nulls. |
strings | Strings column of words to measure. |
ltype | Specify letter type to check. |
indices | The character positions to check in each string. |
mr | Device memory resource used to allocate the returned column's device memory. |
std::unique_ptr<cudf::column> nvtext::is_letter | ( | cudf::strings_column_view const & | strings, |
letter_type | ltype, | ||
cudf::size_type | character_index, | ||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns boolean column indicating if character_index
of the input strings is a consonant or vowel.
Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A negative index value will check the character starting from the end of each string. That is, for character_index < 0
the letter checked for string strings[i]
is at position strings[i].length + index
.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
strings | Strings column of words to measure. |
ltype | Specify letter type to check. |
character_index | The character position to check in each string. |
mr | Device memory resource used to allocate the returned column's device memory. |
std::unique_ptr<cudf::column> nvtext::porter_stemmer_measure | ( | cudf::strings_column_view const & | strings, |
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns the Porter Stemmer measurements of a strings column.
Porter stemming is used to normalize words by removing plural and tense endings from words in English. The stemming measurement involves counting consonant/vowel patterns within a string. Reference paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
strings | Strings column of words to measure. |
mr | Device memory resource used to allocate the returned column's device memory. |