libcudf  23.12.00
Classes | Namespaces | Functions
subword_tokenize.hpp File Reference
#include <cudf/column/column.hpp>
#include <cudf/column/column_view.hpp>
#include <cudf/strings/strings_column_view.hpp>

Go to the source code of this file.

Classes

struct  nvtext::hashed_vocabulary
 The vocabulary data for use with the subword_tokenize function. More...
 
struct  nvtext::tokenizer_result
 Result object for the subword_tokenize functions. More...
 

Namespaces

 nvtext
 NVText APIs.
 

Functions

std::unique_ptr< hashed_vocabulary > nvtext::load_vocabulary_file (std::string const &filename_hashed_vocabulary, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Load the hashed vocabulary file into device memory. More...
 
tokenizer_result nvtext::subword_tokenize (cudf::strings_column_view const &strings, hashed_vocabulary const &vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary. More...