tokenizer_result subword_tokenize(cudf::strings_column_view const &strings, hashed_vocabulary const &vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input v...
std::unique_ptr< hashed_vocabulary > load_vocabulary_file(std::string const &filename_hashed_vocabulary, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
Load the hashed vocabulary file into device memory.
The vocabulary data for use with the subword_tokenize function.
uint32_t outer_hash_a
The a parameter for the outer hash.
std::unique_ptr< cudf::column > bin_coefficients
uint32_t outer_hash_b
The b parameter for the outer hash.
std::unique_ptr< cudf::column > aux_cp_table
uint64 column, The auxiliary code point table to use for normalization
std::unique_ptr< cudf::column > bin_offsets
uint16_t separator_token_id
The separator token id in the vocabulary.
std::unique_ptr< cudf::column > table
std::unique_ptr< cudf::column > cp_metadata
uint32 column, The code point metadata table to use for normalization
uint16_t first_token_id
The first token id in the vocabulary.
uint16_t unknown_token_id
The unknown token id in the vocabulary.
Result object for the subword_tokenize functions.
uint32_t sequence_length
The number of token-ids in each row.
uint32_t nrows_tensor
The number of rows for the output token-ids.
std::unique_ptr< cudf::column > tensor_token_ids
A vector of token-ids for each row.
std::unique_ptr< cudf::column > tensor_metadata
The metadata for each tensor row.
std::unique_ptr< cudf::column > tensor_attention_mask
This mask identifies which tensor-token-ids are valid.