| tokenizers {text2vec} | R Documentation |
Few simple tokenization functions. For more comprehensive list see tokenizers package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*.
word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...)
strings |
|
... |
other parameters (usually not used - see source code for details). |
sep |
|
xptr |
|
list of character vectors. Each element of list contains vector of tokens.
doc = c("first second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
space_tokenizer(doc, " ")