| itoken {text2vec} | R Documentation |
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = foreach::getDoParWorkers(), ids = NULL, ...) ## S3 method for class 'ifiles_parallel' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = foreach::getDoParWorkers(), ids = NULL, ...)
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
n_chunks |
|
progressbar |
|
ids |
|
preprocessor |
|
tokenizer |
|
S3 methods for creating an itoken iterator from list of tokens
list: all elements of the input list should be
character vectors containing tokens
character: raw text
source: the user must provide a tokenizer function
ifiles: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel: from files in parallel
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
# lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
#------------------------------------------------
# PARALLEL iterators
#------------------------------------------------
library(text2vec)
N_WORKERS = 1 # change 1 to number of cores in parallel backend
if(require(doParallel)) registerDoParallel(N_WORKERS)
data("movie_review")
it = itoken_parallel(movie_review$review[1:100], n_chunks = N_WORKERS)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))