| create_vocabulary {text2vec} | R Documentation |
This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L) vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L) ## S3 method for class 'character' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L) ## S3 method for class 'itoken' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L) ## S3 method for class 'itoken_parallel' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
it |
iterator over a |
ngram |
|
stopwords |
|
sep_ngram |
|
window_size |
|
... |
placeholder for additional arguments (not used at the moment). |
text2vec_vocabulary object, which is actually a data.frame
with following columns:
|
|
|
|
|
|
Also it contains metainformation in attributes:
ngram: integer vector, the lower and upper boundary of the
range of n-gram-values.
document_count: integer number of documents vocabulary was
built.
stopwords: character vector of stopwords
sep_ngram: character separator for ngrams
character: creates text2vec_vocabulary from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
itoken_parallel: collects unique terms and corresponding
statistics from iterator.
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)