create_tcm {text2vec}R Documentation

Term-co-occurence matrix construction

Description

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage

create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), ...)

## S3 method for class 'itoken'
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), ...)

## S3 method for class 'itoken_parallel'
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), ...)

Arguments

it

list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.

vectorizer

function vectorizer function. See vectorizers.

skip_grams_window

integer window for term-co-occurence matrix construction. skip_grams_window should be > 0 if you plan to use vectorizer in create_tcm function. Value of 0L means to not construct the TCM.

skip_grams_window_context

one of c("symmetric", "right", "left") - which context words to use when count co-occurence statistics.

weights

weights for context/distant words during co-occurence statistics calculation. By default we are setting weight = 1 / distance_from_current_word. Should have length equal to skip_grams_window. "symmetric" by default - take into account skip_grams_window left and right.

...

arguments to foreach function which is used to iterate over it.

Details

If a parallel backend is registered, it will construct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

Value

dgTMatrix TCM matrix

See Also

itoken create_dtm

Examples

## Not run: 
data("movie_review")

# single thread

tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# parallel version

# set to number of cores on your machine
N_WORKERS = 1
if(require(doParallel)) registerDoParallel(N_WORKERS)
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
jobs = lapply(splits, itoken, tolower, word_tokenizer)

tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")

## End(Not run)

[Package text2vec version 0.5.1 Index]