| ifiles {text2vec} | R Documentation |
Creates iterator over text files from the disk
Description
The result of this function usually used in an itoken function.
Usage
ifiles(file_paths, reader = readLines)
idir(path, reader = readLines)
ifiles_parallel(file_paths, reader = readLines,
n_chunks = foreach::getDoParWorkers())
Arguments
file_paths |
character paths of input files
|
reader |
function which will perform reading of text
files from disk, which should take a path as its first argument. reader() function should
return named character vector: elements of vector = documents,
names of the elements = document ids which will be used in DTM construction.
If user doesn't provide named character vector, document ids will be generated as
file_name + line_number (assuming that each line is a document).
|
path |
character path of directory. All files in the directory will be read.
|
n_chunks |
integer, defines in how many chunks files will be processed.
For example if you have 32 files, and n_chunks = 8, then for each 4 files will be
created a job (for example document-term matrix construction).
In case some parallel backend registered, each job will be evaluated in a separated thread (process) in parallel.
So each such group of files will be processed in parallel and at the end all 8 results from will be combined.
|
See Also
itoken
Examples
## Not run:
current_dir_files = list.files(path = ".", full.names = TRUE)
files_iterator = ifiles(current_dir_files)
parallel_files_iterator = ifiles_parallel(current_dir_files, n_chunks = 4)
it = itoken_parallel(parallel_files_iterator)
dtm = create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix')
## End(Not run)
dir_files_iterator = idir(path = ".")
[Package
text2vec version 0.5.1
Index]