The quanteda.textstats PMI calculation seems faster and more robust, as the current widyr calculation can fail for very large matrices.
Requires turning tidy data into a dfm with tidytext::cast_dfm() (potentially setting a dummy "value" of 1 or count multiple occurrences of the same entity in a document to account for expected format)
While at it, should also implement native corpus/dfm support - also when using simple co-occurrence counts, rather than PMI
Would only affect calculate_network() (and potentially some data checks to accept corpus/dfm objects in the functions utilizing it)
If full corpora are supported, document that it is generally not recommended to use all words for text network analysis // check scalability of the method for large corpora
The quanteda.textstats PMI calculation seems faster and more robust, as the current widyr calculation can fail for very large matrices.
Requires turning tidy data into a dfm with
tidytext::cast_dfm()(potentially setting a dummy "value" of 1 or count multiple occurrences of the same entity in a document to account for expected format)While at it, should also implement native corpus/dfm support - also when using simple co-occurrence counts, rather than PMI
Would only affect
calculate_network()(and potentially some data checks to accept corpus/dfm objects in the functions utilizing it)If full corpora are supported, document that it is generally not recommended to use all words for text network analysis // check scalability of the method for large corpora