Problems with this dataset? Open an issue.
You may also take a look at the source code.
The networks in this dataset can be loaded directly from graph-tool with:(and likewise for the other networks available.)import graph_tool.all as gt g = gt.collection.ns["bag_of_words/enron"]
Five text collections in the form of bags-of-words, i.e. a bipartite document–word network. Left nodes are documents and right nodes are words. Edge weights are multiplicities.
After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons.
Name | Nodes | Edges | $\left<k\right>$ | $\sigma_k$ | $\lambda_h$ | $\tau$ | $r$ | $c$ | $\oslash$ | $S$ | Kind | Mode | NPs | EPs | gt | GraphML | GML | csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
enron | 67,963 | 3,710,420 | 109.19 | 287.73 | 646.11 | 3.65 | -0.23 | 0.00 | 6 | 1.00 | Undirected | Bipartite | is_word name | count | 5.5 MiB | 15.6 MiB | 13.0 MiB | 12.6 MiB |
kos | 10,336 | 353,160 | 68.34 | 92.79 | 198.42 | 2.74 | -0.06 | 0.00 | 4 | 1.00 | Undirected | Bipartite | is_word name | count | 650 KiB | 1.5 MiB | 1.3 MiB | 1.2 MiB |
nips | 13,919 | 746,316 | 107.24 | 184.69 | 412.66 | 1.26 | -0.15 | 0.00 | 6 | 1.00 | Undirected | Bipartite | is_word name | count | 1.2 MiB | 3.2 MiB | 2.7 MiB | 2.4 MiB |
nytimes | 402,660 | 69,679,427 | 346.10 | 1722.63 | 2197.60 | 10.95 | -0.26 | 0.00 | 7 | 1.00 | Undirected | Bipartite | is_word name | count | 94.5 MiB | 263.4 MiB | 207.6 MiB | 234.8 MiB |
pubmed | 8,341,043 | 483,450,157 | 115.92 | 3196.79 | 3405.63 | 2.29 | -0.16 | 0.00 | 5 | 1.00 | Undirected | Bipartite | is_word name | count | 859.3 MiB | 2.081 GiB | 1.749 GiB | 1.826 GiB |