logo Netzschleuder network catalogue, repository and centrifuge

Problems with this dataset? Open an issue.
You may also take a look at the source code.
The networks in this dataset can be loaded directly from graph-tool with:
import graph_tool.all as gt
g = gt.collection.ns["bag_of_words/enron"]
(and likewise for the other networks available.)

bag_of_words — Bag of words (2008)

Description

Five text collections in the form of bags-of-words, i.e. a bipartite document–word network. Left nodes are documents and right nodes are words. Edge weights are multiplicities.

After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons.

Tags
Informational Text Bipartite Weighted Metadata
Citation
Upstream URL OK
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Networks
Tip: click on the table header to sort the list. Hover your mouse over it to obtain a legend.
Name Nodes Edges $\left<k\right>$ $\sigma_k$ $\lambda_h$ $\tau$ $r$ $c$ $\oslash$ $S$ Kind Mode NPs EPs gt GraphML GML csv
enron 67,963 3,710,420 109.19 287.73 646.11 3.65 -0.23 0.00 6 1.00 Undirected Bipartite is_word name count 5.5 MiB 15.6 MiB 13.0 MiB 12.6 MiB
kos 10,336 353,160 68.34 92.79 198.42 2.74 -0.06 0.00 4 1.00 Undirected Bipartite is_word name count 650 KiB 1.5 MiB 1.3 MiB 1.2 MiB
nips 13,919 746,316 107.24 184.69 412.66 1.26 -0.15 0.00 6 1.00 Undirected Bipartite is_word name count 1.2 MiB 3.2 MiB 2.7 MiB 2.4 MiB
nytimes 402,660 69,679,427 346.10 1722.63 2197.60 10.95 -0.26 0.00 7 1.00 Undirected Bipartite is_word name count 94.5 MiB 263.4 MiB 207.6 MiB 234.8 MiB
pubmed 8,341,043 483,450,157 115.92 3196.79 3405.63 2.29 -0.16 0.00 5 1.00 Undirected Bipartite is_word name count 859.3 MiB 2.081 GiB 1.749 GiB 1.826 GiB