Netzschleuder network catalogue, repository and centrifuge

Problems with this dataset? Open an issue.
You may also take a look at the source code.

The networks in this dataset can be loaded directly from graph-tool with:
import graph_tool.all as gt
g = gt.collection.ns["bag_of_words/enron"]
(and likewise for the other networks available.)

bag_of_words — Bag of words (2008)

Description

Five text collections in the form of bags-of-words, i.e. a bipartite document–word network. Left nodes are documents and right nodes are words. Edge weights are multiplicities.

After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons.

Tags

Informational Text Bipartite Weighted Metadata

Citation

David Newman, "Bag of Words Data Set", http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science., http://archive.ics.uci.edu/ml

Upstream URL OK

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Networks

Tip: click on the table header to sort the list. Hover your mouse over it to obtain a legend.

Name	Nodes	Edges	$\left<k\right>$	$\sigma_k$	$\lambda_h$	$\tau$	$r$	$\oslash$	$S$	Kind	Mode	NPs	EPs	gt	GraphML	GML	csv
enron	67,963	3,710,420	109.19	287.73	646.11	3.65	-0.23	6	1.00	Undirected	Bipartite	is_word name	count	5.5 MiB	15.6 MiB	13.0 MiB	12.6 MiB
kos	10,336	353,160	68.34	92.79	198.42	2.74	-0.06	4	1.00	Undirected	Bipartite	is_word name	count	650 KiB	1.5 MiB	1.3 MiB	1.2 MiB
nips	13,919	746,316	107.24	184.69	412.66	1.26	-0.15	6	1.00	Undirected	Bipartite	is_word name	count	1.2 MiB	3.2 MiB	2.7 MiB	2.4 MiB
nytimes	402,660	69,679,427	346.10	1722.63	2197.60	10.95	-0.26	7	1.00	Undirected	Bipartite	is_word name	count	94.5 MiB	263.4 MiB	207.6 MiB	234.8 MiB
pubmed	8,341,043	483,450,157	115.92	3196.79	3405.63	2.29	-0.16	5	1.00	Undirected	Bipartite	is_word name	count	859.3 MiB	2.081 GiB	1.749 GiB	1.826 GiB