Problems with this dataset? Open an issue.
You may also take a look at the source code.
The network in this dataset can be loaded directly from graph-tool with:import graph_tool.all as gt g = gt.collection.ns["trec"]
A bipartite network of documents and the words they contain, extracted from NIST's Text Retrieval Conference (TREC) disks 4 and 5, from 2010. These archives contain material drawn from the Financial Times Ltd., the Congressional Record of the 103rd Congress, the Federal Register, the Foreign Broadcast Information Service, and the Los Angeles Times newspaper.1
Name | Nodes | Edges | $\left<k\right>$ | $\sigma_k$ | $\lambda_h$ | $\tau$ | $r$ | $c$ | $\oslash$ | $S$ | Kind | Mode | NPs | EPs | gt | GraphML | GML | csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trec | 1,729,302 | 83,629,405 | 96.72 | 1358.95 | 2935.32 | 13.41 | -0.21 | 0.00 | 7 | 1.00 | Undirected | Bipartite | weight | 152.7 MiB | 426.6 MiB | 405.9 MiB | 349.4 MiB |