Data Sets

Data sets collected and used in various research projects. All data available here is in the public domain.

  1. Power-law distributions (part 1): 24 univariate quantities that exhibit a heavy-tailed pattern. Quantities drawn from language, cellular biology, communication networks, the World Wide Web, human conflict, ecology, human infrastructure, natural disasters, and various social or economic phenomena.

    Citation: Varies by data set.

  2. Power-law distributions (part 2): 12 univariate quantities that exhibit a heavy-tailed pattern, and most are reported as binned data. Quantities drawn from human conflict, plant physiology, biomedicine, natural disasters, glaciology, and human infrastructure.

    Citation: Varies by data set.

  3. Food web of grassland species (network and vertex labels), where vertices are species (herbivores, parasites, etc.) and edges indicate predation.

    Citation: H.A. Dawah, B.A. Hawkins and M.F. Claridge, "Structure of the parasitoid communities of grass-feeding chalcid wasps." Journal of Animal Ecology 64, 708-720 (1995).

  4. Terrorist associations for 9/11 attacks (network, vertex labels, and vertex names), where vertices are individuals associated with the 9/11 terrorist attacks, and edges indicate social associations.

    Citation: V. Krebs, "Mapping networks of terrorist cells." Connections 24, 43-52 (2002).

  5. NFL 2009 league network (weighted network and vertex labels), where vertices are NFL teams in 2009, the presence of an edge indicates that a game was played, and edges are weighted by the mean score difference across all such games.

    Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." In press, Journal of Complex Networks (2014).

  6. Body masses of extant whale species (table, xlsx), where each line is a measurement, with taxonomic information and source reference.

    Citation: A. Clauset, "How large should whales be?" PLOS ONE 8(1), e53967 (2013).

  7. Sizes of terrorist events worldwide, 1968-2008 (events list), where each line is an event with its date, severity, and a few covariates.

    Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).

  8. Faculty hiring networks (networks and vertex metadata), for 205 Computer Science departments in North America, 112 Business schools in the US, and 144 History departments in the US, representing about 19,000 faculty.

    Citation: A. Clauset, S. Arbesman, and D. B. Larremore, "Systematic inequality and hierarchy in faculty hiring networks." Science Advances 1(1), e1400005 (2015).

  9. Golden Age of Hollywood actor collaborations (networks and vertex names), for 55 actors who were particularly active from 1930-1959. This is a directed, weighted, temporal network spanning 1909 to 2009, aggregated at the level of decades.

    Citation: D. Taylor, S. A. Myers, A. Clauset, M. A. Porter, and P. J. Mucha, "Eigenvector-Based Centrality Measures for Temporal Networks." Preprint, arxiv:1507.01266 (2015).

Code

Open-source implementations of algorithms and models developed by our research group and close collaborators.

  1. Blockmodel Entropy Significance Test (BESTest) and the neoSBM, for characterizing and exploring the relationship between node metadata and network structure. In Matlab and Python, respectively, via Dan Larremore and Leto Peel.

    Citation: L. Peel, D. B. Larremore, and A. Clauset, "The ground truth about metadata and community detection in networks." Science Advances 3(5), e1602548 (2017).

  2. Generalized hierarchical random graph (GHRG) model and change-point detection toolkit for time-evolving networks. Python code (via Leto Peel).

    Citation: L. Peel and A. Clauset, "Detecting change points in the large-scale structure of evolving networks." Proc. AAAI, 2914-2920 (2015).

  3. Bipartite stochastic block model (biSBM) for extracting the communities within a bipartite network, from 2014. Matlab code (via Dan Larremore).

    Citation: D.B. Larremore, A. Clauset and A.Z. Jacobs, "Efficiently inferring community structure in bipartite networks." Phys. Rev. E 90, 012805 (2014).

  4. Weighted stochastic block model (WSBM) for extracting the communities within a weighted network, from 2014. Matlab code.

    Citation: C. Aicher, A.Z. Jacobs and A. Clauset, "Learning latent block structure in weighted networks." Journal of Complex Networks 3(2), 221-248 (2015).

  5. Toolkit for estimating the probability of rare events in heavy-tailed distributions, from 2013. Matlab code.

    Citation: A. Clauset and R. Woodard, "Estimating the historical and future probabilities of large terrorist events." Annals of Applied Statistics 7(4), 1838-1865 (2013).

  6. Toolkit for estimating the rugged shape of the modularity function for a particular network, via simulated annealing and a low-dimensional projection, from 2010. Python code.

    Citation: B.H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).

  7. Toolkit for fitting, testing, and comparing power-law distributions in empirical data, from 2009. Matlab and R code.

    Citation: A. Clauset, C. R. Shalizi and M.E.J. Newman, "Power-law distributions in empirical data." SIAM Review 51(4), 661-703 (2009).

  8. Hierarchical random graphs (HRG) model for extracting hierarchical group structure from networks, from 2008. Can also generate networks with hierarchical structure, and use a fitted model to predict missing links. C/C++ code.

    Citation: A. Clauset, C. Moore and M.E.J. Newman, "Hierarchical structure and the prediction of missing links in networks." Nature 453, 98-101 (2008).

  9. Local community detection via optimizing local modularity algorithm, from 2005. C/C++ code.

    Citation: A. Clauset, "Finding local community structure in networks." Phys. Rev. E 72, 026132 (2005).

  10. Clauset-Newman-Moore (CNM) "fast modularity" community detection algorithm, from 2004. C/C++ code.

    Citation: A. Clauset, M.E.J. Newman, C. Moore, "Finding community structure in very large networks." Phys. Rev. E 70, 066111 (2004).