Inference, Models and Simulation for Complex Systems
CSCI 7000/4830, Fall 2011
Time: Tuesday and Thursday, 11:00am - 12:15pm
Room: ECCR 131
Instructor: Aaron Clauset
Office: ECOT 743
Office hours: Tuesday, 1:30-3:00pm
Email: look it up
Computers are fundamentally changing how science is done by allowing us to record enormous amounts of data on social, biological, technological and physical systems. This data glut should allow us both to address old questions about complex systems in new ways and to answer fundamentally new kinds of questions. But, our ability to do this relies on the development of new algorithms to automatically extract information about patterns in these huge data sets and to reliably test complex scientific hypotheses. Examples include understanding the evolution of the Internet (at various levels), social behavior on the Web (Facebook, Wikipedia, Twitter, etc.), the role of networks in structuring complex social behaviors and biological functions, the emergence of population-scale patterns from seemingly random individual-level behavior, and forecasting rare but important events in the future.
This graduate-level topics course will cover a selection of recent developments in computational approaches to doing science with complex systems. It is not a scientific computing course. Topics will include statistical inference, the structure of complex networks, macro-phenomena in biological evolution and in wars and terrorism, simple mathematical models, and simulation techniques for more complicated models. The focus will be on using computational tools (algorithms) to do science (work with data; test hypotheses; build understanding; make predictions).
Required courses: CSCI 3104 (undergraduate algorithms), two semesters of calculus and basic statistics and probability theory
Recommended courses: CSCI 5454 (graduate algorithms), APPM 3050 (Scientific Computing in Matlab), APPM 3570 (Applied Probability) or APPM 4570 (Statistical Methods)
Note: No biological or social science background is necessary. A quantitative background is mandatory. The concepts and techniques covered in this course depend heavily on basic statistics, scientific programming (Matlab and Mathematica) and calculus (integration and differentiation). Students without sufficient preparation in these areas will struggle to keep up with the lectures and especially with the assignments, each of which has a mathematical and a programming component. Students concerned about their preparation are encouraged to audit the course.
Required (available at the CU Bookstore):
1. Networks: An Introduction by M.E.J. Newman
2. The Sciences of the Artificial (3rd ed.) by H.A. Simon
3. All of Statistics by L. Wasserman
4. Numerical Recipes are recommended. the first two of these are being held on reserve for this course at the CU Library.
Course work and grading
Attendance to the lectures is required.
The first half of the course will be lecture and problem-set driven. The second half of the class will revolve around crowd-source (student) lectures on advanced topics and a small independent research project. The lecture and project are the major deliverables for the class, and students are expected to spend considerable time outside of class preparing them. There are no written examinations in this class; the lecture and project should be viewed as an oral and a final examination. Many lectures will have an accompanying reading assignment from the required texts or a reading from the primary literature. I'll post the latter readings on the class website. Additional detail on the crowd-sourced lectures is given in the syllabus.
Problem sets: The four problem sets will each have some mathematical and some programming problems. Programming problems can be done in any reasonable imperative language, but I highly recommend using Matlab, or something similar, which has data analysis and visualization capabilities built-in. Familiarity with Mathematica will be useful for many of the mathematical problems.
Problem sets will be due roughly two weeks after they are assigned (schedule given below). Solutions must be in PDF format (e.g., typeset using LaTeX), should include all necessary details for me to follow the logic, and must be submitted via email by 11:59pm the day they are due. No late assignments will be accepted. Collaboration is allowed on the problem sets, but you may not copy (in any way) from your collaborators and you must respect University academic policies at all times. To be clear: you may discuss the problems verbally, but you must write up your solutions separately. If you do discuss the problems with someone (and you are encouraged to!), you must then list and describe the extent of your collaboration in your solutions (a footnote is fine). Copying from any source, in any way, including the Web but especially from another student (past or present), is strictly forbidden. If you are unsure about whether something is permitted under these rules, ask me well before the deadline.
Crowd-source lecture: Students will self-organize into teams of two. Each team will then give a full-length (75 minute) technical lecture that (i) frames a scientific question and motivates an algorithm or model to address it, (ii) describes the algorithm or model in appropriate technical detail (at least at the pseudocode level), (iii) describes its performance or results, and (iv) describes interesting extensions, open question, problems, etc. Students should notify me via email of their team composition by September 13. Each team should then notify me via email by September 15 of their preferences on lecture topics in the form of a complete ranking of the topics listed below. I will match teams to topics using a matching algorithm. (The number of topics and the precise schedule may change slightly during the semester depending on the number of students in the class, etc.)
Independent project: The assignment here is to develop and complete a small research-style project based around the tools and concepts developed in the course. This independent project should focus on either inference, modeling or simulation. Students will work independently, but are strongly encouraged to get feedback from each other on their ideas. The deliverables here are a short (20 minute) in-class presentation of their results and a 10-page writeup due via email to me by 11:00am on December 15.
A short project proposal is due via email on October 13. The proposal should be two paragraphs: one explaining any background material, including necessary references to the scientific literature, and a second describing what you're going to do. When choosing a topic, students should be cognizant of the need to make reasonable progress in the 9 weeks remaining in the semester. To help facilitate this, students are strongly encouraged to meet with me outside of class prior to October 13 to discuss potential project topics.
Grading Grades will be assigned based on a 20% attendance, 30% problem sets, 20% lecture and 30% project division. Undergraduates taking CSCI 4830: Instead of doing both the lecture and the project, undergraduate students may choose to do one or the other. Grades will be assigned based on a 30% attendance, 30% problem sets, and 40% lecture/project division.
Week 1: Overview, probability distributions, likelihood functions, the Poisson process (Lecture 0 and Lecture 1)
Week 2: Power-law distributions and power-law tails (Lecture 2 and Lecture 3)
Week 3: Model plausibility, model comparison, and two fallacies (Lecture 4 and Lecture 5)
Week 4: Time series analysis, random walks (Lecture 6 and Lecture 7)
Week 5: Random walk models of macroevolution (Lectures 8 and 9)
Week 6: Probabilistic models of terrorism and wars (Lecture 10)
Week 7: Measures of network structure, algorithms (Lecture 12a and Lecture 12b)
Week 8: Random graph models, citation networks (Lecture 13 and Lecture 14)
Week 9: Modular and hierarchical structure, missing information, etc. (Lecture 15 and Lecture 16)
Week 10: (CSL 1) The bootstrap and (2) duplication-mutation models
Week 11: (3) Privacy and de-anonymizing human data and (4) Levy flights
Week 12: (5) Regression and (6) metabolic networks
Week 13: (7) Gaussian mixture models and (8) People are not? particles
Week 14: Fall break
Week 15: Project presentations
Week 16: Project presentations
data set 1A:
the frequencies of occurrence of US family names in the 1990 US Census.
(From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.)
data set 1B:
the intensity of wars from 1816-1980 measured as the number of battle deaths per
10,000 of the combined populations of the warring nations.
(From M. Small and J.D. Singer (1982) Sage Publications).
data set 1C:
the aggregate net worth in US dollars of the richest individuals in the United States
in October 2003.
(From M.E.J. Newman (2005) Contemporary Physics 46, 323).
data set 1D:
the numbers of adherents of religious denominations, bodies, and sects, as compiled and
published on the web site adherents.com.
(From A. Clauset, C.R. Shalizi and M.E.J. Newman (2009) SIAM Review 51, 661.)
Problem set 2 (assigned September 16; due September 27)
Problem set 3 (assigned September 30; due October 11)
Problem set 4 (assigned October 11; due October 25)
- L. Breiman, "Statistical Modeling: The Two Cultures." Statistical Science 16, 199-231 (2001).
- R.D. Malmgren, D.B. Stouffer, A.E. Motter and L.A.N. Amaral, "A Poissonian explanation for heavy tails in e-mail communication." PNAS 105, 18153-18158 (2008).
- M.E.J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46(5), 323-351 (2005).
- M. Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions." Internet Mathematics 1(2), 226-251 (2004).
- A. Clauset, C.R. Shalizi and M.E.J. Newman, "Power-law distributions in empirical data." SIAM Review 51(4), 661-703 (2009).
- S. Goel, A. Broder, E. Gabrilovich and B. Pang, "Anatomy of the Long Tail: Ordinary People with Extraordinary Tastes." Proc. 3rd ACM Workshop on Web Search and Data Mining, New York City (2010).
- A. Gelman and C.R. Shalizi, "Philosophy and the practice of Bayesian statistics." Preprint (2010).
- D.G. Mayo, Error and the Growth of Experimental Knowledge. Chicago Press (1996).
- S. Redner, "Random multiplicative processes: An elementary tutorial." American Journal of Physics 58(3), 267-273 (1990).
- A. Gabel and S. Redner, "Random Walk Picture of Basketball ." arxiv:1109.2825 (2011).
- J. Alroy, "Understanding the dynamics of trends within evolving lineages." Paleobiology 26(3), 319-329 (2000).
- G. Hunt, "Evolution in Fossil Lineages: Paleontology and The Origin of Species." The American Naturalist 176, Supplement S61-S75 (2010).
- A. Clauset and D.H. Erwin, "The Evolution and Distribution of Species Body Size." Science 321, 399-401 (2008).
- A. Clauset, M. Young and K.S. Gleditsch, "On the Frequency of Severe Terrorist Events." Journal of Conflict Resolution 51, 58-87 (2007).
- A. Clauset and K.S. Gleditsch, "The developmental dynamics of terrorist organizations." Preprint, arxiv:0906.3287 (2011).
Weeks 7 and 8:
- M.E.J. Newman, "The structure and function of complex networks." SIAM Review 45, 167-256 (2003).
- S.N. Dorogovtsev and J.F.F. Mendes, "Evolution of networks." Adv. Phys. 51, 1079 (2002).
- M.E.J. Newman, S.H. Strogatz and D.J. Watts, "Random graphs with arbitrary degree distributions and their applications." Physical Review E 64, 026118 (2001).
- D.S. Callaway, J.E. Hopcroft, J.M. Kleinberg, M.E.J. Newman and S.H. Strogatz, "Are randomly grown graphs really random?." Physical Review E 64, 041902 (2001).
- M.E.J. Newman, "The first-mover advantage in scientific publication." European Physics Letters 86, 68001 (2009).
- S. Redner, "Citation statistics from 110 years of Physical Review." Physics Today 58, 49-54 (2005).
LaTeX (general) and TeXShop (Mac)
Matlab license for CU staff (includes student employees)
Mathematica license for CU students
NumPy/SciPy libraries for Python (similar to Matlab)
GNU Octave (similar to Matlab)
Wolfram Alpha (Web interface for simple integration and differentiation)
Machine Learning, Statistical Inference and Induction Notebook (by Cosma Shalizi)
Power Law distributions, etc. Notebook (by Cosma Shalizi)
Some Advice on Process for [Research Projects]
Cytoscape, network visualization software
yEd Graph Editor, network visualization software
Graphviz, network visualization software