Datasets to Analyze
- http://tangra.si.umich.edu/clair/anthology
- http://www.theyrule.net/2004/tr2.php
- http://www.orgnet.com/SN.html
- Datasets mentioned in Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, Jure Leskovec, Jon Klienberg and Christos Faloutsos. ACM August 2005.
Datasets: arXiv, Patents, Autonomous Systems, Affiliation network
- Patent Data: http://iv.slis.indiana.edu/db/patents.html
- arXiv Citation Data: http://www.cs.cornell.edu/projects/kddcup/datasets.html
- Enron Email Dataset: http://www.cs.cmu.edu/~enron/
- World-Wide-Web, Actor, Cellular, and Protien Interaction Data: http://www.nd.edu/~networks/resources.htm
- URV Email, Jazz musicians, PGP users, and C. Elegans Metabolic data: http://deim.urv.cat/~aarenas/data/welcome.htm
- Zachary's karate club:http://www-personal.umich.edu/~mejn/netdata/karate.zip
- Les Miserables characters http://www-personal.umich.edu/~mejn/netdata/lesmis.zip
- adjectives/nouns from David Copperfield http://www-personal.umich.edu/~mejn/netdata/adjnoun.zip
- American College Football http://www-personal.umich.edu/~mejn/netdata/football.zip
- Dolphin Social Network http://www-personal.umich.edu/~mejn/netdata/dolphins.zip
- Political Blogs http://www-personal.umich.edu/~mejn/netdata/polblogs.zi
- Books about US politics http://www-personal.umich.edu/~mejn/netdata/polbooks.zip
- Power Grid http://www-personal.umich.edu/~mejn/netdata/power.zip
- Neural Networks of C. Elegans http://www-personal.umich.edu/~mejn/netdata/celegansneural.zip
- Condensed Matter Collaborations 2003 http://www-personal.umich.edu/~mejn/netdata/cond-mat.zip
- Condensed Matter Collaborations 2005 http://www-personal.umich.edu/~mejn/netdata/cond-mat-2005.zip
- Astrophysics Collaborations http://www-personal.umich.edu/~mejn/netdata/astro-ph.zip
- High-energy theory collaborations http://www-personal.umich.edu/~mejn/netdata/hep-th.zip
- Coauthorships in network science http://www-personal.umich.edu/~mejn/netdata/netscience.zip
- Internet http://www-personal.umich.edu/~mejn/netdata/as-22july06.zip
- languages http://www.weizmann.ac.il/mcb/UriAlon/Papers/networkMotifs/darwinBookInter_st.txt, http://www.weizmann.ac.il/mcb/UriAlon/Papers/networkMotifs/frenchBookInter_st.txt, http://www.weizmann.ac.il/mcb/UriAlon/Papers/networkMotifs/spanishBookInter_st.txt, http://www.weizmann.ac.il/mcb/UriAlon/Papers/networkMotifs/japaneseBookInter_st.txt
- power grid http://cdg.columbia.edu/uploads/datasets/power_unweighted
- citations http://www.cs.cornell.edu/projects/kddcup/download/hep-th-citations.tar.gz
- comic book characters http://bioinfo.uib.es/~joemiro/marvel/porgat.txt
- prostate cancer http://tangra.si.umich.edu/clair/allnets/pcancer/pcancer_sample.net
- Subset of the Maple Blog collection http://tangra.si.umich.edu/clair/allnets/R1000/R1000-small.net
- Eurovision 1980 http://tangra.si.umich.edu/clair/allnets/eurovision1980/eurovision-1980-small.net
- blog title http://tangra.si.umich.edu/clair/allnets/blogtitles/cheney_all-small.net
- romanian network (Contact Prof. Radev to receive data)
- romanian DGA (Contact Prof. Radev to receive data)
- synonyms of all nouns in wordnet (Contact Prof. Radev to receive data)
- kzoo.edu crawl (Contact Prof. Radev to receive data)
- 1000 docs from blogger blogs (Contact Prof. Radev to receive data)
- ACL citation network (Contact Prof. Radev to receive data)
- ACL_Anthology Collection of ACL papers in PDF format. No intrinsic link structure (Contact Prof. Radev to receive data)
- ACL_Anthology1415 A random subset of the ACL anthology (Contact Prof. Radev to receive data)
- AOL collocation graph (Xiaodong Shi) (Contact Prof. Radev to receive data)
- WWW crawl used in Barabasi-Albert paper (Contact Prof. Radev to receive data)
- First 1000 results from Google query for "bulgaria" (Contact Prof. Radev to receive data)
- Cellular network data from The large-scale organization of metabolic networks (Contact Prof. Radev to receive data)
- A web corpus of many .gov sites (Contact Prof. Radev to receive data)
- DUC04 document sample (Contact Prof. Radev to receive data)
- free word association (Contact Prof. Radev to receive data)
- Internet Movie Database (which actors have starred in what movies) (Contact Prof. Radev to receive data)
- 11-sentence sample from LexRank paper (Contact Prof. Radev to receive data)
- Metabolic network data from The large-scale organization of metabolic networks (Contact Prof. Radev to receive data)
- Internet routing data from The National Laboratory for Applied Network Research (NLANR) Project (Contact Prof. Radev to receive data)
- Protein interaction network used in Centrality and lethality of protein networks (Contact Prof. Radev to receive data)
- 200, 500, or 1000 document sample from Maple blog collection (Contact Prof. Radev to receive data)
- Dependency network from Romanian newspapers used in Cancho paper (Contact Prof. Radev to receive data)
- Crawl of University of Michigan (Contact Prof. Radev to receive data)
- Crawl of Washtenaw Community College (Contact Prof. Radev to receive data)
- Wordnet synonym network for nouns (Contact Prof. Radev to receive data)
- OpenPGP Web of Trust (Contact Prof. Radev to receive data)
- http://www.adrian.edu/ (Contact Prof. Radev to receive data)
- Collection of .gov crawls (Contact Prof. Radev to receive data)
- Electrical Engineering and Computer Science at Michigan crawl (http://www.eecs.umich.edu/) (Contact Prof. Radev to receive data)
- Eastern Michigan University Physics Department crawl (Contact Prof. Radev to receive data)
- Assorted news snippets (Contact Prof. Radev to receive data)
- http://www.kines.umich.edu/ crawl (Contact Prof. Radev to receive data)
- Ohio State University website crawl (Contact Prof. Radev to receive data)
- Washtenaw Community College crawl (Contact Prof. Radev to receive data)
- TREC Web Corpus (Contact Prof. Radev to receive data)
Build new datasets
- program committees of conferences in NLP/CL or IR or ML
- syntactic dependencies
- mentions of named entities in text
- wikipedia
- social networking sites such as myspace, facebook, linkedin, etc..
- product recommendations for sites such as amazon, ebay, clothing sites etc..
- youtube related videos
- adjective/noun network
- Two words are connected if one appears in the directory definition of another.
- analyze the AAN author network, collaboration network, or title network (two paper titles are connected if they share a non-stop word)
- people or locations that are mentioned in the same news story
- collocation networks (Dorogovtsev and Mendes)
- co-occurrence or other sentence graphs
- concept, thesaurus, and association graphs
- citation
- Web Related
- similarity-based (e.g., cosine)
Papers on Networks
This section lists some papers that can be used as a guide for ideas for the report of the first homework assignment
- How to become a superhero, P. M. Gleiser, J. Stat. Mech. (2007) P09020
http://arxiv.org/abs/0708.2410
- The Political Blogosphere and the 2004 U.S. Election: Divided They Blog (2005)
http://www.blogpulse.com/papers/2005/AdamicGlanceBlogWWW.pdf
- Patterns in syntactic dependency networks, Ramon Ferrer Cancho, Ricard V. Solé, and Reinhard Köhler, PHYSICAL REVIEW E 69, 051915 (2004)
http://complex.upf.es/~ricard/syntaxPRE51915.pdf
- Network properties of written human language, A. P. Masucci and G. J. Rodgers, Phys. Rev. E 74, 026102 (2006)
http://arxiv.org/abs/physics/0605071
- An evaluation of human protein-protein interaction data in the public domain, BMC Bioinformatics 2006, 7(Suppl 5):S19
http://www.biomedcentral.com/1471-2105/7/S5/S19/abstract
Database: This database is hand-curated. There are around 25,000 proteins and 35,000 interactions http://www.hprd.org/download