Datasets and Code
- Protein interaction data set from HPRD database,
http://www.hprd.org/
Manually curated, contains 37,108 interactions.
- Protein interaction network for vaious species
http://www.cs.purdue.edu/homes/koyuturk/mawish/
- AAN bibliometric corpus
http://tangra.si.umich.edu/clair/anthology/index.cgi
Available from Prof. Radev upon request
- Prostrate Cancer genes dataset
We started with a list of seed genes (15 genes) that are known to be related to prostate cancer.
We used around 48,000 articles from the PMCOA corpus (PubMed Central Open Access) to extract the
interactions of the seed genes with each other and with the other genes (neighbor genes).
Then, we extracted the interactions among the neighbor genes too.
The interactions were extracted automatically by using dependency parsing and support vector machines.
When extracting the interactions, all the synonyms of the gene names were considered.
The gene names in the produced network are normalized to their official HGNC symbols (http://www.genenames.org/).
The network consists of 226 nodes and 1,187 edges.
Available from Prof. Radev upon request
- STRING - Protein Association Networks
Protein association networks containing approximatly 300 organisms
Link Provided by Anil Raj.
- Facebook Networks
Columbia Engineering and Microsoft Alumni
Anonymized Facebook datasets. Code to extract more datasets is also available.
Provided by Junfeng He. Contact for code.
- Co-occurrence networks
Two Co-occurence networks created from 7.5M words of the British National Corpus. Both have around 120K nodes each. The restricted network has around 1M edges, whereas the unrestricted one has 3M.
(Refer to The Small World of Language (Cancho and Sole) for more information on restricted and unrestricted)
Available from Madhav Krishna upon request (Some network code available as well)
- IMDB dataset
Dataset containing 10,000 random actors from 2007 and all their movies, genre, keyword, etc...
Available from Sara Stolbach upon request (Code to extract more data is also available)
- Clothing Recommendation Network
Network containg urls that are connected because B is recommended on A's page (directed network A -> B)
Available from Sara Stolbach upon request
- 1984 by George Orwell
Cleaned up version of 1984 by George Orwell
Available from Sara Stolbach upon request
- NCAA Data
NCAA Division I basketball games from 1938-39 season to 2007-08 season. This dataset only contains regular season games from the 1979-80 season onwards; before that only NCAA tournament games are present. This dataset also contains other tournaments such as the NIT.
Available from Matt Chu upon request
- Supreme Court Data
Data: http://jhfowler.ucsd.edu/judicial.htm
directed network of citations in Pajek format is available from Will Mee upon request. Link provided by Will Mee.
- SMS Data
http://www.cel.iitkgp.ernet.in/~monojit/sms.html and http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/
Adjacency networks from this data available from Felix Sanchez Garcia upon request. Links provided by Felix Sanchez Garcia