Datasets and Code
- Protein interaction data set from HPRD database,
http://www.hprd.org/
Manually curated, contains 37,108 interactions.
- Protein interaction network for vaious species
http://www.cs.purdue.edu/homes/koyuturk/mawish/
- AAN bibliometric corpus
http://tangra.si.umich.edu/clair/anthology/index.cgi
Available from Prof. Radev upon request
- Prostrate Cancer genes dataset
We started with a list of seed genes (15 genes) that are known to be related to prostate cancer.
We used around 48,000 articles from the PMCOA corpus (PubMed Central Open Access) to extract the
interactions of the seed genes with each other and with the other genes (neighbor genes).
Then, we extracted the interactions among the neighbor genes too.
The interactions were extracted automatically by using dependency parsing and support vector machines.
When extracting the interactions, all the synonyms of the gene names were considered.
The gene names in the produced network are normalized to their official HGNC symbols (http://www.genenames.org/).
The network consists of 226 nodes and 1,187 edges.
Available from Prof. Radev upon request
- STRING - Protein Association Networks
Protein association networks containing approximatly 300 organisms
Link Provided by Anil Raj.
- Facebook Networks
Columbia Engineering and Microsoft Alumni
Anonymized Facebook datasets. Code to extract more datasets is also available.
Provided by Junfeng He. Contact for code.
- Co-occurrence networks
Two Co-occurence networks created from 7.5M words of the British National Corpus. Both have around 120K nodes each. The restricted network has around 1M edges, whereas the unrestricted one has 3M.
(Refer to The Small World of Language (Cancho and Sole) for more information on restricted and unrestricted)
Available from Madhav Krishna upon request (Some network code available as well)