DASH Associates Shared Haplotypes

About

Genomewide association has been a powerful tool for detecting common disease variants. However, this approach has been underpowered in identifying variation that is poorly represented on commercial SNP arrays, being too rare or population-specific. Recent multipoint methods including SNP tagging and imputation boost the power of detecting and localizing the true causal variant, leveraging common haplotypes in a densely typed panel of reference samples. However, they are limited by the need to obtain a robust population-specific reference panel with sampling deep enough to observe a rare variant of interest. We set out to overcome these challenges by using long stretches of genomic sharing that are identical by descent (IBD). We use such evident sharing between pairs and small subsets of individuals to recover the underlying shared haplotypes that have been co-inherited by these individuals.

We have created a software tool, DASH (DASH Associates Shared Haplotypes), that builds upon pairwise IBD shared segments to infer clusters of IBD individuals. Briefly, for each locus, DASH constructs a graph with links based on IBD at that locus, and uses an iterative min-cut approach to identify clusters. These are densely connected components, each sharing a haplotype. As DASH slides the local window along the genome, links representing new shared segments are added and old ones expire; these changes cause the resultant connected components to grow and shrink. We code the corresponding haplotypes as genetic markers and use them for association testing.

The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia University. It is built in C++ and tested in the Red Hat Linux environment; the source is distributed here in a tar.gz package under the GPL license. If you plan to use DASH in a published analysis, please reference the following manuscript:

DASH: A Method for Identical-by-Descent Haplotype Mapping Uncovers Association with Recent Variation, Alexander Gusev, Eimear E. Kenny, Jennifer K. Lowe, Jaqueline Salit, Richa Saxena, Sekar Kathiresan, David M. Altshuler, Jeffrey M. Friedman, Jan L. Breslow, Itsik Pe'er. The American Journal of Human Genetics 2011

download: dash 1.1.0 (05.27.11)

Usage

The DASH package consists of 32-bit binaries and C++ source for the efficient connect-component-based clustering (src/dash_cc), the more advanced/slower dense subgraph clustering (src/dash_adv) and additional tools (src/tools). From the command line, extract DASH with tar xzvf dash-X-X-X.tar.gz. Pre-compilined binaries are in the 'bin' directory, but can be regenerated by entering each of the subdirectories in 'src' and calling make. For dash_adv a simple test-case using inputs from the test subdirectory can be run by calling make test.

DASH-adv uses a modified version of the Boost Graph Library subgraph.hpp class, with all of the neccessary files provided in this distribution. If you are having Boost related issues compiling, please make sure that a native copy of Boost is not superceding the one referenced.

Input

DASH accepts IBD segments through the standard input, one segment per line, with each line whitespace delimited with the following columns:

Simple Execution

DASH makes several assumptions about the structure of the shared segments. First, all segments are expected to be on the same chromosome - we recommend splitting genomic data into separate chromosomes which can be easily parallelized. More importantly, DASH assumes that each individual in the pair represents a haploid sample. While DASH allows for some degree of error and attempts to exclude individuals from a haplotype to which they are loosely connected, when a single input individual is sharing both of it's haplotypes to many other samples, DASH will place that individual into the single most likely haplotype cluster rather than both.

A vanilla analysis, first generating IBD segments using our GERMLINE algorithm would be the following:

germline -haploid
cut -f 1,2,4,10,11 germline.match | dash_cc my_samples.fam my_clusters
cut -f 1-3 my_clusters.clst | awk '{ print 1,"cs"$1,0,int(($2+$3)/2) }' > my_clusters.map
plink --ped my_clusters.ped --map my_clusters.map --pheno my_trait --assoc

From experimentation, we have found the "-haploid -bin_out -min_m 1 -bits 32 -err_hom 1 -err_het 1" flags for GERMLINE to be most effective.

Full Experiment

We have written a command-line execution pipeline that starts with unphased PLINK-format data, generates phased haplotypes with BEAGLE, identifies IBD segments with GERMLINE, and process them with DASH-cc. Please make sure that the 'src/tools' binaries have been compiled; and that links/copies to the PLINK, BEAGLE (if phasing), and GERMLINE (if detecting IBD) excutables are in the 'bin' directory under the names 'plink', 'beagle.jar', and 'germline' respectively.

A full analysis, starting with my_input.ped, my_input.map, and my_input.fam unphased, PLINK-format data would be run as follows:

Output

As it runs, DASH generates a .clst Haplotype cluster file where each line represents a cluster/haplotype with the following tab separated fields:

Fields 2 & 3 represent the shortest region containing the clustered individuals with no change in IBD-status; fields 4 & 5 represent the minimum region where all cluster members share and IBD segment. When called using the pipeline scripts, a *.cmap file will be generated corresponding fields 1-5 and a binary PLINK-format file will be generated corresponding to the haplotypes.

Advanced Options

The DASH-adv program has several command line options to direct the clustering process:

FlagDefaultDescription
-help-Print this list of commands
-fam-PLINK format .fam file listing sample ids. Used to generate ped/map files (see above).
-win500000Sliding window size.
-density0.6Minimum cluster density.
-r20.95Maximum r^2 for which two haplotypes are considered different and printed, set to 1 to print all.
-min4Minimum haplotype/cluster size.

Contact

For any questions or comments, please contact the developers directly at: {gusev,itsik}@cs.columbia.edu.

Change Log

1.1.0 (05.27.11)
DASH paper published at AJHG
DASH-cc and DASH-adv versions released
Phasing & IBD pipeline added
1.0.0 (09.17.10)
First stable release