GERMLINE

About

GERMLINE is an algorithm for discovering long shared segments of Identity by Descent (IBD) between pairs of individuals in a large population. It takes as input genotype or haplotype marker data for individuals (as well as an optional known pedigree) and generates a list of all pairwise segmental sharing.

GERMLINE uses a novel hashing & extension algorithm which allows for segment identification in haplotype data in time proportional to the number of individuals. Presently, GERMLINE can execute on phased or un-phased data; though we have found performance much improved with phasing while phasing & running GERMLINE is still significantly faster than comparable IBD algorithms. Utilities for easily phasing data for GERMLINE are available below. GERMLINE can identify shared segments of any specified length, as well as allow for any number of mismatching markers.

The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia University. It has been built in C++ and tested in the Red Hat Linux environment; the source is available here in a tar.gz package as well as pre-compiled binaries under the utilities section. GERMLINE is distributed under the GPL license. If you use GERMLINE in a published analysis, please cite Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe'er I (2008) Whole population, genomewide mapping of hidden relatedness. Genome Research.

This work has recently been applied to optimally selecting individuals for sequencing and inferring previously un-typed variants in Low-pass Genomewide Sequencing and Variant Imputation Using Identity-by-descent in an Isolated Human Population. (2011) A Gusev, MJ Shah, EE Kenny, A Ramachandran, JK Lowe, J Salit, CC Lee, EC Levandowsky, TN Weaver, QC Doan, HE Peckham, SF McLaughlin, MR Lyons, VN Sheth, M Stoffel, FM De La Vega, JM Friedman, JL Breslow, I Pe'er (in submission, pre-pub version).

Usage

From the command line, extract germline with tar xzvf germline-X-X-X.zip, enter the extracted directory, and compile germline with make all. A simple test-case using shortened HapMap samples can be run using make test. The executable is run as germline <options> which prompts the user for input/output file information and runs the algorithm.

Input

GERMLINE accepts as input the following formats:

[ doc ] Plink / ped+map
[ doc ] PHASE / HapMap

NOTE: Although the PLINK format is not intended for haplotypes, GERMLINE expects the respective alleles to appear in order; i.e. the first allele always corresponds to one haplotype and the second allele to the other. Also, PLINK arbitrarily re-orders the alleles in processing the files, so we do not recommend handling phased data with PLINK prior to GERMLINE analysis because the haplotypes may not be intact (use the -from_snp and -to_snp flags to target specific regions).

Output

Upon completion, GERMLINE generates a .match and .log file in the specified location. Each line in the .match file corresponds to a pairwise shared segment, with the following fields:

Family ID 1
Individual ID 1
Family ID 2
Individual ID 2
Chromosome
Segment start (bp)
Segment end (bp)
Segment start (SNP)
Segment end (SNP)
Total SNPs in segment
Genetic length of segment
Units for genetic length (cM or MB)
Mismatching SNPs in segment
1 if Individual 1 is homozygous in match; 0 otherwise
1 if Individual 2 is homozygous in match; 0 otherwise

Binary Output

To spave space GERMLINE can also generate binary output using the -bin_out flag. This flag will generate three files:

*.bsid Two columns per line for each sample: FAM ID,SAMPLE ID.
*.bmid Four columns per line for each marker: CHROMOSOME,RSID,GENETIC DISTANCE,PHYSICAL DISTANCE.
*.bmatch Binary match file containing integer pointers to samples (from bsid file), markers (from bmid file) and boolean meta-data.

The binary files can be converted back to the standard flat format described above by using the parse_bmatch utility provided with the code. Load the three generated files using parse_bmatch [BMATCH FILE] [BSID FILE] [BMID FILE] and the flat match output will be printed to standard out. See the parse_bmatch.cpp code for binary format details.

Options

The program has several command line options to direct the segmental sharing process:

Flag	Default	Description
-map	-	File location for genetic distance map. Uses the PLINK map format.
-min_m	3	Minimum length for match to be used for imputation (in cM or MB).
-err_hom	2	The maximum number of mismatching homozygous markers for a slice to still be considered part of a match.
-err_het	0	The maximum number of mismatching heterozygous markers for a slice to still be considered part of a match.
-from_snp	-	Indicate the ID of the first SNP to start processing from.
-to_snp	-	Indicate the ID of the last SNP to end processing with.
-h_extend		Extends from exact seeds using haplotypes rather than genotypes; useful when data is well-phased (e.g. trios)
-homoz		Allow self matches (runs of homozygosity)
-homoz-only		Analyze and report only auto/homo-zygous segments, no IBD reported but significantly faster analysis.
-haploid		Treat each input individual as two distinct and separate haplotypes. Output IDs will have .0/.1 suffix corresponding to each haplotype. The -err_het flag will have no effect in this analysis.
-bin_out		Generate output matches in binary format, creates a .bmatch .bsid and *.bmid files. These files can be converted to flat output using the parse_bmatch utility included and compiled in the package.
-bits	128	Size of each slice (in markers) used for exact matching seeds.
-w_extend		Extend the match beyond the slice end to the first mismatching marker.

Utilities

We have created some script utilities for converting between data formats; the source code is available below. All scripts can be compiled using g++ [script file] -o [output name].

Title	Usage	Download
Phasing Pipeline	Pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE. README for detailed usage.	phasing_pipeline.tar.gz
Binaries	Pre-compiled binaries of GERMLINE v1.5.0 for Linux 32/64 bit and Windows (cygwin).	32b, 64b, WIN

Contact

For any questions or comments, please the developers directly at: {gusev,itsik}@cs.columbia.edu.

Change Log

1.5.1 (03.07.12)
Fixed minor formatting bug with runs of homozygosity

1.5.0 (09.17.10)
Major computational overhaul - algorithm should run 2-3 times faster with 4-fold memory reduction on large datasets.
Added a binary output option (flag -bin_out) to reduce output size, see documentation.
Added a stand-alone C++ script to convert from binary file to standard readable match file.
Added -haploid flag to treat each input individual as two distinct haplotypes, output IDs will have .0/.1 suffix indicating which respective haplotype the match is on. This is effective for short windows with very well phased data.

1.4.2 (07.22.10)
Fixed bug with -to_snp/-from_snp commands
Added -homoz-only flag for simple homozygosity analysis

1.4.1 (03.22.10)
Fixed bug in PHASE format input
Fixed bug in files with multiple chromosomes
Fixed bug with overlapping segments from -h_extend feature

1.4.0 (08.14.09)
Allows self-matches between individuals (include -homoz flat)
Added columns to indicate weather match is homozygous or heterozygous

1.3.0 (12.22.08)
Now handles unphased data (omit -h_extend flag).
Added options for homozygous / heterozygous in-exact matching.
Output now specifies if match is in cM or MB.

1.2.1 (09.17.08)
Now using the boost dynamic_bitset libraries. These are packaged with the source and do not effect installation/dependency.
Added -bits flag to explicitly define word-length.
Included sample input data & test-case called upon compilation by 'make test_case'.

1.2.0 (09.03.08)
Output format has changed to provide more detailed SNP information (see above).
Can now iteratively process multi-chromosomal data (for PLINK / PED format only).
Genotype calling has been removed for the time being.
Genetic map restructured (see above) and processed as a parameter.

1.0.2 (08.12.08)
Updated the HapMap format input - auto-detection of trio or unrelated input.

1.0.1 (06.09.08)
Added options to perform analysis on specific region (see -from_snp, -to_snp flags).
Added option to print haplotypes and matches (see: -haps, -print flags).

This website has moved to www.gusevlab.org/software, the legacy site is retained below for posterity.