Germline - Genetic Error-tolerant Regional Matching with LINear-time Extension

About

GERMLINE is a program for discovering long shared segments of Identity by Descent (IBD) between pairs of individuals in a large population. It takes as input genotype or haplotype marker data for individuals (as well as an optional known pedigree) and generates a list of all pairwise segmental sharing.

GERMLINE uses a novel hashing & extension algorithm which allows for segment identification in haplotype data in time proportional to the number of individuals. Presently, GERMLINE can execute on phased or un-phased data; though we have found performance much improved with phasing while phasing & running GERMLINE is still significantly faster than comparable IBD algorithms. Utilities for easily phasing data for GERMLINE are available below. GERMLINE can identify shared segments of any specified length, as well as allow for any number of mismatching markers.

The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia University. It is built in C++ and tested in the Red Hat Linux environment; the source is distributed here in a tar.gz package under the GPL license. If you use GERMLINE in a published analysis, please cite Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe'er I (2008) Whole population, genomewide mapping of hidden relatedness. Genome Research.

  Download: germline 1.4.0 (08.14.09)

Usage

From the command line, extract germline with tar xzvf germline-X-X-X.zip, enter the extracted directory, and compile germline with make. A simple test-case using shorted HapMap samples can be run using make test_case. The executable is run as germline <options> which prompts the user for input/output file information and runs the algorithm.

Input

GERMLINE accepts as input the following formats:

NOTE: Although the PLINK format is not intended for haplotypes, GERMLINE expects the respective alleles to appear in order; i.e. the first allele always corresponds to one haplotype and the second allele to the other. Also, PLINK arbitrarily re-orders the alleles in processing the files, so we do not recommend handling phased data with PLINK prior to GERMLINE analysis because the haplotypes may not be intact (use the --from_snp and --to_snp flags to target specific regions).

Output

Upon completion, GERMLINE generates a .match and .log file in the specified location. Each line in the .match file corresponds to a pairwise shared segment, with the following fields:

Options

The program has several command line options to direct the segmental sharing process:

FlagDefaultDescription
-map-File location for genetic distance map. Uses the PLINK map format.
-min_m5Minimum length for match to be used for imputation (in cM or MB).
-err_hom2The maximum number of mismatching homozygous markers for a slice to still be considered part of a match.
-err_het0The maximum number of mismatching heterozygous markers for a slice to still be considered part of a match.
-from_snp-Indicate the ID of the first SNP to start processing from.
-to_snp-Indicate the ID of the last SNP to end processing with.
-print-Print the haplotype sequence for each match along with match information (Warning: This may require a large amount of free space).
-bits128Size of each slice (in markers) used for exact matching seeds.
-h_extend-Extends from exact seeds using haplotypes rather than genotypes; useful when data is well-phased (e.g. trios)
-homoz-Allow self matches (test for homozygosity)

Utilities

We have created some script utilities for converting between data formats; the source code is available below. All scripts can be compiled using g++ [script file] -o [output name].

SourceCommandUsage
phasing_pipeline.tgzbash run.sh [ped file] [map file] [output]Pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE. README for detailed usage.
ped_to_bgl.cpp./ped_to_bgl [ped file] [map file] > out.bglTakes PLINK .PED and .MAP file as input and converts into BEAGLE format to be phased.
bgl_to_ped.cpp./bgl_to_ped [beagle file] [fam file] > out.pedTakes BEAGLE-format file and PLINK .FAM file and converts into PLINK .PED format to be processed by GERMLINE.

Changes

1.4.0 (08.14.09)
Allows self-matches between individuals (include -homoz flat)
Added columns to indicate weather match is homozygous or heterozygous
1.3.0 (12.22.08)
Now handles unphased data (omit -h_extend flag).
Added options for homozygous / heterozygous in-exact matching.
Output now specifies if match is in cM or MB.

1.2.1 (09.17.08)
Now using the boost dynamic_bitset libraries. These are packaged with the source and do not effect installation/dependency.
Added -bits flag to explicitly define word-length.
Included sample input data & test-case called upon compilation by 'make test_case'.

1.2.0 (09.03.08)
Output format has changed to provide more detailed SNP information (see above).
Can now iteratively process multi-chromosomal data (for PLINK / PED format only).
Genotype calling has been removed for the time being.
Genetic map restructured (see above) and processed as a parameter.

1.0.2 (08.12.08)
Updated the HapMap format input - auto-detection of trio or unrelated input.

1.0.1 (06.09.08)
Added options to perform analysis on specific region (see -from_snp, -to_snp flags).
Added option to print haplotypes and matches (see: -haps, -print flags).

Contact

For any questions or comments, please visit our service community or contact the developers directly at: {gusev,itsik}@cs.columbia.edu.