About
GERMLINE is a program for discovering long shared segments of Identity by Descent (IBD) between pairs of individuals in a large population. It takes
as input
genotype or haplotype marker data for individuals (as well as an optional known pedigree) and generates a list of all pairwise segmental sharing.
GERMLINE uses a novel hashing & extension algorithm which allows for segment identification in haplotype data in time proportional to the number of
individuals. Presently, GERMLINE can execute on phased or un-phased data; though we have found performance much improved with phasing while phasing &
running GERMLINE is still significantly faster than comparable IBD algorithms. Utilities for easily phasing data for GERMLINE are available below. GERMLINE can identify shared segments of any specified length, as well as allow for any number of mismatching markers.
The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia
University. It is built in C++ and tested in the Red Hat Linux environment; the source is distributed here in a tar.gz package under the GPL license.
If you use GERMLINE in a published analysis, please cite Gusev
A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe'er I (2008) Whole population, genomewide mapping of hidden relatedness.
Genome Research.
Usage
From the command line, extract germline with tar xzvf germline-X-X-X.zip, enter the extracted directory, and compile germline with make. A simple test-case using shorted HapMap samples can be run using make test_case. The executable is run as germline <options> which prompts the user for input/output file information and runs the algorithm.
Input
GERMLINE accepts as input the following formats:
- ( doc ) Plink / ped+map
- ( doc ) PHASE / HapMap
NOTE: Although the PLINK format is not intended for haplotypes, GERMLINE expects the respective alleles to appear in
order; i.e. the first allele always corresponds to one haplotype and the second allele to the other. Also, PLINK arbitrarily re-orders the
alleles in processing the files, so we do not recommend handling phased data with PLINK prior to GERMLINE analysis because the haplotypes
may not be intact (use the --from_snp and --to_snp flags to target specific regions).
Output
Upon completion, GERMLINE generates a .match and .log file in the specified location. Each line in the .match file corresponds to a pairwise shared segment, with the following fields:
- Family ID 1
- Individual ID 1
- Family ID 2
- Individual ID 2
- Chromosome
- Segment start (bp)
- Segment end (bp)
- Segment start (SNP)
- Segment end (SNP)
- Total SNPs in segment
- Genetic length of segment
- Units for genetic length (cM or MB)
- Mismatching SNPs in segment
- 1 if Individual 1 is homozygous in match; 0 otherwise
- 1 if Individual 2 is homozygous in match; 0 otherwise
Options
The program has several command line options to direct the segmental sharing process:
| Flag | Default | Description |
|---|
| -map | - | File location for genetic distance map. Uses the PLINK map format. |
| -min_m | 5 | Minimum length for match to be used for imputation (in cM or MB). |
| -err_hom | 2 | The maximum number of mismatching homozygous markers for a slice to still be considered part of a match. |
| -err_het | 0 | The maximum number of mismatching heterozygous markers for
a slice to
still be considered part of a match. |
| -from_snp | - | Indicate the ID of the first SNP to start processing from. |
| -to_snp | - | Indicate the ID of the last SNP to end processing with. |
| -print | - | Print the haplotype sequence for each match along with match information
(Warning: This may require a large amount of free space). |
| -bits | 128 | Size of each slice (in markers) used for exact matching seeds. |
| -h_extend | - | Extends from exact seeds using haplotypes rather than genotypes; useful when
data is well-phased (e.g. trios) |
| -homoz | - | Allow self matches (test for homozygosity) |
Utilities
We have created some script utilities for converting between data formats; the source code is available below. All scripts can
be compiled using g++ [script file] -o [output name].
| Source | Command | Usage |
| phasing_pipeline.tgz | bash run.sh [ped file] [map file] [output] | Pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE. README for detailed usage. |
| ped_to_bgl.cpp | ./ped_to_bgl [ped file] [map file] > out.bgl | Takes PLINK .PED and .MAP file as input and converts into BEAGLE format to be phased. |
| bgl_to_ped.cpp | ./bgl_to_ped [beagle file] [fam file]
>
out.ped | Takes BEAGLE-format file and PLINK .FAM file and converts into PLINK .PED format to be processed by GERMLINE. |
Changes
1.4.0 (08.14.09)
Allows self-matches between individuals (include -homoz flat)
Added columns to indicate weather match is homozygous or heterozygous
1.3.0 (12.22.08)
Now handles unphased data (omit -h_extend flag).
Added options for homozygous / heterozygous in-exact matching.
Output now specifies if match is in cM or MB.
1.2.1 (09.17.08)
Now using the boost dynamic_bitset libraries. These are packaged with the source and do not effect installation/dependency.
Added -bits flag to explicitly define word-length.
Included sample input data & test-case called upon compilation by 'make test_case'.
1.2.0 (09.03.08)
Output format has changed to provide more detailed SNP information (see above).
Can now iteratively process multi-chromosomal data (for PLINK / PED format only).
Genotype calling has been removed for the time being.
Genetic map restructured (see above) and processed as a parameter.
1.0.2 (08.12.08)
Updated the HapMap format input - auto-detection of trio or unrelated input.
1.0.1 (06.09.08)
Added options to perform analysis on specific region (see -from_snp, -to_snp flags).
Added option to print haplotypes and matches (see: -haps, -print flags).
Contact
For any questions or comments, please visit our service
community or contact the developers directly at:
{gusev,itsik}@cs.columbia.edu.