akt — ancestry and kinship toolkit
Ancestry and Kinship Tools (AKT) provides a handful of useful statistical genetics routines using the htslib API for input/output. This means it can seamlessly read BCF/VCF files and play nicely with BCFtools. Many command line arguments and parts of this manpage were also borrowed/stolen from BCFtools!
Copyright (c) 2017, Illumina, Inc. All rights reserved. This software is not commercially supported.
AKT is freely available under the GPL3 license. AKT relies on HTSlib and Eigen. Eigen is a header-only library for matrix algebra released under the MPL2. HTSlib is a released under the MIT/Expat License. Both libraries are included with AKT.
For a full list of available commands, run akt without arguments. For a full list of available options, run akt COMMAND without arguments.
There are a number of options that are shared by multiple akt subcommands which we list here. We have tried to keep these consistent with BCFtools where possible.
Performs principal component analysis on a BCF/VCF. Can also be used to project samples onto pre-calculated principal components from another cohort. Uses a randomised SVD by default for very fast computation. WGS data is far denser than required for a meaningful PCA, it is recommended you provide a thinned set of sites via the -R
command.
RedSVD
algorithm, which requires this parameter. The higher the number the more accurate principle components will be obtained.
Examples:
./akt pca multisample.bcf -R data/wgs.grch37.vcf.gz -O b -o pca.bcf > pca.txt
The file pca.txt
contains
SAMPLE_ID0 P0 P1 P2 P3 P4 SAMPLE_ID1 P0 P1 P2 P3 P4 ...
The bcf file pca.bcf
contains
bcftools query -f "%INFO/WEIGHT\n" pca.bcf pc00 pc01 pc02 pc03 pc04 pc10 pc11 pc12 pc13 pc14 ...
First index is the site index and second which is the coefficient (loading) that can be used to project other samples onto these principal components. For example we could project a new set of samples onto these same PCs via:
./akt pca new_multisample.bcf -W pca.bcf > projections
Calculates kinship coefficients (and other related metrics) from multi-sample VCF/BCFs. Can be used to detect (closely) related or duplicated samples.
Run the kinship calculation by giving akt a multi-sample vcf/bcf file:
Example usage:
$ akt kin multisample.bcf -R data/wgs.grch37.vcf.gz -n 32 > kin.txt
This outputs the following seven column format:
ID1 ID2 IBD0 IBD1 IBD2 KINSHIP NSNP
The default algorithm (-M 0
) used to calculate IBD is taken from PLINK with some minor changes.
As with PLINK, we set KINSHIP = 0.5 * IBD2 + 0.25 * IBD1. Our IBD values may slighly differ to PLINK’s (by design) due to the following differences:
Normalization as follows:
relatives
code.
The second method (-M1
) uses the robust kinship coefficent estimate describing in the KING paper. This may be preferable when your cohort has large amounts of population structure. Note that while the kinship coefficient differs for -M0
, the IBD estimates and output format are the same as for -M0
.
Takes the output from akt kin
and detects/reconstructs pedigrees from the information. Can also flag duplicated samples and create lists of unrelated samples.
neato -Tpng -O out.allgraph
or for family pedigrees dot -Tpng -O out.Fam0.graph
.
./akt relatives allibd -g > allrelatives
The output contains duplicates, families and relationship types.
grep ^Dup allrelatives Dup0 Sample0 Dup0 Sample1 ... grep ^Fam allrelatives Fam0 Sample2 Fam0 Sample3 ... ... grep ^Type allrelatives Type Fam0 Sample2 Sample3 Parent/Child ... grep ^Unrel allrelatives Sample0 Sample2 ...
The file out.allgraph
can be viewed with gviz e.g. fdp out.allgraph -Tpng -O
and the families can be viewed using
e.g. dot out.Fam0.graph -Tpng -O
. The parent child relations are also recorded in PLINK fam format in out.fam
. If
e.g. a sibling pair, is found the samples will appear in out.fam
without parents. If the direction of the relationship
can’t be determined e.g. for parent/child duos a random sample is assigned to be the parent in out.fam
. The final column
in the .fam
file specifies how many potential parents the sample had.
Note that relatives
is quite a aggressive in its pedigree search, and can make errors when founders
are missing (for example a mother and two children). We can remove false pedigrees via a simple Mendel consistency check:
akt kin --force -M 1 test.bcf > kinship.txt akt relatives kinship.txt akt mendel -p out.fam test.bcf > mendel.txt python ~/workspace/akt/scripts/check_pedigree.py -fam out.fam -m mendel.txt > corrected.fam
This takes the output from akt kin
and creates a list of nominally unrelated individuals.
The algorithm has two options:
Simple greedy algorithm
Stochastic approach - for each sub-graph:
Repeat this i times, storing the largest unconnected set found. If the stochastic approach yields a larger unconnected set than the greedy approach then that is returned, else the greedy result is returned.
Note this maximal independent set problem is NP-hard.
This performs simple Mendelian phase-by-transmission, with the novelty that FORMAT/PS
will be handled sensibly.
Note: this does not do anything clever with complex pedigrees, parental haplotypes are inferred as the transmitted/untransmitted haplotypes of the first listed child. For clever complex pedigree phasing, use Merlin, HAPI or duohmm (which one depends on your use case).