akt

Name

akt — ancestry and kinship toolkit

Synopsis

akt [COMMAND] [OPTIONS]

DESCRIPTION

Ancestry and Kinship Tools (AKT) provides a handful of useful statistical genetics routines using the htslib API for input/output. This means it can seamlessly read BCF/VCF files and play nicely with BCFtools. Many command line arguments and parts of this manpage were also borrowed/stolen from BCFtools!

AKT is freely available under the GPL3 license. AKT relies on HTSlib and Eigen. Eigen is a header-only library for matrix algebra released under the MPL2. HTSlib is a released under the MIT/Expat License. Both libraries are included with AKT.

LIST OF COMMANDS

For a full list of available commands, run akt without arguments. For a full list of available options, run akt COMMAND without arguments.

pca .. principal component analysis
kin .. kinship coefficient calculation
relatives .. find pedigrees using the output from kin
unrelated .. generate a list of unrelated individuals using the output from kin
pedphase .. perform simple Mendelian phasing for duos/trios

COMMON OPTIONS

There are a number of options that are shared by multiple akt subcommands which we list here. We have tried to keep these consistent with BCFtools where possible.

-R, --regions-file FILE: a file (tabixed VCF or bed) containing the markers to perform analysis on. -R uses tabix jumping for fast look up
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,…]: same as -R but a string containing the region eg. "chr1:1000000-2000000"
-T, --targets-file FILE: same as -R but streams rather than tabix jumps ie. is slow
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,…]: same as -r but streams rather than tabix jumps ie. is slow
-S, --samples-file FILE: File of sample names to include or exclude if prefixed with "^"
-s, --samples samples: Comma-separated list of samples to include or exclude if prefixed with "^"
-@, --threads INT: Number of threads to use.
-o, --output-file FILE: Output file name
-O, --output-type b|u|z|v: Output format of vcf b=compressed bcf, z=compressed vcf, u=uncompressed bcf, v=uncompressed vcf

COMMANDS

akt pca [OPTIONS] FILE

Performs principal component analysis on a BCF/VCF. Can also be used to project samples onto pre-calculated principal components from another cohort. Uses a randomised SVD by default for very fast computation. WGS data is far denser than required for a meaningful PCA, it is recommended you provide a thinned set of sites via the -R command.

-o, --output FILE: see Common Options
-O, --output-type b|u|z|v: see Common Options
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,…]: see Common Options
-R, --regions-file file: see Common Options
-s, --samples [^]LIST: subset of samples to annotate, see also Common Options
-S, --samples-file FILE: subset of samples to annotate. If the samples are named differently in the target VCF and the -a, --annotations VCF, the name mapping can be given as "src_name dst_name\n", separated by whitespaces, each pair on a separate line.
-W, --weights FILE: Use precalculated principle components.
-N, --npca VALUE: Number of principle components to calculate.
-a, --alg: Use JacobiSVD PCA algorithm, which is exact to float precision but very slow.
-e, --extra:: VALUE: Default PCA calculation is the inexact RedSVD algorithm, which requires this parameter. The higher the number the more accurate principle components will be obtained.
-F, --svfile: File to output the singular values.
-C, --covdef: Which matrix to take the PCA of. 0 uses mean subtracted genotype matrix; 1 uses mean subtracted and normalized genotype matrix; 2 uses normalized covariance matrix with bias term subtracted from diagonal elements.

Examples:

./akt pca multisample.bcf -R data/wgs.grch37.vcf.gz -O b -o pca.bcf > pca.txt

The file pca.txt contains

SAMPLE_ID0 P0 P1 P2 P3 P4
SAMPLE_ID1 P0 P1 P2 P3 P4
...

The bcf file pca.bcf contains

bcftools query -f "%INFO/WEIGHT\n" pca.bcf
pc00 pc01 pc02 pc03 pc04
pc10 pc11 pc12 pc13 pc14
...

First index is the site index and second which is the coefficient (loading) that can be used to project other samples onto these principal components. For example we could project a new set of samples onto these same PCs via:

./akt pca new_multisample.bcf -W pca.bcf > projections

akt kin [OPTIONS] FILE

Calculates kinship coefficients (and other related metrics) from multi-sample VCF/BCFs. Can be used to detect (closely) related or duplicated samples.

-k, --minkin VALUE: Only output pairs with kinship coefficient greater than VALUE
-F, --freq-file FILE: a file containing population allele frequencies to use in kinship calculation -M, --method '0/1/2` type of estimator. 0:[plink (default) 1:king-robust 2:genetic-relationship-matrix
-a --aftag: VALUE allele frequency tag (default AF)
-@, --threads INT: see Common Options
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,…]: see Common Options
-R, --regions-file file: see Common Options
-s, --samples [^]LIST: subset of samples to annotate, see also Common Options
-S, --samples-file FILE: subset of samples to annotate. If the samples are named differently in the target VCF and the -a, --annotations VCF, the name mapping can be given as "src_name dst_name\n", separated by whitespaces, each pair on a separate line.

Run the kinship calculation by giving akt a multi-sample vcf/bcf file:

Example usage:

$ akt kin multisample.bcf -R data/wgs.grch37.vcf.gz -n 32 > kin.txt

This outputs the following seven column format:

ID1 ID2 IBD0 IBD1 IBD2 KINSHIP NSNP

Choice of estimator:

The default algorithm (-M 0) used to calculate IBD is taken from PLINK with some minor changes.

As with PLINK, we set KINSHIP = 0.5 * IBD2 + 0.25 * IBD1. Our IBD values may slighly differ to PLINK’s (by design) due to the following differences:

No bias correction since allele frequencies are assumed to be accurate
Normalization as follows:
- if IBD0 > 1: IBD0 = 1, IBD1 = 0, IBD2 = 0
- if IBD1 < 0: IBD1 = 0
- if IBD2 < 0: IBD2 = 0
- norm = IBD0 + IBD1 + IBD2
- IBD0 /= norm, IBD1 /= norm; IBD2 /= norm;
We do not follow PLINK which forces IBD to obey consistency conditions - this affects the clustering that is required for the relatives code.

The second method (-M1) uses the robust kinship coefficent estimate describing in the KING paper. This may be preferable when your cohort has large amounts of population structure. Note that while the kinship coefficient differs for -M0, the IBD estimates and output format are the same as for -M0.

akt relatives [OPTIONS] FILE

Takes the output from akt kin and detects/reconstructs pedigrees from the information. Can also flag duplicated samples and create lists of unrelated samples.

-k, --kmin VALUE: Only keep links with kinship above this threshold (searches in this set for duplicate, parent-child and sibling links).
-i, --its VALUE: Iteration parameter for unrelated set output.
-g, --graphout: If present output graphviz files. These can be visualised using e.g. neato -Tpng -O out.allgraph or for family pedigrees dot -Tpng -O out.Fam0.graph.
-p, --prefis PREFIX: Prefix for output files.

./akt relatives allibd -g > allrelatives

The output contains duplicates, families and relationship types.

grep ^Dup allrelatives
Dup0 Sample0
Dup0 Sample1
...
grep ^Fam allrelatives
Fam0 Sample2
Fam0 Sample3
...
...
grep ^Type allrelatives
Type Fam0 Sample2 Sample3 Parent/Child
...
grep ^Unrel allrelatives
Sample0
Sample2
...

The file out.allgraph can be viewed with gviz e.g. fdp out.allgraph -Tpng -O and the families can be viewed using e.g. dot out.Fam0.graph -Tpng -O. The parent child relations are also recorded in PLINK fam format in out.fam. If e.g. a sibling pair, is found the samples will appear in out.fam without parents. If the direction of the relationship can’t be determined e.g. for parent/child duos a random sample is assigned to be the parent in out.fam. The final column in the .fam file specifies how many potential parents the sample had.

Note that relatives is quite a aggressive in its pedigree search, and can make errors when founders are missing (for example a mother and two children). We can remove false pedigrees via a simple Mendel consistency check:

akt kin --force -M 1 test.bcf > kinship.txt
akt relatives kinship.txt
akt mendel -p out.fam test.bcf > mendel.txt
python ~/workspace/akt/scripts/check_pedigree.py -fam out.fam -m mendel.txt > corrected.fam

akt unrelated [OPTIONS] FILE

This takes the output from akt kin and creates a list of nominally unrelated individuals.

-k, --kmin VALUE: individuals with kinship coefficient > value are considered related (default 0.025)
-i, --its VALUE: setting value>0 enables stochastic approach (default 0)

The algorithm has two options:

Simple greedy algorithm

Select individual with smallest number of relatives (defined as kinship coefficient > k) and remove all their relatives.
Repeat 1. until remaining individuals are unrelated.

Stochastic approach - for each sub-graph:

Randomly select individuals within each sub-graph and remove their relatives
Repeat 1. until all individuals are unrelated

Repeat this i times, storing the largest unconnected set found. If the stochastic approach yields a larger unconnected set than the greedy approach then that is returned, else the greedy result is returned.

Note this maximal independent set problem is NP-hard.

akt pedphase [OPTIONS] FILE

This performs simple Mendelian phase-by-transmission, with the novelty that FORMAT/PS will be handled sensibly.

Note: this does not do anything clever with complex pedigrees, parental haplotypes are inferred as the transmitted/untransmitted haplotypes of the first listed child. For clever complex pedigree phasing, use Merlin, HAPI or duohmm (which one depends on your use case).

AUTHORS

Rudy Arthur and Jared O’Connell