gnomAD
Overview
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
Publication
Koch, L., 2020. Exploring human genomic diversity with gnomAD. Nature Reviews Genetics, 21(8), pp.448-448.
Small Variants
VCF extraction
We currently extract the following info fields from gnomAD genome and exome VCF files:
##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate allele count for samples">
##INFO=<ID=AN,Number=A,Type=Integer,Description="Total number of alleles in samples">
##INFO=<ID=nhomalt,Number=A,Type=Integer,Description="Count of homozygous individuals in samples">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Depth of informative coverage for each sample; reads with MQ=255 or with bad mates are filtered">
##INFO=<ID=lcr,Number=0,Type=Flag,Description="Variant falls within a low complexity region">
##INFO=<ID=AC_afr,Number=A,Type=Integer,Description="Alternate allele count for samples of African-American ancestry">
##INFO=<ID=AN_afr,Number=A,Type=Integer,Description="Total number of alleles in samples of African-American ancestry">
##INFO=<ID=AF_afr,Number=A,Type=Float,Description="Alternate allele frequency in samples of African-American ancestry">
##INFO=<ID=nhomalt_afr,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of African-American ancestry">
##INFO=<ID=AC_amr,Number=A,Type=Integer,Description="Alternate allele count for samples of Latino ancestry">
##INFO=<ID=AN_amr,Number=A,Type=Integer,Description="Total number of alleles in samples of Latino ancestry">
##INFO=<ID=nhomalt_amr,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of Latino ancestry">
##INFO=<ID=AC_eas,Number=A,Type=Integer,Description="Alternate allele count for samples of East Asian ancestry">
##INFO=<ID=AN_eas,Number=A,Type=Integer,Description="Total number of alleles in samples of East Asian ancestry">
##INFO=<ID=nhomalt_eas,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of East Asian ancestry">
##INFO=<ID=AC_female,Number=A,Type=Integer,Description="Alternate allele count for female samples">
##INFO=<ID=AN_female,Number=A,Type=Integer,Description="Total number of alleles in female samples">
##INFO=<ID=nhomalt_female,Number=A,Type=Integer,Description="Count of homozygous individuals in female samples">
##INFO=<ID=AC_nfe,Number=A,Type=Integer,Description="Alternate allele count for samples of non-Finnish European ancestry">
##INFO=<ID=AN_nfe,Number=A,Type=Integer,Description="Total number of alleles in samples of non-Finnish European ancestry">
##INFO=<ID=nhomalt_nfe,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of non-Finnish European ancestry">
##INFO=<ID=AC_fin,Number=A,Type=Integer,Description="Alternate allele count for samples of Finnish ancestry">
##INFO=<ID=AN_fin,Number=A,Type=Integer,Description="Total number of alleles in samples of Finnish ancestry">
##INFO=<ID=nhomalt_fin,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of Finnish ancestry">
##INFO=<ID=AC_asj,Number=A,Type=Integer,Description="Alternate allele count for samples of Ashkenazi Jewish ancestry">
##INFO=<ID=AN_asj,Number=A,Type=Integer,Description="Total number of alleles in samples of Ashkenazi Jewish ancestry">
##INFO=<ID=nhomalt_asj,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of Ashkenazi Jewish ancestry">
##INFO=<ID=AC_oth,Number=A,Type=Integer,Description="Alternate allele count for samples of uncertain ancestry">
##INFO=<ID=AN_oth,Number=A,Type=Integer,Description="Total number of alleles in samples of uncertain ancestry">
##INFO=<ID=nhomalt_oth,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of uncertain ancestry">
##INFO=<ID=AC_male,Number=A,Type=Integer,Description="Alternate allele count for male samples">
##INFO=<ID=AN_male,Number=A,Type=Integer,Description="Total number of alleles in male samples">
##INFO=<ID=nhomalt_male,Number=A,Type=Integer,Description="Count of homozygous individuals in male samples">
##INFO=<ID=controls_AC,Number=A,Type=Integer,Description="Alternate allele count for samples in the controls subset">
##INFO=<ID=controls_AN,Number=A,Type=Integer,Description="Total number of alleles in samples in the controls subset">
We also extract the following extra fields from gnomAD exome VCF file:
##INFO=<ID=AC_sas,Number=A,Type=Integer,Description="Alternate allele count for samples of South Asian ancestry">
##INFO=<ID=AN_sas,Number=A,Type=Integer,Description="Total number of alleles in samples of South Asian ancestry">
##INFO=<ID=nhomalt_sas,Number=A,Type=Integer,Description="Count of homozygous individuals in samples of South Asian ancestry">
Computation
Using these, we compute the following:
- Coverage
- Allele count, Homozygous count, allele number and allele frequencies for:
- Global population
- African/African Americans
- Admixed Americans
- Ashkenazi Jews
- East Asians
- Finnish
- Non-Finnish Europeans
- South Asian
- Others (population not assigned)
- Male
- Female
- Controls
Note
- Coverage = DP / AN. Frequencies are computed using AC/AN for each population.
- Please note that currently there is no genome sequencing data of south asian (SAS) population available in gnomAD.
- Allele Count, Homozygous count, allele number and allele frequencies for control groups are also provided for the global population.
Merging genomes and exomes
When merging the genomes and exomes, the allele counts and allele numbers will be summed across both of the data sets.
info
- For GRCh37, Illumina Connected Annotations currently uses gnomAD version 2.1 which contains both genomes and exomes data. Genomes and exomes data are merged in the output.
- For GRCh38, Illumina Connected Annotations currently uses gnomAD version 3.0 which doesn't contain the exomes data. Therefore, only genomes data are presented in the output.
Filters
The following strategy will be used when there's a conflict in filter status:
Genomes PASS | Genomes Filtered | |
---|---|---|
Exomes PASS | PASS | Only use exome data |
Exomes Filtered | Only use genome data | Filtered |
VCF download instructions
https://gnomad.broadinstitute.org/downloads
JSON output
"gnomad":{
"coverage":20,
"allAf":0.190317,
"maleAf":0.193,
"femaleAf": 0.1935,
"afrAf":0.222876,
"amrAf":0.121394,
"easAf":0.239802,
"finAf":0.136833,
"nfeAf":0.181282,
"asjAf":0.258278,
"othAf":0.186094,
"allAn":30796,
"maleAn":15096,
"femaleAn":15700
"afrAn":8664,
"amrAn":832,
"easAn":1618,
"finAn":3486,
"nfeAn":14916,
"asjAn":302,
"othAn":978,
"allAc":5861,
"maleAc":2930,
"femaleAc": 2931,
"afrAc":1931,
"amrAc":101,
"easAc":388,
"finAc":477,
"nfeAc":2704,
"asjAc":78,
"othAc":182,
"allHc":561,
"afrHc":208,
"amrHc":6,
"easHc":42,
"finHc":31,
"nfeHc":242,
"asjHc":13,
"othHc":19,
"maleHc":280,
"femaleHc":281,
"controlsAllAf":0.190317,
"controlsAllAn":30796,
"controlsAllAc":5861,
"lowComplexityRegion":true,
"failedFilter":true
}
Field | Type | Notes |
---|---|---|
coverage | int | average coverage (non-negative integer values) |
allAf | float | allele frequency for all populations. Range: 0 - 1.0 |
maleAf | float | allele frequency for male population. Range: 0 - 1.0 |
femaleAf | float | allele frequency for female population. Range: 0 - 1.0 |
controlsAllAf | float | allele frequency for the controls subset. Range: 0 - 1.0 |
allAc | int | allele count for all populations. Integer. |
maleAc | int | allele count for male population. Integer. |
femaleAc | int | allele count for female population. Integer. |
controlsAllAc | int | allele count for the controls subset. Integer. |
allAn | int | allele number for all populations. Non-zero integer. |
maleAn | int | allele number for male population. Non-zero integer. |
femaleAn | int | allele number for female population. Non-zero integer. |
controlsAllAn | int | allele number for the controls subset. Non-zero integer. |
allHc | int | count of homozygous individuals for all populations. Non-negative integer. |
maleHc | int | count of homozygous individuals for male population. Non-negative integer. |
femaleHc | int | count of homozygous individuals for female population. Non-negative integer. |
afrAf | float | allele frequency for the African / African American population. Range: 0 - 1.0 |
afrAc | int | allele count for the African / African American population. Integer. |
afrAn | int | allele number for the African / African American population. Non-zero integer. |
afrHc | int | count of homozygous individuals for African / African American population. Non-negative integer. |
amrAf | float | allele frequency for the Latino population. Range: 0 - 1.0 |
amrAc | int | allele count for the Latino population. Integer. |
amrAn | int | allele number for the Latino population. Non-zero integer. |
amrHc | int | count of homozygous individuals for Latino population. Non-negative integer. |
easAf | float | allele frequency for the East Asian population. Range: 0 - 1.0 |
easAc | int | allele count for the East Asian population. Integer. |
easAn | int | allele number for the East Asian population. Non-zero integer. |
easHc | int | count of homozygous individuals for East Asian population. Non-negative integer. |
finAf | float | allele frequency for the Finnish population. Range: 0 - 1.0 |
finAc | int | allele count for the Finnish population. Integer. |
finAn | int | allele number for the Finnish population. Non-zero integer. |
finHc | int | count of homozygous individuals for Finnish population. Non-negative integer |
nfeAf | float | allele frequency for the Non-Finnish European population. Range: 0 - 1.0 |
nfeAc | int | allele count for the Non-Finnish European population. Integer. |
nfeAn | int | allele number for the Non-Finnish European population. Non-zero integer. |
nfeHc | int | count of homozygous individuals for Non-Finnish European population. Non-negative integer |
othAf | float | allele frequency for the Other population. Range: 0 - 1.0 |
othAc | int | allele count for the Other population. Integer. |
othAn | int | allele number for the Other population. Non-zero integer. |
othHc | int | count of homozygous individuals for Other population. Non-negative integer |
asjAf | float | allele frequency for the Ashkenazi Jewish population. Range: 0 - 1.0 |
asjAc | int | allele count for the Ashkenazi Jewish population Integer. |
asjAn | int | allele number for the Ashkenazi Jewish population. Non-zero integer. |
asjHc | int | count of homozygous individuals for the Ashkenazi Jewish population. Non-negative integer |
sasAf | float | allele frequency for the South Asian population. Range: 0 - 1.0 |
sasAc | int | allele count for the South Asian population Integer. |
sasAn | int | allele number for the South Asian population. Non-zero integer. |
sasHc | int | count of homozygous individuals for the South Asian population. Non-negative integer. |
failedFilter | bool | True if this variant failed any filters (Note: we do not list the failed filters) |
lowComplexityRegion | bool | True if this variant is located in a low complexity region. |
Building the supplementary files
The gnomAD .nsa
for Illumina Connected Annotations can be built using the SAUtils
command's gnomad
subcommand. We will describe building gnomAD version 3.1 here.
Source data files
Input VCF files (one per chromosome) and a .version
file are required in a folder to build the .nsa
file. For example, my directory contains:
chr10.vcf.bgz chr22.vcf.bgz
chr11.vcf.bgz chr2.vcf.bgz
chr12.vcf.bgz chr3.vcf.bgz
chr13.vcf.bgz chr4.vcf.bgz
chr14.vcf.bgz chr5.vcf.bgz
chr15.vcf.bgz chr6.vcf.bgz
chr16.vcf.bgz chr7.vcf.bgz
chr17.vcf.bgz chr8.vcf.bgz
chr18.vcf.bgz chr9.vcf.bgz
chr19.vcf.bgz chrM.vcf.bgz
chr1.vcf.bgz chrX.vcf.bgz
chr20.vcf.bgz chrY.vcf.bgz
chr21.vcf.bgz gnomad.r3.1.version
The version file is a text file with the following content.
NAME=gnomAD
VERSION=3.1
DATE=2020-10-29
DESCRIPTION=Allele frequencies from Genome Aggregation Database (gnomAD)
The help menu for the utility is as follows:
SAUtils.dll gnomad
---------------------------------------------------------------------------
SAUtils (c) 2021 Illumina, Inc.
Stromberg, Roy, Lajugie, Jiang, Li, and Kang 3.17.0
---------------------------------------------------------------------------
USAGE: dotnet SAUtils.dll gnomad [options]
Reads provided supplementary data files and populates tsv files
OPTIONS:
--ref, -r <VALUE> compressed reference sequence file
--genome, -g <VALUE> input directory containing VCF (and .version)
files with genomic frequencies
--exome, -e <VALUE> input directory containing VCF (and .version)
files with exomic frequencies
--temp, -t <VALUE> output temp directory for intermediate (per chrom)
NSA files
--out, -o <VALUE> output directory for NSA file
--help, -h displays the help menu
--version, -v displays the version
Here is a sample execution:
dotnet SAUtils.dll Gnomad \\
--ref ~/References/7/Homo_sapiens.GRCh38.Nirvana.dat --genome genomes/ \\
--out ~/SupplementaryDatabase/63/GRCh38 --temp ~/ExternalDataSources/gnomAD/3.1/GRCh38/temp
LoF Gene Metrics
Tab delimited file example
gene transcript obs_mis exp_mis oe_mis mu_mis possible_mis obs_mis_pphen exp_mis_pphen oe_mis_pphen possible_mis_pphen obs_syn exp_syn oe_syn mu_syn possible_syn obs_lof mu_lof possible_lof exp_lof pLI pNull pRec oe_lof oe_syn_lower oe_syn_upper oe_mis_lower oe_mis_upper oe_lof_lower oe_lof_upper constraint_flag syn_zmis_z lof_z oe_lof_upper_rank oe_lof_upper_bin oe_lof_upper_bin_6 n_sites classic_caf max_af no_lofs obs_het_lof obs_hom_lof defined p exp_hom_lof classic_caf_afr classic_caf_amr classic_caf_asj classic_caf_eas classic_caf_fin classic_caf_nfe classic_caf_oth classic_caf_sas p_afr p_amr p_asj p_eas p_fin p_nfep_oth p_sas transcript_type gene_id transcript_level cds_length num_coding_exons gene_type gene_length exac_pLI exac_obs_lof exac_exp_lof exac_oe_lof brain_expression chromosome start_positionend_position
MED13 ENST00000397786 871 1.1178e+03 7.7921e-01 5.5598e-05 14195 314 5.2975e+02 5.9273e-01 6708 422 3.8753e+02 1.0890e+00 1.9097e-05 4248 0 4.9203e-06 1257 9.8429e+01 1.0000e+00 8.9436e-40 1.8383e-16 0.0000e+00 1.0050e+00 1.1800e+00 7.3600e-01 8.2400e-01 0.0000e+00 3.0000e-02 -1.3765e+00 2.6232e+00 9.1935e+00 0 0 0 2 1.2058e-05 8.0492e-06 124782 3 0 124785 1.2021e-05 1.8031e-05 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 9.2812e-05 8.8571e-06 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 9.2760e-05 8.8276e-06 0.0000e+00 0.0000e+00 protein_coding ENSG00000108510 2 6522 30 protein_coding 122678 1.0000e+00 0 6.4393e+01 0.0000e+00 NA 17 60019966 60142643
JSON key to TSV column mapping
JSON key | TSV column | Description |
---|---|---|
pLi | pLI | probability of being intolerant of a single loss-of-function variant (like haploinsufficient genes, observed ~ 0.1*expected) |
pNull | pNull | probability of being completely tolerant of loss of function variation (observed = expected) |
pRec | pRec | probability of being intolerant of two loss of function variants (like recessive genes, observed ~ 0.5*expected) |
synZ | syn_z | corrected synonymous Z score |
misZ | mis_z | corrected missense Z score |
loeuf | oe_lof_upper | loss of function observed/expected upper bound fraction (LOEUF) |
Gene symbol update
The input file provides Ensembl gene ids for each entry. We observed that they were unique while gene symbols may be repeated (multiple lines may have the same gene symbol). Since Ensembl gene Ids are more stable, and Illumina Connected Annotations transcript cache data contains Ensembl gene ids, we use these ids to extract the gene symbols from the transcript cache. For example, if ENSG0001 has gene symbol GENE1 in the input but Illumina Connected Annotations cache say ENSG0001 maps to GENE2, we use GENE2 as the gene symbol for that entry.
Conflict resolution
gnomAD uses Ensembl GeneID as unique identifiers in the source file but Illumina Connected Annotations uses HGNC gene symbols. Multiple Ensembl GeneIDs can map to the same HGNC symbol and therefore may result is conflict.
MDGA2 ENST00000426342 306 4.0043e+02 7.6419e-01 2.1096e-05 4724 78 1.6525e+02 4.7202e-01 1923 125 1.3737e+02 9.0993e-01 7.1973e-06 1413 4 2.0926e-06 453 3.8316e+01 9.9922e-01 8.6490e-12 7.8128e-04 1.0440e-01 7.8600e-01 1.0560e+00 6.9500e-01 8.4000e-01 5.0000e-02 2.3900e-01 8.2988e-01 1.6769e+00 5.1372e+00 1529 0 0 7 2.8103e-05 4.0317e-06 124784 7 0 124791 2.8047e-05 9.8167e-05 0.0000e+00 2.8962e-05 0.0000e+00 0.0000e+00 0.0000e+00 3.5391e-05 1.6672e-04 3.2680e-05 0.0000e+00 2.8962e-05 0.0000e+00 0.0000e+00 0.0000e+00 3.5308e-05 1.6492e-04 3.2678e-05 protein_coding ENSG00000139915 2 2181 13 protein_coding 835332 9.9322e-01 3 2.7833e+01 1.0779e-01 NA 14 47308826 48144157
MDGA2 ENST00000439988 438 5.5311e+02 7.9189e-01 2.9490e-05 6608 105 2.0496e+02 5.1228e-01 2386 180 1.9491e+02 9.2351e-01 9.8371e-06 2048 11 2.8074e-06 627 5.1882e+01 6.6457e-01 5.5841e-10 3.3543e-01 2.1202e-01 8.1700e-01 1.0450e+00 7.3100e-01 8.5700e-01 1.3200e-01 3.5100e-01 8.3940e-01 1.7393e+00 5.2595e+00 2989 1 0 9 3.6173e-05 4.0463e-06 124782 9 0 124791 3.6061e-05 1.6228e-04 6.4986e-05 2.8962e-05 0.0000e+00 0.0000e+00 0.0000e+00 4.4275e-05 1.6672e-04 3.2680e-05 6.4577e-05 2.8962e-05 0.0000e+00 0.0000e+00 0.0000e+00 4.4135e-05 1.6492e-04 3.2678e-05 protein_coding ENSG00000272781 3 3075 17 protein_coding 832866 NA NA NA NA NA 14 47311134 48143999
In such cases, Illumina Connected Annotations chooses the entry with the smallest "LOEUF" value. The reason for choosing this value can be highlighted by the following table:
LOEUF decile | Haplo-insufficient | Autosomal Dominant | Autosomal Recessive | Olfactory Genes |
---|---|---|---|---|
0-10% | 104 | 140 | 36 | 0 |
10-20% | 47 | 128 | 72 | 1 |
20-30% | 17 | 86 | 112 | 0 |
30-40% | 8 | 80 | 173 | 4 |
40-50% | 7 | 65 | 206 | 8 |
50-60% | 4 | 54 | 207 | 6 |
60-70% | 0 | 46 | 154 | 18 |
70-80% | 2 | 49 | 120 | 49 |
80-90% | 0 | 34 | 58 | 96 |
90-100% | 0 | 26 | 40 | 174 |
Note
- Table source: https://www.biorxiv.org/content/biorxiv/early/2019/01/28/531210.full-text.pdf
- This table indicates that lower LOEUF scores have more deleterious effect on genes.
- Only 15 out of 19685 genes have conflicting entries.
List of genes with conflicting entries
MDGA2:
{"pLI":9.99e-1,"pRec":7.81e-4,"pNull":8.65e-12,"synZ":8.30e-1,"misZ":1.68e0,"loeuf":2.39e-1}
{"pLI":6.65e-1,"pRec":3.35e-1,"pNull":5.58e-10,"synZ":8.39e-1,"misZ":1.74e0,"loeuf":3.51e-1}
CRYBG3:
{"pLI":9.27e-5,"pRec":1.00e0,"pNull":1.88e-7,"synZ":1.82e0,"misZ":4.68e-1,"loeuf":4.93e-1}
{"pLI":2.69e-4,"pRec":1.00e0,"pNull":1.20e-4,"synZ":2.63e0,"misZ":9.80e-1,"loeuf":5.98e-1}
CHTF8:
{"pLI":8.29e-1,"pRec":1.67e-1,"pNull":3.21e-3,"synZ":1.94e0,"misZ":9.48e-1,"loeuf":5.13e-1}
{"pLI":3.73e-1,"pRec":5.84e-1,"pNull":4.29e-2,"synZ":3.33e-1,"misZ":2.91e-1,"loeuf":9.92e-1}
SEPT1:
{"pLI":6.77e-8,"pRec":8.90e-1,"pNull":1.10e-1,"synZ":1.58e-1,"misZ":1.57e0,"loeuf":9.68e-1}
{"pLI":1.96e-8,"pRec":6.71e-1,"pNull":3.29e-1,"synZ":1.68e-1,"misZ":1.41e0,"loeuf":1.08e0}
ARL14EPL:
{"pLI":3.48e-2,"pRec":8.38e-1,"pNull":1.28e-1,"synZ":3.56e-1,"misZ":-1.87e-1,"loeuf":1.23e0}
{"pLI":3.23e-2,"pRec":8.29e-1,"pNull":1.38e-1,"synZ":1.15e0,"misZ":-4.05e-1,"loeuf":1.26e0}
UGT2A1:
{"pLI":2.90e-13,"pRec":1.40e-1,"pNull":8.60e-1,"synZ":-1.29e0,"misZ":-1.77e0,"loeuf":1.18e0}
{"pLI":3.88e-17,"pRec":2.87e-3,"pNull":9.97e-1,"synZ":-8.00e-1,"misZ":-1.40e0,"loeuf":1.53e0}
LTB4R2:
{"pLI":4.39e-4,"pRec":6.71e-1,"pNull":3.29e-1,"synZ":-5.24e-1,"misZ":-2.96e-1,"loeuf":1.40e0}
{"pLI":1.38e-5,"pRec":4.12e-1,"pNull":5.88e-1,"synZ":-4.58e-1,"misZ":-2.02e-1,"loeuf":1.54e0}
CDRT1:
{"pLI":4.98e-14,"pRec":5.31e-1,"pNull":4.69e-1,"synZ":8.18e-1,"misZ":6.57e-1,"loeuf":1.00e0}
{"pLI":3.50e-3,"pRec":6.37e-1,"pNull":3.59e-1,"synZ":4.89e-1,"misZ":6.90e-1,"loeuf":1.63e0}
MUC3A:
{"pLI":1.48e-10,"pRec":5.76e-1,"pNull":4.24e-1,"synZ":5.81e-2,"misZ":-6.01e-1,"loeuf":1.06e0}
{"pLI":4.03e-1,"pRec":4.79e-1,"pNull":1.17e-1,"synZ":4.05e-2,"misZ":-1.60e-1,"loeuf":1.70e0}
COG8:
{"pLI":2.97e-9,"pRec":5.04e-1,"pNull":4.96e-1,"synZ":-1.35e0,"misZ":-9.37e-2,"loeuf":1.13e0}
{"pLI":2.31e-3,"pRec":5.47e-1,"pNull":4.50e-1,"synZ":-4.94e-1,"misZ":-1.48e-1,"loeuf":1.76e0}
AC006486.1:
{"pLI":9.37e-1,"pRec":6.27e-2,"pNull":2.47e-4,"synZ":1.44e0,"misZ":2.12e0,"loeuf":3.41e-1}
{"pLI":1.14e-1,"pRec":6.16e-1,"pNull":2.70e-1,"synZ":-7.57e-2,"misZ":8.33e-2,"loeuf":1.84e0}
AL645922.1:
{"pLI":4.67e-16,"pRec":1.00e0,"pNull":4.15e-5,"synZ":7.99e-1,"misZ":1.61e0,"loeuf":6.92e-1}
{"pLI":1.60e-3,"pRec":2.78e-1,"pNull":7.21e-1,"synZ":-7.30e-2,"misZ":3.21e-1,"loeuf":1.96e0}
NBPF20:
{"pLI":1.42e-7,"pRec":3.40e-2,"pNull":9.66e-1,"synZ":-1.86e0,"misZ":-2.88e0,"loeuf":1.97e0}
{"pLI":1.92e-22,"pRec":7.96e-6,"pNull":1.00e0,"synZ":-9.73e0,"misZ":-7.67e0,"loeuf":1.97e0}
PRAMEF11:
{"pLI":6.16e-4,"pRec":7.42e-1,"pNull":2.58e-1,"synZ":-4.02e0,"misZ":-3.69e0,"loeuf":1.31e0}
{"synZ":-3.33e0,"misZ":-2.59e0}
FAM231D:
{"synZ":-1.98e0,"misZ":-1.44e0}
{"synZ":1.07e0,"misZ":3.13e-1}
Conflict resolution
- Pick the entry with the lowest LOEUF score
- If the same, pick the lowest pLI
- Otherwise pick the entry with the max absolute value of synZ + misZ
Download URL
JSON output
"gnomAD":{
"pLi":1.00e0,
"pNull":8.94e-40,
"pRec":1.84e-16,
"synZ":-8.44e-2,
"misZ":5.96e-1,
"loeuf":1.13e0
}
Field | Type | Notes |
---|---|---|
pLi | float | probability of being intolerant of a single loss-of-function variant (like haploinsufficient genes, observed ~ 0.1*expected) |
pNull | float | probability of being completely tolerant of loss of function variation (observed = expected) |
pRec | float | probability of being intolerant of two loss of function variants (like recessive genes, observed ~ 0.5*expected) |
synZ | float | corrected synonymous Z score |
misZ | float | corrected missense Z score |
loeuf | float | loss of function observed/expected upper bound fraction (LOEUF) |
Structural Variants
Publication
Collins, R.L., Brand, H., Karczewski, K.J. et al. 2020. A structural variation reference for medical and population genetics. Nature 581, pp.444–451. https://doi.org/10.1038/s41586-020-2287-8
Note The gnomAD structural variant annotations are in a preview stage at the moment. Currently, the annotations do not include translocation breakends. Future updates will include a better way of annotating the structural variants.
Source Files
Bed Example
The bed file was obtained from original source for GRCh37
#chrom start end name svtype ALGORITHMS BOTHSIDES_SUPPORT CHR2 CPX_INTERVALS CPX_TYPE END2 ENDEVIDENCE HIGH_SR_BACKGROUND PCRPLUS_DEPLETED PESR_GT_OVERDISPERSION POS2 PROTEIN_CODING__COPY_GAIN PROTEIN_CODING__DUP_LOF PROTEIN_CODING__DUP_PARTIAL PROTEIN_CODING__INTERGENIC PROTEIN_CODING__INTRONIC PROTEIN_CODING__INV_SPAN PROTEIN_CODING__LOF PROTEIN_CODING__MSV_EXON_OVR PROTEIN_CODING__NEAREST_TSS PROTEIN_CODING__PROMOTER PROTEIN_CODING__UTR SOURCE STRANDS SVLEN SVTYPE UNRESOLVED_TYPE UNSTABLE_AF_PCRPLUS VARIABLE_ACROSS_BATCHES AN AC AF N_BI_GENOS N_HOMREF N_HET N_HOMALT FREQ_HOMREF FREQ_HET FREQ_HOMALT MALE_AN MALE_AC MALE_AF MALE_N_BI_GENOS MALE_N_HOMREF MALE_N_HET MALE_N_HOMALT MALE_FREQ_HOMREF MALE_FREQ_HET MALE_FREQ_HOMALT MALE_N_HEMIREF MALE_N_HEMIALT MALE_FREQ_HEMIREF MALE_FREQ_HEMIALT PAR FEMALE_AN FEMALE_AC FEMALE_AF FEMALE_N_BI_GENOS FEMALE_N_HOMREF FEMALE_N_HET FEMALE_N_HOMALT FEMALE_FREQ_HOMREF FEMALE_FREQ_HET FEMALE_FREQ_HOMALT POPMAX_AF AFR_AN AFR_AC AFR_AF AFR_N_BI_GENOS AFR_N_HOMREF AFR_N_HET AFR_N_HOMALT AFR_FREQ_HOMREF AFR_FREQ_HEAFR_FREQ_HOMALT AFR_MALE_AN AFR_MALE_AC AFR_MALE_AF AFR_MALE_N_BI_GENOS AFR_MALE_N_HOMREF AFR_MALE_N_HET AFR_MALE_N_HOMALT AFR_MALE_FREQ_HOMREF AFR_MALE_FREQ_HET AFR_MALE_FREQ_HOMALT AFR_MALE_N_HEMIREF AFR_MALE_N_HEMIALT AFR_MALE_FREQ_HEMIREF AFR_MALE_FREQ_HEMIALT AFR_FEMALE_AN AFR_FEMALE_AC AFR_FEMALE_AF AFR_FEMALE_N_BI_GENOS AFR_FEMALE_N_HOMREF AFR_FEMALE_N_HET AFR_FEMALE_N_HOMALT AFR_FEMALE_FREQ_HOMREF AFR_FEMALE_FREQ_HET AFR_FEMALE_FREQ_HOMALT AMR_AN AMR_AC AMR_AF AMR_N_BI_GENOS AMR_N_HOMREF AMR_N_HET AMR_N_HOMALT AMR_FREQ_HOMREF AMR_FREQ_HET AMR_FREQ_HOMALT AMR_MALE_AN AMR_MALE_AC AMR_MALE_AF AMR_MALE_N_BI_GENOS AMR_MALE_N_HOMREF AMR_MALE_N_HET AMR_MALE_N_HOMALT AMR_MALE_FREQ_HOMREF AMR_MALE_FREQ_HET AMR_MALE_FREQ_HOMALT AMR_MALE_N_HEMIREF AMR_MALE_N_HEMIALT AMR_MALE_FREQ_HEMIREF AMR_MALE_FREQ_HEMIALT AMR_FEMALE_AN AMR_FEMALE_AC AMR_FEMALE_AF AMR_FEMALE_N_BI_GENOS AMR_FEMALE_N_HOMREF AMR_FEMALE_N_HET AMR_FEMALE_N_HOMALT AMR_FEMALE_FREQ_HOMREF AMR_FEMALE_FREQ_HET AMR_FEMALE_FREQ_HOMALT EAS_AN EAS_AC EAS_AF EAS_N_BI_GENOS EAS_N_HOMREF EAS_N_HET EAS_N_HOMALT EAS_FREQ_HOMREF EAS_FREQ_HET EAS_FREQ_HOMALT EAS_MALE_AN EAS_MALE_AC EAS_MALE_AF EAS_MALE_N_BI_GENOS EAS_MALE_N_HOMREF EAS_MALE_N_HET EAS_MALE_N_HOMALT EAS_MALE_FREQ_HOMREF EAS_MALE_FREQ_HET EAS_MALE_FREQ_HOMALT EAS_MALE_N_HEMIREF EAS_MALE_N_HEMIALT EAS_MALE_FREQ_HEMIREF EAS_MALE_FREQ_HEMIALT EAS_FEMALE_AN EAS_FEMALE_AC EAS_FEMALE_AF EAS_FEMALE_N_BI_GENOS EAS_FEMALE_N_HOMREF EAS_FEMALE_N_HET EAS_FEMALE_N_HOMALT EAS_FEMALE_FREQ_HOMREF EAS_FEMALE_FREQ_HET EAS_FEMALE_FREQ_HOMALT EUR_AN EUR_AC EUR_AF EUR_N_BI_GENOS EUR_N_HOMREF EUR_N_HET EUR_N_HOMALT EUR_FREQ_HOMREF EUR_FREQ_HET EUR_FREQ_HOMALT EUR_MALE_AN EUR_MALE_AC EUR_MALE_AF EUR_MALE_N_BI_GENOS EUR_MALE_N_HOMREF EUR_MALE_N_HET EUR_MALE_N_HOMALT EUR_MALE_FREQ_HOMREF EUR_MALE_FREQ_HET EUR_MALE_FREQ_HOMALT EUR_MALE_N_HEMIREF EUR_MALE_N_HEMIALT EUR_MALE_FREQ_HEMIREF EUR_MALE_FREQ_HEMIALT EUR_FEMALE_AN EUR_FEMALE_AC EUR_FEMALE_AF EUR_FEMALE_N_BI_GENOS EUR_FEMALE_N_HOMREF EUR_FEMALE_N_HET EUR_FEMALE_N_HOMALT EUR_FEMALE_FREQ_HOMREF EUR_FEMALE_FREQ_HET EUR_FEMALE_FREQ_HOMALT OTH_AN OTH_AC OTH_AF OTH_N_BI_GENOS OTH_N_HOMREF OTH_N_HET OTH_N_HOMALT OTH_FREQ_HOMREF OTH_FREQ_HET OTH_FREQ_HOMALT OTH_MALE_AN OTH_MALE_AC OTH_MALE_AF OTH_MALE_N_BI_GENOS OTH_MALE_N_HOMREF OTH_MALE_N_HET OTH_MALE_N_HOMALT OTH_MALE_FREQ_HOMREF OTH_MALE_FREQ_HET OTH_MALE_FREQ_HOMALT OTH_MALE_N_HEMIREF OTH_MALE_N_HEMIALT OTH_MALE_FREQ_HEMIREF OTH_MALE_FREQ_HEMIALT OTH_FEMALE_AN OTH_FEMALE_AC OTH_FEMALE_AF OTH_FEMALE_N_BI_GENOS OTH_FEMALE_N_HOMREF OTH_FEMALE_N_HET OTH_FEMALE_N_HOMALT OTH_FEMALE_FREQ_HOMREF OTH_FEMALE_FREQ_HET OTH_FEMALE_FREQ_HOMALT FILTER
1 10641 10642 gnomAD-SV_v2.1_BND_1_1 BND manta False 15 NA NA 10643 10643 PE,SR False False True 10642 NA NA NA False NA NA NA NA NA NA NA NA NA -1 BND SINGLE_ENDER_-- False False 21366 145 0.006785999983549118 10683 10543 135 5 0.9868950247764587 0.012636899948120117 0.00046803298755548894 10866 69 0.00634999992325902 5433 5366 65 2 0.987667977809906 0.011963900178670883 0.000368120992789045 NA NA NA NA False 10454 76 0.007269999943673615227 5154 70 3 0.9860339760780334 0.013392000459134579 0.0005739430198445916 0.015956999734044075 93972 0.007660999894142151 4699 4629 68 2 0.9851030111312866 0.014471200294792652 0.0004256220126990229 5154 33 0.006403000093996525 2577 2544 33 0 0.9871940016746521 0.012805599719285965 0.0NA NA NA NA 4232 39 0.009216000325977802 2116 2079 35 2 0.9825140237808228 0.01654059998691082 0.0009451800142414868 1910 7 0.003664999967440963 955 949 5 1 0.9937170147895813 0.00523559981957078 0.001047119963914156 950 4 0.004211000166833401 475 472 2 1 0.9936839938163757 0.00421052984893322 0.0021052600350230932 NA NA NA NA 952 3 0.0031510000117123127 476473 3 0 0.9936969876289368 0.006302520167082548 0.0 2296 31 0.013501999899744987 1148 11131 0 0.9729970097541809 0.02700350061058998 0.0 1312 13 0.009909000247716904 656 643 13 0.9801830053329468 0.01981710083782673 0.0 NA NA NA NA 976 18 0.018442999571561813 488470 18 0 0.9631149768829346 0.03688519820570946 0.0 7574 32 0.004224999807775021 3787 37528 2 0.9920780062675476 0.007393720094114542 0.0005281229969114065 3374 17 0.005038999952375889 1681671 15 1 0.9905160069465637 0.008891520090401173 0.000592768017668277 NA NA NA NA 41815 0.003587000072002411 2091 2077 13 1 0.9933050274848938 0.006217120215296745 0.00047823999193497188 3 0.015956999734044075 94 91 3 0 0.968084990978241 0.03191490098834038 0.0 76 0.026316000148653984 38 36 2 0 0.9473680257797241 0.05263160169124603 0.0 NA NA NA NA 112 1 0.008929000236093998 56 55 1 0 0.982142984867096 0.017857100814580917 0.0UNRESOLVED
TSV Example
The tsv was obtained from lifted over dataset created by dbVar for GRCh38
#variant_call_accession variant_call_id variant_call_type experiment_id sample_id sampleset_id assembly chrcontig outer_start start inner_start inner_stop stop outer_stop insertion_length variant_region_acc variant_region_id copy_number description validation zygosity origin phenotype hgvs_name placement_method placement_rank placements_per_assembly remap_alignment remap_best_within_cluster remap_coverage remap_diff_chr remap_failure_code allele_count allele_frequency allele_number
nssv15777856 gnomAD-SV_v2.1_CNV_10_564_alt_1 copy number variation 1 1 GRCh38.p12 10 736806 738184 nsv4039284 10__782746___784124______GRCh37.p13_copy_number_variation 0 Remapped BestAvailable Single First Pass 0 1 AC=21,AFR_AC=10,AMR_AC=9,EAS_AC=0,EUR_AC=2,OTH_AC=0AF=0.038889,AFR_AF=0.044643,AMR_AF=0.03913,EAS_AF=0,EUR_AF=0.023256,OTH_AF=0 AN=540,AFR_AN=224,AMR_AN=230,EAS_AN=0,EUR_AN=86,OTH_AN=0
Structural Variant Type Mapping
The source files represented the structural variants with keys using various naming conventions. In the Illumina Connected Annotations JSON output, these keys will be mapped according to the following.
Illumina Connected Annotations JSON SV Type Key | GRCh37 Source SV Type Key | GRCh38 Source SV Type Key |
---|---|---|
copy_number_variation | copy number variation | |
deletion | DEL, CN=0 | deletion |
duplication | DUP | duplication |
insertion | INS | insertion |
inversion | INV | inversion |
mobile_element_insertion | INS:ME | mobile element insertion |
mobile_element_insertion | INS:ME:ALU | alu insertion |
mobile_element_insertion | INS:ME:LINE1 | line1 insertion |
mobile_element_insertion | INS:ME:SVA | sva insertion |
structural alteration | sequence alteration | |
complex_structural_alteration | CPX |
Download URLs
GRCh37
The GRCh37 file was downloaded from the original source. Following table gives some essential data metrics:
https://storage.googleapis.com/gcp-public-data--gnomad/papers/2019-sv/gnomad_v2.1_sv.sites.bed.gz
GRCh38
Note: The data was unavailable from gnomAD 2.1 original source, however the lifted over structural variant dataset was created by dbVar and was obtained from them https://www.ncbi.nlm.nih.gov/sites/dbvarapp/studies/nstd166/.
Download URL
JSON output
"gnomAD-preview": [
{
"chromosome": "1",
"begin": 40001,
"end": 47200,
"variantId": "gnomAD-SV_v2.1_DUP_1_1",
"variantType": "duplication",
"failedFilter": true,
"allAf": 0.068963,
"afrAf": 0.135694,
"amrAf": 0.022876,
"easAf": 0.01101,
"eurAf": 0.007846,
"othAf": 0.017544,
"femaleAf": 0.065288,
"maleAf": 0.07255,
"allAc": 943,
"afrAc": 866,
"amrAc": 21,
"easAc": 17,
"eurAc": 37,
"othAc": 2,
"femaleAc": 442,
"maleAc": 499,
"allAn": 13674,
"afrAn": 6382,
"amrAn": 918,
"easAn": 1544,
"eurAn": 4716,
"othAn": 114,
"femaleAn": 6770,
"maleAn": 6878,
"allHc": 91,
"afrHc": 90,
"amrHc": 1,
"easHc": 0,
"eurHc": 0,
"othHc": 55,
"femaleHc": 44,
"maleHc": 47,
"reciprocalOverlap": 0.01839,
"annotationOverlap": 0.16667
}
]
Field | Type | Notes |
---|---|---|
chromosome | string | chromosome number |
begin | integer | position interval start |
end | integer | position internal end |
variantType | string | structural variant type |
variantId | string | gnomAD ID |
allAf | floating point | allele frequency for all populations. Range: 0 - 1.0 |
afrAf | floating point | allele frequency for the African super population. Range: 0 - 1.0 |
amrAf | floating point | allele frequency for the Ad Mixed American super population. Range: 0 - 1.0 |
easAf | floating point | allele frequency for the East Asian super population. Range: 0 - 1.0 |
eurAf | floating point | allele frequency for the European super population. Range: 0 - 1.0 |
othAf | floating point | allele frequency for all other populations. Range: 0 - 1.0 |
femaleAf | floating point | allele frequency for female population. Range: 0 - 1.0 |
maleAf | floating point | allele frequency for male population. Range: 0 - 1.0 |
allAc | integer | allele count for all populations. |
afrAc | integer | allele count for the African super population. |
amrAc | integer | allele count for the Ad Mixed American super population. |
easAc | integer | allele count for the East Asian super population. |
eurAc | integer | allele count for the European super population. |
othAc | integer | allele count for all other populations. |
maleAc | integer | allele count for male population. |
femaleAc | integer | allele count for female population. |
allAn | integer | allele number for all populations. |
afrAn | integer | allele number for the African super population. |
amrAn | integer | allele number for the Ad Mixed American super population. |
easAn | integer | allele number for the East Asian super population. |
eurAn | integer | allele number for the European super population. |
othAn | integer | allele number for all other populations. |
femaleAn | integer | allele number for female population. |
maleAn | integer | allele number for male population. |
allHc | integer | count of homozygous individuals for all populations. |
afrHc | integer | count of homozygous individuals for the African / African American population. |
amrHc | integer | count of homozygous individuals for the Latino population. |
easHc | integer | count of homozygous individuals for the East Asian population. |
eurAc | integer | count of homozygous individuals for the European super population. |
othHc | integer | count of homozygous individuals for all other populations. |
maleHc | integer | count of homozygous individuals for male population. |
femaleHc | integer | count of homozygous individuals for female population. |
failedFilter | boolean | True if this variant failed any filters (Note: we do not list the failed filters) |
reciprocalOverlap | floating point | Reciprocal overlap. Range: 0 - 1.0 |
annotationOverlap | floating point | Reciprocal overlap. Range: 0 - 1.0 |
Note: Following fields are not available in GRCh38 because the source file does not contain this information:
Field |
---|
femaleAf |
maleAf |
maleAc |
femaleAc |
femaleAn |
maleAn |
allHc |
afrHc |
amrHc |
easHc |
eurAc |
othHc |
maleHc |
femaleHc |
failedFilter |