Version: 3.22

1000 Genomes

Overview

The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.

Publication

Sudmant, P., Rausch, T., Gardner, E. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). https://doi.org/10.1038/nature15394

Populations

The super population membership can be found here: (http://www.1000genomes.org/category/population/)
We want to capture the allele frequencies for all 26 populations as well as the 5 super populations and the total population.

Small Variants

VCF File Parsing

The original VCF files come with allele frequency fields (e.g. ALL_AF, AMR_AF) but we recompute them using allele counts and allele numbers in order to get 6 digit precision. The allele counts and allele numbers (e.g. AMR_AC, AMR_AN) are not expressed in the INFO field. Instead the genotypes need to be parsed to compute that information. Our team converted the original data to VCF entries with allele counts and allele numbers like the following.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1       15274   rs62636497      A       G,T     100     PASS    AC=1739,3210;AF=0.347244,0.640974;AN=5008;NS=2504;DP=23255;EAS_AF=0.4812,0.5188;AMR_AF=0.2752,0.7205;AFR_AF=0.323,0.6369;EUR_AF=0.2922,0.7078;SAS_AF=0.3497,0.6472;AA=g|||;VT=SNP;MULTI_ALLELIC;EAS_AN=1008;EAS_AC=485,523;EUR_AN=1006;EUR_AC=294,712;AFR_AN=1322;AFR_AC=427,842;AMR_AN=694;AMR_AC=191,500;SAS_AN=978;SAS_AC=342,633

The ancestral allele, if it exists, is the first value in the pipe separated AA fields (the Indel specific REF, ALT, IndelType fields are ignored).

We parse the VCF file and extract the following fields from INFO:

AA
AC
AN
EAS_AN
AMR_AN
AFR_AN
EUR_AN
SAS_AN
EAS_AC
AMR_AC
AFR_AC
EUR_AC
SAS_AC

Conflict Resolution

We have observed conflicting allele frequency information in the source. Take the following example:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1 20505705 rs35377696 C CTCTG,CTG,CTGTG 100 PASS AC=46,1513,152;AF=0.0091853,0.302117,0.0303514;
1 20505705 rs35377696 C CTG 100 PASS AC=4;AF=0.000798722;

That is, the variant 1-20505705-C-CTG has conflicting entries. To get an idea of how frequently we observe this, here is a table summarizing ChrX and all chromosomes. Note that almost all such entries are found in ChrX.

Chromosome	# of alleles	# of conflicting alleles	percentage
chrX	834800	2733	0.33%
Total	21413098	2743	0.013%

Currently, we removed the allele frequency of the conflicting allele (i.e., insertion TG in the example) but keep allele frequencies of all other alleles in the VCF line.

Potential Alternate Solutions

Remove all alleles that are contained in the vcf lines which have conflicting allele. (Recommended by 1000 genome group Holly Zheng-Bradley, 7/29/2015)
Recalculate the allele frequency for the conflicting allele.
Pick the allele frequency that has the highest data support.

Download URL

GRCh37 GRCh38

JSON Output

"oneKg":{
   "allAf":0.200879,
   "afrAf":0.210287,
   "amrAf":0.139769,
   "easAf":0.275794,
   "eurAf":0.181909,
   "sasAf":0.173824,
   "allAn":5008,
   "afrAn":1322,
   "amrAn":694,
   "easAn":1008,
   "eurAn":1006,
   "sasAn":978,
   "allAc":1006,
   "afrAc":278,
   "amrAc":97,
   "easAc":278,
   "eurAc":183,
   "sasAc":170
}

Field	Type	Notes
allAf	float	allele frequency for all populations. Range: 0 - 1.0
allAc	int	allele count for all populations. Integer.
allAn	int	allele number for all populations. Non-zero integer.
afrAf	float	allele frequency for the African super population. Range: 0 - 1.0
afrAc	int	allele count for the African super population. Integer.
afrAn	int	allele number for the African super population. Non-zero integer.
amrAf	float	allele frequency for the Ad Mixed American super population. Range: 0 - 1.0
amrAc	int	allele count for the Ad Mixed American super population. Integer.
amrAn	int	allele number for the Ad Mixed American super population. Non-zero integer.
easAf	float	allele frequency for the East Asian super population. Range: 0 - 1.0
easAc	int	allele count for the East Asian super population. Integer.
easAn	int	allele number for the East Asian super population. Non-zero integer.
eurAf	float	allele frequency for the European super population. Range: 0 - 1.0
eurAc	int	allele count for the European super population. Integer.
eurAn	int	allele number for the European super population. Non-zero integer.
sasAf	float	allele frequency for the South Asian super population. Range: 0 - 1.0
sasAc	int	allele count for the South Asian super population. Integer.
sasAn	int	allele number for the South Asian super population. Non-zero integer.

Structural Variants

VCF File Parsing

The VCF files contain entries like the following:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
22      16050654        esv3647175;esv3647176;esv3647177;esv3647178     A       <CN0>,<CN2>,<CN3>,<CN4> 100     PASS  AC=9,87,599,20;AF=0.00179712,0.0173722,0.119609,0.00399361;AN=5008;CS=DUP_gs;END=16063474;NS=2504;SVTYPE=CNV;DP=22545;EAS_AF=0.001,0.0169,0.2361,0.0099;AMR_AF=0,0.0101,0.219,0.0072;AFR_AF=0.0061,0.0363,0.0053,0;EUR_AF=0,0.007,0.0944,0.003;SAS_AF=0,0.0082,0.1094,0.002;VT=SV       GT      3|0     0|0     0|0     0|0     0|0     0|0     0|4

Please note that, CNVs are allele-specific. For example, HG00096 is effectively copy number 4, which would be a net gain on chr22.

1000 Genomes contains 5 types of structural variants:

Since data of 1000 genomes is provided in VCF format, we assume that the coordinates follow the vcf format, i.e., there is a padding base for symbolic alleles. So all the interval can be interpreted as [BEGIN+1, END]. Similarly, for all other variant types except insertion, END is far larger than BEGIN. The distribution of BEGIN and END for insertions is summarized below.

Insertion issues

END = BEGIN for 6/165
END = BEGIN+2 for 93/165
END = BEGIN+3 for 11/165
END = BEGIN+4 for 11/165
END – BEGIN range from 5 to 1156 for others.

Converting VCF svTypes to SO sequence alterations

The svType will be captured in our JSON file under the sequenceAlteration key. Here's the translation we'll use according to svType in 1000 Genomes.

svType	Alternative Alleles contain <CN*>	sequenceAlteration
ALU	FALSE	mobile_element_insertion
DUP	TRUE	copy_number_gain
CNV	TRUE	copy_number_gain (observed_gains >0 and observed_losses =0) copy_number_loss (observed_gains = 0 and observed_losses > 0) copy_number_variation (otherwise)
DEL	TRUE	copy_number_loss
LINE1	FALSE	mobile_element_insertion
SVA	FALSE	mobile_element_insertion
INV	FALSE	inversion
INS	FALSE	insertion

Exceptions

We discard structural variants without END

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
21      9495848 esv3646347      A       <INS:ME:LINE1>  100     PASS   AC=1543;AF=0.308107;AN=5008;CS=L1_umary;MEINFO=LINE1,5669,6005,+;NS=2504;SVLEN=336;SVTYPE=LINE1;TSD=null;DP=20015;EAS_AF=0.3125;AMR_AF=0.2911;AFR_AF=0.3026;EUR_AF=0.2922;SAS_AF=0.3395;VT=SV   GT      0|0     1|1     1|0     0|1     1|0     1|0     0|0

CNVs in chrY

No other types of structural variants exist in chrY
Since copy number is provided in genotype field, we directly parse the copy number from "CN" field.
For most CNVs in chrY, the reference copy number is 1, but the refence number for CNVs in segmental duplication sites is 2 (<CN2> in the 2nd example). All segmental duplication calls have identifiers starting with GS_SD_M2.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00101 HG00103 HG00105 HG00107 HG00108
Y       2888555 CNV_Y_2888555_3014661   T       <CN2>   100     PASS    AC=1;AF=0.000817661;AN=1223;END=3014661;NS=1233;SVTYPE=CNV;AMR_AF=0.0000;AFR_AF=0.0000;EUR_AF=0.0000;SAS_AF=0.0019;EAS_AF=0.0000;VT=SV  GT:CN:CNL:CNP:CNQ:GP:GQ:PL      0:1:-1000,0,-58.45:-1000,0,-61.55:99:0,-61.55:99:0,585  0:1:-296.36,0,-16.6:-300.46,0,-19.7:99:0,-19.7:99:0,166 0:1:-1000,0,-39.44:-1000,0,-42.54:99:0,-42.54:99:0,394
Y       6128381 GS_SD_M2_Y_6128381_6230094_Y_9650284_9752225    C       <CN1>,<CN3>     100     PASS    AC=4,2;AF=0.00327065,0.00163532;AN=1223;END=6230094;NS=1233;SVTYPE=CNV;AMR_AF=0.0029,0.0029;AFR_AF=0.0016,0.0016;EUR_AF=0.0000,0.0000;SAS_AF=0.0038,0.0000;EAS_AF=0.0000,0.0000;VT=SV;EX_TARGET GT:CN:CNL:CNP:CNQ:GP:GQ 0:2:-1000,-138.78,0,-38.53:-1000,-141.27,0,-41.33:99:0,-141.27,-41.33:99        0:2:-1000,-53.32,0,-17.85:-1000,-55.81,0,-20.64:99:0,-55.81,-20.64:99   0:2:-1000,-71.83,0,-32.5:-1000,-74.32,0,-35.29:99:0,-74.32,-35.29:99    0:2:-1000,-60.96,0,-20.29:-1000,-63.45,0,-23.08:99:0,-63.45,-23.08:99   0:2:-1000,-77.6,0,-31.45:-1000,-80.09,0,-34.24:99:0,-80.09,-34.24:99

JSON Output

"oneKg":[
   {
      "chromosome":"1",
      "begin":1595369,
      "end":1612441,
      "variantType": "copy_number_variation",
      "id": "esv3635753;esv3635754;esv3635755;esv3635756;esv3635757",
      "allAn": 5008,
      "allAc": 2702,
      "allAf": 0.539537,
      "afrAf": 0.6052,
      "amrAf": 0.3675,
      "eurAf": 0.5357,
      "easAf": 0.5368,
      "sasAf": 0.5797,
      "reciprocalOverlap": 0.07555
   }
],

Field	Type	Notes
chromosome	string
begin	integer
end	integer
variantType	string
id	string
allAn	integer	allele number for all populations. Non-zero integer.
allAc	integer	allele count for all populations. Integer.
allAf	floating point	allele frequency for all populations. Range: 0 - 1.0
afrAf	floating point	allele frequency for the African super population. Range: 0 - 1.0
amrAf	floating point	allele frequency for the Ad Mixed American super population. Range: 0 - 1.0
eurAf	floating point	allele frequency for the European super population. Range: 0 - 1.0
easAf	floating point	allele frequency for the East Asian super population. Range: 0 - 1.0
sasAf	floating point	allele frequency for the South Asian super population. Range: 0 - 1.0
reciprocalOverlap	floating point	range: 0 - 1.

Overview​

Publication

Populations​

Small Variants​

VCF File Parsing​

Conflict Resolution​

Download URL​

JSON Output​

Structural Variants​

VCF File Parsing​

Converting VCF svTypes to SO sequence alterations​

Exceptions​

JSON Output​

Overview

Populations

Small Variants

VCF File Parsing

Conflict Resolution

Download URL

JSON Output

Structural Variants

VCF File Parsing

Converting VCF svTypes to SO sequence alterations

Exceptions

JSON Output