1000 Genomes
Overview
The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.
Publication
Sudmant, P., Rausch, T., Gardner, E. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). https://doi.org/10.1038/nature15394
Populations
- The super population membership can be found here: (http://www.1000genomes.org/category/population/)
- We want to capture the allele frequencies for all 26 populations as well as the 5 super populations and the total population.
Small Variants
VCF File Parsing
The original VCF files come with allele frequency fields (e.g. ALL_AF, AMR_AF) but we recompute them using allele counts and allele numbers in order to get 6 digit precision. The allele counts and allele numbers (e.g. AMR_AC, AMR_AN) are not expressed in the INFO field. Instead the genotypes need to be parsed to compute that information. Our team converted the original data to VCF entries with allele counts and allele numbers like the following.
#CHROM POS ID REF ALT QUAL FILTER INFO
1 15274 rs62636497 A G,T 100 PASS AC=1739,3210;AF=0.347244,0.640974;AN=5008;NS=2504;DP=23255;EAS_AF=0.4812,0.5188;AMR_AF=0.2752,0.7205;AFR_AF=0.323,0.6369;EUR_AF=0.2922,0.7078;SAS_AF=0.3497,0.6472;AA=g|||;VT=SNP;MULTI_ALLELIC;EAS_AN=1008;EAS_AC=485,523;EUR_AN=1006;EUR_AC=294,712;AFR_AN=1322;AFR_AC=427,842;AMR_AN=694;AMR_AC=191,500;SAS_AN=978;SAS_AC=342,633
The ancestral allele, if it exists, is the first value in the pipe separated AA fields (the Indel specific REF, ALT, IndelType fields are ignored).
We parse the VCF file and extract the following fields from INFO:
- AA
- AC
- AN
- EAS_AN
- AMR_AN
- AFR_AN
- EUR_AN
- SAS_AN
- EAS_AC
- AMR_AC
- AFR_AC
- EUR_AC
- SAS_AC
Conflict Resolution
We have observed conflicting allele frequency information in the source. Take the following example:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 20505705 rs35377696 C CTCTG,CTG,CTGTG 100 PASS AC=46,1513,152;AF=0.0091853,0.302117,0.0303514;
1 20505705 rs35377696 C CTG 100 PASS AC=4;AF=0.000798722;
That is, the variant 1-20505705-C-CTG has conflicting entries. To get an idea of how frequently we observe this, here is a table summarizing ChrX and all chromosomes. Note that almost all such entries are found in ChrX.
Chromosome | # of alleles | # of conflicting alleles | percentage |
---|---|---|---|
chrX | 834800 | 2733 | 0.33% |
Total | 21413098 | 2743 | 0.013% |
Currently, we removed the allele frequency of the conflicting allele (i.e., insertion TG in the example) but keep allele frequencies of all other alleles in the VCF line.
Potential Alternate Solutions
- Remove all alleles that are contained in the vcf lines which have conflicting allele. (Recommended by 1000 genome group Holly Zheng-Bradley, 7/29/2015)
- Recalculate the allele frequency for the conflicting allele.
- Pick the allele frequency that has the highest data support.
Download URL
JSON Output
"oneKg":{
"allAf":0.200879,
"afrAf":0.210287,
"amrAf":0.139769,
"easAf":0.275794,
"eurAf":0.181909,
"sasAf":0.173824,
"allAn":5008,
"afrAn":1322,
"amrAn":694,
"easAn":1008,
"eurAn":1006,
"sasAn":978,
"allAc":1006,
"afrAc":278,
"amrAc":97,
"easAc":278,
"eurAc":183,
"sasAc":170
}
Field | Type | Notes |
---|---|---|
allAf | float | allele frequency for all populations. Range: 0 - 1.0 |
allAc | int | allele count for all populations. Integer. |
allAn | int | allele number for all populations. Non-zero integer. |
afrAf | float | allele frequency for the African super population. Range: 0 - 1.0 |
afrAc | int | allele count for the African super population. Integer. |
afrAn | int | allele number for the African super population. Non-zero integer. |
amrAf | float | allele frequency for the Ad Mixed American super population. Range: 0 - 1.0 |
amrAc | int | allele count for the Ad Mixed American super population. Integer. |
amrAn | int | allele number for the Ad Mixed American super population. Non-zero integer. |
easAf | float | allele frequency for the East Asian super population. Range: 0 - 1.0 |
easAc | int | allele count for the East Asian super population. Integer. |
easAn | int | allele number for the East Asian super population. Non-zero integer. |
eurAf | float | allele frequency for the European super population. Range: 0 - 1.0 |
eurAc | int | allele count for the European super population. Integer. |
eurAn | int | allele number for the European super population. Non-zero integer. |
sasAf | float | allele frequency for the South Asian super population. Range: 0 - 1.0 |
sasAc | int | allele count for the South Asian super population. Integer. |
sasAn | int | allele number for the South Asian super population. Non-zero integer. |
Structural Variants
VCF File Parsing
The VCF files contain entries like the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
22 16050654 esv3647175;esv3647176;esv3647177;esv3647178 A <CN0>,<CN2>,<CN3>,<CN4> 100 PASS AC=9,87,599,20;AF=0.00179712,0.0173722,0.119609,0.00399361;AN=5008;CS=DUP_gs;END=16063474;NS=2504;SVTYPE=CNV;DP=22545;EAS_AF=0.001,0.0169,0.2361,0.0099;AMR_AF=0,0.0101,0.219,0.0072;AFR_AF=0.0061,0.0363,0.0053,0;EUR_AF=0,0.007,0.0944,0.003;SAS_AF=0,0.0082,0.1094,0.002;VT=SV GT 3|0 0|0 0|0 0|0 0|0 0|0 0|4
Please note that, CNVs are allele-specific. For example, HG00096 is effectively copy number 4, which would be a net gain on chr22.
1000 Genomes contains 5 types of structural variants:
- CNV
- DEL
- DUP
- INS
- INV
Since data of 1000 genomes is provided in VCF format, we assume that the coordinates follow the vcf format, i.e., there is a padding base for symbolic alleles. So all the interval can be interpreted as [BEGIN+1, END]. Similarly, for all other variant types except insertion, END is far larger than BEGIN. The distribution of BEGIN and END for insertions is summarized below.
Insertion issues
- END = BEGIN for 6/165
- END = BEGIN+2 for 93/165
- END = BEGIN+3 for 11/165
- END = BEGIN+4 for 11/165
- END – BEGIN range from 5 to 1156 for others.
Converting VCF svTypes to SO sequence alterations
The svType will be captured in our JSON file under the sequenceAlteration key. Here's the translation we'll use according to svType in 1000 Genomes.
svType | Alternative Alleles contain <CN*> | sequenceAlteration |
---|---|---|
ALU | FALSE | mobile_element_insertion |
DUP | TRUE | copy_number_gain |
CNV | TRUE | copy_number_gain (observed_gains >0 and observed_losses =0) copy_number_loss (observed_gains = 0 and observed_losses > 0) copy_number_variation (otherwise) |
DEL | TRUE | copy_number_loss |
LINE1 | FALSE | mobile_element_insertion |
SVA | FALSE | mobile_element_insertion |
INV | FALSE | inversion |
INS | FALSE | insertion |
Exceptions
We discard structural variants without END
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
21 9495848 esv3646347 A <INS:ME:LINE1> 100 PASS AC=1543;AF=0.308107;AN=5008;CS=L1_umary;MEINFO=LINE1,5669,6005,+;NS=2504;SVLEN=336;SVTYPE=LINE1;TSD=null;DP=20015;EAS_AF=0.3125;AMR_AF=0.2911;AFR_AF=0.3026;EUR_AF=0.2922;SAS_AF=0.3395;VT=SV GT 0|0 1|1 1|0 0|1 1|0 1|0 0|0
CNVs in chrY
- No other types of structural variants exist in chrY
- Since copy number is provided in genotype field, we directly parse the copy number from "CN" field.
- For most CNVs in chrY, the reference copy number is 1, but the refence number for CNVs in segmental duplication sites is 2 (<CN2> in the 2nd example). All segmental duplication calls have identifiers starting with GS_SD_M2.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00101 HG00103 HG00105 HG00107 HG00108
Y 2888555 CNV_Y_2888555_3014661 T <CN2> 100 PASS AC=1;AF=0.000817661;AN=1223;END=3014661;NS=1233;SVTYPE=CNV;AMR_AF=0.0000;AFR_AF=0.0000;EUR_AF=0.0000;SAS_AF=0.0019;EAS_AF=0.0000;VT=SV GT:CN:CNL:CNP:CNQ:GP:GQ:PL 0:1:-1000,0,-58.45:-1000,0,-61.55:99:0,-61.55:99:0,585 0:1:-296.36,0,-16.6:-300.46,0,-19.7:99:0,-19.7:99:0,166 0:1:-1000,0,-39.44:-1000,0,-42.54:99:0,-42.54:99:0,394
Y 6128381 GS_SD_M2_Y_6128381_6230094_Y_9650284_9752225 C <CN1>,<CN3> 100 PASS AC=4,2;AF=0.00327065,0.00163532;AN=1223;END=6230094;NS=1233;SVTYPE=CNV;AMR_AF=0.0029,0.0029;AFR_AF=0.0016,0.0016;EUR_AF=0.0000,0.0000;SAS_AF=0.0038,0.0000;EAS_AF=0.0000,0.0000;VT=SV;EX_TARGET GT:CN:CNL:CNP:CNQ:GP:GQ 0:2:-1000,-138.78,0,-38.53:-1000,-141.27,0,-41.33:99:0,-141.27,-41.33:99 0:2:-1000,-53.32,0,-17.85:-1000,-55.81,0,-20.64:99:0,-55.81,-20.64:99 0:2:-1000,-71.83,0,-32.5:-1000,-74.32,0,-35.29:99:0,-74.32,-35.29:99 0:2:-1000,-60.96,0,-20.29:-1000,-63.45,0,-23.08:99:0,-63.45,-23.08:99 0:2:-1000,-77.6,0,-31.45:-1000,-80.09,0,-34.24:99:0,-80.09,-34.24:99
JSON Output
"oneKg":[
{
"chromosome":"1",
"begin":1595369,
"end":1612441,
"variantType": "copy_number_variation",
"id": "esv3635753;esv3635754;esv3635755;esv3635756;esv3635757",
"allAn": 5008,
"allAc": 2702,
"allAf": 0.539537,
"afrAf": 0.6052,
"amrAf": 0.3675,
"eurAf": 0.5357,
"easAf": 0.5368,
"sasAf": 0.5797,
"reciprocalOverlap": 0.07555
}
],
Field | Type | Notes |
---|---|---|
chromosome | string | |
begin | integer | |
end | integer | |
variantType | string | |
id | string | |
allAn | integer | allele number for all populations. Non-zero integer. |
allAc | integer | allele count for all populations. Integer. |
allAf | floating point | allele frequency for all populations. Range: 0 - 1.0 |
afrAf | floating point | allele frequency for the African super population. Range: 0 - 1.0 |
amrAf | floating point | allele frequency for the Ad Mixed American super population. Range: 0 - 1.0 |
eurAf | floating point | allele frequency for the European super population. Range: 0 - 1.0 |
easAf | floating point | allele frequency for the East Asian super population. Range: 0 - 1.0 |
sasAf | floating point | allele frequency for the South Asian super population. Range: 0 - 1.0 |
reciprocalOverlap | floating point | range: 0 - 1. |