Skip to main content
Version: 3.23

1000 Genomes

Overview

The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.

Publication

Sudmant, P., Rausch, T., Gardner, E. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). https://doi.org/10.1038/nature15394

Populations

Small Variants

VCF File Parsing

The original VCF files come with allele frequency fields (e.g. ALL_AF, AMR_AF) but we recompute them using allele counts and allele numbers in order to get 6 digit precision. The allele counts and allele numbers (e.g. AMR_AC, AMR_AN) are not expressed in the INFO field. Instead the genotypes need to be parsed to compute that information. Our team converted the original data to VCF entries with allele counts and allele numbers like the following.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1 15274 rs62636497 A G,T 100 PASS AC=1739,3210;AF=0.347244,0.640974;AN=5008;NS=2504;DP=23255;EAS_AF=0.4812,0.5188;AMR_AF=0.2752,0.7205;AFR_AF=0.323,0.6369;EUR_AF=0.2922,0.7078;SAS_AF=0.3497,0.6472;AA=g|||;VT=SNP;MULTI_ALLELIC;EAS_AN=1008;EAS_AC=485,523;EUR_AN=1006;EUR_AC=294,712;AFR_AN=1322;AFR_AC=427,842;AMR_AN=694;AMR_AC=191,500;SAS_AN=978;SAS_AC=342,633

The ancestral allele, if it exists, is the first value in the pipe separated AA fields (the Indel specific REF, ALT, IndelType fields are ignored).

We parse the VCF file and extract the following fields from INFO:

  • AA
  • AC
  • AN
  • EAS_AN
  • AMR_AN
  • AFR_AN
  • EUR_AN
  • SAS_AN
  • EAS_AC
  • AMR_AC
  • AFR_AC
  • EUR_AC
  • SAS_AC

Conflict Resolution

We have observed conflicting allele frequency information in the source. Take the following example:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1 20505705 rs35377696 C CTCTG,CTG,CTGTG 100 PASS AC=46,1513,152;AF=0.0091853,0.302117,0.0303514;
1 20505705 rs35377696 C CTG 100 PASS AC=4;AF=0.000798722;

That is, the variant 1-20505705-C-CTG has conflicting entries. To get an idea of how frequently we observe this, here is a table summarizing ChrX and all chromosomes. Note that almost all such entries are found in ChrX.

Chromosome# of alleles# of conflicting allelespercentage
chrX83480027330.33%
Total2141309827430.013%

Currently, we removed the allele frequency of the conflicting allele (i.e., insertion TG in the example) but keep allele frequencies of all other alleles in the VCF line.

Potential Alternate Solutions

  • Remove all alleles that are contained in the vcf lines which have conflicting allele. (Recommended by 1000 genome group Holly Zheng-Bradley, 7/29/2015)
  • Recalculate the allele frequency for the conflicting allele.
  • Pick the allele frequency that has the highest data support.

Download URL

GRCh37 GRCh38

JSON Output

"oneKg":{
"allAf":0.200879,
"afrAf":0.210287,
"amrAf":0.139769,
"easAf":0.275794,
"eurAf":0.181909,
"sasAf":0.173824,
"allAn":5008,
"afrAn":1322,
"amrAn":694,
"easAn":1008,
"eurAn":1006,
"sasAn":978,
"allAc":1006,
"afrAc":278,
"amrAc":97,
"easAc":278,
"eurAc":183,
"sasAc":170
}
FieldTypeNotes
allAffloatallele frequency for all populations. Range: 0 - 1.0
allAcintallele count for all populations. Integer.
allAnintallele number for all populations. Non-zero integer.
afrAffloatallele frequency for the African super population. Range: 0 - 1.0
afrAcintallele count for the African super population. Integer.
afrAnintallele number for the African super population. Non-zero integer.
amrAffloatallele frequency for the Ad Mixed American super population. Range: 0 - 1.0
amrAcintallele count for the Ad Mixed American super population. Integer.
amrAnintallele number for the Ad Mixed American super population. Non-zero integer.
easAffloatallele frequency for the East Asian super population. Range: 0 - 1.0
easAcintallele count for the East Asian super population. Integer.
easAnintallele number for the East Asian super population. Non-zero integer.
eurAffloatallele frequency for the European super population. Range: 0 - 1.0
eurAcintallele count for the European super population. Integer.
eurAnintallele number for the European super population. Non-zero integer.
sasAffloatallele frequency for the South Asian super population. Range: 0 - 1.0
sasAcintallele count for the South Asian super population. Integer.
sasAnintallele number for the South Asian super population. Non-zero integer.

Structural Variants

VCF File Parsing

The VCF files contain entries like the following:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
22 16050654 esv3647175;esv3647176;esv3647177;esv3647178 A <CN0>,<CN2>,<CN3>,<CN4> 100 PASS AC=9,87,599,20;AF=0.00179712,0.0173722,0.119609,0.00399361;AN=5008;CS=DUP_gs;END=16063474;NS=2504;SVTYPE=CNV;DP=22545;EAS_AF=0.001,0.0169,0.2361,0.0099;AMR_AF=0,0.0101,0.219,0.0072;AFR_AF=0.0061,0.0363,0.0053,0;EUR_AF=0,0.007,0.0944,0.003;SAS_AF=0,0.0082,0.1094,0.002;VT=SV GT 3|0 0|0 0|0 0|0 0|0 0|0 0|4

Please note that, CNVs are allele-specific. For example, HG00096 is effectively copy number 4, which would be a net gain on chr22.

1000 Genomes contains 5 types of structural variants:

  • CNV
  • DEL
  • DUP
  • INS
  • INV

Since data of 1000 genomes is provided in VCF format, we assume that the coordinates follow the vcf format, i.e., there is a padding base for symbolic alleles. So all the interval can be interpreted as [BEGIN+1, END]. Similarly, for all other variant types except insertion, END is far larger than BEGIN. The distribution of BEGIN and END for insertions is summarized below.

Insertion issues

  • END = BEGIN for 6/165
  • END = BEGIN+2 for 93/165
  • END = BEGIN+3 for 11/165
  • END = BEGIN+4 for 11/165
  • END – BEGIN range from 5 to 1156 for others.

Converting VCF svTypes to SO sequence alterations

The svType will be captured in our JSON file under the sequenceAlteration key. Here's the translation we'll use according to svType in 1000 Genomes.

svTypeAlternative Alleles contain <CN*>sequenceAlteration
ALUFALSEmobile_element_insertion
DUPTRUEcopy_number_gain
CNVTRUEcopy_number_gain (observed_gains >0 and observed_losses =0)
copy_number_loss (observed_gains = 0 and observed_losses > 0)
copy_number_variation (otherwise)
DELTRUEcopy_number_loss
LINE1FALSEmobile_element_insertion
SVAFALSEmobile_element_insertion
INVFALSEinversion
INSFALSEinsertion

Exceptions

We discard structural variants without END

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
21 9495848 esv3646347 A <INS:ME:LINE1> 100 PASS AC=1543;AF=0.308107;AN=5008;CS=L1_umary;MEINFO=LINE1,5669,6005,+;NS=2504;SVLEN=336;SVTYPE=LINE1;TSD=null;DP=20015;EAS_AF=0.3125;AMR_AF=0.2911;AFR_AF=0.3026;EUR_AF=0.2922;SAS_AF=0.3395;VT=SV GT 0|0 1|1 1|0 0|1 1|0 1|0 0|0

CNVs in chrY

  • No other types of structural variants exist in chrY
  • Since copy number is provided in genotype field, we directly parse the copy number from "CN" field.
  • For most CNVs in chrY, the reference copy number is 1, but the refence number for CNVs in segmental duplication sites is 2 (<CN2> in the 2nd example). All segmental duplication calls have identifiers starting with GS_SD_M2.
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00101 HG00103 HG00105 HG00107 HG00108
Y 2888555 CNV_Y_2888555_3014661 T <CN2> 100 PASS AC=1;AF=0.000817661;AN=1223;END=3014661;NS=1233;SVTYPE=CNV;AMR_AF=0.0000;AFR_AF=0.0000;EUR_AF=0.0000;SAS_AF=0.0019;EAS_AF=0.0000;VT=SV GT:CN:CNL:CNP:CNQ:GP:GQ:PL 0:1:-1000,0,-58.45:-1000,0,-61.55:99:0,-61.55:99:0,585 0:1:-296.36,0,-16.6:-300.46,0,-19.7:99:0,-19.7:99:0,166 0:1:-1000,0,-39.44:-1000,0,-42.54:99:0,-42.54:99:0,394
Y 6128381 GS_SD_M2_Y_6128381_6230094_Y_9650284_9752225 C <CN1>,<CN3> 100 PASS AC=4,2;AF=0.00327065,0.00163532;AN=1223;END=6230094;NS=1233;SVTYPE=CNV;AMR_AF=0.0029,0.0029;AFR_AF=0.0016,0.0016;EUR_AF=0.0000,0.0000;SAS_AF=0.0038,0.0000;EAS_AF=0.0000,0.0000;VT=SV;EX_TARGET GT:CN:CNL:CNP:CNQ:GP:GQ 0:2:-1000,-138.78,0,-38.53:-1000,-141.27,0,-41.33:99:0,-141.27,-41.33:99 0:2:-1000,-53.32,0,-17.85:-1000,-55.81,0,-20.64:99:0,-55.81,-20.64:99 0:2:-1000,-71.83,0,-32.5:-1000,-74.32,0,-35.29:99:0,-74.32,-35.29:99 0:2:-1000,-60.96,0,-20.29:-1000,-63.45,0,-23.08:99:0,-63.45,-23.08:99 0:2:-1000,-77.6,0,-31.45:-1000,-80.09,0,-34.24:99:0,-80.09,-34.24:99

JSON Output

"oneKg":[
{
"chromosome":"1",
"begin":1595369,
"end":1612441,
"variantType": "copy_number_variation",
"id": "esv3635753;esv3635754;esv3635755;esv3635756;esv3635757",
"allAn": 5008,
"allAc": 2702,
"allAf": 0.539537,
"afrAf": 0.6052,
"amrAf": 0.3675,
"eurAf": 0.5357,
"easAf": 0.5368,
"sasAf": 0.5797,
"reciprocalOverlap": 0.07555
}
],
FieldTypeNotes
chromosomestring
begininteger
endinteger
variantTypestring
idstring
allAnintegerallele number for all populations. Non-zero integer.
allAcintegerallele count for all populations. Integer.
allAffloating pointallele frequency for all populations. Range: 0 - 1.0
afrAffloating pointallele frequency for the African super population. Range: 0 - 1.0
amrAffloating pointallele frequency for the Ad Mixed American super population. Range: 0 - 1.0
eurAffloating pointallele frequency for the European super population. Range: 0 - 1.0
easAffloating pointallele frequency for the East Asian super population. Range: 0 - 1.0
sasAffloating pointallele frequency for the South Asian super population. Range: 0 - 1.0
reciprocalOverlapfloating pointrange: 0 - 1.