Skip to main content
Version: 3.27

CpG Island

Overview

This supplementary data is one of the supplementary data that will only be used for annotating methylation gVCF file. In general, this supplementary data define genoimic region interval that indicates CpG island. The suppplementary data is available for both GRCh37 and GRCh38. Both data is obtained from publicly available source from UCSC FTP.

File URL for GRCh37

File URL for GRCh38

The file schema is explained in this link

Parsing

From those TSV file, we read every entry and store all info so that we can produce those data in the output.

Output

An example of methylation annotation from CpG Island region is below

{
"chromosome": "21",
"begin": 20997105,
"end": 20998245,
"regionId": "CpG: 89",
"biotype": "cpgIsland",
"bin": 745,
"length": 1140,
"cpgNum": 89,
"gcNum": 674,
"percentageCpg": 15.6,
"percentageGc": 59.1,
"observedToExpectedRatio": 0.89,
"samples": [
{
"sampleId": "mate_len_200bp_100X",
"averageCpGMethylation": 0.5085069662921352,
"totalCpGCoverage": 5413,
"totalCpGPosition": 44.5
}
]
}

The output includes the biotype cpgIsland and contains additional fields derived from the supplementary data.

The following table describes these extra fields:

FieldDescription
binIndexing bin used for efficient spatial querying.
lengthTotal length of the CpG island region.
cpgNumNumber of CpG dinucleotides within the island.
gcNumNumber of G and C nucleotides within the island.
percentageCpGPercentage of CpG dinucleotides in the region.
percentageGcPercentage of GC content in the region.
observedToExpectedRatioThe ratio of observed to expected CpG counts.
regionIdCorresponds to the name column from the original UCSC TSV file.