CpG Island
Overview
This supplementary data is one of the supplementary data that will only be used for annotating methylation gVCF file. In general, this supplementary data define genoimic region interval that indicates CpG island. The suppplementary data is available for both GRCh37 and GRCh38. Both data is obtained from publicly available source from UCSC FTP.
The file schema is explained in this link
Parsing
From those TSV file, we read every entry and store all info so that we can produce those data in the output.
Output
An example of methylation annotation from CpG Island region is below
{
"chromosome": "21",
"begin": 20997105,
"end": 20998245,
"regionId": "CpG: 89",
"biotype": "cpgIsland",
"bin": 745,
"length": 1140,
"cpgNum": 89,
"gcNum": 674,
"percentageCpg": 15.6,
"percentageGc": 59.1,
"observedToExpectedRatio": 0.89,
"samples": [
{
"sampleId": "mate_len_200bp_100X",
"averageCpGMethylation": 0.5085069662921352,
"totalCpGCoverage": 5413,
"totalCpGPosition": 44.5
}
]
}
The output includes the biotype cpgIsland and contains additional fields derived from the supplementary data.
The following table describes these extra fields:
| Field | Description |
|---|---|
bin | Indexing bin used for efficient spatial querying. |
length | Total length of the CpG island region. |
cpgNum | Number of CpG dinucleotides within the island. |
gcNum | Number of G and C nucleotides within the island. |
percentageCpG | Percentage of CpG dinucleotides in the region. |
percentageGc | Percentage of GC content in the region. |
observedToExpectedRatio | The ratio of observed to expected CpG counts. |
regionId | Corresponds to the name column from the original UCSC TSV file. |