COSMIC
Overview
COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.
Publication
John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1
Professional data source
This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.
Small Variants
Our main COSMIC deliverable provides annotations for both coding and non-coding variants throughout the genome. As of COSMIC v96, this includes 28.7M variants spanning the human genome. Illumina Connected Annotations currently parses four files to extract the relevant content:
- CosmicCodingMuts.vcf.gz
- CosmicNonCodingVariants.vcf.gz
- CosmicMutantExport.tsv.gz
- CosmicNCV.tsv.gz
VCF extraction
Example
#CHROM POS ID REF ALT QUAL FILTER INFO
1 65797 COSV58737189 T C . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN23957695;CDS=c.9+224T>C;AA=p.?;HGVSC=ENST00000641515.2:c.9+224T>C;HGVSG=1:g.65797T>C;CNT=1
Parsing
From the VCF files, we're mainly interested in the following columns:
CHROM
POS
ID
REF
ALT
TSV extraction
Example
Gene name Accession Number Gene CDS length HGNC ID Sample name ID_sample ID_tumour Primary site Site subtype 1 Site subtype 2 Site subtype 3 Primary histology Histology subtype 1 Histology subtype 2 Histology subtype 3 Genome-wide screen GENOMIC_MUTATION_ID LEGACY_MUTATION_ID MUTATION_ID Mutation CDS Mutation AA Mutation Description Mutation zygosity LOH GRCh Mutation genome position Mutation strand Resistance Mutation Mutation somatic status Pubmed_PMID ID_STUDY Sample Type Tumour origin Age HGVSP HGVSC HGVSG
MCF2L_ENST00000375604 ENST00000375604.6 3372 14576 RK091_C01 1918867 1806188 liver NS NS NS carcinoma NS NS NS y COSV65049364 COSN1601909 113108365 c.73+3096A>G p.? Unknown het 38 13:113005079-113005079 + - Variant of unknown origin 322 fresh/frozen - NOS primary ENST00000375604.6:c.73+3096A>G 13:g.113005079A>G
Parsing
From the TSV file, we're mainly interested in the following columns:
GENOMIC_MUTATION_ID
ID_sample
Primary site
Site subtype 1
Primary histology
Histology subtype 1
Pubmed_PMID
Resistance Mutation
Mutation somatic status
info
For all the histologies and sites, we replace all the underlines with spaces. salivary_gland
would become salivary gland
.
Parsing
To aggregate the data in Illumina Connected Annotations, we perform the following:
- Parse the coding and non-coding TSV files to retrieve the histologies, sites, PubMed IDs, somatic status, and resistance mutation status. Histologies and sites are tracked with respect to sample IDs.
- Parse the coding and non-coding VCF files to retrieve the genomic variant for each entry
Aggregating Histologies & Sites
For sites and histologies, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary
site might be skin
, but the subtype is foot
. Therefore, we will combine the values in the following manner: skin (foot)
.
COSMIC uses NS
to show that a value is empty. If the subtype is NS
, we will use the primary histology instead.
Download URL
GRCh37
GRCh38
JSON Output
{
"id":"COSV58272668",
"numSamples":8,
"refAllele":"-",
"altAllele":"CCT",
"histologies":[
{
"name":"carcinoma (serous carcinoma)",
"numSamples":2
},
{
"name":"meningioma (fibroblastic)",
"numSamples":1
},
{
"name":"carcinoma",
"numSamples":1
},
{
"name":"carcinoma (squamous cell carcinoma)",
"numSamples":1
},
{
"name":"meningioma (transitional)",
"numSamples":1
},
{
"name":"carcinoma (adenocarcinoma)",
"numSamples":1
},
{
"name":"other (neoplasm)",
"numSamples":1
}
],
"sites":[
{
"name":"ovary",
"numSamples":2
},
{
"name":"meninges",
"numSamples":2
},
{
"name":"thyroid",
"numSamples":2
},
{
"name":"cervix",
"numSamples":1
},
{
"name":"large intestine (colon)",
"numSamples":1
}
],
"pubMedIds":[
25738363,
27548314
],
"confirmedSomatic":true,
"drugResistance":true, /* not in this particular COSMIC variant */
"isAlleleSpecific":true
}
Field | Type | Notes |
---|---|---|
id | string | COSMIC Genomic Mutation ID |
numSamples | int | |
refAllele | string | |
altAllele | string | |
histologies | count array | phenotypic descriptions |
sites | count array | tissue types |
pubMedIds | int array | PubMed IDs |
confirmedSomatic | bool | true when the variant is a confirmed somatic variant |
drugResistance | bool | true when the variant has been associated with drug resistance |
Count
Field | Type | Notes |
---|---|---|
name | string | description |
numSamples | int |
Gene Fusions
Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.
TSV extraction
Example
SAMPLE_ID SAMPLE_NAME PRIMARY_SITE SITE_SUBTYPE_1 SITE_SUBTYPE_2 SITE_SUBTYPE_3 PRIMARY_HISTOLOGY HISTOLOGY_SUBTYPE_1 HISTOLOGY_SUBTYPE_2 HISTOLOGY_SUBTYPE_3 FUSION_ID TRANSLOCATION_NAME 5'_CHROMOSOME 5'_STRAND 5'_GENE_ID 5'_GENE_NAME 5'_LAST_OBSERVED_EXON 5'_GENOME_START_FROM 5'_GENOME_START_TO 5'_GENOME_STOP_FROM 5'_GENOME_STOP_TO 3'_CHROMOSOME 3'_STRAND 3'_GENE_ID 3'_GENE_NAME 3'_FIRST_OBSERVED_EXON 3'_GENOME_START_FROM 3'_GENOME_START_TO 3'_GENOME_STOP_FROM 3'_GENOME_STOP_TO FUSION_TYPE PUBMED_PMID
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555::ENST00000369518.1(SYCP1):r.2100_3452 8 - 197199 RGS22 22 99981937 99981937 100106116 100106116 1 + 212470 SYCP1_ENST00000369518 24 114944339 114944339 114995367 114995367 Inferred Breakpoint 20033038
Parsing
From the TSV file, we're mainly interested in the following columns:
SAMPLE_ID
PRIMARY_SITE
PRIMARY_HISTOLOGY
HISTOLOGY_SUBTYPE_1
FUSION_ID
TRANSLOCATION_NAME
PUBMED_PMID
info
For all the histologies and sites, we replace all the underlines with spaces. salivary_gland
would become salivary gland
.
Parsing
To create the gene fusion entries in Illumina Connected Annotations, we perform the following on each row in the TSV file:
- Group all entries by FUSION_ID
- Using all the entries related to this FUSION_ID:
- Collect all the PubMed IDs
- Tally the number of observed sample IDs
- Grab the HGVS r. notation (should not change throughout the FUSION_ID)
- Tally the number of samples observed for each histology
- Tally the number of samples observed for each site
- Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols
Aggregating Histologies & Sites
Aggregating Histologies & Sites was previously described in the small variants section.
Known Issues
Known Issues
There are some issues with the HGVS RNA notation:
- For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.
Download URL
GRCh37
GRCh38
JSON Output
"cosmicGeneFusions":[
{
"id":"COSF881",
"numSamples":6,
"geneSymbols":[
"MYB",
"NFIB"
],
"hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
"histologies":[
{
"name":"adenoid cystic carcinoma",
"numSamples":6
}
],
"sites":[
{
"name":"salivary gland (submandibular)",
"numSamples":1
},
{
"name":"salivary gland (parotid)",
"numSamples":1
},
{
"name":"salivary gland (nasal cavity)",
"numSamples":1
},
{
"name":"breast",
"numSamples":3
}
],
"pubMedIds":[
19841262
]
}
]
Field | Type | Notes |
---|---|---|
id | string | COSMIC fusion ID |
numSamples | int | |
geneSymbols | string array | 5' gene & 3' gene |
hgvsr | string | HGVS RNA translocation fusion notation |
histologies | count array | phenotypic descriptions |
sites | count array | tissue types |
pubMedIds | int array | PubMed IDs |
Count
Field | Type | Notes |
---|---|---|
name | string | description |
numSamples | int |
Cancer Gene Census
TSV Extraction
Example
GENE_NAME CELL_TYPE PUBMED_PMID HALLMARK IMPACT DESCRIPTION CELL_LINE
PRDM16 18496560 role in cancer oncogene oncogene
PRDM16 16015645 role in cancer fusion fusion
Parsing
To extract information about TSGs and oncogenes, the data based on the "role in cancer" attribute is filtered. For tumor suppressor genes, rows with the value "TSG" and for oncogenes, rows with the value "oncogene" are filtered. Some genes have both "TSG/oncogene" as their role, which indicates that they can act as both.
Columns
Only following columns are needed to gather required roles in cancer:
GENE_NAME
IMPACT
HALLMARK
Possible Roles in Cancer
While parsing, only following roles in cancer are found:
fusion
TSG
oncogene
Parsing Stats
The file contained following number of instances for each role type
Role in cancer | Total Instances |
---|---|
fusion | 149 |
TSG | 195 |
oncogene | 181 |
Total | 525 |
Known Issues
None
Download URL
JSON output
{
"name": "PRDM16",
"hgncId": 14000,
"ncbiGeneId": "63976",
"ensemblGeneId": "ENSG00000142611",
"cosmic": {
"roleInCancer": [
"oncogene",
"fusion"
]
}
}
Field | Type | Notes |
---|---|---|
roleInCancer | string array | Possible roles in caner |