COSMIC
Overview
COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.
Publication
John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1
Licensed Content
Commercial companies are required to acquire a license from COSMIC. At the moment, this means that our COSMIC content is only available in Illumina's products and services, not in the open source distribution.
Since many of you are academic users, we will enable a COSMIC login in our downloader later this year that will allow academic and commercial organizations (with a license) access our COSMIC data sources.
Gene Fusions
Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.
TSV File
Example
SAMPLE_ID SAMPLE_NAME PRIMARY_SITE SITE_SUBTYPE_1 SITE_SUBTYPE_2 SITE_SUBTYPE_3 PRIMARY_HISTOLOGY HISTOLOGY_SUBTYPE_1 HISTOLOGY_SUBTYPE_2 HISTOLOGY_SUBTYPE_3 FUSION_ID TRANSLOCATION_NAME 5'_CHROMOSOME 5'_STRAND 5'_GENE_ID 5'_GENE_NAME 5'_LAST_OBSERVED_EXON 5'_GENOME_START_FROM 5'_GENOME_START_TO 5'_GENOME_STOP_FROM 5'_GENOME_STOP_TO 3'_CHROMOSOME 3'_STRAND 3'_GENE_ID 3'_GENE_NAME 3'_FIRST_OBSERVED_EXON 3'_GENOME_START_FROM 3'_GENOME_START_TO 3'_GENOME_STOP_FROM 3'_GENOME_STOP_TO FUSION_TYPE PUBMED_PMID
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452 8 - 197199 RGS22 22 99981937 99981937 100106116 100106116 1 + 212470 SYCP1_ENST00000369518 24 114944339 114944339 114995367 114995367 Inferred Breakpoint 20033038
Parsing
From the TSV file, we're mainly interested in the following columns:
SAMPLE_ID
PRIMARY_SITE
PRIMARY_HISTOLOGY
HISTOLOGY_SUBTYPE_1
FUSION_ID
TRANSLOCATION_NAME
PUBMED_PMID
info
For all the histologies and sites, we replace all the underlines with spaces. salivary_gland
would become salivary gland
.
Aggregation
To create the gene fusion entries in Nirvana, we perform the following on each row in the TSV file:
- Group all entries by FUSION_ID
- Using all the entries related to this FUSION_ID:
- Collect all the PubMed IDs
- Tally the number of observed sample IDs
- Grab the HGVS r. notation (should not change throughout the FUSION_ID)
- Tally the number of samples observed for each histology
- Tally the number of samples observed for each site
- Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols
Fixing the HGVS RNA Notation
ENST00000360863.6(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452
There are some issues with the HGVS RNA notation:
- The two transcripts should be linked by a double colon
::
. - For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusion
- If only the breakpoint is truly known, the recommendation is to use
?
marks
We chose to only update the linkage between each transcript using double colons ::
. While we could have recalculated the HGVS notation using the supplied breakpoints, we chose not to because the resulting notation would be quite different from the original material. This would potentially lead to some confusion.
Aggregating Histologies
For histologies we want to capture the most specific description available. In the example above, we saw that the primary histology was carcinoma
, but the subtype was ductal carcinoma
. In this case we would use the subtype for the annotation.
COSMIC uses NS
to show that a value is empty. If the subtype is NS
, we will use the primary histology instead.
Aggregating Sites
For sites, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary site might be skin
, but the subtype is foot
. Therefore, we will combine the values in the following manner: skin (foot)
.
Known Issues
Known Issues
There are some issues with the HGVS RNA notation:
- The two transcripts should be linked by a double colon
::
. We fixed this aspect in Nirvana. - For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.
Download URL
- https://cancer.sanger.ac.uk/cosmic/file_download/GRCh37/cosmic/v94/CosmicFusionExport.tsv.gz
- https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/CosmicFusionExport.tsv.gz
JSON Output
"cosmicGeneFusions":[
{
"id":"COSF881",
"numSamples":6,
"geneSymbols":[
"MYB",
"NFIB"
],
"hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
"histologies":[
{
"name":"adenoid cystic carcinoma",
"numSamples":6
}
],
"sites":[
{
"name":"salivary gland (submandibular)",
"numSamples":1
},
{
"name":"salivary gland (parotid)",
"numSamples":1
},
{
"name":"salivary gland (nasal cavity)",
"numSamples":1
},
{
"name":"breast",
"numSamples":3
}
],
"pubMedIds":[
19841262
]
}
]
Field | Type | Notes |
---|---|---|
id | string | COSMIC fusion ID |
numSamples | int | |
geneSymbols | string array | 5' gene & 3' gene |
hgvsr | string | HGVS RNA translocation fusion notation |
histologies | count array | phenotypic descriptions |
sites | count array | tissue types |
pubMedIds | int array | PubMed IDs |
Count
Field | Type | Notes |
---|---|---|
name | string | description |
numSamples | int |