Skip to main content
Version: 3.16

COSMIC

Overview

COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.

Publication

John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1

Licensed Content

Commercial companies are required to acquire a license from COSMIC. At the moment, this means that our COSMIC content is only available in Illumina's products and services, not in the open source distribution.

Since many of you are academic users, we will enable a COSMIC login in our downloader later this year that will allow academic and commercial organizations (with a license) access our COSMIC data sources.

Gene Fusions

Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.

TSV File

Example

SAMPLE_ID       SAMPLE_NAME     PRIMARY_SITE    SITE_SUBTYPE_1  SITE_SUBTYPE_2  SITE_SUBTYPE_3  PRIMARY_HISTOLOGY      HISTOLOGY_SUBTYPE_1      HISTOLOGY_SUBTYPE_2     HISTOLOGY_SUBTYPE_3     FUSION_ID       TRANSLOCATION_NAME      5'_CHROMOSOME   5'_STRAND       5'_GENE_ID      5'_GENE_NAME    5'_LAST_OBSERVED_EXON   5'_GENOME_START_FROM    5'_GENOME_START_TO      5'_GENOME_STOP_FROM     5'_GENOME_STOP_TO       3'_CHROMOSOME   3'_STRAND       3'_GENE_ID      3'_GENE_NAME   3'_FIRST_OBSERVED_EXON   3'_GENOME_START_FROM    3'_GENOME_START_TO      3'_GENOME_STOP_FROM     3'_GENOME_STOP_TO      FUSION_TYPE      PUBMED_PMID
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452 8 - 197199 RGS22 22 99981937 99981937 100106116 100106116 1 + 212470 SYCP1_ENST00000369518 24 114944339 114944339 114995367 114995367 Inferred Breakpoint 20033038

Parsing

From the TSV file, we're mainly interested in the following columns:

  • SAMPLE_ID
  • PRIMARY_SITE
  • PRIMARY_HISTOLOGY
  • HISTOLOGY_SUBTYPE_1
  • FUSION_ID
  • TRANSLOCATION_NAME
  • PUBMED_PMID
info

For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.

Aggregation

To create the gene fusion entries in Nirvana, we perform the following on each row in the TSV file:

  • Group all entries by FUSION_ID
  • Using all the entries related to this FUSION_ID:
    • Collect all the PubMed IDs
    • Tally the number of observed sample IDs
    • Grab the HGVS r. notation (should not change throughout the FUSION_ID)
    • Tally the number of samples observed for each histology
    • Tally the number of samples observed for each site
  • Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols

Fixing the HGVS RNA Notation

ENST00000360863.6(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452

There are some issues with the HGVS RNA notation:

  • The two transcripts should be linked by a double colon ::.
  • For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusion
  • If only the breakpoint is truly known, the recommendation is to use ? marks

We chose to only update the linkage between each transcript using double colons ::. While we could have recalculated the HGVS notation using the supplied breakpoints, we chose not to because the resulting notation would be quite different from the original material. This would potentially lead to some confusion.

Aggregating Histologies

For histologies we want to capture the most specific description available. In the example above, we saw that the primary histology was carcinoma, but the subtype was ductal carcinoma. In this case we would use the subtype for the annotation.

COSMIC uses NS to show that a value is empty. If the subtype is NS, we will use the primary histology instead.

Aggregating Sites

For sites, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary site might be skin, but the subtype is foot. Therefore, we will combine the values in the following manner: skin (foot).

Known Issues

Known Issues

There are some issues with the HGVS RNA notation:

  • The two transcripts should be linked by a double colon ::. We fixed this aspect in Nirvana.
  • For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.

Download URL

JSON Output

   "cosmicGeneFusions":[
{
"id":"COSF881",
"numSamples":6,
"geneSymbols":[
"MYB",
"NFIB"
],
"hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
"histologies":[
{
"name":"adenoid cystic carcinoma",
"numSamples":6
}
],
"sites":[
{
"name":"salivary gland (submandibular)",
"numSamples":1
},
{
"name":"salivary gland (parotid)",
"numSamples":1
},
{
"name":"salivary gland (nasal cavity)",
"numSamples":1
},
{
"name":"breast",
"numSamples":3
}
],
"pubMedIds":[
19841262
]
}
]
FieldTypeNotes
idstringCOSMIC fusion ID
numSamplesint
geneSymbolsstring array5' gene & 3' gene
hgvsrstringHGVS RNA translocation fusion notation
histologiescount arrayphenotypic descriptions
sitescount arraytissue types
pubMedIdsint arrayPubMed IDs

Count

FieldTypeNotes
namestringdescription
numSamplesint