COSMIC
Overview
COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.
Publication
John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1
Licensed Content
Commercial companies are required to acquire a license from COSMIC. At the moment, this means that our COSMIC content is only available in Illumina's products and services, not in the open source distribution.
Since many of you are academic users, we will enable a COSMIC login in our downloader later this year that will allow academic and commercial organizations (with a license) access our COSMIC data sources.
Gene Fusions
Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.
TSV File
Example
SAMPLE_ID       SAMPLE_NAME     PRIMARY_SITE    SITE_SUBTYPE_1  SITE_SUBTYPE_2  SITE_SUBTYPE_3  PRIMARY_HISTOLOGY      HISTOLOGY_SUBTYPE_1      HISTOLOGY_SUBTYPE_2     HISTOLOGY_SUBTYPE_3     FUSION_ID       TRANSLOCATION_NAME      5'_CHROMOSOME   5'_STRAND       5'_GENE_ID      5'_GENE_NAME    5'_LAST_OBSERVED_EXON   5'_GENOME_START_FROM    5'_GENOME_START_TO      5'_GENOME_STOP_FROM     5'_GENOME_STOP_TO       3'_CHROMOSOME   3'_STRAND       3'_GENE_ID      3'_GENE_NAME   3'_FIRST_OBSERVED_EXON   3'_GENOME_START_FROM    3'_GENOME_START_TO      3'_GENOME_STOP_FROM     3'_GENOME_STOP_TO      FUSION_TYPE      PUBMED_PMID
749711  HCC1187 breast  NS      NS      NS      carcinoma       ductal_carcinoma        NS      NS      665     ENST00000360863.10(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452 8       -       197199  RGS22   22      99981937       99981937 100106116       100106116       1       +       212470  SYCP1_ENST00000369518   24      114944339       114944339       114995367       114995367       Inferred Breakpoint     20033038
Parsing
From the TSV file, we're mainly interested in the following columns:
- SAMPLE_ID
- PRIMARY_SITE
- PRIMARY_HISTOLOGY
- HISTOLOGY_SUBTYPE_1
- FUSION_ID
- TRANSLOCATION_NAME
- PUBMED_PMID
info
For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.
Aggregation
To create the gene fusion entries in Nirvana, we perform the following on each row in the TSV file:
- Group all entries by FUSION_ID
- Using all the entries related to this FUSION_ID:- Collect all the PubMed IDs
- Tally the number of observed sample IDs
- Grab the HGVS r. notation (should not change throughout the FUSION_ID)
- Tally the number of samples observed for each histology
- Tally the number of samples observed for each site
 
- Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols
Fixing the HGVS RNA Notation
ENST00000360863.6(RGS22):r.1_3555_ENST00000369518.1(SYCP1):r.2100_3452
There are some issues with the HGVS RNA notation:
- The two transcripts should be linked by a double colon ::.
- For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusion
- If only the breakpoint is truly known, the recommendation is to use ?marks
We chose to only update the linkage between each transcript using double colons ::. While we could have recalculated the HGVS notation using the supplied breakpoints, we chose not to because the resulting notation would be quite different from the original material. This would potentially lead to some confusion.
Aggregating Histologies
For histologies we want to capture the most specific description available. In the example above, we saw that the primary histology was carcinoma, but the subtype was ductal carcinoma. In this case we would use the subtype for the annotation.
COSMIC uses NS to show that a value is empty. If the subtype is NS, we will use the primary histology instead.
Aggregating Sites
For sites, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary site might be skin, but the subtype is foot. Therefore, we will combine the values in the following manner: skin (foot).
Known Issues
Known Issues
There are some issues with the HGVS RNA notation:
- The two transcripts should be linked by a double colon ::. We fixed this aspect in Nirvana.
- For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.
Download URL
- https://cancer.sanger.ac.uk/cosmic/file_download/GRCh37/cosmic/v94/CosmicFusionExport.tsv.gz
- https://cancer.sanger.ac.uk/cosmic/file_download/GRCh38/cosmic/v94/CosmicFusionExport.tsv.gz
JSON Output
   "cosmicGeneFusions":[
      {
         "id":"COSF881",
         "numSamples":6,
         "geneSymbols":[
            "MYB",
            "NFIB"
         ],
         "hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
         "histologies":[
            {
               "name":"adenoid cystic carcinoma",
               "numSamples":6
            }
         ],
         "sites":[
            {
               "name":"salivary gland (submandibular)",
               "numSamples":1
            },
            {
               "name":"salivary gland (parotid)",
               "numSamples":1
            },
            {
               "name":"salivary gland (nasal cavity)",
               "numSamples":1
            },
            {
               "name":"breast",
               "numSamples":3
            }
         ],
         "pubMedIds":[
            19841262
         ]
      }
   ]
| Field | Type | Notes | 
|---|---|---|
| id | string | COSMIC fusion ID | 
| numSamples | int | |
| geneSymbols | string array | 5' gene & 3' gene | 
| hgvsr | string | HGVS RNA translocation fusion notation | 
| histologies | count array | phenotypic descriptions | 
| sites | count array | tissue types | 
| pubMedIds | int array | PubMed IDs | 
Count
| Field | Type | Notes | 
|---|---|---|
| name | string | description | 
| numSamples | int |