Skip to main content
Version: 3.19 (unreleased)

COSMIC

Overview

COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.

Publication

John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1

Licensed Content

Commercial companies are required to acquire a license from COSMIC. At the moment, this means that our COSMIC content is only available in Illumina's products and services, not in the open source distribution.

Since many of you are academic users, we will enable a COSMIC login in our downloader later this year that will allow academic and commercial organizations (with a license) access our COSMIC data sources.

Small Variants

Our main COSMIC deliverable provides annotations for both coding and non-coding variants throughout the genome. As of COSMIC v96, this includes 28.7M variants spanning the human genome. Nirvana currently parses four files to extract the relevant content:

  • CosmicCodingMuts.vcf.gz
  • CosmicNonCodingVariants.vcf.gz
  • CosmicMutantExport.tsv.gz
  • CosmicNCV.tsv.gz

VCF extraction

Example

#CHROM  POS ID  REF ALT QUAL  FILTER  INFO
1 65797 COSV58737189 T C . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN23957695;CDS=c.9+224T>C;AA=p.?;HGVSC=ENST00000641515.2:c.9+224T>C;HGVSG=1:g.65797T>C;CNT=1

Parsing

From the VCF files, we're mainly interested in the following columns:

  • CHROM
  • POS
  • ID
  • REF
  • ALT

TSV extraction

Example

Gene name Accession Number  Gene CDS length HGNC ID Sample name ID_sample ID_tumour Primary site  Site subtype 1  Site subtype 2  Site subtype 3  Primary histology Histology subtype 1 Histology subtype 2 Histology subtype 3 Genome-wide screen  GENOMIC_MUTATION_ID LEGACY_MUTATION_ID  MUTATION_ID Mutation CDS  Mutation AA Mutation Description  Mutation zygosity LOH GRCh  Mutation genome position  Mutation strand Resistance Mutation Mutation somatic status Pubmed_PMID ID_STUDY  Sample Type Tumour origin Age HGVSP HGVSC HGVSG
MCF2L_ENST00000375604 ENST00000375604.6 3372 14576 RK091_C01 1918867 1806188 liver NS NS NS carcinoma NS NS NS y COSV65049364 COSN1601909 113108365 c.73+3096A>G p.? Unknown het 38 13:113005079-113005079 + - Variant of unknown origin 322 fresh/frozen - NOS primary ENST00000375604.6:c.73+3096A>G 13:g.113005079A>G

Parsing

From the TSV file, we're mainly interested in the following columns:

  • GENOMIC_MUTATION_ID
  • ID_sample
  • Primary site
  • Site subtype 1
  • Primary histology
  • Histology subtype 1
  • Pubmed_PMID
  • Resistance Mutation
  • Mutation somatic status
info

For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.

Parsing

To aggregate the data in Nirvana, we perform the following:

  • Parse the coding and non-coding TSV files to retrieve the histologies, sites, PubMed IDs, somatic status, and resistance mutation status. Histologies and sites are tracked with respect to sample IDs.
  • Parse the coding and non-coding VCF files to retrieve the genomic variant for each entry

Aggregating Histologies & Sites

For sites and histologies, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary site might be skin, but the subtype is foot. Therefore, we will combine the values in the following manner: skin (foot).

COSMIC uses NS to show that a value is empty. If the subtype is NS, we will use the primary histology instead.

Download URL

GRCh37

GRCh38

JSON Output

{
"id":"COSV58272668",
"numSamples":8,
"refAllele":"-",
"altAllele":"CCT",
"histologies":[
{
"name":"carcinoma (serous carcinoma)",
"numSamples":2
},
{
"name":"meningioma (fibroblastic)",
"numSamples":1
},
{
"name":"carcinoma",
"numSamples":1
},
{
"name":"carcinoma (squamous cell carcinoma)",
"numSamples":1
},
{
"name":"meningioma (transitional)",
"numSamples":1
},
{
"name":"carcinoma (adenocarcinoma)",
"numSamples":1
},
{
"name":"other (neoplasm)",
"numSamples":1
}
],
"sites":[
{
"name":"ovary",
"numSamples":2
},
{
"name":"meninges",
"numSamples":2
},
{
"name":"thyroid",
"numSamples":2
},
{
"name":"cervix",
"numSamples":1
},
{
"name":"large intestine (colon)",
"numSamples":1
}
],
"pubMedIds":[
25738363,
27548314
],
"confirmedSomatic":true,
"drugResistance":true, /* not in this particular COSMIC variant */
"isAlleleSpecific":true
}
FieldTypeNotes
idstringCOSMIC Genomic Mutation ID
numSamplesint
refAllelestring
altAllelestring
histologiescount arrayphenotypic descriptions
sitescount arraytissue types
pubMedIdsint arrayPubMed IDs
confirmedSomaticbooltrue when the variant is a confirmed somatic variant
drugResistancebooltrue when the variant has been associated with drug resistance

Count

FieldTypeNotes
namestringdescription
numSamplesint

Gene Fusions

Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.

TSV extraction

Example

SAMPLE_ID SAMPLE_NAME PRIMARY_SITE  SITE_SUBTYPE_1  SITE_SUBTYPE_2  SITE_SUBTYPE_3  PRIMARY_HISTOLOGY HISTOLOGY_SUBTYPE_1 HISTOLOGY_SUBTYPE_2 HISTOLOGY_SUBTYPE_3 FUSION_ID TRANSLOCATION_NAME  5'_CHROMOSOME 5'_STRAND 5'_GENE_ID  5'_GENE_NAME  5'_LAST_OBSERVED_EXON 5'_GENOME_START_FROM  5'_GENOME_START_TO  5'_GENOME_STOP_FROM 5'_GENOME_STOP_TO 3'_CHROMOSOME 3'_STRAND 3'_GENE_ID  3'_GENE_NAME  3'_FIRST_OBSERVED_EXON  3'_GENOME_START_FROM  3'_GENOME_START_TO  3'_GENOME_STOP_FROM 3'_GENOME_STOP_TO FUSION_TYPE PUBMED_PMID
749711 HCC1187 breast NS NS NS carcinoma ductal_carcinoma NS NS 665 ENST00000360863.10(RGS22):r.1_3555::ENST00000369518.1(SYCP1):r.2100_3452 8 - 197199 RGS22 22 99981937 99981937 100106116 100106116 1 + 212470 SYCP1_ENST00000369518 24 114944339 114944339 114995367 114995367 Inferred Breakpoint 20033038

Parsing

From the TSV file, we're mainly interested in the following columns:

  • SAMPLE_ID
  • PRIMARY_SITE
  • PRIMARY_HISTOLOGY
  • HISTOLOGY_SUBTYPE_1
  • FUSION_ID
  • TRANSLOCATION_NAME
  • PUBMED_PMID
info

For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.

Parsing

To create the gene fusion entries in Nirvana, we perform the following on each row in the TSV file:

  • Group all entries by FUSION_ID
  • Using all the entries related to this FUSION_ID:
    • Collect all the PubMed IDs
    • Tally the number of observed sample IDs
    • Grab the HGVS r. notation (should not change throughout the FUSION_ID)
    • Tally the number of samples observed for each histology
    • Tally the number of samples observed for each site
  • Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols

Aggregating Histologies & Sites

Aggregating Histologies & Sites was previously described in the small variants section.

Known Issues

Known Issues

There are some issues with the HGVS RNA notation:

  • For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.

Download URL

GRCh37

GRCh38

JSON Output

   "cosmicGeneFusions":[
{
"id":"COSF881",
"numSamples":6,
"geneSymbols":[
"MYB",
"NFIB"
],
"hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
"histologies":[
{
"name":"adenoid cystic carcinoma",
"numSamples":6
}
],
"sites":[
{
"name":"salivary gland (submandibular)",
"numSamples":1
},
{
"name":"salivary gland (parotid)",
"numSamples":1
},
{
"name":"salivary gland (nasal cavity)",
"numSamples":1
},
{
"name":"breast",
"numSamples":3
}
],
"pubMedIds":[
19841262
]
}
]
FieldTypeNotes
idstringCOSMIC fusion ID
numSamplesint
geneSymbolsstring array5' gene & 3' gene
hgvsrstringHGVS RNA translocation fusion notation
histologiescount arrayphenotypic descriptions
sitescount arraytissue types
pubMedIdsint arrayPubMed IDs

Count

FieldTypeNotes
namestringdescription
numSamplesint

Cancer Gene Census

TSV Extraction

Example

GENE_NAME       CELL_TYPE       PUBMED_PMID     HALLMARK        IMPACT  DESCRIPTION     CELL_LINE
PRDM16 18496560 role in cancer oncogene oncogene
PRDM16 16015645 role in cancer fusion fusion

Parsing

To extract information about TSGs and oncogenes, the data based on the "role in cancer" attribute is filtered. For tumor suppressor genes, rows with the value "TSG" and for oncogenes, rows with the value "oncogene" are filtered. Some genes have both "TSG/oncogene" as their role, which indicates that they can act as both.

Columns

Only following columns are needed to gather required roles in cancer:

  • GENE_NAME
  • IMPACT
  • HALLMARK
Possible Roles in Cancer

While parsing, only following roles in cancer are found:

  • fusion
  • TSG
  • oncogene
Parsing Stats

The file contained following number of instances for each role type

Role in cancerTotal Instances
fusion149
TSG195
oncogene181
Total525

Known Issues

None

Download URL

JSON output

   {
"name": "PRDM16",
"hgncId": 14000,
"ncbiGeneId": "63976",
"ensemblGeneId": "ENSG00000142611",
"cosmic": {
"roleInCancer": [
"oncogene",
"fusion"
]
}
}
FieldTypeNotes
roleInCancerstring arrayPossible roles in caner