Version: 3.22

COSMIC

Overview

COSMIC, the Catalogue of Somatic Mutations in Cancer, is the world's largest source of expert manually curated somatic mutation information relating to human cancers.

Publication

John G Tate, Sally Bamford, Harry C Jubb, Zbyslaw Sondka, David M Beare, Nidhi Bindal, Harry Boutselakis, Charlotte G Cole, Celestino Creatore, Elisabeth Dawson, Peter Fish, Bhavana Harsha, Charlie Hathaway, Steve C Jupe, Chai Yin Kok, Kate Noble, Laura Ponting, Christopher C Ramshaw, Claire E Rye, Helen E Speedy, Ray Stefancsik, Sam L Thompson, Shicai Wang, Sari Ward, Peter J Campbell, Simon A Forbes. (2019) COSMIC: the Catalogue Of Somatic Mutations In Cancer, Nucleic Acids Research, Volume 47, Issue D1

Professional data source

This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.

Small Variants

Our main COSMIC deliverable provides annotations for both coding and non-coding variants throughout the genome. As of COSMIC v96, this includes 28.7M variants spanning the human genome. Illumina Connected Annotations currently parses four files to extract the relevant content:

CosmicCodingMuts.vcf.gz
CosmicNonCodingVariants.vcf.gz
CosmicMutantExport.tsv.gz
CosmicNCV.tsv.gz

VCF extraction

Example

#CHROM  POS ID  REF ALT QUAL  FILTER  INFO
1 65797 COSV58737189  T C . . GENE=OR4F5_ENST00000641515;STRAND=+;LEGACY_ID=COSN23957695;CDS=c.9+224T>C;AA=p.?;HGVSC=ENST00000641515.2:c.9+224T>C;HGVSG=1:g.65797T>C;CNT=1

Parsing

From the VCF files, we're mainly interested in the following columns:

CHROM
POS
ID
REF
ALT

TSV extraction

Example

Gene name Accession Number  Gene CDS length HGNC ID Sample name ID_sample ID_tumour Primary site  Site subtype 1  Site subtype 2  Site subtype 3  Primary histology Histology subtype 1 Histology subtype 2 Histology subtype 3 Genome-wide screen  GENOMIC_MUTATION_ID LEGACY_MUTATION_ID  MUTATION_ID Mutation CDS  Mutation AA Mutation Description  Mutation zygosity LOH GRCh  Mutation genome position  Mutation strand Resistance Mutation Mutation somatic status Pubmed_PMID ID_STUDY  Sample Type Tumour origin Age HGVSP HGVSC HGVSG
MCF2L_ENST00000375604 ENST00000375604.6 3372  14576 RK091_C01 1918867 1806188 liver NS  NS  NS  carcinoma NS  NS  NS  y COSV65049364  COSN1601909 113108365 c.73+3096A>G  p.? Unknown het   38  13:113005079-113005079  + - Variant of unknown origin   322 fresh/frozen - NOS  primary     ENST00000375604.6:c.73+3096A>G  13:g.113005079A>G

Parsing

From the TSV file, we're mainly interested in the following columns:

GENOMIC_MUTATION_ID
ID_sample
Primary site
Site subtype 1
Primary histology
Histology subtype 1
Pubmed_PMID
Resistance Mutation
Mutation somatic status

info

For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.

Parsing

To aggregate the data in Illumina Connected Annotations, we perform the following:

Parse the coding and non-coding TSV files to retrieve the histologies, sites, PubMed IDs, somatic status, and resistance mutation status. Histologies and sites are tracked with respect to sample IDs.
Parse the coding and non-coding VCF files to retrieve the genomic variant for each entry

Aggregating Histologies & Sites

For sites and histologies, we observe that the subtype provides additional description but is still dependent on the primary site value. For example, the primary site might be skin, but the subtype is foot. Therefore, we will combine the values in the following manner: skin (foot).

COSMIC uses NS to show that a value is empty. If the subtype is NS, we will use the primary histology instead.

Download URL

GRCh37

GRCh38

JSON Output

{
   "id":"COSV58272668",
   "numSamples":8,
   "refAllele":"-",
   "altAllele":"CCT",
   "histologies":[
      {
         "name":"carcinoma (serous carcinoma)",
         "numSamples":2
      },
      {
         "name":"meningioma (fibroblastic)",
         "numSamples":1
      },
      {
         "name":"carcinoma",
         "numSamples":1
      },
      {
         "name":"carcinoma (squamous cell carcinoma)",
         "numSamples":1
      },
      {
         "name":"meningioma (transitional)",
         "numSamples":1
      },
      {
         "name":"carcinoma (adenocarcinoma)",
         "numSamples":1
      },
      {
         "name":"other (neoplasm)",
         "numSamples":1
      }
   ],
   "sites":[
      {
         "name":"ovary",
         "numSamples":2
      },
      {
         "name":"meninges",
         "numSamples":2
      },
      {
         "name":"thyroid",
         "numSamples":2
      },
      {
         "name":"cervix",
         "numSamples":1
      },
      {
         "name":"large intestine (colon)",
         "numSamples":1
      }
   ],
   "pubMedIds":[
      25738363,
      27548314
   ],
   "confirmedSomatic":true,
   "drugResistance":true, /* not in this particular COSMIC variant */
   "isAlleleSpecific":true
}

Field	Type	Notes
id	string	COSMIC Genomic Mutation ID
numSamples	int
refAllele	string
altAllele	string
histologies	count array	phenotypic descriptions
sites	count array	tissue types
pubMedIds	int array	PubMed IDs
confirmedSomatic	bool	true when the variant is a confirmed somatic variant
drugResistance	bool	true when the variant has been associated with drug resistance

Count

Field	Type	Notes
name	string	description
numSamples	int

Gene Fusions

Gene fusions are manually curated from peer reviewed publications by expert COSMIC curators. A comprehensive literature curation is completed for each fusion pair when it is released in the database. Currently COSMIC includes information on fusions involved in solid tumours and leukaemias.

TSV extraction

Example

SAMPLE_ID SAMPLE_NAME PRIMARY_SITE  SITE_SUBTYPE_1  SITE_SUBTYPE_2  SITE_SUBTYPE_3  PRIMARY_HISTOLOGY HISTOLOGY_SUBTYPE_1 HISTOLOGY_SUBTYPE_2 HISTOLOGY_SUBTYPE_3 FUSION_ID TRANSLOCATION_NAME  5'_CHROMOSOME 5'_STRAND 5'_GENE_ID  5'_GENE_NAME  5'_LAST_OBSERVED_EXON 5'_GENOME_START_FROM  5'_GENOME_START_TO  5'_GENOME_STOP_FROM 5'_GENOME_STOP_TO 3'_CHROMOSOME 3'_STRAND 3'_GENE_ID  3'_GENE_NAME  3'_FIRST_OBSERVED_EXON  3'_GENOME_START_FROM  3'_GENOME_START_TO  3'_GENOME_STOP_FROM 3'_GENOME_STOP_TO FUSION_TYPE PUBMED_PMID
749711  HCC1187 breast  NS  NS  NS  carcinoma ductal_carcinoma  NS  NS  665 ENST00000360863.10(RGS22):r.1_3555::ENST00000369518.1(SYCP1):r.2100_3452  8 - 197199  RGS22 22  99981937  99981937  100106116 100106116 1 + 212470  SYCP1_ENST00000369518 24  114944339 114944339 114995367 114995367 Inferred Breakpoint 20033038

Parsing

From the TSV file, we're mainly interested in the following columns:

SAMPLE_ID
PRIMARY_SITE
PRIMARY_HISTOLOGY
HISTOLOGY_SUBTYPE_1
FUSION_ID
TRANSLOCATION_NAME
PUBMED_PMID

info

For all the histologies and sites, we replace all the underlines with spaces. salivary_gland would become salivary gland.

Parsing

To create the gene fusion entries in Illumina Connected Annotations, we perform the following on each row in the TSV file:

Group all entries by FUSION_ID
Using all the entries related to this FUSION_ID:
- Collect all the PubMed IDs
- Tally the number of observed sample IDs
- Grab the HGVS r. notation (should not change throughout the FUSION_ID)
- Tally the number of samples observed for each histology
- Tally the number of samples observed for each site
Extract the transcript IDs from the HGVS notation and lookup the associated gene symbols

Aggregating Histologies & Sites

Aggregating Histologies & Sites was previously described in the small variants section.

Known Issues

There are some issues with the HGVS RNA notation:

For coding transcripts, HGVS numbering should use CDS coordinates. Right now COSMIC is using cDNA coordinates for all their fusions.

Download URL

GRCh37

CosmicFusionExport.tsv.gz

GRCh38

CosmicFusionExport.tsv.gz

JSON Output

   "cosmicGeneFusions":[
      {
         "id":"COSF881",
         "numSamples":6,
         "geneSymbols":[
            "MYB",
            "NFIB"
         ],
         "hgvsr":"ENST00000341911.5(MYB):r.1_2368::ENST00000397581.2(NFIB):r.2592_3318",
         "histologies":[
            {
               "name":"adenoid cystic carcinoma",
               "numSamples":6
            }
         ],
         "sites":[
            {
               "name":"salivary gland (submandibular)",
               "numSamples":1
            },
            {
               "name":"salivary gland (parotid)",
               "numSamples":1
            },
            {
               "name":"salivary gland (nasal cavity)",
               "numSamples":1
            },
            {
               "name":"breast",
               "numSamples":3
            }
         ],
         "pubMedIds":[
            19841262
         ]
      }
   ]

Field	Type	Notes
id	string	COSMIC fusion ID
numSamples	int
geneSymbols	string array	5' gene & 3' gene
hgvsr	string	HGVS RNA translocation fusion notation
histologies	count array	phenotypic descriptions
sites	count array	tissue types
pubMedIds	int array	PubMed IDs

Count

Field	Type	Notes
name	string	description
numSamples	int

Cancer Gene Census

TSV Extraction

Example

GENE_NAME       CELL_TYPE       PUBMED_PMID     HALLMARK        IMPACT  DESCRIPTION     CELL_LINE
PRDM16      18496560    role in cancer  oncogene    oncogene
PRDM16      16015645    role in cancer  fusion  fusion

Parsing

To extract information about TSGs and oncogenes, the data based on the "role in cancer" attribute is filtered. For tumor suppressor genes, rows with the value "TSG" and for oncogenes, rows with the value "oncogene" are filtered. Some genes have both "TSG/oncogene" as their role, which indicates that they can act as both.

Columns

Only following columns are needed to gather required roles in cancer:

GENE_NAME
IMPACT
HALLMARK

Possible Roles in Cancer

While parsing, only following roles in cancer are found:

fusion
TSG
oncogene

Parsing Stats

The file contained following number of instances for each role type

Role in cancer	Total Instances
fusion	149
TSG	195
oncogene	181
Total	525

Known Issues

None

Download URL

Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv.gz

JSON output

   {
  "name": "PRDM16",
  "hgncId": 14000,
  "ncbiGeneId": "63976",
  "ensemblGeneId": "ENSG00000142611",
  "cosmic": {
    "roleInCancer": [
      "oncogene",
      "fusion"
    ]
  }
}

Field	Type	Notes
roleInCancer	string array	Possible roles in caner

Overview​

Publication

Professional data source

Small Variants​

VCF extraction​

Example​

Parsing​

TSV extraction​

Example​

Parsing​

info

Parsing​

Aggregating Histologies & Sites​

Download URL​

GRCh37​

GRCh38​

JSON Output​

Gene Fusions​

TSV extraction​

Example​

Parsing​

info

Parsing​

Aggregating Histologies & Sites​

Known Issues​

Known Issues

Download URL​

GRCh37​

GRCh38​

JSON Output​

Cancer Gene Census​

TSV Extraction​

Example​

Parsing​

Columns​

Possible Roles in Cancer​

Parsing Stats​

Known Issues​

Download URL​

JSON output​

Overview

Small Variants

VCF extraction

Example

Parsing

TSV extraction

Example

Parsing

Parsing

Aggregating Histologies & Sites

Download URL

GRCh37

GRCh38

JSON Output

Gene Fusions

TSV extraction

Example

Parsing

Parsing

Aggregating Histologies & Sites

Known Issues

Download URL

GRCh37

GRCh38

JSON Output

Cancer Gene Census

TSV Extraction

Example

Parsing

Columns

Possible Roles in Cancer

Parsing Stats

Known Issues

Download URL

JSON output