FusionCatcher
Overview
FusionCatcher is a well-known tool that searches for somatic novel/known fusion genes, translocations, and/or chimeras in RNA-seq data. While FusionCatcher itself is not part of Nirvana, we have included a subset of their genomic databases in Nirvana.
Publication
Daniel Nicorici, Mihaela Şatalan, Henrik Edgren, Sara Kangaspeska, Astrid Murumägi, Olli Kallioniemi, Sami Virtanen, Olavi Kilkku. (2014) FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data. bioRxiv 011650
Supported Data Sources
Oncogenes
The following data sources are aggregated and used to populate the isOncogene field in the gene JSON object:
| Description | Reference | Data | FusionCatcher filename |
|---|---|---|---|
| Bushman | bushmanlab.org | cancer_genes.txt | |
| ONGENE | JGG | bioinfo-minzhao.org | oncogenes_more.txt |
| UniProt tumor genes | NAR | uniprot.org | tumor_genes.txt |
Germline
| Nirvana label | Reference | Data | FusionCatcher filename |
|---|---|---|---|
| 1000 Genomes Project | PLOS ONE | 1000genomes.txt | |
| Healthy (strong support) | banned.txt | ||
| Illumina Body Map 2.0 | EBI | bodymap2.txt | |
| CACG | Genomics | cacg.txt | |
| ConjoinG | PLOS ONE | conjoing.txt | |
| Healthy prefrontal cortex | BMC Medical Genomics | NCBI GEO | cortex.txt |
| Duplicated Genes Database | PLOS ONE | genouest.org | dgd.txt |
| GTEx healthy tissues | gtexportal.org | gtex.txt | |
| Healthy | healthy.txt | ||
| Human Protein Atlas | MCP | EBI | hpa.txt |
| Babiceanu non-cancer tissues | NAR | NAR | non-cancer_tissues.txt |
| non-tumor cell lines | non-tumor_cells.txt | ||
| TumorFusions normal | NAR | NAR | tcga-normal.txt |
Somatic
| Nirvana label | Reference | Data | FusionCatcher filename |
|---|---|---|---|
| Alaei-Mahabadi 18 cancers | PNAS | 18cancers.txt | |
| DepMap CCLE | depmap.org | ccle.txt | |
| CCLE Klijn | Nature Biotechnology | Nature Biotechnology | ccle2.txt |
| CCLE Vellichirammal | Molecular Therapy Nucleic Acids | ccle3.txt | |
| Cancer Genome Project | COSMIC | cgp.txt | |
| ChimerKB 4.0 | NAR | kobic.re.kr | chimerdb4kb.txt |
| ChimerPub 4.0 | NAR | kobic.re.kr | chimerdb4pub.txt |
| ChimerSeq 4.0 | NAR | kobic.re.kr | chimerdb4seq.txt |
| COSMIC | NAR | COSMIC | cosmic.txt |
| Bao gliomas | Genome Research | gliomas.txt | |
| Known | known.txt | ||
| Mitelman DB | ISB-CGC | Google Cloud | mitelman.txt |
| TCGA oesophageal carcinomas | Nature | oesophagus.txt | |
| Bailey pancreatic cancers | Nature | Nature | pancreases.txt |
| PCAWG | Cell | ICGC | pcawg.txt |
| Robinson prostate cancers | Cell | Cell | prostate_cancer.txt |
| TCGA | cancer.gov | tcga.txt | |
| TumorFusions tumor | NAR | NAR | tcga-cancer.txt |
| TCGA Gao | Cell | Cell | tcga2.txt |
| TCGA Vellichirammal | Molecular Therapy Nucleic Acids | tcga3.txt | |
| TICdb | BMC Genomics | unav.edu | ticdb.txt |
Gene Pair TSV File
Most of the data files in FusionCatcher are two-column TSV files containing the Ensembl gene IDs that are paired together.
Example
Here are the first few lines of the 1000genomes.txt file:
ENSG00000006210 ENSG00000102962
ENSG00000006652 ENSG00000181016
ENSG00000014138 ENSG00000149798
ENSG00000026297 ENSG00000071242
ENSG00000035499 ENSG00000155959
ENSG00000055211 ENSG00000131013
ENSG00000055332 ENSG00000179915
ENSG00000062485 ENSG00000257727
ENSG00000065978 ENSG00000166501
ENSG00000066044 ENSG00000104980
Parsing
In Nirvana, we will only import a gene pair if both Ensembl gene IDs are recognized from either our GRCh37 or GRCh38 cache files.
Gene TSV File
Some of the data files are single-column files containing Ensembl gene IDs. This is commonly used in the data files representing oncogene data sources.
Example
Here are the first few lines of the oncogenes_more.txt file:
ENSG00000000938
ENSG00000003402
ENSG00000005469
ENSG00000005884
ENSG00000006128
ENSG00000006453
ENSG00000006468
ENSG00000007350
ENSG00000008294
ENSG00000008952
Parsing
Known Issues
Known Issues
FusionCatcher also uses creates custom Ensembl genes (e.g. ENSG09000000002) to handle missing Ensembl genes. Nirvana will ignore these entries since we only include the gene IDs that are currently recognized by Nirvana.
I suspect that these were originally RefSeq genes and if so, we can support those directly in Nirvana in the future.
Download URL
https://sourceforge.net/projects/fusioncatcher/files/data
JSON Output
"fusionCatcher":[
{
"genes":{
"first":{
"hgnc":"ETV6",
"isOncogene":true
},
"second":{
"hgnc":"RUNX1"
},
"isParalogPair":true,
"isPseudogenePair":true,
"isReadthrough":true
},
"germlineSources":[
"1000 Genomes Project"
],
"somaticSources":[
"COSMIC",
"TCGA oesophageal carcinomas"
]
}
]
| Field | Type | Notes |
|---|---|---|
| genes | genes object | 5' gene & 3' gene |
| germlineSources | string array | matches in known germline data sources |
| somaticSources | string array | matches in known somatic data sources |
genes
| Field | Type | Notes |
|---|---|---|
| first | gene object | 5' gene |
| second | gene object | 3' gene |
| isParalogPair | bool | true when both genes are paralogs for each other |
| isPseudogenePair | bool | true when both genes are pseudogenes for each other |
| isReadthrough | bool | true when this fusion gene is a readthrough event (both are on the same strand and there are no genes between them) |
gene
| Field | Type | Notes |
|---|---|---|
| hgnc | string | gene symbol. e.g. MSH6 |
| isOncogene | bool | true when this gene is an oncogene |