FusionCatcher
Overview
FusionCatcher is a well-known tool that searches for somatic novel/known fusion genes, translocations, and/or chimeras in RNA-seq data. While FusionCatcher itself is not part of Illumina Connected Annotations, we have included a subset of their genomic databases in Illumina Connected Annotations.
Publication
Daniel Nicorici, Mihaela Şatalan, Henrik Edgren, Sara Kangaspeska, Astrid Murumägi, Olli Kallioniemi, Sami Virtanen, Olavi Kilkku. (2014) FusionCatcher – a tool for finding somatic fusion genes in paired-end RNA-sequencing data. bioRxiv 011650
Supported Data Sources
Oncogenes
The following data sources are aggregated and used to populate the isOncogene
field in the gene JSON object:
Description | Reference | Data | FusionCatcher filename |
---|---|---|---|
Bushman | bushmanlab.org | cancer_genes.txt | |
ONGENE | JGG | bioinfo-minzhao.org | oncogenes_more.txt |
UniProt tumor genes | NAR | uniprot.org | tumor_genes.txt |
Germline
Illumina Connected Annotations label | Reference | Data | FusionCatcher filename |
---|---|---|---|
1000 Genomes Project | PLOS ONE | 1000genomes.txt | |
Healthy (strong support) | banned.txt | ||
Illumina Body Map 2.0 | EBI | bodymap2.txt | |
CACG | Genomics | cacg.txt | |
ConjoinG | PLOS ONE | conjoing.txt | |
Healthy prefrontal cortex | BMC Medical Genomics | NCBI GEO | cortex.txt |
Duplicated Genes Database | PLOS ONE | genouest.org | dgd.txt |
GTEx healthy tissues | gtexportal.org | gtex.txt | |
Healthy | healthy.txt | ||
Human Protein Atlas | MCP | EBI | hpa.txt |
Babiceanu non-cancer tissues | NAR | NAR | non-cancer_tissues.txt |
non-tumor cell lines | non-tumor_cells.txt | ||
TumorFusions normal | NAR | NAR | tcga-normal.txt |
Somatic
Illumina Connected Annotations label | Reference | Data | FusionCatcher filename |
---|---|---|---|
Alaei-Mahabadi 18 cancers | PNAS | 18cancers.txt | |
DepMap CCLE | depmap.org | ccle.txt | |
CCLE Klijn | Nature Biotechnology | Nature Biotechnology | ccle2.txt |
CCLE Vellichirammal | Molecular Therapy Nucleic Acids | ccle3.txt | |
Cancer Genome Project | COSMIC | cgp.txt | |
ChimerKB 4.0 | NAR | kobic.re.kr | chimerdb4kb.txt |
ChimerPub 4.0 | NAR | kobic.re.kr | chimerdb4pub.txt |
ChimerSeq 4.0 | NAR | kobic.re.kr | chimerdb4seq.txt |
COSMIC | NAR | COSMIC | cosmic.txt |
Bao gliomas | Genome Research | gliomas.txt | |
Known | known.txt | ||
Mitelman DB | ISB-CGC | Google Cloud | mitelman.txt |
TCGA oesophageal carcinomas | Nature | oesophagus.txt | |
Bailey pancreatic cancers | Nature | Nature | pancreases.txt |
PCAWG | Cell | ICGC | pcawg.txt |
Robinson prostate cancers | Cell | Cell | prostate_cancer.txt |
TCGA | cancer.gov | tcga.txt | |
TumorFusions tumor | NAR | NAR | tcga-cancer.txt |
TCGA Gao | Cell | Cell | tcga2.txt |
TCGA Vellichirammal | Molecular Therapy Nucleic Acids | tcga3.txt | |
TICdb | BMC Genomics | unav.edu | ticdb.txt |
Gene Pair TSV File
Most of the data files in FusionCatcher are two-column TSV files containing the Ensembl gene IDs that are paired together.
Example
Here are the first few lines of the 1000genomes.txt file:
ENSG00000006210 ENSG00000102962
ENSG00000006652 ENSG00000181016
ENSG00000014138 ENSG00000149798
ENSG00000026297 ENSG00000071242
ENSG00000035499 ENSG00000155959
ENSG00000055211 ENSG00000131013
ENSG00000055332 ENSG00000179915
ENSG00000062485 ENSG00000257727
ENSG00000065978 ENSG00000166501
ENSG00000066044 ENSG00000104980
Parsing
In Illumina Connected Annotations, we will only import a gene pair if both Ensembl gene IDs are recognized from either our GRCh37 or GRCh38 cache files.
Gene TSV File
Some of the data files are single-column files containing Ensembl gene IDs. This is commonly used in the data files representing oncogene data sources.
Example
Here are the first few lines of the oncogenes_more.txt file:
ENSG00000000938
ENSG00000003402
ENSG00000005469
ENSG00000005884
ENSG00000006128
ENSG00000006453
ENSG00000006468
ENSG00000007350
ENSG00000008294
ENSG00000008952
Parsing
Known Issues
Known Issues
FusionCatcher also uses creates custom Ensembl genes (e.g. ENSG09000000002
) to handle missing Ensembl genes. Illumina Connected Annotations will ignore these entries since we only include the gene IDs that are currently recognized by Illumina Connected Annotations.
I suspect that these were originally RefSeq genes and if so, we can support those directly in Illumina Connected Annotations in the future.
Download URL
https://sourceforge.net/projects/fusioncatcher/files/data
JSON Output
"fusionCatcher":[
{
"genes":{
"first":{
"hgnc":"ETV6",
"isOncogene":true
},
"second":{
"hgnc":"RUNX1"
},
"isParalogPair":true,
"isPseudogenePair":true,
"isReadthrough":true
},
"germlineSources":[
"1000 Genomes Project"
],
"somaticSources":[
"COSMIC",
"TCGA oesophageal carcinomas"
]
}
]
Field | Type | Notes |
---|---|---|
genes | genes object | 5' gene & 3' gene |
germlineSources | string array | matches in known germline data sources |
somaticSources | string array | matches in known somatic data sources |
genes
Field | Type | Notes |
---|---|---|
first | gene object | 5' gene |
second | gene object | 3' gene |
isParalogPair | bool | true when both genes are paralogs for each other |
isPseudogenePair | bool | true when both genes are pseudogenes for each other |
isReadthrough | bool | true when this fusion gene is a readthrough event (both are on the same strand and there are no genes between them) |
gene
Field | Type | Notes |
---|---|---|
hgnc | string | gene symbol. e.g. MSH6 |
isOncogene | bool | true when this gene is an oncogene |