Version: 3.18

Annotating COVID-19

The Nirvana development team is mainly focused on providing annotations for the human genome. This focus allows us to maximize our resources towards understanding human health.

However, nothing in our architecture prevents us from supporting other genomes. Earlier this year, we had an opportunity to put that statement to the test - we added support for annotating the SARS-CoV-2 genome, the virus that causes the COVID-19 disease.

In addition to normal transcript annotation, we also supply:

allele frequencies
protein domains

SARS-CoV-2 Galaxy Project

The allele frequencies used by Nirvana were provided by the SARS-CoV-2 Galaxy Project. This is an international effort that provides ongoing analysis of COVID-19 using Galaxy, BioConda, and public research infrastructures.

Getting Nirvana

If you don't have Nirvana already, please consult our Getting Started page first.

Downloading the COVID-19 data files

Here's a data zip file containing new gene models, reference, and external data sources for SARS-CoV-2:

Just go to the directory that contains your Nirvana Data directory.

cd ~/Nirvana
curl -O https://illumina.github.io/NirvanaDocumentation/files/Covid19Data.zip
unzip Covid19Data.zip

Download a COVID-19 VCF file

Here's a COVID-19 VCF file you can play around with:

curl -O https://illumina.github.io/NirvanaDocumentation/files/Covid19Mutations.vcf.gz

Running Nirvana

Once you have downloaded the data sets, use the following command to annotate your VCF:

dotnet bin/Release/netcoreapp2.1/Nirvana.dll \
     -c Data/Cache/SARS-CoV-2/SARS-CoV-2 \
     --sd Data/SupplementaryAnnotation/SARS-CoV-2 \
     -r Data/References/SARS-CoV-2.ASM985889v3.dat \
     -i Covid19Mutations.vcf.gz \
     -o Covid19Mutations

the -c argument specifies the cache prefix
the --sd argument specifies the supplementary annotation directory
the -r argument specifies the compressed reference path
the -i argument specifies the input VCF path
the -o argument specifies the output filename prefix

When running Nirvana, performance metrics are shown as it evaluates each chromosome in the input VCF file:

---------------------------------------------------------------------------
Nirvana                                             (c) 2020 Illumina, Inc.
Stromberg, Roy, Lajugie, Jiang, Li, and Kang                         3.12.0
---------------------------------------------------------------------------

Initialization                                         Time     Positions/s
---------------------------------------------------------------------------
Cache                                               00:00:00.0
SA Position Scan                                    00:00:00.0         1763

Reference                                Preload    Annotation   Variants/s
---------------------------------------------------------------------------
NC_045512                               00:00:00.0  00:00:00.1          173

Summary                                                Time         Percent
---------------------------------------------------------------------------
Initialization                                      00:00:00.0        2.0 %
Preload                                             00:00:00.0        0.3 %
Annotation                                          00:00:00.1        6.0 %

Time: 00:00:01.5

The output will be a JSON file called Covid19Mutations.json.gz. Here's the full JSON file.

Investigating the Results

Here's an example of what a COVID-19 variant looks like in the JSON output:

{
   "chromosome":"NC_045512.2",
   "position":27323,
   "refAllele":"C",
   "altAlleles":[
      "T"
   ],
   "filters":[
      "PASS"
   ],
   "proteinDomains":[
      {
         "start":27202,
         "end":27384,
         "proteinId":"YP_009724394.1",
         "domainId":"cl13556",
         "domainName":"Sars6 super family",
         "reciprocalOverlap":0.00546,
         "annotationOverlap":0.00546
      }
   ],
   "variants":[
      {
         "vid":"NC_045512.2-27323-C-T",
         "chromosome":"NC_045512.2",
         "begin":27323,
         "end":27323,
         "refAllele":"C",
         "altAllele":"T",
         "variantType":"SNV",
         "hgvsg":"NC_045512.2:g.27323C>T",
         "alleleFrequency":{
            "refAllele":"C",
            "altAllele":"T",
            "allAc":8,
            "allAn":1058,
            "allAf":0.007561
         },
         "transcripts":[
            {
               "transcript":"YP_009724394.1",
               "source":"RefSeq",
               "bioType":"protein_coding",
               "codons":"tCt/tTt",
               "aminoAcids":"S/F",
               "cdnaPos":"122",
               "cdsPos":"122",
               "exons":"1/1",
               "proteinPos":"41",
               "geneId":"43740572",
               "hgnc":"ORF6",
               "consequence":[
                  "missense_variant"
               ],
               "hgvsc":"YP_009724394.1:c.122C>T",
               "hgvsp":"YP_009724394.1:p.(Ser41Phe)",
               "proteinId":"YP_009724394.1"
            },
            {
               "transcript":"YP_009724395.1",
               "source":"RefSeq",
               "bioType":"protein_coding",
               "geneId":"43740573",
               "hgnc":"ORF7a",
               "consequence":[
                  "upstream_gene_variant"
               ],
               "proteinId":"YP_009724395.1"
            }
         ]
      }
   ]
}

SARS-CoV-2 Galaxy Project

Getting Nirvana​

Downloading the COVID-19 data files​

Download a COVID-19 VCF file​

Running Nirvana​

Investigating the Results​

Getting Nirvana

Downloading the COVID-19 data files

Download a COVID-19 VCF file

Running Nirvana

Investigating the Results