Skip to main content
Version: 3.19 (unreleased)

ClinVar

Overview

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.

Publication

Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Wonhee Jang, Karen Karapetyan, Kenneth Katz, Chunlei Liu, Zenith Maddipatla, Adriana Malheiro, Kurt McDaniel, Michael Ovetsky, George Riley, George Zhou, J Bradley Holmes, Brandi L Kattman, Donna R Maglott, ClinVar: improving access to variant interpretations and supporting evidence, Nucleic Acids Research, 46, Issue D1, 4 January 2018, Pages D1062–D1067, https://doi.org/10.1093/nar/gkx1153

RCV File

Example

Here's a full RCV entry.

Parsing

In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.

ID

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinVarAccession Acc="RCV000000001" Version="2">
</ClinVarSet>

The Acc and Version fields are merged to form the ID (RCV000000001.2)

LastUpdatedDate

<ClinVarSet>
<ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2016-02-17" ID="57604" >
</ClinVarSet>

Significance

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>

ReviewStatus

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>

Phenotypes

<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="62">
<Trait Type="Disease">
<Name>
<ElementValue Type="Preferred">Joubert syndrome 9</ElementValue>
</Name>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>

We only use the field with Type="Preferred". Multiple phenotypes may be reported

Location, Variant Type and Variant Id

<ReferenceClinVarAssertion>
<GenotypeSet Type="CompoundHeterozygote" ID="424709">
<MeasureSet Type="Variant" ID="81">
<Measure Type="single nucleotide variant" ID="15120">
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
</Measure>
</MeasureSet>
</GenotypeSet>
</ReferenceClinVarAssertion>
  • The variant position is extracted from the fields for their respective assemblies.
  • Updated records contain positionVCF, referenceAlleleVCF and alternateAlleleVCF fields and when present, we use them to create the variant.
  • For older records, since "start' and "stop" fields are not always available, we use the "display_start" and "display_end" fields.
  • If a required allele is not available, we extract it from the reference sequence.
  • Only variants having a dbSNP id are extracted.
  • Note that a ClinVar accession may have multiple variants associated with it (possible in different locations)
  • VariantId is extracted from the MeasureSet attributes.
  • VariantType is extracted from the Measure attributes.
    unsupported variant types

    We currently don't support the following variant types:

    • Microsatellite
    • protein only
    • fusion
    • Complex
    • Variation
    • Translocation

MedGen, OMIM, Orphanet IDs

<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="175">
<Trait ID="3036" Type="Disease">
<XRef ID="C0086651" DB="MedGen"/>
<XRef ID="309297" DB="Orphanet"/>
<XRef ID="582" DB="Orphanet"/>
<XRef Type="MIM" ID="253000" DB="OMIM"/>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>

AlleleOrigins

<ClinVarAssertion>
<Origin>germline</Origin>
</ClinVarAssertion>

We only extract all Allele Origins from Submissions (SCV) entries.

PubMedIds

<ClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<Citation Type="general">
<ID Source="PubMed">12114475</ID>
</Citation>
</ClinicalSignificance>
<AttributeSet>
<Attribute Type="AssertionMethod">LMM Criteria</Attribute>
<Citation>
<ID Source="PubMed">24033266</ID>
</Citation>
</AttributeSet>
<ObservedIn>
<ObservedData ID="9727445">
<Citation Type="general">
<ID Source="PubMed">9113933</ID>
</Citation>
</ObservedData>
</ObservedIn>
<Citation Type="general">
<ID Source="PubMed">23757202</ID>
</Citation>
</ClinVarAssertion>

We only extract all Pubmed Ids from Submissions (SCV) entries.

Parsing Significance

Extracting significance(s) may involve parsing multiple fields. Take the following snippets into consideration.

<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>

<ClinicalSignificance DateLastEvaluated="2016-10-13">
<ReviewStatus>criteria provided, multiple submitters, no conflicts</ReviewStatus>
<Description>Pathogenic/Likely pathogenic</Description>
</ClinicalSignificance>

<ClinicalSignificance DateLastEvaluated="2012-06-07">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Conflicting interpretations of pathogenicity</Description>
<Explanation DataSource="ClinVar" Type="public">Pathogenic(1);Uncertain significance(1)</Explanation>
</ClinicalSignificance>

Given the evidence, we converted the significance field into an array of strings which may be parsed out of the Descriptions or Explanation fields.

Varying Delimiters

The delimiters in each field may vary. Currently, the delimiters for Description are , and /. The delimiters for Explanation are ; and /.

VCV File

Example

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ClinVarVariationRelease xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/clinvar_variation/variation_archive_1.4.xsd" ReleaseDate="2019-12-31">
<VariationArchive VariationID="431749" VariationName="GRCh37/hg19 1p36.31(chr1:6051187-6158763)" VariationType="copy number gain" DateCreated="2017-08-12" DateLastUpdated="2019-09-10" Accession="VCV000431749" Version="1" RecordType="included" NumberOfSubmissions="0" NumberOfSubmitters="0">
<RecordStatus>current</RecordStatus>
<Species>Homo sapiens</Species>
<IncludedRecord>
<SimpleAllele AlleleID="425239" VariationID="431749">
<GeneList>
<Gene Symbol="KCNAB2" FullName="potassium voltage-gated channel subfamily A regulatory beta subunit 2" GeneID="8514" HGNC_ID="HGNC:6229" Source="calculated" RelationshipType="genes overlapped by variant">
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="1" Accession="NC_000001.11" start="5992639" stop="6101186" display_start="5992639" display_stop="6101186" Strand="+"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="6052357" stop="6161252" display_start="6052357" display_stop="6161252" Strand="+"/>
</Location>
<OMIM>601142</OMIM>
</Gene>
<Gene Symbol="NPHP4" FullName="nephrocystin 4" GeneID="261734" HGNC_ID="HGNC:19104" Source="calculated" RelationshipType="genes overlapped by variant">
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="1" Accession="NC_000001.11" start="5862810" stop="5992425" display_start="5862810" display_stop="5992425" Strand="-"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="5922869" stop="6052532" display_start="5922869" display_stop="6052532" Strand="-"/>
</Location>
<OMIM>607215</OMIM>
</Gene>
</GeneList>
<Name>GRCh37/hg19 1p36.31(chr1:6051187-6158763)</Name>
<VariantType>copy number gain</VariantType>
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" forDisplay="true" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="6051187" stop="6158763" display_start="6051187" display_stop="6158763"/> </Location>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
<XRefList>
<XRef Type="Interpreted" ID="431733" DB="ClinVar"/>
</XRefList>
</SimpleAllele>
<ReviewStatus>no interpretation for the single variant</ReviewStatus>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
<SubmittedInterpretationList>
<SCV Title="SUB1895145" Accession="SCV000296057" Version="1"/>
</SubmittedInterpretationList>
<InterpretedVariationList>
<InterpretedVariation VariationID="431733" Accession="VCV000431733" Version="1"/>
</InterpretedVariationList>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>

Parsing

In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.

id

<VariationArchive VariationID="431749" VariationName="GRCh37/hg19 1p36.31(chr1:6051187-6158763)" VariationType="copy number gain" DateCreated="2017-08-12" DateLastUpdated="2019-09-10" Accession="VCV000431749" Version="1" RecordType="included" NumberOfSubmissions="0" NumberOfSubmitters="0">

The Acc and Version fields are merged to form the ID (RCV000000001.2)

significance

<ClinVarVariationRelease>
<VariationArchive>
<IncludedRecord>
<SimpleAllele>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
</SimpleAllele>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>

May have multiple significances listed.

reviewStatus

<ClinVarVariationRelease>
<VariationArchive>
<IncludedRecord>
<ReviewStatus>no interpretation for the single variant</ReviewStatus>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>

Known Issues

Known Issues
  • The XML file contains ~1k more entries (out of 162K) than the VCF file
  • The XML file does not have a field indicating that a record is associated with the reference base - something that was present in VCF
  • The XML file contains entries (e.g. RCV000016645 version=1) which have IUPAC ambiguous bases ("R", "Y", "H", etc.) as their alternate allele

Download URLs

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz

https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz

JSON Output

small variants:

"clinvar":[
{
"id":"VCV000036581.3",
"reviewStatus":"reviewed by expert panel",
"significance":[
"benign"
],
"refAllele":"G",
"altAllele":"A",
"lastUpdatedDate":"2020-03-01",
"isAlleleSpecific":true
},
{
"id":"RCV000030258.4",
"variationId":"VCV000036581.3",
"reviewStatus":"reviewed by expert panel",
"alleleOrigins":[
"germline"
],
"refAllele":"G",
"altAllele":"A",
"phenotypes":[
"Lynch syndrome"
],
"medGenIds":[
"C1333990"
],
"omimIds":[
"120435"
],
"significance":[
"benign"
],
"lastUpdatedDate":"2017-05-01",
"isAlleleSpecific":true
}
]

large variants:

"clinvar":[
{
"chromosome":"1",
"begin":629025,
"end":8537745,
"variantType":"copy_number_loss",
"id":"RCV000051993.4",
"variationId":"VCV000058242.1",
"reviewStatus":"criteria provided, single submitter",
"alleleOrigins":[
"not provided"
],
"phenotypes":[
"See cases"
],
"significance":[
"pathogenic"
],
"lastUpdatedDate":"2022-04-21",
"pubMedIds":[
"21844811"
]
},
{
"id":"VCV000058242.1",
"reviewStatus":"criteria provided, single submitter",
"significance":[
"pathogenic"
],
"lastUpdatedDate":"2022-04-21"
},
......
]
FieldTypeNotes
idstringClinVar ID
variationIdstringClinVar VCV ID
variantTypestringvariant type
reviewStatusstringsee possible values below
alleleOriginsstring arraysee possible values below
refAllelestring
altAllelestring
phenotypesstring array
medGenIdsstring arrayMedGen IDs
omimIdsstring arrayOMIM IDs
orphanetIdsstring arrayOrphanet IDs
significancestring arraysee possible values below
lastUpdatedDatestringyyyy-MM-dd
pubMedIdsstring arrayPubMed IDs
isAlleleSpecificbooltrue when the current variant alternate allele matches the ClinVar alternate allele

reviewStatus:

  • no assertion provided
  • no assertion criteria provided
  • criteria provided, single submitter
  • practice guideline
  • classified by multiple submitters
  • criteria provided, conflicting interpretations
  • criteria provided, multiple submitters, no conflicts
  • no interpretation for the single variant

alleleOrigins:

  • unknown
  • other
  • germline
  • somatic
  • inherited
  • paternal
  • maternal
  • de-novo
  • biparental
  • uniparental
  • not-tested
  • tested-inconclusive

significance:

  • uncertain significance
  • not provided
  • benign
  • likely benign
  • likely pathogenic
  • pathogenic
  • drug response
  • histocompatibility
  • association
  • risk factor
  • protective
  • affects
  • conflicting data from submitters
  • other
  • no interpretation for the single variant
  • conflicting interpretations of pathogenicity

Building the supplementary files

The ClinVar .nsa and .nsi for Nirvana can be built using the SAUtils command's clinvar subcommand.

Source data files

Two input .xml files and a .version file are required in order to build the .nsa and .nsi file. You should have the following files:

ClinVarFullRelease_00-latest.xml.gz     ClinVarVariationRelease_00-latest.xml.gz
ClinVarFullRelease_00-latest.xml.gz.version

The version file is a text file with the follwoing format.

NAME=ClinVar
VERSION=20220505
DATE=2022-05-05
DESCRIPTION=A freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence

The help menu for the utility is as follows:

dotnet SAUtils.dll clinvar
---------------------------------------------------------------------------
SAUtils (c) 2022 Illumina, Inc.
Stromberg, Roy, Platzer, Siddiqui, Ouyang, et al 3.18.1
---------------------------------------------------------------------------

USAGE: dotnet SAUtils.dll clinvar [options]
Creates a supplementary database with ClinVar annotations

OPTIONS:
--ref, -r <VALUE> compressed reference sequence file
--rcv, -i <VALUE> ClinVar Full release XML file
--vcv, -c <VALUE> ClinVar Variation release XML file
--out, -o <VALUE> output directory
--help, -h displays the help menu
--version, -v displays the version

dotnet SAUtils.dll clinvar

Here is a sample execution:

dotnet ~/development/Nirvana/bin/Debug/net6.0/SAUtils.dll clinvar \\
--ref ~/development/References/7/Homo_sapiens.GRCh38.Nirvana.dat --rcv ClinVarFullRelease_00-latest.xml.gz \\
--vcv ClinVarVariationRelease_00-latest.xml.gz --out ~/development/SupplementaryDatabase/63/GRCh38
---------------------------------------------------------------------------
SAUtils (c) 2022 Illumina, Inc.
Stromberg, Roy, Platzer, Siddiqui, Ouyang, et al 3.18.1
---------------------------------------------------------------------------

Found 1535677 VCV records
Unknown vcv id:225946 found in RCV000211201.2
Unknown vcv id:225946 found in RCV000211253.2
Unknown vcv id:225946 found in RCV000211375.2
Unknown vcv id:976117 found in RCV001253316.1
Unknown vcv id:1321016 found in RCV001776995.2
3 unknown VCVs found in RCVs.
225946,976117,1321016
0 unknown VCVs found in RCVs.
Chromosome 1 completed in 00:00:15.1
Chromosome 2 completed in 00:00:20.0
Chromosome 3 completed in 00:00:09.7
Chromosome 4 completed in 00:00:05.9
Chromosome 5 completed in 00:00:09.8
Chromosome 6 completed in 00:00:08.3
Chromosome 7 completed in 00:00:08.7
Chromosome 8 completed in 00:00:06.2
Chromosome 9 completed in 00:00:08.6
Chromosome 10 completed in 00:00:07.0
Chromosome 11 completed in 00:00:11.7
Chromosome 12 completed in 00:00:08.0
Chromosome 13 completed in 00:00:06.3
Chromosome 14 completed in 00:00:06.0
Chromosome 15 completed in 00:00:06.6
Chromosome 16 completed in 00:00:10.8
Chromosome 17 completed in 00:00:13.8
Chromosome 18 completed in 00:00:02.9
Chromosome 19 completed in 00:00:08.7
Chromosome 20 completed in 00:00:03.6
Chromosome 21 completed in 00:00:02.4
Chromosome 22 completed in 00:00:03.6
Chromosome MT completed in 00:00:00.2
Chromosome X completed in 00:00:07.5
Chromosome Y completed in 00:00:00.0
Maximum bp shifted for any variant:2
Writing 37097 intervals to database...

Time: 00:13:26.9