ClinVar
Overview
ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.
Publication
Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Wonhee Jang, Karen Karapetyan, Kenneth Katz, Chunlei Liu, Zenith Maddipatla, Adriana Malheiro, Kurt McDaniel, Michael Ovetsky, George Riley, George Zhou, J Bradley Holmes, Brandi L Kattman, Donna R Maglott, ClinVar: improving access to variant interpretations and supporting evidence, Nucleic Acids Research, 46, Issue D1, 4 January 2018, Pages D1062–D1067, https://doi.org/10.1093/nar/gkx1153
RCV File
Example
Here's a full RCV entry.
Parsing
In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.
ID
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinVarAccession Acc="RCV000000001" Version="2">
</ClinVarSet>
The Acc and Version fields are merged to form the ID (RCV000000001.2)
LastUpdatedDate
<ClinVarSet>
<ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2016-02-17" ID="57604" >
</ClinVarSet>
Significance
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>
ReviewStatus
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>
Phenotypes
<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="62">
<Trait Type="Disease">
<Name>
<ElementValue Type="Preferred">Joubert syndrome 9</ElementValue>
</Name>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>
We only use the field with Type="Preferred". Multiple phenotypes may be reported
Location, Variant Type and Variant Id
<ReferenceClinVarAssertion>
<GenotypeSet Type="CompoundHeterozygote" ID="424709">
<MeasureSet Type="Variant" ID="81">
<Measure Type="single nucleotide variant" ID="15120">
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
</Measure>
</MeasureSet>
</GenotypeSet>
</ReferenceClinVarAssertion>
- The variant position is extracted from the fields for their respective assemblies.
- Updated records contain positionVCF, referenceAlleleVCF and alternateAlleleVCF fields and when present, we use them to create the variant.
- For older records, since "start' and "stop" fields are not always available, we use the "display_start" and "display_end" fields.
- If a required allele is not available, we extract it from the reference sequence.
- Only variants having a dbSNP id are extracted.
- Note that a ClinVar accession may have multiple variants associated with it (possible in different locations)
- VariantId is extracted from the MeasureSet attributes.
- VariantType is extracted from the Measure attributes.
unsupported variant types
We currently don't support the following variant types:
- Microsatellite
- protein only
- fusion
- Complex
- Variation
- Translocation
MedGen, OMIM, Orphanet IDs
<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="175">
<Trait ID="3036" Type="Disease">
<XRef ID="C0086651" DB="MedGen"/>
<XRef ID="309297" DB="Orphanet"/>
<XRef ID="582" DB="Orphanet"/>
<XRef Type="MIM" ID="253000" DB="OMIM"/>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>
AlleleOrigins
<ClinVarAssertion>
<Origin>germline</Origin>
</ClinVarAssertion>
We only extract all Allele Origins from Submissions (SCV) entries.
PubMedIds
<ClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<Citation Type="general">
<ID Source="PubMed">12114475</ID>
</Citation>
</ClinicalSignificance>
<AttributeSet>
<Attribute Type="AssertionMethod">LMM Criteria</Attribute>
<Citation>
<ID Source="PubMed">24033266</ID>
</Citation>
</AttributeSet>
<ObservedIn>
<ObservedData ID="9727445">
<Citation Type="general">
<ID Source="PubMed">9113933</ID>
</Citation>
</ObservedData>
</ObservedIn>
<Citation Type="general">
<ID Source="PubMed">23757202</ID>
</Citation>
</ClinVarAssertion>
We only extract all Pubmed Ids from Submissions (SCV) entries.
Parsing Significance
Extracting significance(s) may involve parsing multiple fields. Take the following snippets into consideration.
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
<ClinicalSignificance DateLastEvaluated="2016-10-13">
<ReviewStatus>criteria provided, multiple submitters, no conflicts</ReviewStatus>
<Description>Pathogenic/Likely pathogenic</Description>
</ClinicalSignificance>
<ClinicalSignificance DateLastEvaluated="2012-06-07">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Conflicting interpretations of pathogenicity</Description>
<Explanation DataSource="ClinVar" Type="public">Pathogenic(1);Uncertain significance(1)</Explanation>
</ClinicalSignificance>
Given the evidence, we converted the significance field into an array of strings which may be parsed out of the Descriptions
or Explanation
fields.
Varying Delimiters
The delimiters in each field may vary. Currently, the delimiters for Description
are ,
and /
. The delimiters for Explanation
are ;
and /
.
VCV File
Example
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ClinVarVariationRelease xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/clinvar_variation/variation_archive_1.4.xsd" ReleaseDate="2019-12-31">
<VariationArchive VariationID="431749" VariationName="GRCh37/hg19 1p36.31(chr1:6051187-6158763)" VariationType="copy number gain" DateCreated="2017-08-12" DateLastUpdated="2019-09-10" Accession="VCV000431749" Version="1" RecordType="included" NumberOfSubmissions="0" NumberOfSubmitters="0">
<RecordStatus>current</RecordStatus>
<Species>Homo sapiens</Species>
<IncludedRecord>
<SimpleAllele AlleleID="425239" VariationID="431749">
<GeneList>
<Gene Symbol="KCNAB2" FullName="potassium voltage-gated channel subfamily A regulatory beta subunit 2" GeneID="8514" HGNC_ID="HGNC:6229" Source="calculated" RelationshipType="genes overlapped by variant">
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="1" Accession="NC_000001.11" start="5992639" stop="6101186" display_start="5992639" display_stop="6101186" Strand="+"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="6052357" stop="6161252" display_start="6052357" display_stop="6161252" Strand="+"/>
</Location>
<OMIM>601142</OMIM>
</Gene>
<Gene Symbol="NPHP4" FullName="nephrocystin 4" GeneID="261734" HGNC_ID="HGNC:19104" Source="calculated" RelationshipType="genes overlapped by variant">
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current" Chr="1" Accession="NC_000001.11" start="5862810" stop="5992425" display_start="5862810" display_stop="5992425" Strand="-"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="5922869" stop="6052532" display_start="5922869" display_stop="6052532" Strand="-"/>
</Location>
<OMIM>607215</OMIM>
</Gene>
</GeneList>
<Name>GRCh37/hg19 1p36.31(chr1:6051187-6158763)</Name>
<VariantType>copy number gain</VariantType>
<Location>
<CytogeneticLocation>1p36.31</CytogeneticLocation>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25" forDisplay="true" AssemblyStatus="previous" Chr="1" Accession="NC_000001.10" start="6051187" stop="6158763" display_start="6051187" display_stop="6158763"/> </Location>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
<XRefList>
<XRef Type="Interpreted" ID="431733" DB="ClinVar"/>
</XRefList>
</SimpleAllele>
<ReviewStatus>no interpretation for the single variant</ReviewStatus>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
<SubmittedInterpretationList>
<SCV Title="SUB1895145" Accession="SCV000296057" Version="1"/>
</SubmittedInterpretationList>
<InterpretedVariationList>
<InterpretedVariation VariationID="431733" Accession="VCV000431733" Version="1"/>
</InterpretedVariationList>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>
Parsing
In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.
id
<VariationArchive VariationID="431749" VariationName="GRCh37/hg19 1p36.31(chr1:6051187-6158763)" VariationType="copy number gain" DateCreated="2017-08-12" DateLastUpdated="2019-09-10" Accession="VCV000431749" Version="1" RecordType="included" NumberOfSubmissions="0" NumberOfSubmitters="0">
The Acc and Version fields are merged to form the ID (RCV000000001.2)
significance
<ClinVarVariationRelease>
<VariationArchive>
<IncludedRecord>
<SimpleAllele>
<Interpretations>
<Interpretation NumberOfSubmissions="0" NumberOfSubmitters="0" Type="Clinical significance">
<Description>no interpretation for the single variant</Description>
</Interpretation>
</Interpretations>
</SimpleAllele>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>
May have multiple significances listed.
reviewStatus
<ClinVarVariationRelease>
<VariationArchive>
<IncludedRecord>
<ReviewStatus>no interpretation for the single variant</ReviewStatus>
</IncludedRecord>
</VariationArchive>
</ClinVarVariationRelease>
Known Issues
Known Issues
- The XML file contains ~1k more entries (out of 162K) than the VCF file
- The XML file does not have a field indicating that a record is associated with the reference base - something that was present in VCF
- The XML file contains entries (e.g. RCV000016645 version=1) which have IUPAC ambiguous bases ("R", "Y", "H", etc.) as their alternate allele
Download URLs
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz
JSON Output
small variants:
"clinvar":[
{
"id":"VCV000036581.3",
"reviewStatus":"reviewed by expert panel",
"significance":[
"benign"
],
"refAllele":"G",
"altAllele":"A",
"lastUpdatedDate":"2020-03-01",
"isAlleleSpecific":true
},
{
"id":"RCV000030258.4",
"variationId":"VCV000036581.3",
"reviewStatus":"reviewed by expert panel",
"alleleOrigins":[
"germline"
],
"refAllele":"G",
"altAllele":"A",
"phenotypes":[
"Lynch syndrome"
],
"medGenIds":[
"C1333990"
],
"omimIds":[
"120435"
],
"significance":[
"benign"
],
"lastUpdatedDate":"2017-05-01",
"isAlleleSpecific":true
}
]
large variants:
"clinvar":[
{
"chromosome":"1",
"begin":629025,
"end":8537745,
"variantType":"copy_number_loss",
"id":"RCV000051993.4",
"variationId":"VCV000058242.1",
"reviewStatus":"criteria provided, single submitter",
"alleleOrigins":[
"not provided"
],
"phenotypes":[
"See cases"
],
"significance":[
"pathogenic"
],
"lastUpdatedDate":"2022-04-21",
"pubMedIds":[
"21844811"
]
},
{
"id":"VCV000058242.1",
"reviewStatus":"criteria provided, single submitter",
"significance":[
"pathogenic"
],
"lastUpdatedDate":"2022-04-21"
},
......
]
Field | Type | Notes |
---|---|---|
id | string | ClinVar ID |
variationId | string | ClinVar VCV ID |
variantType | string | variant type |
reviewStatus | string | see possible values below |
alleleOrigins | string array | see possible values below |
refAllele | string | |
altAllele | string | |
phenotypes | string array | |
medGenIds | string array | MedGen IDs |
omimIds | string array | OMIM IDs |
orphanetIds | string array | Orphanet IDs |
significance | string array | see possible values below |
lastUpdatedDate | string | yyyy-MM-dd |
pubMedIds | string array | PubMed IDs |
isAlleleSpecific | bool | true when the current variant alternate allele matches the ClinVar alternate allele |
reviewStatus:
- no assertion provided
- no assertion criteria provided
- criteria provided, single submitter
- practice guideline
- classified by multiple submitters
- criteria provided, conflicting interpretations
- criteria provided, multiple submitters, no conflicts
- no interpretation for the single variant
alleleOrigins:
- unknown
- other
- germline
- somatic
- inherited
- paternal
- maternal
- de-novo
- biparental
- uniparental
- not-tested
- tested-inconclusive
significance:
- uncertain significance
- not provided
- benign
- likely benign
- likely pathogenic
- pathogenic
- drug response
- histocompatibility
- association
- risk factor
- protective
- affects
- conflicting data from submitters
- other
- no interpretation for the single variant
- conflicting interpretations of pathogenicity
Building the supplementary files
The ClinVar .nsa
and .nsi
for Illumina Connected Annotations can be built using the SAUtils
command's clinvar
subcommand.
Source data files
Two input .xml
files and a .version
file are required in order to build the .nsa
and .nsi
file. You should have the following files:
ClinVarFullRelease_00-latest.xml.gz ClinVarVariationRelease_00-latest.xml.gz
ClinVarFullRelease_00-latest.xml.gz.version
The version file is a text file with the follwoing format.
NAME=ClinVar
VERSION=20220505
DATE=2022-05-05
DESCRIPTION=A freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence
The help menu for the utility is as follows:
dotnet SAUtils.dll clinvar
---------------------------------------------------------------------------
SAUtils (c) 2022 Illumina, Inc.
Stromberg, Roy, Platzer, Siddiqui, Ouyang, et al 3.18.1
---------------------------------------------------------------------------
USAGE: dotnet SAUtils.dll clinvar [options]
Creates a supplementary database with ClinVar annotations
OPTIONS:
--ref, -r <VALUE> compressed reference sequence file
--rcv, -i <VALUE> ClinVar Full release XML file
--vcv, -c <VALUE> ClinVar Variation release XML file
--out, -o <VALUE> output directory
--help, -h displays the help menu
--version, -v displays the version
dotnet SAUtils.dll clinvar
Here is a sample execution:
dotnet SAUtils.dll clinvar \\
--ref ~/development/References/7/Homo_sapiens.GRCh38.Nirvana.dat --rcv ClinVarFullRelease_00-latest.xml.gz \\
--vcv ClinVarVariationRelease_00-latest.xml.gz --out ~/development/SupplementaryDatabase/63/GRCh38
---------------------------------------------------------------------------
SAUtils (c) 2022 Illumina, Inc.
Stromberg, Roy, Platzer, Siddiqui, Ouyang, et al 3.18.1
---------------------------------------------------------------------------
Found 1535677 VCV records
Unknown vcv id:225946 found in RCV000211201.2
Unknown vcv id:225946 found in RCV000211253.2
Unknown vcv id:225946 found in RCV000211375.2
Unknown vcv id:976117 found in RCV001253316.1
Unknown vcv id:1321016 found in RCV001776995.2
3 unknown VCVs found in RCVs.
225946,976117,1321016
0 unknown VCVs found in RCVs.
Chromosome 1 completed in 00:00:15.1
Chromosome 2 completed in 00:00:20.0
Chromosome 3 completed in 00:00:09.7
Chromosome 4 completed in 00:00:05.9
Chromosome 5 completed in 00:00:09.8
Chromosome 6 completed in 00:00:08.3
Chromosome 7 completed in 00:00:08.7
Chromosome 8 completed in 00:00:06.2
Chromosome 9 completed in 00:00:08.6
Chromosome 10 completed in 00:00:07.0
Chromosome 11 completed in 00:00:11.7
Chromosome 12 completed in 00:00:08.0
Chromosome 13 completed in 00:00:06.3
Chromosome 14 completed in 00:00:06.0
Chromosome 15 completed in 00:00:06.6
Chromosome 16 completed in 00:00:10.8
Chromosome 17 completed in 00:00:13.8
Chromosome 18 completed in 00:00:02.9
Chromosome 19 completed in 00:00:08.7
Chromosome 20 completed in 00:00:03.6
Chromosome 21 completed in 00:00:02.4
Chromosome 22 completed in 00:00:03.6
Chromosome MT completed in 00:00:00.2
Chromosome X completed in 00:00:07.5
Chromosome Y completed in 00:00:00.0
Maximum bp shifted for any variant:2
Writing 37097 intervals to database...
Time: 00:13:26.9