Skip to main content
Version: 3.2.5

ClinVar

Overview

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.

Publication

Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Wonhee Jang, Karen Karapetyan, Kenneth Katz, Chunlei Liu, Zenith Maddipatla, Adriana Malheiro, Kurt McDaniel, Michael Ovetsky, George Riley, George Zhou, J Bradley Holmes, Brandi L Kattman, Donna R Maglott, ClinVar: improving access to variant interpretations and supporting evidence, Nucleic Acids Research, 46, Issue D1, 4 January 2018, Pages D1062–D1067, https://doi.org/10.1093/nar/gkx1153

RCV File

Example

Here's a full RCV entry.

Parsing

In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.

ID

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinVarAccession Acc="RCV000000001" Version="2">
</ClinVarSet>

The Acc and Version fields are merged to form the ID (RCV000000001.2)

LastUpdatedDate

<ClinVarSet>
<ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2016-02-17" ID="57604" >
</ClinVarSet>

Significance

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>

ReviewStatus

<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>

Phenotypes

<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="62">
<Trait Type="Disease">
<Name>
<ElementValue Type="Preferred">Joubert syndrome 9</ElementValue>
</Name>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>

We only use the field with Type="Preferred". Multiple phenotypes may be reported

Location and Variant Id

<ReferenceClinVarAssertion>
<GenotypeSet Type="CompoundHeterozygote" ID="424709">
<MeasureSet Type="Variant" ID="81">
<Measure Type="single nucleotide variant" ID="15120">
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
</Measure>
</MeasureSet>
</GenotypeSet>
</ReferenceClinVarAssertion>
  • The variant position is extracted from the fields for their respective assemblies.
  • Updated records contain positionVCF, referenceAlleleVCF and alternateAlleleVCF fields and when present, we use them to create the variant.
  • For older records, since "start' and "stop" fields are not always available, we use the "display_start" and "display_end" fields.
  • If a required allele is not available, we extract it from the reference sequence.
  • Only variants having a dbSNP id are extracted.
  • Note that a ClinVar accession may have multiple variants associated with it (possible in different locations)
  • VariantId is extracted from the MeasureSet attributes.

MedGen, OMIM, Orphanet IDs

<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="175">
<Trait ID="3036" Type="Disease">
<XRef ID="C0086651" DB="MedGen"/>
<XRef ID="309297" DB="Orphanet"/>
<XRef ID="582" DB="Orphanet"/>
<XRef Type="MIM" ID="253000" DB="OMIM"/>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>

AlleleOrigins

<ClinVarAssertion>
<Origin>germline</Origin>
</ClinVarAssertion>

We only extract all Allele Origins from Submissions (SCV) entries.

PubMedIds

<ClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<Citation Type="general">
<ID Source="PubMed">12114475</ID>
</Citation>
</ClinicalSignificance>
<AttributeSet>
<Attribute Type="AssertionMethod">LMM Criteria</Attribute>
<Citation>
<ID Source="PubMed">24033266</ID>
</Citation>
</AttributeSet>
<ObservedIn>
<ObservedData ID="9727445">
<Citation Type="general">
<ID Source="PubMed">9113933</ID>
</Citation>
</ObservedData>
</ObservedIn>
<Citation Type="general">
<ID Source="PubMed">23757202</ID>
</Citation>
</ClinVarAssertion>

We only extract all Pubmed Ids from Submissions (SCV) entries.

Parsing Significance

Extracting significance(s) may involve parsing multiple fields. Take the following snippets into consideration.

<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>

<ClinicalSignificance DateLastEvaluated="2016-10-13">
<ReviewStatus>criteria provided, multiple submitters, no conflicts</ReviewStatus>
<Description>Pathogenic/Likely pathogenic</Description>
</ClinicalSignificance>

<ClinicalSignificance DateLastEvaluated="2012-06-07">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Conflicting interpretations of pathogenicity</Description>
<Explanation DataSource="ClinVar" Type="public">Pathogenic(1);Uncertain significance(1)</Explanation>
</ClinicalSignificance>

Given the evidence, we converted the significance field into an array of strings which may be parsed out of the Descriptions or Explanation fields.

Varying Delimiters

The delimiters in each field may vary. Currently, the delimiters for Description are , and /. The delimiters for Explanation are ; and /.

Known Issues

Known Issues
  • The XML file contains ~1k more entries (out of 162K) than the VCF file
  • The XML file does not have a field indicating that a record is associated with the reference base - something that was present in VCF
  • The XML file contains entries (e.g. RCV000016645 version=1) which have IUPAC ambiguous bases ("R", "Y", "H", etc.) as their alternate allele

Download URL

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz

JSON Output

"clinvar":[
{
"id":"RCV000030258.4",
"reviewStatus":"reviewed by expert panel",
"alleleOrigins":[
"germline"
],
"refAllele":"G",
"altAllele":"A",
"phenotypes":[
"Lynch syndrome"
],
"medGenIds":[
"C1333990"
],
"omimIds":[
"120435"
],
"significance":[
"benign"
],
"lastUpdatedDate":"2017-05-01",
"isAlleleSpecific":true
}
]
FieldTypeNotes
idstringClinVar ID
reviewStatusstringsee possible values below
alleleOriginsstring arraysee possible values below
refAllelestring
altAllelestring
phenotypesstring array
medGenIdsstring arrayMedGen IDs
omimIdsstring arrayOMIM IDs
orphanetIdsstring arrayOrphanet IDs
significancestring arraysee possible values below
lastUpdatedDatestringyyyy-MM-dd
pubMedIdsstring arrayPubMed IDs
isAlleleSpecificbooltrue when the current variant alternate allele matches the ClinVar alternate allele

reviewStatus:

  • no assertion provided
  • no assertion criteria provided
  • criteria provided, single submitter
  • practice guideline
  • classified by multiple submitters
  • criteria provided, conflicting interpretations
  • criteria provided, multiple submitters, no conflicts
  • no interpretation for the single variant

alleleOrigins:

  • unknown
  • other
  • germline
  • somatic
  • inherited
  • paternal
  • maternal
  • de-novo
  • biparental
  • uniparental
  • not-tested
  • tested-inconclusive

significance:

  • uncertain significance
  • not provided
  • benign
  • likely benign
  • likely pathogenic
  • pathogenic
  • drug response
  • histocompatibility
  • association
  • risk factor
  • protective
  • affects
  • conflicting data from submitters
  • other
  • no interpretation for the single variant
  • conflicting interpretations of pathogenicity