ClinVar
Overview
ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.
Publication
Melissa J Landrum, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, Jennifer Hart, Douglas Hoffman, Wonhee Jang, Karen Karapetyan, Kenneth Katz, Chunlei Liu, Zenith Maddipatla, Adriana Malheiro, Kurt McDaniel, Michael Ovetsky, George Riley, George Zhou, J Bradley Holmes, Brandi L Kattman, Donna R Maglott, ClinVar: improving access to variant interpretations and supporting evidence, Nucleic Acids Research, 46, Issue D1, 4 January 2018, Pages D1062–D1067, https://doi.org/10.1093/nar/gkx1153
RCV File
Example
Here's a full RCV entry.
Parsing
In the following section, we discuss which field of the XML was used to extract information that is presented in the JSON output.
ID
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinVarAccession Acc="RCV000000001" Version="2">
</ClinVarSet>
The Acc and Version fields are merged to form the ID (RCV000000001.2)
LastUpdatedDate
<ClinVarSet>
<ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2016-02-17" ID="57604" >
</ClinVarSet>
Significance
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>
ReviewStatus
<ClinVarSet>
<ReferenceClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
</ClinVarSet>
Phenotypes
<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="62">
<Trait Type="Disease">
<Name>
<ElementValue Type="Preferred">Joubert syndrome 9</ElementValue>
</Name>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>
We only use the field with Type="Preferred". Multiple phenotypes may be reported
Location and Variant Id
<ReferenceClinVarAssertion>
<GenotypeSet Type="CompoundHeterozygote" ID="424709">
<MeasureSet Type="Variant" ID="81">
<Measure Type="single nucleotide variant" ID="15120">
<SequenceLocation Assembly="GRCh38" AssemblyAccessionVersion="GCF_000001405.38"
AssemblyStatus="current" Chr="10" Accession="NC_000010.11" start="89222510"
stop="89222510" display_start="89222510" display_stop="89222510" variantLength="1"
positionVCF="89222510" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
<SequenceLocation Assembly="GRCh37" AssemblyAccessionVersion="GCF_000001405.25"
AssemblyStatus="previous" Chr="10" Accession="NC_000010.10" start="90982267"
stop="90982267" display_start="90982267" display_stop="90982267" variantLength="1"
positionVCF="90982267" referenceAlleleVCF="C" alternateAlleleVCF="T"/>
</Measure>
</MeasureSet>
</GenotypeSet>
</ReferenceClinVarAssertion>
- The variant position is extracted from the fields for their respective assemblies.
- Updated records contain positionVCF, referenceAlleleVCF and alternateAlleleVCF fields and when present, we use them to create the variant.
- For older records, since "start' and "stop" fields are not always available, we use the "display_start" and "display_end" fields.
- If a required allele is not available, we extract it from the reference sequence.
- Only variants having a dbSNP id are extracted.
- Note that a ClinVar accession may have multiple variants associated with it (possible in different locations)
- VariantId is extracted from the MeasureSet attributes.
MedGen, OMIM, Orphanet IDs
<ReferenceClinVarAssertion>
<TraitSet Type="Disease" ID="175">
<Trait ID="3036" Type="Disease">
<XRef ID="C0086651" DB="MedGen"/>
<XRef ID="309297" DB="Orphanet"/>
<XRef ID="582" DB="Orphanet"/>
<XRef Type="MIM" ID="253000" DB="OMIM"/>
</Trait>
</TraitSet>
</ReferenceClinVarAssertion>
AlleleOrigins
<ClinVarAssertion>
<Origin>germline</Origin>
</ClinVarAssertion>
We only extract all Allele Origins from Submissions (SCV) entries.
PubMedIds
<ClinVarAssertion>
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<Citation Type="general">
<ID Source="PubMed">12114475</ID>
</Citation>
</ClinicalSignificance>
<AttributeSet>
<Attribute Type="AssertionMethod">LMM Criteria</Attribute>
<Citation>
<ID Source="PubMed">24033266</ID>
</Citation>
</AttributeSet>
<ObservedIn>
<ObservedData ID="9727445">
<Citation Type="general">
<ID Source="PubMed">9113933</ID>
</Citation>
</ObservedData>
</ObservedIn>
<Citation Type="general">
<ID Source="PubMed">23757202</ID>
</Citation>
</ClinVarAssertion>
We only extract all Pubmed Ids from Submissions (SCV) entries.
Parsing Significance
Extracting significance(s) may involve parsing multiple fields. Take the following snippets into consideration.
<ClinicalSignificance DateLastEvaluated="1996-04-01">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Pathogenic</Description>
</ClinicalSignificance>
<ClinicalSignificance DateLastEvaluated="2016-10-13">
<ReviewStatus>criteria provided, multiple submitters, no conflicts</ReviewStatus>
<Description>Pathogenic/Likely pathogenic</Description>
</ClinicalSignificance>
<ClinicalSignificance DateLastEvaluated="2012-06-07">
<ReviewStatus>no assertion criteria provided</ReviewStatus>
<Description>Conflicting interpretations of pathogenicity</Description>
<Explanation DataSource="ClinVar" Type="public">Pathogenic(1);Uncertain significance(1)</Explanation>
</ClinicalSignificance>
Given the evidence, we converted the significance field into an array of strings which may be parsed out of the Descriptions
or Explanation
fields.
Varying Delimiters
The delimiters in each field may vary. Currently, the delimiters for Description
are ,
and /
. The delimiters for Explanation
are ;
and /
.
Known Issues
Known Issues
- The XML file contains ~1k more entries (out of 162K) than the VCF file
- The XML file does not have a field indicating that a record is associated with the reference base - something that was present in VCF
- The XML file contains entries (e.g. RCV000016645 version=1) which have IUPAC ambiguous bases ("R", "Y", "H", etc.) as their alternate allele
Download URL
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz
JSON Output
"clinvar":[
{
"id":"RCV000030258.4",
"reviewStatus":"reviewed by expert panel",
"alleleOrigins":[
"germline"
],
"refAllele":"G",
"altAllele":"A",
"phenotypes":[
"Lynch syndrome"
],
"medGenIds":[
"C1333990"
],
"omimIds":[
"120435"
],
"significance":[
"benign"
],
"lastUpdatedDate":"2017-05-01",
"isAlleleSpecific":true
}
]
Field | Type | Notes |
---|---|---|
id | string | ClinVar ID |
reviewStatus | string | see possible values below |
alleleOrigins | string array | see possible values below |
refAllele | string | |
altAllele | string | |
phenotypes | string array | |
medGenIds | string array | MedGen IDs |
omimIds | string array | OMIM IDs |
orphanetIds | string array | Orphanet IDs |
significance | string array | see possible values below |
lastUpdatedDate | string | yyyy-MM-dd |
pubMedIds | string array | PubMed IDs |
isAlleleSpecific | bool | true when the current variant alternate allele matches the ClinVar alternate allele |
reviewStatus:
- no assertion provided
- no assertion criteria provided
- criteria provided, single submitter
- practice guideline
- classified by multiple submitters
- criteria provided, conflicting interpretations
- criteria provided, multiple submitters, no conflicts
- no interpretation for the single variant
alleleOrigins:
- unknown
- other
- germline
- somatic
- inherited
- paternal
- maternal
- de-novo
- biparental
- uniparental
- not-tested
- tested-inconclusive
significance:
- uncertain significance
- not provided
- benign
- likely benign
- likely pathogenic
- pathogenic
- drug response
- histocompatibility
- association
- risk factor
- protective
- affects
- conflicting data from submitters
- other
- no interpretation for the single variant
- conflicting interpretations of pathogenicity