Skip to main content
Version: 3.24 (unreleased)

Primate AI-3D

Overview

Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.

The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.

For more details, refer to these publications:

Publication
  1. Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
  2. Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source

This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.

Parsing

TSV File

chr pos non_flipped_ref non_flipped_alt gene_name   change_position_1based  ref_aa  alt_aa  score_PAI3D percentile_PAI3D     refseq prediction   per_gene_percentile_PAI3D  hgnc
chr1 69094 G A ENST00000335137.4 2 V M 0.6169436463713646 0.5200308441794135 NM_001005484.1 pathogenic 0.699207135777998 OR4F5
chr1 69094 G C ENST00000335137.4 2 V L 0.5557043975591658 0.4271457250214688 NM_001005484.1 benign 0.6053022794846382 OR4F5
chr1 69094 G T ENST00000335137.4 2 V L 0.5557043975591658 0.4271457391722522 NM_001005484.1 benign 0.6053022794846382 OR4F5
chr1 69095 T A ENST00000335137.4 2 V E 0.8063537482917307 0.8032228720356267 NM_001005484.1 pathogenic 0.9202180376610506 OR4F5
chr1 69095 T C ENST00000335137.4 2 V A 0.5795628190040587 0.4631329075815453 NM_001005484.1 benign 0.6442021803766105 OR4F5
chr1 69095 T G ENST00000335137.4 2 V G 0.7922330142557621 0.7834049546930125 NM_001005484.1 pathogenic 0.900396432111001 OR4F5

From the file, all columns are parsed:

  • chr
  • pos
  • non_flipped_ref
  • non_flipped_alt
  • gene_name
  • change_position_1based
  • ref_aa
  • alt_aa
  • score_PAI3D
  • percentile_PAI3D
  • refseq
  • prediction
  • per_gene_percentile_PAI3D
  • hgnc

The fields gene_name and refseq define the Ensembl and RefSeq transcript IDs respectively. These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.

GRCh37

Note that for GRCh37, a lifted over file is provided. The file is not sorted, therefore it must first be sorted. Also note that certain RefSeq transcripts appear not to have been mapped during the lift-over process.

Pre-processing

Sorting

gzcat PrimateAI-3D.hg19.txt.gz | sort -t $'\t'  -k1,1 -k2,2n | gzip > PrimateAI-3D.hg19_sorted.tsv.gz

SA Generation

dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PrimateAI-3D.hg38.txt.gz" \
--o "${SaUtilsOutput]"

Known Issues

Known Issues

Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.

Example:

ENST00000643905.1 transcript is retired according to Ensembl

NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq

Download URL

https://primad.basespace.illumina.com/

JSON Output

"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"genePercentile": 0.7,
"classification": "pathogenic",
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1",
"geneSymbol":"OR4F5"
}
]
FieldTypeNotes
aminoAcidPositionintAmino Acid Position (1-based)
refAminoAcidstringReference Amino Acid
altAminoAcidstringAlternate Amino Acid
ensemblTranscriptIdstringTranscript ID (Ensembl)
refSeqTranscriptIdstringTranscript ID (RefSeq)
scorePercentilefloatrange: 0 - 1.0
genePercentilefloatrange: 0 - 1.0
scorefloatrange: 0 - 1.0
classificationstringpathogenic or benign classification
geneSymbolstringHGNC gene symbol