Skip to main content
Version: 3.23

Primate AI-3D

Overview

Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.

The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.

For more details, refer to these publications:

Publication
  1. Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
  2. Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source

This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.

Parsing

TSV File

chr pos non_flipped_ref non_flipped_alt gene_name   change_position_1based  ref_aa  alt_aa  score_PAI3D percentile_PAI3D    refseq  prediction
chr1 69094 G A ENST00000335137.4 2 V M 0.6169436463713646 0.5200308441794135 NM_001005484.1 pathogenic
chr1 69094 G C ENST00000335137.4 2 V L 0.5557043975591658 0.4271457250214688 NM_001005484.1 benign
chr1 69094 G T ENST00000335137.4 2 V L 0.5557043975591658 0.4271457391722522 NM_001005484.1 benign
chr1 69095 T A ENST00000335137.4 2 V E 0.8063537482917307 0.8032228720356267 NM_001005484.1 pathogenic
chr1 69095 T C ENST00000335137.4 2 V A 0.5795628190040587 0.4631329075815453 NM_001005484.1 benign
chr1 69095 T G ENST00000335137.4 2 V G 0.7922330142557621 0.7834049546930125 NM_001005484.1 pathogenic

From the CSV file, all columns are parsed:

  • chr
  • pos
  • non_flipped_ref
  • non_flipped_alt
  • gene_name
  • change_position_1based
  • ref_aa
  • alt_aa
  • score_PAI3D
  • percentile_PAI3D
  • refseq
  • prediction

The fields gene_name and refseq define the Ensembl and RefSeq transcript IDs respectively. These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.

GRCh37

Note that for GRCh37, a lifted over file is provided. The file is not sorted, therefore it must first be sorted. Also note that certain RefSeq transcripts appear not to have been mapped during the lift-over process.

Pre-processing

Sorting

gzcat PrimateAI-3D.hg19.txt.gz | sort -t $'\t'  -k1,1 -k2,2n | gzip > PrimateAI-3D.hg19_sorted.tsv.gz

SA Generation

dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PrimateAI-3D.hg38.txt.gz" \
--o "${SaUtilsOutput]"

Known Issues

Known Issues

Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.

Example:

ENST00000643905.1 transcript is retired according to Ensembl

NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq

Download URL

https://primad.basespace.illumina.com/

JSON Output

"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"classification": "pathogenic",
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1"
}
]
FieldTypeNotes
aminoAcidPositionintAmino Acid Position (1-based)
refAminoAcidstringReference Amino Acid
altAminoAcidstringAlternate Amino Acid
ensemblTranscriptIdstringTranscript ID (Ensembl)
refSeqTranscriptIdstringTranscript ID (RefSeq)
scorePercentilefloatrange: 0 - 1.0
scorefloatrange: 0 - 1.0
classificationstringpathogenic or benign classification