Primate AI-3D
Overview
Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.
The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.
For more details, refer to these publications:
Publication
- Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
- Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source
This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.
Parsing
TSV File
chr pos non_flipped_ref non_flipped_alt gene_name change_position_1based ref_aa alt_aa score_PAI3D percentile_PAI3D refseq prediction
chr1 69094 G A ENST00000335137.4 2 V M 0.6169436463713646 0.5200308441794135 NM_001005484.1 pathogenic
chr1 69094 G C ENST00000335137.4 2 V L 0.5557043975591658 0.4271457250214688 NM_001005484.1 benign
chr1 69094 G T ENST00000335137.4 2 V L 0.5557043975591658 0.4271457391722522 NM_001005484.1 benign
chr1 69095 T A ENST00000335137.4 2 V E 0.8063537482917307 0.8032228720356267 NM_001005484.1 pathogenic
chr1 69095 T C ENST00000335137.4 2 V A 0.5795628190040587 0.4631329075815453 NM_001005484.1 benign
chr1 69095 T G ENST00000335137.4 2 V G 0.7922330142557621 0.7834049546930125 NM_001005484.1 pathogenic
From the CSV file, all columns are parsed:
chr
pos
non_flipped_ref
non_flipped_alt
gene_name
change_position_1based
ref_aa
alt_aa
score_PAI3D
percentile_PAI3D
refseq
prediction
The fields gene_name
and refseq
define the Ensembl and RefSeq transcript IDs respectively.
These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.
GRCh37
Note that for GRCh37, a lifted over file is provided. The file is not sorted, therefore it must first be sorted. Also note that certain RefSeq transcripts appear not to have been mapped during the lift-over process.
Pre-processing
Sorting
gzcat PrimateAI-3D.hg19.txt.gz | sort -t $'\t' -k1,1 -k2,2n | gzip > PrimateAI-3D.hg19_sorted.tsv.gz
SA Generation
dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PrimateAI-3D.hg38.txt.gz" \
--o "${SaUtilsOutput]"
Known Issues
Known Issues
Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.
Example:
ENST00000643905.1 transcript is retired according to Ensembl
NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq
Download URL
https://primad.basespace.illumina.com/
JSON Output
"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"classification": "pathogenic",
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1"
}
]
Field | Type | Notes |
---|---|---|
aminoAcidPosition | int | Amino Acid Position (1-based) |
refAminoAcid | string | Reference Amino Acid |
altAminoAcid | string | Alternate Amino Acid |
ensemblTranscriptId | string | Transcript ID (Ensembl) |
refSeqTranscriptId | string | Transcript ID (RefSeq) |
scorePercentile | float | range: 0 - 1.0 |
score | float | range: 0 - 1.0 |
classification | string | pathogenic or benign classification |