Primate AI-3D
Overview
Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.
The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.
For more details, refer to these publications:
Publication
- Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
- Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source
This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.
Parsing
TSV File
chr pos non_flipped_ref non_flipped_alt gene_name change_position_1based ref_aa alt_aa score_PAI3D percentile_PAI3D refseq prediction per_gene_percentile_PAI3D hgnc
chr1 69094 G A ENST00000335137.4 2 V M 0.6169436463713646 0.5200308441794135 NM_001005484.1 pathogenic 0.699207135777998 OR4F5
chr1 69094 G C ENST00000335137.4 2 V L 0.5557043975591658 0.4271457250214688 NM_001005484.1 benign 0.6053022794846382 OR4F5
chr1 69094 G T ENST00000335137.4 2 V L 0.5557043975591658 0.4271457391722522 NM_001005484.1 benign 0.6053022794846382 OR4F5
chr1 69095 T A ENST00000335137.4 2 V E 0.8063537482917307 0.8032228720356267 NM_001005484.1 pathogenic 0.9202180376610506 OR4F5
chr1 69095 T C ENST00000335137.4 2 V A 0.5795628190040587 0.4631329075815453 NM_001005484.1 benign 0.6442021803766105 OR4F5
chr1 69095 T G ENST00000335137.4 2 V G 0.7922330142557621 0.7834049546930125 NM_001005484.1 pathogenic 0.900396432111001 OR4F5
From the file, all columns are parsed:
chr
pos
non_flipped_ref
non_flipped_alt
gene_name
change_position_1based
ref_aa
alt_aa
score_PAI3D
percentile_PAI3D
refseq
prediction
per_gene_percentile_PAI3D
hgnc
The fields gene_name
and refseq
define the Ensembl and RefSeq transcript IDs respectively.
These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.
GRCh37
Note that for GRCh37, a lifted over file is provided. The file is not sorted, therefore it must first be sorted. Also note that certain RefSeq transcripts appear not to have been mapped during the lift-over process.
Pre-processing
Sorting
gzcat PrimateAI-3D.hg19.txt.gz | sort -t $'\t' -k1,1 -k2,2n | gzip > PrimateAI-3D.hg19_sorted.tsv.gz
SA Generation
dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PrimateAI-3D.hg38.txt.gz" \
--o "${SaUtilsOutput]"
Known Issues
Known Issues
Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.
Example:
ENST00000643905.1 transcript is retired according to Ensembl
NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq
Download URL
https://primad.basespace.illumina.com/
JSON Output
"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"genePercentile": 0.7,
"classification": "pathogenic",
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1",
"geneSymbol":"OR4F5"
}
]
Field | Type | Notes |
---|---|---|
aminoAcidPosition | int | Amino Acid Position (1-based) |
refAminoAcid | string | Reference Amino Acid |
altAminoAcid | string | Alternate Amino Acid |
ensemblTranscriptId | string | Transcript ID (Ensembl) |
refSeqTranscriptId | string | Transcript ID (RefSeq) |
scorePercentile | float | range: 0 - 1.0 |
genePercentile | float | range: 0 - 1.0 |
score | float | range: 0 - 1.0 |
classification | string | pathogenic or benign classification |
geneSymbol | string | HGNC gene symbol |