Primate AI
Overview
Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.
The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.
For more details, refer to these publications:
Publication
- Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
- Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source
This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.
Primate AI is available in two versions based on assembly:
- Primate AI 3D: Only available for GRCh38
- Primate AI: Only available for GRCh37
Both have different file structures, and information. Therefore, they are handled separately:
Primate AI 3D: GRCh38
Parsing
CSV File
,chr,pos,non_flipped_ref,non_flipped_alt,gene_name,change_position_1based,ref_aa,alt_aa,score_PAI3D,percentile_PAI3D,refseq
0,chr1,69094,G,A,ENST00000335137.4,2,V,M,0.6169436463713646,0.5200308441794135,NM_001005484.1
1,chr1,69094,G,C,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457250214688,NM_001005484.1
2,chr1,69094,G,T,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457391722522,NM_001005484.1
From the CSV file, all columns are parsed:
chr
pos
ref
alt
gene_name
change_position_1based
ref_aa
alt_aa
score_PAI3D
percentile_PAI3D
refseq
The fields gene_name
and refseq
define the Ensembl and RefSeq transcript IDs respectively.
These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.
Parsing Command
dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PAI3D_wholeProteome_23_04_11.percentiles.pkg.refseq.csv.gz" \
--o "${SaUtilsOutput]"
Known Issues
Known Issues
Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.
Example:
ENST00000643905.1 transcript is retired according to Ensembl
NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq
Download URL
https://primad.basespace.illumina.com/
Primate AI: GRCh37
Parsing
TSV File
chr pos ref alt refAA altAA strand_1pos_0neg trinucleotide_context UCSC_gene ExAC_coverage primateDL_score
chr10 1046704 C T R C 1 CCG uc001ift.3 45.49 0.849114537239
chr10 1046704 C G R G 1 CCG uc001ift.3 45.49 0.795686006546
From the TSV file, we're mainly interested in the following columns:
chr
pos
ref
alt
primateDL_score
We also use UCSC_gene
to filter out variants that don't have matching gene models in Illumina Connected Annotations.
Pre-processing
Converting UCSC IDs
Primate AI only provides UCSC IDs. As an initial pre-processing step, we'll need to convert these to either Entrez or Ensembl Gene IDs.
The following queries are used to download the conversions from UCSC:
mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
-e "select * FROM knownToLocusLink;" hg19 > ucsc_locuslink.tsv
mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
-e "select knownToEnsembl.name, knownToEnsembl.value, ensGene.name2 FROM knownToEnsembl, ensGene WHERE knownToEnsembl.value = ensGene.name;" \
hg19 > ucsc_ensembl.tsv
Running the Pre-Processor
The Primate AI pre-processor can be run as follows:
dotnet PrimateAiPreProcessor.dll UGA_develop.tsv PrimateAI_scores_v0.2.tsv.gz \
ucsc_locuslink.tsv ucsc_ensembl.tsv PrimateAI_0.2_GRCh37.tsv.gz
During conversion, 0.5% of the UCSC Ids cannot be converted to either Entrez or Ensembl gene IDs. Once the gene IDs have been acquired, we check to see which are available in Illumina Connected Annotations.
The following Entrez Gene IDs were not found:
399753
401980
504189
504191
100293534
Here is the output from the pre-processor:
- loading UCSC to Entrez Gene ID dictionary... 73,432 genes loaded.
- loading UCSC to Ensembl Gene ID dictionary... 76,178 genes loaded.
- loading UGA gene ID to gene dictionary... 103,277 genes loaded.
- parsing Primate AI variants... 70,121,953 variants parsed.
# variants with unknown gene ID: 27,253 / 70,121,953
# genes with unknown gene ID: 109 / 19,614
# variants not in UGA: 2,036 / 70,121,953
# genes not in UGA: 6 / 19,614
Known Issues
Known Issues
The Primate AI data set provides raw scores, but the scores are biased according to gene context. I.e. a 0.4 means something different in TP53
than it does in KRAS
.
As a result, the Primate AI team provided guidance on aggregating these scores and presenting them as percentiles with respect to the associated gene. According to their research, the 25th percentile is a good proxy for benign variants and the 75th percentile is a good proxy for pathogenic variants.
Download URL
https://basespace.illumina.com/s/cPgCSmecvhb4
JSON Output
GRCh38
"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1"
}
]
Field | Type | Notes |
---|---|---|
aminoAcidPosition | int | Amino Acid Position (1-based) |
refAminoAcid | string | Reference Amino Acid |
altAminoAcid | string | Alternate Amino Acid |
ensemblTranscriptId | string | Transcript ID (Ensembl) |
refSeqTranscriptId | string | Transcript ID (RefSeq) |
scorePercentile | float | range: 0 - 1.0 |
score | float | range: 0 - 1.0 |
GRCh37
"primateAI": [
{
"hgnc":"TP53",
"scorePercentile":0.3,
}
]
Field | Type | Notes |
---|---|---|
hgnc | string | HGNC Gene Symbol |
scorePercentile | float | range: 0 - 1.0 |