Skip to main content
Version: 3.22

Primate AI

Overview

Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.

The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.

For more details, refer to these publications:

Publication
  1. Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
  2. Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z
Professional data source

This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.

Primate AI is available in two versions based on assembly:

  1. Primate AI 3D: Only available for GRCh38
  2. Primate AI: Only available for GRCh37

Both have different file structures, and information. Therefore, they are handled separately:

Primate AI 3D: GRCh38

Parsing

CSV File

,chr,pos,non_flipped_ref,non_flipped_alt,gene_name,change_position_1based,ref_aa,alt_aa,score_PAI3D,percentile_PAI3D,refseq
0,chr1,69094,G,A,ENST00000335137.4,2,V,M,0.6169436463713646,0.5200308441794135,NM_001005484.1
1,chr1,69094,G,C,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457250214688,NM_001005484.1
2,chr1,69094,G,T,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457391722522,NM_001005484.1

From the CSV file, all columns are parsed:

  • chr
  • pos
  • ref
  • alt
  • gene_name
  • change_position_1based
  • ref_aa
  • alt_aa
  • score_PAI3D
  • percentile_PAI3D
  • refseq

The fields gene_name and refseq define the Ensembl and RefSeq transcript IDs respectively. These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.

Parsing Command

dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PAI3D_wholeProteome_23_04_11.percentiles.pkg.refseq.csv.gz" \
--o "${SaUtilsOutput]"

Known Issues

Known Issues

Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.

Example:

ENST00000643905.1 transcript is retired according to Ensembl

NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq

Download URL

https://primad.basespace.illumina.com/

Primate AI: GRCh37

Parsing

TSV File

chr pos ref alt refAA   altAA   strand_1pos_0neg    trinucleotide_context   UCSC_gene   ExAC_coverage   primateDL_score
chr10 1046704 C T R C 1 CCG uc001ift.3 45.49 0.849114537239
chr10 1046704 C G R G 1 CCG uc001ift.3 45.49 0.795686006546

From the TSV file, we're mainly interested in the following columns:

  • chr
  • pos
  • ref
  • alt
  • primateDL_score

We also use UCSC_gene to filter out variants that don't have matching gene models in Illumina Connected Annotations.

Pre-processing

Converting UCSC IDs

Primate AI only provides UCSC IDs. As an initial pre-processing step, we'll need to convert these to either Entrez or Ensembl Gene IDs.

The following queries are used to download the conversions from UCSC:

mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
-e "select * FROM knownToLocusLink;" hg19 > ucsc_locuslink.tsv

mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
-e "select knownToEnsembl.name, knownToEnsembl.value, ensGene.name2 FROM knownToEnsembl, ensGene WHERE knownToEnsembl.value = ensGene.name;" \
hg19 > ucsc_ensembl.tsv

Running the Pre-Processor

The Primate AI pre-processor can be run as follows:

dotnet PrimateAiPreProcessor.dll UGA_develop.tsv PrimateAI_scores_v0.2.tsv.gz \
ucsc_locuslink.tsv ucsc_ensembl.tsv PrimateAI_0.2_GRCh37.tsv.gz

During conversion, 0.5% of the UCSC Ids cannot be converted to either Entrez or Ensembl gene IDs. Once the gene IDs have been acquired, we check to see which are available in Illumina Connected Annotations.

The following Entrez Gene IDs were not found:

399753
401980
504189
504191
100293534

Here is the output from the pre-processor:

- loading UCSC to Entrez Gene ID dictionary... 73,432 genes loaded.
- loading UCSC to Ensembl Gene ID dictionary... 76,178 genes loaded.
- loading UGA gene ID to gene dictionary... 103,277 genes loaded.
- parsing Primate AI variants... 70,121,953 variants parsed.

# variants with unknown gene ID: 27,253 / 70,121,953
# genes with unknown gene ID: 109 / 19,614

# variants not in UGA: 2,036 / 70,121,953
# genes not in UGA: 6 / 19,614

Known Issues

Known Issues

The Primate AI data set provides raw scores, but the scores are biased according to gene context. I.e. a 0.4 means something different in TP53 than it does in KRAS.

As a result, the Primate AI team provided guidance on aggregating these scores and presenting them as percentiles with respect to the associated gene. According to their research, the 25th percentile is a good proxy for benign variants and the 75th percentile is a good proxy for pathogenic variants.

Download URL

https://basespace.illumina.com/s/cPgCSmecvhb4

JSON Output

GRCh38

"primateAI-3D": [
{
"aminoAcidPosition": 2,
"refAminoAcid": "V",
"altAminoAcid": "M",
"score": 0.616944,
"scorePercentile": 0.52,
"ensemblTranscriptId": "ENST00000335137.4",
"refSeqTranscriptId": "NM_001005484.1"
}
]
FieldTypeNotes
aminoAcidPositionintAmino Acid Position (1-based)
refAminoAcidstringReference Amino Acid
altAminoAcidstringAlternate Amino Acid
ensemblTranscriptIdstringTranscript ID (Ensembl)
refSeqTranscriptIdstringTranscript ID (RefSeq)
scorePercentilefloatrange: 0 - 1.0
scorefloatrange: 0 - 1.0

GRCh37

"primateAI": [
{
"hgnc":"TP53",
"scorePercentile":0.3,
}
]
FieldTypeNotes
hgncstringHGNC Gene Symbol
scorePercentilefloatrange: 0 - 1.0