Version: 3.22

Primate AI

Overview

Primate AI is a deep residual neural network for classifying the pathogenicity of missense mutations.

The newer version, PrimateAI-3D, uses a 3D convolutional neural network, to predict protein variant pathogenicity using structural information. The model's innovative use of primate sequencing and structural data offers promising insights into variant interpretation and disease gene identification. The predictive score range between 0 and 1, with 0 being benign and 1 being most pathogenic.

For more details, refer to these publications:

Publication

Hong Gao et al. ,The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). https://doi.org/10.1126/science.abn8197
Sundaram, L., Gao, H., Padigepati, S.R. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). https://doi.org/10.1038/s41588-018-0167-z

Professional data source

This is a Professional data source and is not available freely. Please contact annotation_support@illumina.com if you would like to obtain it.

Primate AI is available in two versions based on assembly:

Primate AI 3D: Only available for GRCh38
Primate AI: Only available for GRCh37

Both have different file structures, and information. Therefore, they are handled separately:

Primate AI 3D: GRCh38

Parsing

CSV File

,chr,pos,non_flipped_ref,non_flipped_alt,gene_name,change_position_1based,ref_aa,alt_aa,score_PAI3D,percentile_PAI3D,refseq
0,chr1,69094,G,A,ENST00000335137.4,2,V,M,0.6169436463713646,0.5200308441794135,NM_001005484.1
1,chr1,69094,G,C,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457250214688,NM_001005484.1
2,chr1,69094,G,T,ENST00000335137.4,2,V,L,0.5557043975591658,0.4271457391722522,NM_001005484.1

From the CSV file, all columns are parsed:

chr
pos
ref
alt
gene_name
change_position_1based
ref_aa
alt_aa
score_PAI3D
percentile_PAI3D
refseq

The fields gene_name and refseq define the Ensembl and RefSeq transcript IDs respectively. These transcripts are passed as-is and some of them might be unrecognized/deprecated by RefSeq/Ensembl.

Parsing Command

dotnet SAUtils.dll \
PrimateAi \
--r "${References}/Homo_sapiens.GRCh38.Nirvana.dat" \
--i "${ExternalDataSources}/PrimateAI/3D/PAI3D_wholeProteome_23_04_11.percentiles.pkg.refseq.csv.gz" \
--o "${SaUtilsOutput]"

Known Issues

Some transcript IDs defined in the data file are obsolete, retired, or updated. They are not removed or modified by Illumina Connected Annotations, and are passed as-is from the PrimateAI-3D data source.

Example:

ENST00000643905.1 transcript is retired according to Ensembl

NM_182838.2 transcript is removed because it is a pseudo-gene according to RefSeq

Download URL

https://primad.basespace.illumina.com/

Primate AI: GRCh37

Parsing

TSV File

chr pos ref alt refAA   altAA   strand_1pos_0neg    trinucleotide_context   UCSC_gene   ExAC_coverage   primateDL_score
chr10   1046704 C   T   R   C   1   CCG uc001ift.3  45.49   0.849114537239
chr10   1046704 C   G   R   G   1   CCG uc001ift.3  45.49   0.795686006546

From the TSV file, we're mainly interested in the following columns:

chr
pos
ref
alt
primateDL_score

We also use UCSC_gene to filter out variants that don't have matching gene models in Illumina Connected Annotations.

Pre-processing

Converting UCSC IDs

Primate AI only provides UCSC IDs. As an initial pre-processing step, we'll need to convert these to either Entrez or Ensembl Gene IDs.

The following queries are used to download the conversions from UCSC:

mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
      -e "select * FROM knownToLocusLink;" hg19 > ucsc_locuslink.tsv

mysql -h genome-mysql.soe.ucsc.edu -u genome -A -P 3306 \
      -e "select knownToEnsembl.name, knownToEnsembl.value, ensGene.name2 FROM knownToEnsembl, ensGene WHERE knownToEnsembl.value = ensGene.name;" \
      hg19 > ucsc_ensembl.tsv

Running the Pre-Processor

The Primate AI pre-processor can be run as follows:

dotnet PrimateAiPreProcessor.dll UGA_develop.tsv PrimateAI_scores_v0.2.tsv.gz \
     ucsc_locuslink.tsv ucsc_ensembl.tsv PrimateAI_0.2_GRCh37.tsv.gz

During conversion, 0.5% of the UCSC Ids cannot be converted to either Entrez or Ensembl gene IDs. Once the gene IDs have been acquired, we check to see which are available in Illumina Connected Annotations.

The following Entrez Gene IDs were not found:

Here is the output from the pre-processor:

- loading UCSC to Entrez Gene ID dictionary... 73,432 genes loaded.
- loading UCSC to Ensembl Gene ID dictionary... 76,178 genes loaded.
- loading UGA gene ID to gene dictionary... 103,277 genes loaded.
- parsing Primate AI variants... 70,121,953 variants parsed.

# variants with unknown gene ID: 27,253 / 70,121,953
# genes with unknown gene ID:    109 / 19,614

# variants not in UGA: 2,036 / 70,121,953
# genes not in UGA:    6 / 19,614

Known Issues

The Primate AI data set provides raw scores, but the scores are biased according to gene context. I.e. a 0.4 means something different in TP53 than it does in KRAS.

As a result, the Primate AI team provided guidance on aggregating these scores and presenting them as percentiles with respect to the associated gene. According to their research, the 25^th percentile is a good proxy for benign variants and the 75^th percentile is a good proxy for pathogenic variants.

Download URL

https://basespace.illumina.com/s/cPgCSmecvhb4

JSON Output

GRCh38

"primateAI-3D": [
  {
    "aminoAcidPosition": 2,
    "refAminoAcid": "V",
    "altAminoAcid": "M",
    "score": 0.616944,
    "scorePercentile": 0.52,
    "ensemblTranscriptId": "ENST00000335137.4",
    "refSeqTranscriptId": "NM_001005484.1"
  }
]

Field	Type	Notes
aminoAcidPosition	int	Amino Acid Position (1-based)
refAminoAcid	string	Reference Amino Acid
altAminoAcid	string	Alternate Amino Acid
ensemblTranscriptId	string	Transcript ID (Ensembl)
refSeqTranscriptId	string	Transcript ID (RefSeq)
scorePercentile	float	range: 0 - 1.0
score	float	range: 0 - 1.0

GRCh37

"primateAI": [
   {
      "hgnc":"TP53",
      "scorePercentile":0.3,
   }
]

Field	Type	Notes
hgnc	string	HGNC Gene Symbol
scorePercentile	float	range: 0 - 1.0

Overview​

Publication

Professional data source

Primate AI 3D: GRCh38​

Parsing​

CSV File​

Parsing Command​

Known Issues​

Known Issues

Example:​

Download URL​

Primate AI: GRCh37​

Parsing​

TSV File​

Pre-processing​

Converting UCSC IDs​

Running the Pre-Processor​

Known Issues​

Known Issues

Download URL​

JSON Output​

GRCh38​

GRCh37​

Overview

Primate AI 3D: GRCh38

Parsing

CSV File

Parsing Command

Known Issues

Example:

Download URL

Primate AI: GRCh37

Parsing

TSV File

Pre-processing

Converting UCSC IDs

Running the Pre-Processor

Known Issues

Download URL

JSON Output

GRCh38

GRCh37