One of the features of hap.py is the ability to produce TP, FP and FN counts in specific genomic subsets, as specified by the user. This opens up the possibility to compare performance metrics across genomic loci, as well as across analysis groups. However, replicate variability and variable subset sizes pose a challenge when interpreting these type of results. E.g. if we observe the counts from the table below in 4 different replicates, can we confidently claim that there is a difference in recall across conditions for subset A? And can we say that there is a difference across subsets for condition 2?

SUBSET TP.CONDITION_1 TP.CONDITION_2 TRUTH.TOTAL
A 81,82,82,82 74,75,76,75 82
B 2,2,2,2 2,2,2,2 2

In this vignette we will descibe a method to calculate credible intervals for stratified counts that takes into account the challenges stated above. In particular, we will be comparing stratified recall between a set of PCR-Free vs. Nano builds from NA12878.

Obtaining stratified counts

We will be relying on a pre-processed dataset that we have generated by running hap.py on VCFs from BaseSpace Data Central - “NovaSeq S2: TruSeq PCR-Free 350 (Replicates of NA12878)” and “NovaSeq S2: TruSeq Nano 550 - (Replicates of NA12878)”:

The example dataset consists of a set of replicates for NA12878, sequenced on NovaSeq S2 and analysed using WGSv8. 5 of the replicates have been processed using TruSeq PCR-Free, and 8 using Truseq Nano. Our goal will be to compare recall across both library preparation workflows, both at a genome-wide level, and across specific regions (coding exons in ACMG genes, high AT/GC regions and a repeat masker track). Note that, for speed reasons, we have focused our analyses on chr22, and have subsetted the truth VCF and stratification BEDs accordingly.

Loading demo results into R

Let’s first load hap.py outputs into R using a happyR samplesheet:

Once we have a happy_samplesheet object we can rely on the extract_metrics function to access our metrics of interest. Since hap.py saves stratified counts under *extended.csv files, we can use the following:

stratified_counts <- extract_results(hap_samplesheet$results, table = "extended") %>% 
  # focus on PASS calls in level 0 subsets
  filter(Subtype == "*", Filter == "PASS", Subset.Level == 0, !grepl(pattern = "TS*", Subset)) %>% 
  inner_join(hap_samplesheet$samplesheet) %>% 
  mutate(.type_group = paste(Type, group_id))
#> Joining, by = "happy_prefix"

stratified_counts %>% head
#> # A tibble: 6 x 69
#>   Type  Subtype Subset Filter Genotype QQ.Field    QQ METRIC.Recall
#>   <chr> <chr>   <chr>  <chr>  <chr>    <chr>    <dbl>         <dbl>
#> 1 INDEL *       *      PASS   *        QUAL        NA         0.906
#> 2 INDEL *       codin… PASS   *        QUAL        NA         1    
#> 3 INDEL *       high.… PASS   *        QUAL        NA         0.875
#> 4 INDEL *       high.… PASS   *        QUAL        NA         0.905
#> 5 INDEL *       repea… PASS   *        QUAL        NA         0.892
#> 6 SNP   *       *      PASS   *        QUAL        NA         0.945
#> # … with 61 more variables: METRIC.Precision <dbl>, METRIC.Frac_NA <dbl>,
#> #   METRIC.F1_Score <dbl>, FP.gt <int>, FP.al <int>, Subset.Size <dbl>,
#> #   Subset.IS_CONF.Size <dbl>, Subset.Level <dbl>, TRUTH.TOTAL <int>,
#> #   TRUTH.TOTAL.ti <dbl>, TRUTH.TOTAL.tv <dbl>, TRUTH.TOTAL.het <dbl>,
#> #   TRUTH.TOTAL.homalt <dbl>, TRUTH.TOTAL.TiTv_ratio <dbl>,
#> #   TRUTH.TOTAL.het_hom_ratio <dbl>, TRUTH.TP <int>, TRUTH.TP.ti <dbl>,
#> #   TRUTH.TP.tv <dbl>, TRUTH.TP.het <dbl>, TRUTH.TP.homalt <dbl>,
#> #   TRUTH.TP.TiTv_ratio <dbl>, TRUTH.TP.het_hom_ratio <dbl>,
#> #   TRUTH.FN <int>, TRUTH.FN.ti <dbl>, TRUTH.FN.tv <dbl>,
#> #   TRUTH.FN.het <dbl>, TRUTH.FN.homalt <dbl>, TRUTH.FN.TiTv_ratio <dbl>,
#> #   TRUTH.FN.het_hom_ratio <dbl>, QUERY.TOTAL <int>, QUERY.TOTAL.ti <dbl>,
#> #   QUERY.TOTAL.tv <dbl>, QUERY.TOTAL.het <dbl>, QUERY.TOTAL.homalt <dbl>,
#> #   QUERY.TOTAL.TiTv_ratio <dbl>, QUERY.TOTAL.het_hom_ratio <dbl>,
#> #   QUERY.TP <int>, QUERY.TP.ti <dbl>, QUERY.TP.tv <dbl>,
#> #   QUERY.TP.het <dbl>, QUERY.TP.homalt <dbl>, QUERY.TP.TiTv_ratio <dbl>,
#> #   QUERY.TP.het_hom_ratio <dbl>, QUERY.FP <int>, QUERY.FP.ti <dbl>,
#> #   QUERY.FP.tv <dbl>, QUERY.FP.het <dbl>, QUERY.FP.homalt <dbl>,
#> #   QUERY.FP.TiTv_ratio <dbl>, QUERY.FP.het_hom_ratio <dbl>,
#> #   QUERY.UNK <int>, QUERY.UNK.ti <dbl>, QUERY.UNK.tv <dbl>,
#> #   QUERY.UNK.het <dbl>, QUERY.UNK.homalt <dbl>,
#> #   QUERY.UNK.TiTv_ratio <dbl>, QUERY.UNK.het_hom_ratio <dbl>,
#> #   happy_prefix <chr>, group_id <chr>, replicate_id <chr>,
#> #   .type_group <chr>

Let’s inspect the number of true events that overlap with each of our regions of interest:

We can see how some subsets contain a large number of events (e.g. * and repeat.masker), whilst others barely contain any observations (e.g. coding.exons.acmg.ensembl87).

Calculating HDIs

In order to reliably compare TP counts across subsets and analysis groups, we propose a method to compute highest-posterior density intervals (HDIs) from hap.py outputs. Our method is able to account for variable subset sizes and replicate variability, and is based on the following steps:

  1. Model counts using Negative Binomial sampling: \[k_{ij} \sim NB(\rho_{i}*n_{ij}, \sigma)\] where \(k_{ij}\) = observed successes for subset i in replicate j; \(\rho_{i}\) = success rate for subset i; \(n_{ij}\) = total counts for subset i in replicate j; \(\sigma\) = dispersion parameter; \(\rho_{i} \sim Beta(1, 1)\) = \(\rho\) prior; \(\sigma \sim Normal(\mu, \sigma^2)\) = \(\sigma\) prior

  2. Use MCMC to obtain posterior distributions for the success rate, per replicate
  3. Compute HDIs by sampling from the aggregate posterior

We provide a set of helper functions attached to this vignette (stratified_counts_util.R) that make it easy to calculate HDIs from the extended metrics table that we have previously imported; e.g. for stratified recall:

Calibrating HDIs with simulated data

One way to evaluate the accuracy of our HDIs is through simulated data, where we know what the true success rate is and, as a result, we can measure how often that value is captured by our HDI estimates. However, we do not want to rely on a simulation approach that is based solely on the model that we have used for estimation, since it might not be fully representative of real data.

Let us take recall measurements across \(r\) replicates as an example. For a subset with \(n\) total variants we can classify each variant into one of these 3 groups:

  1. the variant is consistently called across all of the replicates (always seen - A)
  2. the variant is only called in a subset of the replicates (sometimes seen - S)
  3. the variant is never called (never seen - N)

We can then define our true recall \(\rho\) as follows:

\[\rho = (n*q_A*x_A + n*q_S*x_S + n*q_N*x_N) / n\]

where \(n\): total number of variants in the subset; \(q_A\): probability of having variants in group A; \(x_A\): recall in group A (will always be 1); \(q_S\): probability of having variants in group S; \(x_S\): recall in group S (variable); \(q_N\): probability of having variants in group N; \(x_N\): recall in group N (will always be 0).

From these parameters, we can simulate TP counts in one sample by drawing from a multinomial distribution that attributes each of the \(n\) variants into one of the groups using the defined probabilities \(\{q_A, q_S, q_N\}\). We can next simulate replicates from this sample by performing random binomial draws from each group with recalls \(\{x_A, x_S, x_N\}\), which in practice will be \(\{1, x_S, 0\}\). Note that the number of true positives for groups A and N will be fixed across replicates by definition, but we can still use \(x_S\) to simulate replicate variability. As a result of this variability, our observed recall will differ slightly from the theoretical expectation \(\rho\).

Generating the simulated dataset

We can implement the simulation strategy discussed above as follows:

# not run

# simulate a sample with N subsets, by randomly assigning values to {q_A, q_S, q_N}, x_S and n
n_subsets = 1000
sim_sample = lapply(1:n_subsets, function(i) {
    q = runif(3, min = 0, max = 1)
    q = round(q/sum(q), 4)
    d = data.frame(
        i = i,
        qa = q[1],
        qs = q[2],
        qn = q[3]
    )
}) %>% 
  bind_rows() %>%
  mutate(x = sample(x = seq(0, 1, 1e-4), size = n_subsets, replace = TRUE)) %>% 
  mutate(n = sample(x = seq(1e3, 1e4, 1), size = n_subsets, replace = TRUE)) %>% 
  mutate(expected_rho = round((n*qa + n*qs*x) / n, 4)) %>% 
  mutate(subset_qa_qs_qn_x_n = sprintf("S%s_%s_%s_%s_%s_%s", i, qa, qs, qn, x, n))

sim_sample %>% head

# simulate r replicates from the sample by drawing counts in each subset, with probabilities {1, x, 0}
get_counts <- function(subset_qa_qs_qn_x_n, r) {
    s <- data.frame(subset_qa_qs_qn_x_n) %>% 
        separate(subset_qa_qs_qn_x_n, into = c("subset", "qa", "qs", "qn", "x", "n"), sep = "_")
    
    counts <- table(sample(
      factor(c("always", "sometimes", "never")), s$n, prob = c(s$qa, s$qs, s$qn), replace = TRUE))
    always <- counts["always"]
    sometimes <- rbinom(as.numeric(r), counts["sometimes"], as.numeric(s$x))
    per_replicate_successes <- data.frame(
        subset_qa_qs_qn_x_n = subset_qa_qs_qn_x_n,
        replicate_id = paste("R", 1:r, sep = ""),
        successes = always + sometimes,
        stringsAsFactors = FALSE
    )
    per_replicate_successes
}

sim_replicates <- lapply(sim_sample$subset_qa_qs_qn_x_n, function(s) get_counts(s, r = 5)) %>% 
  bind_rows()

sim_replicates %>% head

# combine the sample and replicate datasets
sim_data <- sim_sample %>% 
  inner_join(sim_replicates) %>% 
  mutate(observed_rho = round(successes / n, 4))

sim_data %>% head
write_csv(sim_data, path = "sim_data.csv")

# calculate HDIs
stan_result <- sample_posterior(m = sim_data %>% rename(Subset = subset_qa_qs_qn_x_n),
                                successes_field = "successes", totals_field = "n")
sim_hdi <- estimate_hdi(r = stan_result) %>%
    separate(subset, into = c("subset", "qa", "qs", "qn", "x", "n"), sep = "_") %>% 
    mutate(n = as.numeric(n)) %>% 
    mutate(obs.range = obs.max - obs.min) %>% 
    mutate(hdi.range = posterior.hdi.high - posterior.hdi.low)

sim_hdi %>% head
write_csv(sim_hdi, path = "sim_hdi.csv")

Simulation results

By comparing the expected vs. observed success rates (\(\rho\)) from our simulated dataset, we can confirm that our simulation has indeed produced the intended variability:

We can also inspect if our HDIs respond to the variables of interest, that is, subset size and replicate variability:

And we can confirm that our HDIs contain the simulated per-replicate probabilities by inspecting a few subsets:

Last, we can extend the evaluation to all subsets and quantify what proportion of HDIs contain all 5 replicate observations (should be close to 95%):