Sixty-second Take: Is the QTN Paradigm Rotten to the Core?

In January, I published paper in Nature Communications with Sudhir Kumar and colleagues on the surprisingly poor performance of predictive algorithms for regulatory variant annotation.  I happen to think it is one of the more important of my scientific contributions, yet despite it being highlighted by the editors, early returns for AltMetrics are as underwhelming as the predictions themselves.  Perhaps this is because everyone already knows that regulatory prediction sucks, so the surprise factor is not there; or maybe because everyone is so invested in the QTN paradigm being true, that they don’t want to hear it.  Either way, I hope interest picks up.

The paper arises from a collaboration with Sudhir as part of the NHGRI’s NoVa Consortium of investigators involved in fine-mapping of causal variants through a combination of computational and experimental approaches.  Our studies involve association mapping of eQTL under the assumption of multiple causal variants for each transcript, incorporation of evolutionary and functional covariates into the modeling, and single-cell CRISPR experimental validation.  We started with the assumption that published scores that purport to predict pathogenicity work well, expecting to refine credible intervals to a small number of candidate polymorphisms for each gene.  

Not so fast, my friend!  Turns out first that they do not do so well, and second that most of the signal they pick up is of evolutionary conservation anyway, so scope for improvement is limited.  I’ll get to a third, more profound, implication as this Take develops.  Most of the work of the paper was done by Li Liu at Arizona State and Max Sanderford at Temple, with contributions from Ravi Patel and Pramod Chandrashekar at both places as well.  It is published as article 330 in volume 10, January 2019, of Nature Communications “Biological relevance of computationally predicted pathogenicity of noncoding variants”. 

Our starting point was six published methods, each of which claim high accuracy in predicting functional performance of human polymorphisms, with ‘areas under the curve’ greater than 80%, a crude hint that they have good sensitivity and specificity.  That is, the scores identify most of the causal variants while making few mistakes.  The scores are CADD, CATO, DeepSea, Eigen, GWAVA, and Linsight, and they have been cited several thousand times collectively, just in the past few years, as justification for conclusions regarding the roles of variants in gene regulation.  To be fair, CADD’s performance for coding region variants does seem to be rather good and the vast majority of high CADD scores are coding; and the others have considerable intuitive appeal, or deploy sophisticated machine learning methods that Gen-Xers love.

Here’s the problem.  Everyone is comfortable now with the notion that the majority of GWAS hits are regulatory.  It is not hard to see how polymorphisms that affect binding to transcription factors, or influence epigenetic modifications, or mediate splicing, or structurally alter chromatin, could adjust the levels of transcript abundance.  Hence, the task before us to work out exactly which polymorphisms in a regulatory region are responsible for both the gene expression and trait association at each locus: that is, to fine map the joint eQTL and GWAS hits.  The standard paradigm is that each QTL can be reduced to a QTN, a single quantitative trait nucleotide.  Since each causal variant is usually accompanied by dozens or even a hundred or more other variants in high linkage disequilibrium that give similar statistical signals, the sense is that incorporating functional and/or evolutionary priors should help pinpoint the causal variant in each credible interval.  If the univariate associations across an interval look like the panel on the left at the CPVL locus, that seems very reasonable; but if they look like the one on the right at AMFR (which is often the case) then you can see the task is a bit more complicated.  By the way, you can scan such profiles for the blood transcriptome at our new eQTLHub Shiny put together by Biao Zeng,

Anyway, Li and Max first asked the question whether the six tools were any good at discriminating among the four possible alleles at each position in the genome.  Turns out that four of them are site-specific, meaning that they make no attempt to do so and report the same score for all four alleles.  They were designed to discriminate among sites, so cannot tell whether the A or the G is more likely to be pathogenic.  The other two tests, CADD and DeepSea, do assign scores to each allele, but they might as well not bother.  When we asked whether they could discriminate between highly-likely pathogenic and highly-likely near-neutral, their performance was barely better than guessing.  One of the reviewers gave us a really hard time about this over two review cycles, and forced us into a lot more analysis that is buried in the supplement but definitely adds to the rigor of the conclusions, since it shows that how we define pathogenicity does not really matter.  The way we did it, similar but more conservative than CADD, was to contrast sites that are common in the human genome but not associated with any traits (the neutral set)  with those that are never found in other primates and hence subject to deleterious selection (the pathogenic set).  There are a lot of details in the paper, but the bottom line is that the predicted scores do not discriminate among alleles at a position.

Next we wondered whether the scores could discriminate among sites in a credible interval.  For this, we switched our definition of pathogenicity to be based on a set of 764 disease-associated variants (DAVs) from the Human Gene Mutation Database (HGMD) that pass very strict filters reminiscent of the ACMG guidelines for clinical interpretation of coding variants.  Our strict reviewer gave us a hard time about this as well, but again it turns out that the specifics do not much matter: none of the six scores are particularly good at picking out these DAVs from the crowd of frequency- and region-matched SNPs thought not to be deleterious, which we call common population polymorphisms (CPPs).  Not even the very rare ones.  Up to a quarter of a time, it is possible to find a CPP in the vicinity of the DAV, which has a more pathogenic score than the DAV does.  Most of the time, the difference in scores of the DAV and CPP are too small to be biologically meaningful.  The two exceptions, unsurprisingly, are sites in the promoter, or in ultra-conserved sequence elements.  Why don’t the scores work?  In large part because it turns out that the underlying measures used by the various predictors are themselves highly correlated within up to a kilobase (or more) of any site.

As if this weren’t enough, we thirdly asked how the predictions stand up under the more biologically reasonable scenario where someone is trying to pick out the DAV from the entire credible interval, rather than just a single matched site.  It is not well enough appreciated that positive predictive value, also known as precision, is highly sensitive to the ratio of cases to controls (in this situation, causal variants to innocent polymorphisms).  Consequently, matching the presumptive DAV to tens or a hundred candidates drastically reduces predictive performance relative to matching it to just one alternative.  We conclude that future improvements and real-world application of predictive algorithms need to be sensitive to this pervasive bias.

OK, then, so what is the more profound insight I want to draw from all this, and why the provocative title to the Take?  I don’t expect everyone to agree with this interpretation, but I think the parsimonious claim that QTL can be reduced to QTN must now be treated with skepticism.  Sure, there are cases where it is absolutely the case that a single site is responsible for regulating gene expression.  Careful dissection of SORT1 showed this early on, and entrenched the paradigm that molecular genetic manipulation will link causal variants to gene expression to heart disease or whatever.  But read the fine details even there, it is not as though the one site explains all of the effect.  And I well recall Cathy Laurie’s heroic efforts to dissect one of the best-studied Drosophila loci three decades ago before human genetics came to dominate popgen, the Adh fast-slow allele, which turned out to be both an amino acid change and a regulatory change, or two, or what Facebook would call “it’s complicated”. 

What evidence is there against the counter-proposition that in fact haplotype effects are just that, the cumulative influence of multiple variants in the linkage disequilibrium block? This is a little different from recognizing that several independent sites in low LD with one another give rise to multiple GWAS signals at many loci.  It is the claim that very often what is assumed to be a single causal variant for each high LD block is in fact multiple causal variants.  If several sites have the same statistical signal, and similar evolutionary conservation, and similar profiles of DNase hypersensitivity or chromatin accessibility or association with methylation, then why shouldn’t they all be contributing?  Experimental assays can be impressive and are becoming more high-throughput, but I doubt they have the resolution to pick out the subtle effects that influence gene activity and that natural selection can see.  The possibility that soft selection acting on multiple functional variants is responsible for association profiles such as the one at the top right needs to be taken more seriously. Time to question the one-association, one-nucleotide paradigm.


This month’s Freedom Watch is a shout out to the people of Venezuela.  Self-determination is a right primarily of people, not of nation-states or those who claim to represent them.  Let’s hope the aid gets through.  People before politics.

Source: genomestake

Sixty-second Take: Is the QTN Paradigm Rotten to the Core?