What Makes a Published Result Believable?

OMG, it’s been such a while! As I am writing these lines, I feel that I’ve gone back in time! Much has happened since 2015. Mostly, I’ve gone French and YouTube, arguably quite successfully. Also, last year, I published my first book, on Bayesianism, a topic I have become very fond of. The book is still only in French, but the English version should hopefully be coming out soon. Meanwhile, here’s a post about an important application of Bayes rule which, weirdly enough, didn’t make it into my book.

I believe that what we’re going to discuss here is a relatively neglected aspect of science, even though it seems absolutely critical. Namely, we’re going to discuss the reliability of its results. This may seem wrong to question it these days, as a lot of pseudosciences seem to be gaining in popularity. But actually, I’d argue that the rise of pseudosciences is all the more a reason why we should really do our best to better understand what makes science valid, or not. In fact, many scholars have raised huge concerns about the validity of science before me, as wonderfully summed up by this brilliant Veritasium video:

Many scientists even acknowledge that science is undergoing a replication crisis, which led to many, perhaps even most, scientifically published results being wrong. Here, we’ll aim at clarifying the reasons of this replication crisis, and at underlining what ought to be done to solve it.

In fact, I first thought that I should write something very serious about it, like for research journal or something. But this quickly frustrated me. This is because the simpler, though very approximate, explanations seemed actually much more pedagogically useful to me. In this article, I’ll present a very basic toy model of publication, which relies on lots of unjustified assumptions. But arguably, it still allows to pinpoint some of the critical points to keep in mind when judging the validity of scientifically published result.

The fundamental equation (of a toy model)

The scientific process is extremely complicated and convoluted. After all, we’re dealing here with the frontier of knowledge. A lot of expertise is involved, peer review plays a big role, as well as journals’ politics, trends within science and so on. Modeling it all would be a nightmare! It would basically boil down to doing all of science better than scientists themselves. Understanding how science really works is extremely hard and complicated.

This is why we’ve got to simplify the scientific process to have a chance to say something nontrivial about it. Even though caveats will apply. This is what I’ll propose here, with a very basic toy model of publication. We consider three events in the study of a theory. First, we assume that the theory may be true, denoted $T$, or not, which we denote $\neg T$. Second, we assume that the theory is tested, typically through some statistical test. This test may yield a statistically significant signal $SS$. We shall only consider the case where statistically significant signals are signals of rejections of the theory. Third and finally, we assume that this will lead to a published result $PR$ or not.

Moreover, for simplicity, we assume that only statistically significant signals get published. Thus, only two chain of events lead to publication, namely $T \rightarrow SS \rightarrow PR$, or $\neg T \rightarrow SS \rightarrow PR$. Since published results are rejections of the theory, the validity of the published results corresponds to the probability of theory $\neg T$, given that the results have been published. In other words, the validity is measured by $\mathbb P[\neg T|PR]$. Published results are highly believable if $\mathbb P[\neg T|PR]$ is large.

To simplify computation, we shall rather study the odds of $\neg T$ given publication, i.e. the quantity $validity = \mathbb P[\neg T|PR] / \mathbb P[T|PR]$. Bayes rule, combined with our previous assumptions, then yields:

$$validity = \frac{\mathbb P[PR|\neg T]}{\mathbb P[PR|T]} \frac{\mathbb P[\neg T]}{\mathbb P[T]} = \frac{\mathbb P[PR|\neg T,SS]}{\mathbb P[PR|T,SS]} \frac{\mathbb P[SS|\neg T]}{\mathbb P[SS|T]} \frac{\mathbb P[\neg T]}{\mathbb P[T]}.$$

This equation is really the heart of this article. But it may be hard to understand if you are not familiar with statistical tests and conditional probabilities. So let’s replace the weird and complicated mathematical notations by words that roughly convey their meaning.

Namely, we define $prior = \mathbb P[\neg T] / \mathbb P[T]$, $power = \mathbb P[SS|\neg T]$, $threshold = \mathbb P[SS|T]$, $clickbait = \mathbb P[PR|T,SS] / \mathbb P[PR|\neg T,SS]$. With these new quantities, we finally obtain the equation of the validity of science, according to our toy model of publication:

$$validity = \frac{power \cdot prior}{threshold \cdot clickbait}.$$

Now, the rest of this article will be a discussion of this equation. In particular, we shall discuss the validity of the terminology, the likely values of the terms (depending on fields) and what ought to be done to increase this quantity (and whether that really is a desirable goal).

Classical statistics terms

Let’s start with the easiest notion, which is that of $threshold$. The threshold discussed here will typically be that of the p-value statistical test. Under standard assumptions, it can easily be shown that the threshold is the probability $\mathbb P[SS|T]$ of a statistically significant signal assuming $T$ true. In many areas of science, it is chosen to be 5%. Some scientists argue that it should be brought down to 1%. This would be a great way to increase the validity of science indeed. All else being equal (which is definitely unrealistic though), this would multiply by 5 the validity of science!

A very important caveat though is that there may be reasons why $\mathbb P[SS|T]$ may sometimes get larger than the threshold of the statistical test. Namely, there may be imperfections in measurement devices that have gone unnoticed. This increases the likelihood of statistical significant signal despite the theory being true. In fact, it may be much worse because of additional biases, as explained in this paper by Professor John Ioannidis. These biases may be typically large when financial or publishing incentives for researchers are huge and lead to distortion or misconduct. Also, such biases may result from a flawed application of statistical tools. Because of all of this, dividing the p-value threshold by 5 will probably actually not divide the value of $threshold$ by 5. Depending on fields, this quantity might for instance remain at 10%.

Another well-known term in this equation is the $power$ of the statistical test. Unfortunately, this is also a trickier notion to pinpoint, especially in the setting we define. Recall that it is equal to the probability $\mathbb P[SS|\neg T]$ of a statistically significant signal, assuming $T$ false. The problem is that $\neg T$ is often not much a predictive theory. It will typically roughly correspond to some proposition of the form $x \neq 0$, where $x$ is a parameter of a model.

For this reason, in some studies, the negation $\neg T$ of the theory to be tested is replaced by some alternative hypothesis. Let’s call it $A$. The trouble is that if $A$ is very close to $T$, typically if it assumes $x = \epsilon$, then $A$ will actually make mostly the same predictions as $T$. In particular, it will be both difficult and not very useful to reject $T$, since it would mean replacing it with a nearly identical theory.

In fact, instead of a single alternative $A$, a more Bayesian approach would usually consider a infinite number of alternatives $A$ to theory $T$. In such cases, it is arguably quite clear that $T$ is not that likely, since it is in competition with infinitely many other alternatives! Moreover, to structure fairly the competition, instead of assuming that all alternatives are on equal foot, Bayesianism will usually rather add a prior distribution over all alternatives, and will rather puzzle over how this prior distribution is changed by the unveiling of data. Evidently, much more needs to be said on this fascinating topic. Don’t worry. The English version of my book on this topic is forthcoming…

Now, it might be relevant to choose $A$ to be the most credible alternative to $T$, or something like this. This is what has been proposed by Professor Valen Johnson. Using an alternative theory somewhat based on this principle, and numerous disputable simplifying assumptions, Johnson estimated that, for a p-value threshold at 5% and p-values of the order of 1 to 5%, the ratio $power/threshold$ was somewhere between 4 and 6 for well-designed statistical tests. For a p-value threshold at 1%, this quantity would be around 14.

However, given our previous caveats about the value of $threshold$, it seems that we should rather regard the ratio $power/threshold$ to be around, say 3, for a 5% threshold, and around, say 7 for 1% threshold. These quantities are extremely approximate estimates. And of course, they should not be taken as is.

The prior

We can now move on to the most hated term of classical statistics, namely the concept of $prior$. Note that a Bayesian prior is usually defined as $\mathbb P[\neg T]$, not as the prior odds $prior = \mathbb P[\neg T]/\mathbb P[T]$. Please apologize this abuse of terminology whose purpose was to simplify as much as possible the understanding of the fundamental equation of this article. For the purpose of this article (and this article only!), $prior$ is thus a real-valued number between 0 and $\infty$.

But what does it represent? Well, as the equation $prior = \mathbb P[\neg T]/\mathbb P[T]$ says it, $prior$ quantifies how much more likely $\neg T$ seemed compared to $T$ before collecting any data. The larger $prior$ is, the more $\neg T$ seemed to be believable prior to the study.

Now, if you’re not a Bayesian, it may be tempting to argue that there’s no such thing as a prior. Or that science should be performed without prejudice. There may be a problem with the connotation of words here. In fact, prior is a synonym (with opposite connotation) of “current state of knowledge”. And it seems irrational to analyze data without taking into account our “current state of knowledge”. In fact, many statisticians, for instance here, argue that the current problem of statistical tests is that they don’t sufficiently rely on our current understanding of the world. Statistical analyzes, they argue, must be contextualized.

To give a clear example where prior-less reasoning seems to fail, let’s look at parapsychology which may test something like precognition. The reason why parapsychology papers do not seem to yield reliable results usually has nothing to do with the method they apply. Indeed, they usually apply the supposedly “gold standard” of science, namely double-blind randomized controlled trials with a p-value statistical test. You cannot blame them for their “scientific method”.

The reason why, despite the rigor of their method, parapsychologists nevertheless obtain results that do not seem trustworthy is because of the prior. In fact, parapsychologists mostly test very likely theories. In other words, they choose to test theories whose $prior$ is near-zero. Yet, if the $prior$ is near-zero, according to our fundamental equation, then the validity of the published result will be near-zero too.

It is worth putting this in perspective, including with respect to the value of $threshold$. In particle physics, this $threshold$ is extremely low. It is at around $10^{-7}$ (and if there’s no experimental error, this would correspond to $power/threshold \approx 10^5$. This sounds like this should be sufficient. Well, not necessarily. Given that Bayes rule is multiplicative, if a theory is really wrong, it will become exponentially unbelievable with respect to the amount of collected data. In other words, by collecting, say, thousands of data points, one could get to a credence of $10^{-100}$ for a given very wrong theory. Put differently, the scales of the term $prior$ may be very different from those of $threshold$ or $power$. This is why $prior$ arguably plays a much bigger role in judging the validity of scientifically published results than $threshold$.

Adding to this the very nonnegligible probability of experimental error, in which case $power/threshold$ may actually be of the order of 100, this explains why the discovery of faster-than-light neutrinos was promptly rejected by physicists.

In particular, in such a case, the quantity $power \cdot prior / threshold$ may actually be smaller than 1. In fact, in cases where a threshold of 1% is used, it suffices that the the prior odds of $\neg T$ was 1 to 10, which arguably may not be absurd in, say, clinical tests of drugs, for $power \cdot prior / threshold$ to be much smaller than 1. This would mean that despite publication arguing for $\neg T$, the theory $T$ is still more credible than $\neg T$. Or put differently, the publication is more likely to be wrong.

So, you may ask, why on earth would we try to reject theories $T$ with large priors? Isn’t it the root cause of most publications being false? Yes it is. But this doesn’t seem to be a bug of science. It actually seems to a feature of science. Typically, scientists are often said to have to test theories over and over, even though such theories have been supported so far by a lot of evidence. This means that they will have to test likely theories, which means that the $prior$ of rejection is very small.

There is a more practical reason why scientists may want to reject theories that seem very likely prior to analysis: these are the results that are often the most prestigious. Such results, if they hold, will be much more cited and celebrated. And of course, this gets even worse when scientists have motives, such as clearing the use of some drug. This can be heavily aggravated by p-hacking, which can be regarded as the scaling or automation of analysis of true theories. This scaling or automation may hugely increase the number of true theories to be tested, with the guarantee that statistical significant results will eventually be found. And this leads to the final problematic piece of the puzzle: clickbait.

Clickbait

So let’s discuss the $clickbait$ term, which may also be called the publication bias. Recall that $clickbait = \mathbb P[PR|T,SS]/\mathbb P[PR|\neg T,SS]$. In other words, it is about how much more likely it is to publish the rejection of a true theory $T$, as opposed to that of false theory $T$, given the same amount of statistical significance.

It may seem that the $clickbait$ term should equal 1. After all, from far away, it may seem that the publication peer-review process is mostly about judging the validity of the analysis. Indeed, it is often said that peer review mostly aims at validating results. But that seems very far from how peer review processes actually work. At least from my experience. And from the experience of essentially all scientists I know.

In fact, I’d argue that much of peer review has nearly nothing to do with validating results. After all, quite often, reviewers do not bother to ask for the raw data, nor will they redo the computations. Instead, I’d argue that peer review is mostly about judging whether a given paper is worthy of publication in this or that journal. Peer review is about grading adequately the importance or the value of such or such findings.

This is why, at peer review, not all theories are treated equally, even when they have the same statistical significance. In particular, the rejection of a theory that really seemed true is much more clickbaity than the rejection of a theory that did not seem true. This is why $clickbait$ is probably larger than 1, if not much, much, much larger than 1.

Note that this will be particularly the case for journals that receive a large number of submissions. Indeed, the more submissions there are, the more there room there is for publication bias. This arguably partly explains why Nature and Science are particularly prone to the replication crisis and to paper retractions.

The focus on p-value seem to be aggravating the clickbait problem. Indeed, by removing other considerations that could have favored the rejection of false theories rather than that of true theories, p-values may be serving as an excuse to validate any paper with a statistical significant signal, and to then move on the discussion towards the clickbaitness of the papers whose p-value was publishable.

Unfortunately, it seems that the scales of clickbaitness may be at least comparable to that of the p-value threhold. Evidently, this probably strongly depends on the domain and the journal that we consider. Combining it all, given that I already essentially argued that $power \cdot prior / threshold$ might already be smaller than 1, we should expect $validity = (power \cdot prior) / (threshold \cdot clickbait)$ to actually be much smaller than 1. This is why, arguably, especially if the p-value is used as a central threshold for publication, we really should expect the overwhelming scientifically published results to be wrong. This is not a flattering news for science.

Should published results be more valid?

It may seem that I am criticizing scientists for preferring the test of very likely theories, and journals for favoring clickbait papers. But this is not nearly the point of this article. What does seem to be suggested by our toy model of publication though, is that these phenomena undermine the validity of scientifically published results. It seems unfortunate to me that these important features of science are not sufficiently underlined by the scientific community.

Having said this, I would actually argue that science should not aim at the validity of scientifically published results. In fact, I believe that it is a great thing that scientists still investigate theories that really seem to hold, and that journals favor the publications of really surprising results. Indeed, science research is arguably not about establishing truths. It seems rather to be about advancing the frontier of knowledge. This frontier is full of uncertainty. And it is very much the case that, in the shroud of mystery, steps that seem forward may eventually turn out to be backwards.

In particular, from an information theoretic point of view, what we should aim at is not the validity of what we say, but, rather, some high-information content. Importantly, high-information does not mean true. Rather, it means information that is likely to greatly change what we thought we knew.

This discussion makes more sense in a Bayesian framework. High-information contents would typically be those that justifiably greatly modify our prior beliefs, when Bayes rule is applied. Typically, a study that shows that we greatly neglect the risk of nuclear war may be worth putting forward, not because the nuclear war is argued to be very likely by the study, but perhaps because it argues that whoever gave a $10^{-15}$ probability of nuclear war in the 21st century would have to update this belief, given the study, to, say $10^{-3}$. Such an individual would still find the study “unreliable”. Yet, the study would still have been of high-information, since it would have upset the probability by 12 orders of magnitude!

Still more precisely, and still in a Bayesian framework, the question of which research ought to undertaken, published and publicized would correspond to information that becomes useful to update strategies to achieve some goal, say, improving the state of the world. In other words, in this framework, roughly speaking, a published research $PR$ is worth publishing if
$$\max_{action} \mathbb E[world|action,PR] \gg \max_{action} \mathbb E[world|action].$$

The more general approach to goal-directed research publication (and, more generally, to what ought to be communicated) is better captured by the still more general framework of reinforcement learning, and especially of the AIXI framework. Arfff… There’s so much more to say!

On another note, it seems unfortunate to me that there is not a clearer line between scientific investigations and scientific consensus. In particular, both seem to deserve to be communicated, which may be done through publications. But it may be worth distinguishing interesting findings from what scientists actually believe about such or such topic. And importantly, scientific investigations perhaps shouldn’t aim at truths. It perhaps should aim at highlighting results that greatly upset readers’ beliefs — even if they don’t actually convince them.

Conclusion

Wow that was long! I didn’t think I would reach the usual length of Science4All articles. It does feel good to be back, though I will probably not really be back. You probably shouldn’t expect another new article in the next 3 years!

Just to sum up, the validity of a scientifically published result does depend on a few classical statistics notions, like the threshold of p-value tests and statistical power. And arguably, yes, it’d probably be better to lower the p-value threshold, even to, say, 0.1% as proposed by Valen Johnson. However, this does not seem sufficient at all. It seems also important to be watchful of the publication bias, also known as the clickbaitness of the paper. Perhaps most importantly, one should pay attention to the prior credence in theory to be tested. Unfortunately, this is hard to do, as it usually requires a lot of expertise. Also, it’s very controversial because this would make science subjective, which is something that some scientists abhor.

On this note, I’d like to quote one of my favorite scientist of all times, the great Ray Solomonoff: “Subjectivity in science has usually been regarded as Evil — that it is something that does not occur in \true science” — that if it does occur, the results are not “science” at all. The great statistician, R. A. Fisher, was of this opinion. He wanted to make statistics “a true science” free of the subjectivity that had been so much a part of its history. I feel that Fisher was seriously wrong in this matter, and that his work in this area has profoundly damaged the understanding of statistics in the scientific community — damage from which it is recovering all too slowly.”

Unfortunately, overall, it is far from clear how much we should trust a scientifically published result. This is very context-dependent. It depends on the theory, the area of research, the peer review process, the policy of the journal and the current state of knowledge. Science is complex. Nevertheless, the lack of validity of scientifically published results need not undermine the validity of the scientific consensus. In fact, in my book, I argue through a Bayesian argument that, in many contexts, the scientific consensus should be given a much greater credence than any scientific publication, and than any scientist. Indeed, especially if this consensus is fed with a lot of data that are likely to change the scientists’ priors, it can be argued that the opinion of the scientific community gets updated in a similar manner as the way Bayes rule requires priors to be updated (this is also related to things like multiplicative weights update or the Lotka-Volterra equations).

But more on that in a book whose English version is forthcoming…