Representative image

Image by Guy Beauchamp.

What Keeps a Bayesian Awake At Night? Part 2: Night Time

foundations theory

By Wessel Bruinsma, Andrew Y. K. Foong, Richard E. Turner

The theory of subjective probability describes ideally consistent behaviour and ought not, therefore, be taken too literally.

Leonard Jimmie Savage (1917–1971)

In the first post in this series, we laid out the standard arguments that we and many others have used to support the edifice which is Bayesian inference. In this post, we identify the weaknesses in these arguments that cause us to lose sleep at night.

We’ve split these weaknesses into three types: first, weaknesses in the standard mathematical justifications for the probabilistic approach; second, weaknesses arising in practice from the modelling stage; and, third, weaknesses arising from realising the inference stage due to computational constraints.

Weakness 1: Standard justifications have problems

We have seen that probability theory and Bayesian decision theory are usually justified in one of several ways, but do these standard justifications really stand up to scrutiny? Let’s go through each of the seven arguments in turn.

(1) de Finetti’s exchangeability theorem presupposes a probability distribution over the data and shows this naturally leads to distributions over parameters. However, it does not justify the use of probability in the first place.

(2) Cox’s theorem does justify the probabilistic approach, but making the argument watertight turns out to be far more delicate than the textbooks, say of Jaynes (2003) or Bishop (2006), would have you believe. To make the theorem mathematically rigorous requires additional technical assumptions that muddy the clarity of the argument. Paris (1994, p. 24): “[W]hen an attempt is made to fill in all the details some of the attractiveness of the original is lost.”1 Moreover, there remains disagreement about the desirability of several of Cox’s theorem’s other assumptions.2 Perhaps the most controversial assumption is that plausibilities are represented by real numbers. As a consequence, for every two possible propositions, one of the two propositions conclusively has higher (or equal) belief. (This is called universal comparability.) But what if we are truly ignorant about two matters, or have not yet formed an opinion? In that case, is it reasonable to require that the plausibilities assigned to the matters are necessarily comparable?3

(3) The Dutch book argument tells us that any coherent actor should use probability to express their beliefs, but the argument leaves the door open as to how the actor should update their beliefs in light of new evidence (Hacking, 1976). This is because the standard Dutch book setup is static: it does not involve a step where beliefs are updated on the basis of new information. That Bayes’ rule should be used for this purpose requires additional assumptions. Dynamic alternatives of the Dutch book argument attempt to fix this flaw, but again the force of the argument is diminished and open to criticism (Skyrms, 1987).4

(4) Savage’s theorem guarantees a nearly unique loss function (unique up to an affine transform) and a unique probability, which together form the expected loss. The troubling aspect of Savage’s theorem is that the constructed loss function depends only on the outcome of a decision, and Savage’s axioms imply that outcomes of decisions have a value which is independent of the state of the world (Karni, 2005). As Wakker & Zank (1998) remark, this disentanglement can be undesirable. They give the example of health insurance, where the value of money depends on sickness or health. If the loss function is allowed to depend on other aspects of the world, then the probability constructed by the theorem is no longer unique (Karni, 2005). For this reason, Karni (2005) argues that the probability constructed by the theorem is arbitrary and thus cannot realistically represent the decision maker’s beliefs.

(5) Doob’s consistency theorem, the (6) optimality of Bayesian predictions and (7) Wald’s theorem are all only unit tests: how reassured should we really be that Bayesian inference passes them? It is not clear that the guarantees on the optimality of Bayesian predictions are the guarantees you care about. For example, typically we’re faced with a single dataset and care only about performance on it alone, rather than the average performance across many potential datasets. Similarly, the admissibility of a decision rule in Wald’s theorem is desirable but not sufficient: indeed, there are admissible estimators that are not reasonable.5 Doob’s consistency theorem also suffers from subtle theoretical issues (Diaconis & Freedman, 1986), which we discuss below in the context of model mismatch.

Take away. It is clear then, that uniquely and precisely justifying the three stage Bayesian approach via a single argument is much more delicate than many would have you believe. However, these issues don’t trouble our sleep. That is partly due to the fact that it is reassuring that so many and so diverse a set of arguments suggest that it is a sensible approach.6 However, it is also because there are far bigger issues to worry about.

Weakness 2: Modelling is hard and inferences are sensitive to innocuous details

All practitioners of Bayesian inference will know well that beliefs are typically only roughly encoded into probabilistic models. There are at least three good reasons for this. First, it is often hard to really pin down precisely what you believe: What tail behaviour should a variable have? Are there latent variables at play? Et cetera. Second, even when you do have an ideal model in mind, it can be mathematically challenging to accurately translate that into a probability distribution. Third, many modelling choices are often based on convenience such as mathematical tractability.

Should we be worried about this? Surely roughly encoding our prior knowledge is sufficient?

Unfortunately, seemingly small or irrelevant inaccuracies in the model can greatly affect the posterior and therefore downstream decision making. For example, Diaconis & Freedman (1986) show that “in high-dimensional problems, arbitrary details of the prior really matter”. Similarly, Kass & Raftery (1995) conclude that, in the context of Bayesian model comparison, “[t]he chief limits of Bayes factors are their sensitivity to the assumptions in the parametric model and the choice of priors.” Indeed, for this reason, the textbook presentation of model comparison, such as that in David MacKay’s excellent text book, should be seen as only a pedagogical depiction of Bayesian inference at work, rather than an approach that will bear practical fruit: discrete Bayesian model comparison does not work in practice.7 As a more recent example of the importance of priors in high-dimensional settings, experiments by Wenzel et al. (2020) suggest that the usual choice of Gaussian prior8 for Bayesian neural network models contributes to the cold posterior effect in which the Bayesian posterior is outperformed on prediction tasks by strongly tempered versions.

Can theory about the performance of the Bayesian approach in the face of model misspecification act as a comfort blanket? There are theorems, analogous to Doob’s consistency theorem in the well-specified case, that describe situations when the Bayesian posterior will still concentrate on the true parameter value even when the model is misspecified (Kleijn & van der Vaart, 2006; De Blasi & Walker, 2013; Ramamoorthi et al., 2015), but, as Grünwald & van Ommen (2017) point out9, “[these theorems hold] under regularity conditions that are substantially stronger than those needed for consistency when the model is correct.”10

Theoretical understanding of the consequences of model misspecification is not yet available to rescue us. The probabilistic approach to modelling therefore poses an unsettling dichotomy: on the one hand, you’re free to choose the prior, because it is your belief, your prior; but on the other hand, you should choose your prior absolutely right, because seemingly small or irrelevant changes can greatly affect the conclusions.

A way forward? This weakness of the Bayesian approach used to keep us awake at night. However, taking a slightly different perspective cures our insomnia. The conventional view dogmatically insists that Bayesians should initially and immutably encapsulate prior assumptions up front in the modelling step before seeing data. They then perform inference and then decision making when the data arrives and at that point they’re done. The alternative perspective instead uses the three stage process as a consistency checking device: if I made these assumptions, then these would be the corresponding coherent statistical inferences. In this way, we are free to explore a range of different assumptions, assessing their consequences both for the model and inferences we make. We are free to modify the model accordingly until we arrive at something that well approximates what we believe. This view has been called the hypothetico–deductive view of Bayesian inference (Gelman & Shalizi, 2011) and it has the advantage that this is the way many of us use these methods in practice. The disadvantage is that double-dipping is possible, which requires that you are careful about the checks and diagnostics that you perform on the model and corresponding inferences.

Weakness 3: Approximate inference

The weakness that kept Zoubin Ghahramani (and many other Bayesians) awake at night is that the Bayesian posterior is computationally intractable in all but the simplest cases. Consequently, approximations are necessary, which means that all the theoretical guarantees and justifications that are true for the exact Bayesian posterior no longer hold. Worse still, we do not know in what ways approximation will affect different aspects of the solution.

For example, one common approximation technique is variational inference (Wainwright & Jordan, 2008), which side-steps intractable computations arising from application of the sum and product rules by projecting distributions onto simpler tractable alternatives. Variational inference is therefore guaranteed to be incoherent: it returns different solutions from exactly applying the sum and product rules. How do these approximate inferences differ from the true ones?

Variational inference is known to (1) underestimate posterior uncertainty if factorised approximations are used11 and (2) bias parameter learning so that overly simple models are returned that underfit the data (Turner & Sahani, 2011). An example of the first phenomenon is the observation that the mean-field variational approximation in neural networks is unable to model in-between uncertainty (Foong et al., 2019). Examples of the second phenomenon are over-pruning in mixture models (MacKay, 2001; Blei & Jordan, 2006), variational autoencoders12 (Burda et al., 2015; Chen et al., 2016; Zhao et al., 2017; Yeung et al., 2017), and Bayesian neural networks (Trippe & Turner, 2018).

Inevitably, these errors are amplified if repeated approximation steps are required. For example, in online or continual learning the goal is to incorporate new observations sequentially without forgetting old ones (Nguyen et al., 2017). In theory, repeated application of Bayes’ rule provides the optimal solution, but in practice approximations like variational inference are required at each step. This causes amplification of the approximation error, which eventually leads the model to “forget” old data. Consequently, Bayesian methods are approximate and have to be checked and benchmarked for catastrophic forgetting, just like their non-Bayesian counterparts.

We lack a general theory that justifies variational inference. Recent work has shown that it is frequentist consistent (Wang & Blei, 2017) under assumptions similar to those required for consistency of Bayesian inference under model misspecification. Although useful, this is a long way short of a general compelling justification for the approach.13

Similar arguments can be made against other approaches to approximate inference such as Monte Carlo methods. For example, Markov chain Monte Carlo (MCMC) is guaranteed to eventually give you a close-enough answer; but it may take an unfeasibly long time to do so and diagnosing whether you have reached that point is very difficult (we often rely on heuristics to check whether the chain has been run for long enough). So, although excellent software for performing MCMC exists,14 ensuring that it is performing correctly is delicate.15

Let us take a step back and consider approximate inference in the context of Bayesian decision theory. In its ideal form, Bayesian decision theory is a cleanly separated three-step procedure consisting of model specification, inference and loss minimisation. However, in practice the use of approximate inference entangles the three steps: models which are simpler result in more accurate, and therefore more coherent, inference; and downstream decision making dictates what aspects of the true posterior your approximate posterior should capture and therefore feeds into the inference stage rather than being decoupled from it (Lacoste-Julien et al., 2011).16

A way forward? To our minds, the jury is still out here about the best way forward. Hopefully a more general theory — potentially based on decision making under uncertainty and computational constraints that extends Cox’s, Ramsey’s and de Finetti’s, or Savage’s ideas — will emerge. In the meantime, continuing to provide more limited theoretical characterisation of the properties of existing inference approaches is vital. Practitioners should also test the hell out of their inference schemes to gain confidence in them. Testing on special cases where ground truth posteriors or underlying variables are known is helpful. The acid test is whether your inference scheme works on the real world data you care about, so test cases also need to replicate aspects of this situation. Here the ideas of probabilistic models and being able to sample fake data, potentially from models trained on real data, is very useful. This fits nicely with the hypothetico–deductive loop: specify your model, perform approximate inference, check your inferences, now check your model, review your modelling choices, and start the process again.

Of course the amazing and diverse practical successes of Bayesian inference — from cracking Enigma and finding Air France Flight 447, to the TrueSkill match-making system — give us great confidence in the utility of the probabilistic approach.

Conclusion

We started the first of these two posts by remarking that it was striking that Michael Jordan, Geoff Hinton, and Zoubin Ghahramani — individuals who had made huge contributions to the probabilistic brand of machine learning — maintained reservations about that very approach. We hope that this post has brought colour to these concerns.

Michael Jordan thought that Bayesians and frequentists should reconcile and operate arm-in-arm. This makes sense since Bayesians develop inference procedures and frequentist methods can be used to evaluate them.

Geoff Hinton’s view was that Bayesian approaches don’t cut it for the problems he is interested in, like object recognition and machine translation. This is understandable since we have seen that Bayesian methods are useful when (1) you’re able to encode knowledge carefully into a detailed probabilistic model; (2) you are uncertain and want to propagate uncertainty accurately; and (3) you can perform accurate approximate inference. Is this really true in situations like object recognition or machine translation? These problems involve learning complex high-dimensional representations from stacks of largely noise-free images, words, or speech, and, to quote Zoubin Ghahramani, “if our goal is to learn a representation (…) then Bayesian inference gets in the way”.17

Zoubin Ghahramani admitted that approximate inference kept him awake at night. We have seen that the arguments that we make to justify the Bayesian approach — Cox’s and Savage’s theorems, Dutch books, et cetera — are practically meaningless because our inference is approximate. The elegant separation of modelling, inference, and decision making steps is, in fact, a tangled web.

So, when we teach the ideas of probabilistic modelling and inference and write boilerplate in our papers, we’d ask that you pause for breath before trotting off the standard justifications. Instead, consider evangelising a little less about how principled the approach is, mention the importance of exploring different modelling assumptions and their consequences, and stress that the quality of approximate inference should be evaluated in a systematic and principled way.

  1. Also quoted by Terenin & Draper (2015). Halpern (1999) constructs a counterexample to Cox’s theorem if the additional technical assumption, Paris’ (1994) density assumption, is omitted. 

  2. Van Horn (2003) concludes that “[they] cannot make a compelling case for all of [Cox’s theorem’s] requirements, however, and there remains disagreement as to the desirability of several of them.” 

  3. Interestingly, there are two-dimensional theories of probability; see Section 4.1.2 by Bakker (2016) for an overview. 

  4. Skyrms (1987, p. 4) states a dynamic Dutch book argument due to David Lewis (reported by Paul Teller; 1973, 1976). Skyrms also says (p. 19) that “(…) not every learning situation is of the kind to which conditionalization applies. The situation may not be of the kind that can be correctly described or even usefully approximated as the attainment of certainty by the agent in some proposition in his degree of belief space. The rule of belief change by probability kinematics on a partition was designed to apply to a much broader class of learning situations than the rule of conditionalization.” In the paper, Skyrms describes the Observation Game, a learning situation where conditionalisation does not apply, where a generalisation of conditionalisation called probability kinematic (Jeffrey, 1965) — essentially, Bayes’ rule with a different prior — is necessary and, under certain conditions, sufficient to be bulletproof — a strong coherence condition that excludes a Dutch book. 

  5. For example, see Example 5.7.2 by Lehmann & Casella (1998; also Makani, 1997). 

  6. Savage’s (1962) comment on personal probability that was reproduced at the start of this post captures this sentiment well. 

  7. Andrew Gelman and David MacKay discuss this issue here

  8. We previously wrote “the usual choice of a Gaussian prior”, which, unfortunately, is slightly ambiguous. We mean to refer to the usual choices of $\mathbf{w} \sim \mathcal{N}(0, 1)$ or $\mathbf{w} \sim \mathcal{N}(0, 1/\sqrt{n})$. 

  9. See also this great answer by Peter Grünwald on StackExchange. 

  10. As a solution, Grünwald & van Ommen (2017) propose to replace the posterior by a generalised posterior, which depends on a learning rate. This, however, requires an appropriate choice of the learning rate, obscures the meaning of the likelihood, and, more importantly, invalidates the fundamental justifications which hold for Bayes’ rule. 

  11. Often the approximations involve some form of factorisation assumption, but alternatives exist that have different properties. For example, inducing point approximations for Gaussian processes (Titsias, 2009) tend to overestimate uncertainty. 

  12. As an attempt to improve the variational approximation, it has been proposed to recalibrate the variational objective by reweighting the KL term (Alemi et al., 2017; Higgings et al., 2017), which is closely related to the cold posterior effect. But by attempting to fix variational inference in this way, we are blurring the lines between Bayesian modelling and end-to-end optimisation, producing cleverly regularised estimators rather than models which reason according to fundamental principles. 

  13. A pragmatic solution identifies standard practice, which uses point estimates for the unknowns such as $\hat X_{\text{MAP}}$, as effectively summarising the posterior $p(X \cond D)$ by a Dirac delta function, a probability distribution concentrated on $\hat X_{\text{MAP}}$. David MacKay (2003, Section 33.6) says that, “[f]rom this perspective, any approximating distribution $Q(x;\theta)$, no matter how crummy it is, has to be an improvement on the spike produced by the standard method!” However, “has” is doing a lot of work in this sentence. 

  14. For example, Stan, PyMC3, or Turing.jl, but there is more out there

  15. Micheal Betancourt has a great post about responsible use of MCMC in practice. 

  16. Indeed, the idea of decision-making-aware inference has been successful in the meta-learning setting (Gordon et al., 2019). It seems likely that the entanglement of inference and decision making (as a consequence of the intractability of the posterior) could justify the recent trend towards end-to-end systems, which directly optimise for the metric of interest, thereby circumventing these issues. 

  17. The development of “Bayesian-inspired” approaches, which blend end-to-end deep learning with ideas from probabilistic modelling and inference is arguably a reaction to these concerns. The goal of these developments is to provide excellent representation learning and strong predictive performance whilst still handling uncertainty, but without claiming strict adherence to the Bayesian dogma. 

Published on 31 March 2021.