# Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let $\Pi$ be a prior on a countable space $\mathcal{N} = \{f_1, f_2, f_3, \dots\}$ of probability density functions, with $\Pi(f) > 0$ for all $f \in \mathcal{N}$. Data $X_1, X_2, X_3, \dots$ follows (independently) some unknown distribution $P_0$ with density $f_0$.

We denote by $D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f}$ the Kullback-Leibler divergence and we let $D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f}$ be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of $\Pi$ accumulates in Hellinger neighborhoods of $f_0$, assuming the prior is root-summable (i.e. $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha > 0$) . In the well-specified case (i.e. $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$), the posterior accumulates in any neighborhood of $f_0$. In the misspecified case, small neighborhoods of $f_0$ could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of $\alpha$ and $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f)$).Read More »

# The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of $\mathcal{M}$

Recall that $\mathbb{M}$ is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel $\sigma$-algebra $\mathfrak{B}$ which is the smallest family of subsets of $\mathbb{M}$ that contains its topology and that is closed under countable unions and intersections. All subsets of $\mathbb{M}$ we consider in the following are supposed to be part of $\mathfrak{B}$. A probability measure on $\mathbb{M}$ is a function $\mu : \mathfrak{B} \rightarrow [0,1]$ such that for any countable partition $A_1, A_2, A_3, \dots$ of $\mathbb{M}$ we have that $\sum_{i=1}^\infty \mu(A_i) = 1.$ The set $\mathcal{M}$ consists of all such probability measures.

Note that since $\mathbb{M}$ is complete and separable, every probability measure $\mu \in \mathcal{M}$ is regular (and tight). It means that the measure of any $A\subset \mathbb{M}$ can be well approximated from the measure of compact subsets of $A$ as well as from the measure of open super-sets of $A$:

$\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.$

## Metrics on $\mathcal{M}$

Let me review some facts. A natural metric used to compare the mass allocation of two measures $\mu, \nu \in \mathbb{M}$ is the total variation distance defined by

$\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.$Read More »

# Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

## The problem.

Let $f, g, h$ be three densities and suppose that, $x_i \sim h$, $i \in \mathbb{N}$, independently. What happens to the likelihood ratio

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$

as $n\rightarrow \infty$?

Clearly, it depends. If $h = g \not = f$, then

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$

almost surely at an exponential rate. More generally, if $h$ is closer to $g$ than to $f$, in some sense, we’d expect that $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

$D_{KL}(f, g) = \int f \log\frac{f}{g}.$

It can be verified that $D_{KL}(f,g) \geq 0$ with equality if and only if $f=g$, and that

$D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)$

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

## Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in $(1)$ (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »

# ‘Pataphysics

Tune in to Today’s Random Time!

Note: It’s meant to be useless and absurd.

# The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

# 1. Introduction

Let $\mathbb{X}$ be a complete and separable metric space and let $\mathcal{M}$ be the space of all probability measures on $\mathbb{X}$. Some unknown distribution $P_0\in \mathcal{M}$ is generating observable data $\mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n$, where each $X_i$ is independently drawn from $P_0$. The problem is to learn about $P_0$ using only $\mathcal{D}_n$ and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution $P_0$, in an unknown countable language $\mathcal{L}$. What are the $P_0$-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

## 1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is $P_0$. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint $P_0$.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is $P_0$ is quantified by a probability measure $\Pi$ on $\mathcal{M}$. For any $A \subset \mathcal{M}$, $\Pi(A)$ is the the prior probability that “$P_0 \in A$“. Then, given data $\mathcal{D}_n$, prior probabilities are adjusted to posterior probabilities: $\Pi$ becomes $\Pi_n$, the conditional distribution of $\Pi$ given $\mathcal{D}_n$. The celebrated Bayes’ theorem provides a formula to calculate $\Pi_n$ from $\Pi$ and $\mathcal{D}_n$. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

## 1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part $\mathcal{M}$ of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “$P_0 \in A$” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.

# Explanations

Angular data arises in many scientific fields, such as in experimental biology for the study of animal orientation, and in bioinformatics in relation to the protein structure prediction problem.

The statistical analysis of this data requires adapted tools such as $2\pi$-periodic density models. Fernandez-Duran (Biometrics, 60(2), 2004) proposed non-negative trigonometric sums (i.e. non-negative trigonometric polynomials) as a flexible family of circular distributions. However, the coefficients of trigonometric polynomials expressed in the standard basis $1, \cos(x), \sin(x), \dots$ are difficult to interpret and we do not see how an informative prior could be specified through this parametrization. Moreover, the use of this basis was criticized by Ferreira et al. (Bayesian Analysis, 3(2), 2008) as resulting in a “wigly approximation, unlikely to be useful in most real applications”.

### Trigonometric density basis

Here, we suggest the use of a density basis of the trigonometric polynomials and argue it is well suited to statistical applications. In particular, coefficients of trigonometric densities expressed in this basis possess an intuitive geometric interpretation. Furthermore, we show how “wiggliness” can be precisely controlled using this basis and how another geometric constraint, periodic unimodality, can be enforced [first proposition on the poster]. To ensure that nothing is lost by using this basis, we also show that the whole model consists of precisely all positive trigonometric densities, together with the basis functions [first theorem on the poster].

### Prior specification

Priors can be specified on the coefficients of mixtures in our basis and on the degree of the trigonometric polynomials to be used. Through the interpretability of the coefficients and the shape-preserving properties of the basis, different types of prior knowledge may be incorporated. Together with an approximate understanding of mass allocation, these include:

• periodic unimodality;
• bounds on total variation; and
• knowledge of the marginal distributions (in the multivariate case).

The priors obtained this way are part of a well-studied family called sieve priors, including the well-known Bernstein-Dirichlet prior, and are finite mixtures with an unknown number of components. Most results and interpretations about the Bernstein-Dirichlet prior (see Petrone & Wasserman (J. R. Stat. Soc. B., 64(1),  2002), Kruijer and Van der Vaart (J. Stat. Plan. Inference, 138(7), 2008), McVinish et al. (Scand. J. Statist., 36(2), 2009) can carry over to the priors we consider, but we dot not discuss them further.

### Approximation-theoric framework

Our density models arise as the image of “shape-perserving” linear approximation operators. This approximation-theoric relationship is used to obtain a notably large prior Kullback-Leibler support and ensures strong posterior consistency at all bounded (not necessarily continuous) density. The result partly relies on known properties of sieve priors, as well as general consistency results (Walker (Ann. Statist., 32(5), 2004)), but extends known result by removing an usual continuity hypothesis on the densities at which consistency is achieved (see Wu & Ghosal (‎Electron. J. Stat., 2, 2008), Petrone & Veronese (Statistica Sinica, 20, 2010)). For contraction rates, higher order smoothness conditions are usually required (see Shen & Ghosal (Scand. J. Statist., 42(4), 2015)).

For example, consider the prior induced by the random density

$T_n \mathcal{D} := \sum_i \mathcal{D}(R_{i,n}) C_{i,n},\qquad (1)$

where $\mathcal{D}$ is a Dirichlet process, $n$ is distributed on $\mathbb{N}$ and $R_{i,n}$ is a partition of the circle. It has the strong posterior consistency at all bounded density provided that the associated operator

$T_n : f \mapsto \sum_i C_{i,n} \int_{R_{i,n}} f$

is such that $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$.

More generally, let $\mathbb{F}$ be a set of bounded densities on some compact metric space $\mathbb{M}$, let $T_n : L^1(\mathbb{M}) \rightarrow L^1(\mathbb{M})$, $n \in \mathbb{N}$, be a sequence of operators that are:

• shape preserving: $T_n$ maps densities to densities and $T_n(\mathbb{F}) \subset \mathbb{F}$; and
• approximating: $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$;

and finally let $\Pi_n$ be priors on $T_n(\mathbb{F})$ with full support. A sieve prior on $\mathbb{F}$ is defined by

$\Pi : A \mapsto \sum_n \rho(n) \Pi_n(A \cap T_n(\mathbb{F}))$.

Theorem.
If $0 < \rho(n) < Ce^{-c d_n}$ for some increasing sequence $d_n$ bounding the dimensions of $T_n (\mathbb{F})$, then the posterior distribution of $\Pi$ is strongly consistent at each density of $\mathbb{F}$.

The approximation theory literature is rich in such operators. The theorem shows that they provide strongly consistent priors on arbitrary density spaces simply given priors $\Pi_n$ on $T_n(\mathbb{F})$.

Basic density estimation:

A thousand samples (grey histogram) were drawn from the density in orange. The prior is defined by (1) with the Dirichlet process centered on the uniform density and with a precision parameter of 2. The degree $n$ is distributed as a $\text{Poiss}(15)$. The blue line is the posterior mean, the dark blue shaded region is a 50% pointwise credible region around the median, and the light blue shaded region is a 90% credible region.

# Approximation

Présentation (20 minutes) au séminaire du 5e.

Je présente le théorème d’approximation de Weierstrass pour les fonctions périodiques, en utilisant une base des polynômes trigonométriques récemment suggérée par Róth et al. (2009). Celle-ci se prête naturellement bien à notre application.

Théorème d’approximation de Weierstrass.
Soit $f : \mathbb{R} \rightarrow \mathbb{R}$ une fonction $2\pi$-périodique. Si $f$ est continue, alors on peut construire des polynômes trigonométriques $f_1, f_2, f_3, \dots$ tels que

$f(x) = \sum_{i=1}^{\infty} f_i(x)$

et tels que la convergence de la série ci-dessus est uniforme.

Ce théorème intervient dans plusieurs domaines: en topologie pour démontrer le théorème du point fixe de Brouwer, en géométrie pour l’inégalité isopérimétrique et en géométrie algébrique pour le théorème de Nash-Tognoli. Il implique que $\{1, \cos(x), \sin(x), \cos(2x), \sin(2x), \dots\}$, en tant que système orthonormal, est complèt dans $L^2(\mathbb{S}^1)$. Plus généralement, on s’en sert pour ramener un problème sur les fonctions continues à un problème sur les polynômes, où le calcul différentiel et l’algèbre linéaire s’appliquent. Les démonstrations constructives du théorème fournissent de plus des outils permettant d’effectuer la régression ou la reconstruction de courbes et de surfaces.Read More »

# UQÀM

UQAM’s logo drawn in our bivariate angular density model. It is the type of picture we’d expect from convolution of an image with a gaussian kernel, but here it is obtained using finite mixtures. The difficulty lies in choosing the density basis in such a way as to constrain “bumpyness”, even though a finite number of “bumps” are summed together.

# Drawings

I use these grids to construct smooth approximations to bivariate angular probability densities (of interest in protein bioinformatics) and to specify semiparametric priors on bivariate angular density spaces. The grid on the Torus illustrates the proper angular wrapping behavior of planar grid. I drew them in Mathematica.

Code and variants below.

where $R$ is a side chain called the $R$-group. There are 20 different amino acids found in proteins, each characterized by its $R$-group.