# Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

## The problem.

Let $f, g, h$ be three densities and suppose that, $x_i \sim h$, $i \in \mathbb{N}$, independently. What happens to the likelihood ratio

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$

as $n\rightarrow \infty$?

Clearly, it depends. If $h = g \not = f$, then

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$

almost surely at an exponential rate. More generally, if $h$ is closer to $g$ than to $f$, in some sense, we’d expect that $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

$D_{KL}(f, g) = \int f \log\frac{f}{g}.$

It can be verified that $D_{KL}(f,g) \geq 0$ with equality if and only if $f=g$, and that

$D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)$

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

## Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in $(1)$ (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.

This is of interest in Bayesian nonparametrics. The hypothesis that a density $f_0$ is in the Kullback-Leibler support of a prior $\Pi$ on a density space $\mathbb{F}$ is used to ensure that

$\int_{\mathbb{F}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi(df)$

does not converge to $0$ exponentially fast as $n\rightarrow \infty$ and $x_i \sim f_0$. Under the conditions we specify, our remark implies that the hypothesis that “$f_0$ is in the Kullback-Leibler support of $\Pi$” may not be replaced by a weaker one.

# Statement of the remark

Some notations. Let $\mathcal{M}$ the space of all probability measures on some measurable space $\mathbb{M}$. If $\mu, \nu \in \mathcal{M}$, then both are absolutely continuous with respect to $\tau = \mu + \nu$ and possess densities $f, g$ such that $d\mu = fd\tau$ and $d\nu gd\tau$. We denote by $d\mu /d\nu$ the ratio of densities $f / g$, which in fact does not depend on the choice of dominating measure $\tau$. The likelihood ratio $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$ is abbreviated to $\frac{d\mu}{d\nu}(X)$, depending implicitely on $n$. The Kullback-Leibler divergence between $\mu$ and $\nu$ is

$D_{KL}(\mu, \nu) = \int \log\frac{d\mu}{d\nu} d\mu.$

We let $D$ be any other function $\mathcal{M}\times \mathcal{M} \rightarrow [0,\infty]$ such that $D(\mu, \nu) = 0$ iff $\mu = \nu$ and such that if $D(\lambda, \nu) < D(\lambda, \mu)$, then there exists a $\varepsilon > 0$ with

$e^{\varepsilon n} \frac{d\mu}{d\nu}(X) = e^{\varepsilon n} \prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0$

almost surely as $n \rightarrow \infty$. The topology on a subset $\mathcal{F} \subset \mathcal{M}$ induced by such a function $D$ is the topology induced by the sets

$\{\nu \,|\, D(\mu, \nu) < \varepsilon\},\quad \mu \in \mathcal{F}, \, \varepsilon > 0.$

The remark below shows that any exponential rate of convergence of the likelihood ratio is picked up by the KL divergence. It is rather obvious (albeit a bit technical), but I thought it was worth writing it up properly.

Remark.
Let $x_i \sim \mu$, $i \in \mathbb{N}$, independently, and let $\frac{d\nu}{d\mu}(X) = \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i)$.

1. We have that $D_{KL}(\mu, \nu) < \infty$ if and only if $\frac{d\nu}{d\mu}(X)$ does not converge more than exponentially fast to $0$ (i.e. there exists $\varepsilon > 0$ such that $e^{\varepsilon n}\frac{d\nu}{d\mu} (X) \rightarrow \infty$).
2. If $\mathcal{F} \subset \mathcal{M}$ is such that $D_{KL}(\mu, \nu) < \infty$ for all $\mu, \nu \in \mathcal{F}$, then

$D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu)$

and the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the topology of any other function $D$ defined as above.

Proof of 1.
Suppose that $D_{KL}(\mu, \nu) < \infty$. Then, since by the strong law of large numbers $D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i) \rightarrow \varepsilon > 0$, we find that

$\log\left(e^{n (D_{KL}(\mu, \nu) + \varepsilon)} \frac{d\nu}{d\mu}(X)\right) = n (D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i)) \rightarrow \infty$

for all $\varepsilon > 0$.

If $D_{KL}(\mu, \nu) = \infty$, then for all $\varepsilon > 0$ we have

$\log \left(e^{n \varepsilon}\frac{d\nu}{d\mu}(X)\right) = n\left(\varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)}\right) \rightarrow -\infty$

since $\frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)} \rightarrow \infty$. $\Box$

Proof of 2.
Suppose that there exists $\lambda, \mu, \nu \in \mathcal{F}$ such that $D(\lambda, \mu) < D(\lambda, \nu)$ but $D_{KL}(\lambda, \mu) = D_{KL}(\lambda, \nu)$. Then, there is a $\varepsilon > 0$ such that

$e^{2\varepsilon n} \frac{d\nu}{d\mu}(X) = e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) /\left( e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \right) \rightarrow 0.$

But $e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) \rightarrow \infty$ and $e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \rightarrow 0$, which yields the contradiction.

Since $D(\lambda, \mu) =0$ iff $\lambda = \mu$, this implies that the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the one induced by $D$. $\Box$