# Combinatorics of Phylogenetic Trees

The following is based on a weekend project that I also presented as a short talk in an undergraduate combinatorics seminar. The project is self-contained and mostly based on independent work. Ideas and inspiration came from discussions with my teacher and from the introduction of Diaconis and Holmes (1998). Theorem 2 is from Semple and Steel (2003). Tree pictures were produced with Sagemath and Latex.

French pdf.

# 1. Introduction

A phylogenetic tree is a rooted binary tree with labeled leaves.

These trees are used in biology to represent the evolutive history of species. The leaves are the identified species, the root is a common anscestor, and branching represents speciation.

An interesting problem is that of reconstructing the phylogenetic tree that best explains the observed biological characterics of a set of species. A naive mathematical formulation of this problem is proposed in section 4, and used to implement a tree reconstruction algorithm.

# 2. Phylogenetic trees

Let $X$ be a finite set, totally ordered in any way, such as a set of species. A phylogenetic tree on $X$ is a rooted binary tree with leaves set $X$. Two phylogenetic trees are said to be equal if they are isomorphic under an application which, when restricted to $X$, is the identity on $X$. This is simply a formalisation of the idea that the leaves are labelled, whereas the other vertices are anonymous. In the following, we do not make special cases of this quotient.

## 2.1 Notations

We write $\texttt{Phy}(X)$ for the set of all phylogenetic trees on $X$, and if $n \in \mathbb{N}$, then $\texttt{Phy}(n) = \texttt{Phy}(\{1,2,\dots, n\})$. For $\mathcal{A} \in \texttt{Phy}(X)$ with root $r$ and $s$ any vertex, there exists a unique path from $s$ to $r$. The vertices on this path are ordered in the obvious way as going away from the root. The subtree of a vertex $s$ is the subtree of $\mathcal{A}$ containing $s$ and all the vertices under $s$. The principal branch of $\mathcal{A}$ is the component of $\mathcal{A} - r$ with the smallest leaf; its secondary branch is the other component of $\mathcal{A} - r$. Similarly, the principal branch of an interior vertex $s$ is the principal branch of the subtree of $s$, and the secondary branch of $s$ is the secondary branch of the subtree of $s$. If $s$ is any vertex that is not the root, its parent $p$ is the vertex above it and its cousin is the vertex under $p$ that is not $s$.

## 2.2 Properties

We begin with an useful elementary lemma. The number of elements in $X$ is $n$.

Lemma 1.
Let $\mathcal{A} \in \texttt{Phy}(X)$. It has $2n-1$ vertices and $2n-2$ edges.

Proof.
If $\mathcal{A}$ has only two leaves, then it has 3 vertex and 2 edges. Otherwise, let $n_1$ and $n_2$ be the number of leaves in the principal and secondary branches of $\mathcal{A}$. By induction, $\mathcal{A}$ has $(2n_1 - 1)+(2n_2-1)+1 = 2n-1$ vertices and $(2n_1-2)+(2n_2-2)+2 = 2n-2$ edges. $\Box$

We can now count the number of phylogenetic trees with $n$ leaves. At the same time, we will see a first recursive characterization of $\texttt{Phy}(X)$.

Theorem 2 (See Semple and Steel, 2003).
For all $n \in \mathbb{N}$, there exists exactly

$1 \cdots 3 \cdots 5 \cdots (2n-3) = \frac{(2n-3)!}{(n-2)! 2^{n-2}} =: (2n-3)!!$

different phylogenetic trees with $n$ leaves.

Proof.
Let $T : \texttt{Phy}(n) \rightarrow \texttt{Phy}(n-1)$ be defined as follows. For $\mathcal{A}$ a tree with root $r$, $T(\mathcal{A})$ is obtained in two steps:

1. remove the root $r$ and add an edge between the roots of the two components so obtained; and
2. remove the leaf $n$.

The root of $T(\mathcal{A})$ is the vertex adjacent to the leaf $n$ at the begining of the second step.

Each tree $\mathcal{A}' \in \texttt{Phy}(n-1)$ has at most $2n-3$ inverse images, since either $n$ was adjacent to the root of the preimage, or one of the $2n-4$ edges of $\mathcal{A}'$ was added by $T$. Since a tree is identifiable to is principal and secondary branches, it is straightforward to verify that $\mathcal{A}'$ has precisely $2n-3$ inverse images. It follows that $|\texttt{Phy}(n)| = (2n-3)|\texttt{Phy}(n-1)|$. $\Box$

Remark 3.
This proof provides a recursive enumeration of $\texttt{Phy}(n)$, by enumerating the preimage of each $\mathcal{A}' \in \texttt{Phy}(n-1)$.

Remark 4.
It also provides a way of sampling random phylogenetic trees, uniformly distributed on $\texttt{Phy}(n)$: given $\mathcal{A}' \sim \text{Uniform}(\texttt{Phy}(n-1)$, we let $\mathcal{A} \sim \text{Uniform}(T^{-1}(\mathcal{A}'))$. This can be used to integrate over $\texttt{Phy}(n)$.

# 3. Exploring the tree space

## 3.1 Representation

Let us describe another recursive nature of $\texttt{Phy}(X)$. Here, $\sum$ denotes the disjoint union of sets. We then have a bijection

$\texttt{Phy}(X) \rightarrow \sum\limits_{\min X \in A \subset X} \texttt{Phy}(A) \times \texttt{Phy}(X \backslash A)$

that maps a tree to the pair of its principal and secondary branches. By letting $\texttt{Phy}(\{x\}) \simeq x$, for all $x \in X$, the elements of $\texttt{Phy}(X)$ can be identified as shown below.

Furthermore, it follows from this bijection that the exponential generating function $F(x) = \sum_{n=0}^{\infty} a_n x^n/n!$ of $a_n := |\texttt{Phy}(n)|$, $a_0 = 0$, $a_1 = a_2 = 1$, satisfies $F(x) = x + F(x)^2/2$. In other words,

$F(x) = 1-\sqrt{1-2x}$.

## 3.2 Graph structure on the tree space

Let $\mathcal{A} \in \texttt{Phy}(X)$ and let $s$ be an interior vertex of $\mathcal{A}$ ($s$ is not the root and is not a leaf). We consider two transformations of $\mathcal{A}$, $\tau_1(s)$ and $\tau_2(s)$, defined in the diagram below. Here, $A$ is the principal branch of $s$, $B$ is the secondary branch of $s$, and $r$ is the parent of $s$. Also, $C$ is the branch of $r$ not containing $s$.

It is possible to give a formal recursive definition of these transformations, but this suffices for our purposes.

Now, two trees $\mathcal{A}$ and $\mathcal{A}'$ are said to be adjacent in $\texttt{Phy}(X)$ if one can be transformed into the other by such a transformation, for some interior vertex. The following theorem describes some important aspects of this graph structure on $\texttt{Phy}(X)$.

Theorem 5.
Let $n = |X|$.We have

1. each vertex of $\texttt{Phy}(X)$ is of dregree $2n-4$; and
2. the graph $\texttt{Phy}(X)$ is connected and its diameter is bounded by $(n-1)(n-2)$.

Proof of (1).
Let $\mathcal{A} \in \texttt{Phy}(X)$ and note that the transformations of the type $\tau_i(s)$ are invertible. The degree of $\mathcal{A}$ is therefore the cardinality of the set $\{\tau_i(s)(\mathcal{A}) \,|\, i \in \{1,2\}, s \text{ is an interior vertex}\}$. We show that $\tau_i(s) (\mathcal{A}) \not = \tau_j(s') (\mathcal{A})$ whenever $(i, s) \not = (j,s')$, from which the claim follows. Three cases are possible.

1. If $s = s'$ and $i \not = j$. By definition, $\tau_i(s)(\mathcal{A}) \not = \tau_j(s)(\mathcal{A})$.
2. If none of $s$ and $s'$ is a descendant of the other. Then, there exists a vertex $r$ such that $s$ and $s'$ are in distinct branches of $r$. It follows that $\tau_i(s)(\mathcal{A}) \not = \tau_j(s')(\mathcal{A})$.
3. If $s'$ is a descendant of $s$, $s \not = s'$. In this case, consider the parent $r$ of $s$. The transform $\tau_i(s)$ modifies the two branches of $r$, whereas $\tau_j(s')$ only modifies one of them. It follows that $\tau_i(s)(\mathcal{A}) \not = \tau_j(s')(\mathcal{A})$. $\Box$

Proof of (2).
Suppose, without loss of generality, that $X = \{1,2,\dots, n\}$. Let $\mathcal{E}_n \in \texttt{Phy}(n)$ be the elementary tree on $X$. Let $\mathcal{A} \in \texttt{Phy}(X)$. We show that $\mathcal{A}$ can be transformed into $\mathcal{E}_n$ through at most ${n-1 \choose 2}$ transformations of the type $\tau_i(s)$. It takes at most $n-2$ transformations to make the leaf $n$ adjacent to the root. By induction, it then takes at most $n-2 + {n-2 \choose 2} = {n-1 \choose 2}$ steps to transform $\mathcal{A}$ in $\mathcal{E}(n)$. Since each $\tau_i(s)$ is invertible, we obtain that $\texttt{Phy}(X)$ is connected and that its diameter is bounded by $2 {n-1 \choose 2}$. $\Box$

Remark 6.
Theorem 5 can be used to specify random walks on $\texttt{Phy}(X)$ that converge to an arbitrary positive distribution, using a Metropolis-Hastings algorithm.

# 4. Phylogenetic tree reconstruction

Let $X$ be a set of species. We assume that the evolutive history of the species in $X$ is described by some unknown tree $\mathsf{A} \in \texttt{Phy}(X)$. A priori, $\mathsf{A}$ is (say) uniformly distributed on $\texttt{Phy}(X)$. We observe for each specie $x \in X$ a set of characters $\mathcal{C}_x \subset \mathcal{C}$, $\mathcal{C}$ being the set of all biological characters under consideration. For instance, we may have $\mathcal{C} = \{\text{fur}, \text{scales}\}$. Bears have fur whereas snakes have scales.

Now, consider the following naive character propagation model. In a tree $\mathcal{A} \in \texttt{Phy}(X)$, the root $r$ has no distinctive characters. The children $a$ and $b$ of $r$ spawn new characters sets $\mathcal{C}_a$ and $\mathcal{C}_b$. We suppose that $\mathcal{C}_a \cap \mathcal{C}_b = \emptyset$ (if $a$ and $b$ were to share a character $c$, then we would assume this character is also present in their parent $r$). Now, if $s$ is any vertex of $\mathcal{A}$ that is not a leaf, with character set $\mathcal{C}_r$, the children $x$ and $y$ spawn character sets $\mathcal{C}_x\subset \mathcal{C}$ and $\mathcal{C}_y\subset \mathcal{C}$ such that $\mathcal{C}_x \cap \mathcal{C}_y = \mathcal{C}_s$, through some probability distribution. This distribution should be specified by a biologist; here it will be implicitely defined in a very simple likelihood function.

Given observed characters $\mathcal{D} = \{\mathcal{C}_x\}_{x \in X}$ and a candidate tree $\mathcal{A}$, we can reconstruct the character sets $\mathcal{C}_s$, for each $s \in \mathcal{A}$, as follows. If $\mathcal{C}_x$ and $\mathcal{C}_y$ are the character sets of the children of $s$, then $\mathcal{C}_s = \mathcal{C}_x \cap \mathcal{C}_y$. We finally define the likelihood of $\mathcal{A}$ given $\mathcal{D}$ through

$\log P(\mathcal{D}\,|\, \mathcal{A}) = \text{cst.} + \sum_{s = (x, y)} \log(|\mathcal{C}_x| - |\mathcal{C}_s| + 1) + \log(|\mathcal{C}_y| - |\mathcal{C}_s| + 1),$

where cst. is a universal constant and the summation is taken over all vertices $s \in \mathcal{A}$ having children $x$ and $y$. This likelihood was obtained through a partially specified probability model. It is maximized when the number of new characters spawned by each vertex (that is not the root) is constant, whenever that is possible.

To reconstruct $\mathsf{A} \in \texttt{Phy}(X)$, we look for the tree $\mathcal{A} \in \texttt{Phy}(X)$ of maximal a posteriori probability

$P(\mathcal{A} \,|\, \mathcal{D}) \propto P(\mathcal{D} \,|\, \mathcal{A}) P(\mathcal{A}).$

Through gradient ascent on the graph $\texttt{Phy}(X)$, we obtain local maximums.

(see the pdf.)

# References.

[1] Semple, C. et Steel, M., Phylogenetics, Oxford University Press, 2003.
[2] Diaconis, P. et Holmes, S., Matchings and Phylogenetic Trees, Proceedings of the National Academy of Sciences of the United States of America, Vol. 95, No. 25, 1998.