Learning from data is powering the modern world. In the age of big data and abundant computational resources, statistics and machine learning have become ubiquitous. Modern statistical modeling can help us accomplish a wide breadth of useful tasks, such as regression, classification, or clustering. In *regression* models, we seek to estimate the relationship between the response variable and its predictors. In *classification* models, we seek to classify data points into classes. This is typically done in a *supervised* setting, in which the model is fit using true class labels. *Clustering* seeks to partition the data into a set of classes in an *unsupervised* setting, that is, to group the data such that points in the same group have similar properties to one another, without knowing any true class labels.

In this article, we discuss a class of models used for clustering called *Dirichlet process mixture models*, which allow us to cluster data according to its natural structure without the need to presuppose a quantity of clusters.

Dirichlet process mixture models fall into the category of Bayesian nonparametric models. To understand this term, we examine both descriptors:

A

*Bayesian model*is a model in which the model parameters are thought of as random variables, and inference is performed using Bayes’ theorem: for parameters \(\theta\) and observed data \(y\), we have \[p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)}\,. \label{bayesthm}\] From Bayes’ theorem, we observe that the*posterior density*\(p(\theta|y)\), the density of the distribution on the parameter given the observed data, is proportional to the product of the*prior density*\(p(\theta)\), the density of the distribution of the parameter, and the likelihood \(p(y|\theta)\) of the observed data having been observed.A

*nonparametric model*is a model in which the number of model parameters scales with the data on which it is fit. In this sense, it can be thought of as a model in which the parameter space is infinite-dimensional rather than finite-dimesional. Note that nonparametric models still have parameters, despite the name. Examples of*parametric*models include binomial models and Gaussian models.

Thus, a Bayesian nonparametric model embraces both of these descriptors– they are models in which the parameters are a random vector in an infinite dimensional space (theoretically speaking), with inference performed according to Bayes’ theorem.

While this category describes a broad class of models, models commonly fit in two subcategories: regression models, which make use of Gaussian processes, and hierarchical models, which make use of the Dirichlet process.

The Dirichlet process mixture model, the principal model in the latter class, will be our focus. These models fit data to a mixture of parametric component distributions with the key property that the number of components grows with the data, and is fitted to the data. This nonparametric nature allows the model to be used to cluster data in which the number of clusters is not known *a priori*, and is thought to grow with the size of the data. As one might posit, this type of model has a variety of relevant applications: galaxy localization, grouping documents by topic, and neuroimaging problems

The Dirichlet process itself is a probability distribution whose realizations are discrete probabilities distributions themselves. It is used as a prior distribution on the mixing measure in Dirichlet process mixture models. First, we will provide preliminary primers on the topics of Bayesian inference and exchangeability, and then motivate and build up the theory of the Dirichlet process by first discussing limiting proportions of Polya’s urn models and the extension to the Blackwell-MacQueen urn scheme. After this, we discuss Sethuraman’s stick-breaking construction and elaborate on properties of the Dirichlet process. After this, we discuss mixture modeling and build up from the finite case to motivate the Dirichlet process as a prior in nonparametric Bayesian models. Finally, we give examples of Dirichlet process mixture models fit to some real data, discuss approximate inference for these classes of models, and future directions for the interested reader.

In order to work up to hierarchical Bayesian models like the Dirichlet process mixture model, we first refer back to the basics of Bayesian inference. Model parameters are viewed as random variables rather than fixed, unknown quantities, and we seek to perform inference on these parameters through the usage of Bayes’ theorem.

One of the simplest examples of Bayesian inference involves binomial data, like flipping a coin. Say we have a coin with a probability \(\theta\) of yielding heads. After flipping the coin \(n\) times, we can calculate the posterior distribution on \(\theta\) given our observed flips. We now define the beta distribution, which will play an instrumental role in this example.

We define the \(\textit{Beta}(\alpha,\beta)\) distribution

We will use the beta distribution as our prior distribution on \(\theta\) for Bayesian inference (that is, \(\theta\sim\text{Beta}(\alpha,\beta)\)). If the reader is not familiar with this problem, they may wonder why we chose to use a Beta distribution as the prior on \(\theta\). This is because for binomial data like this, a beta distribution is a *conjugate prior*. This means that the posterior on \(\theta\) given some observed data will be in the same parametric family as the prior, so in this case, the posterior will always be a beta distribution as well, with different parameters. To see this, suppose we observe \(y\) heads flips out of \(n\) total flips, and our prior on \(\theta\) is a \(\text{Beta}(\alpha,\beta)\). Then we have \[\begin{aligned}
p(\theta\lvert y) &= \frac{p(\theta)p(y|\theta)}{p(y)} \\[1em]
&= \frac{p(\theta)p(y|\theta)}{\int_0^1 p(\theta)p(y|\theta)\,\mathrm{d}\theta} \\[1em]
&=\frac{\binom{n}{y}\theta^y(1-\theta)^{n-y} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\int_0^1\binom{n}{y}\theta^y(1-\theta)^{n-y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\,\mathrm{d}\theta} \\[1em]
&= \frac{\theta^{y+\alpha-1} (1-\theta)^{n+\beta-y-y}}{\int_0^1\theta^{y+\alpha-1} (1-\theta)^{n+\beta-y-y} } \\[1em]
&= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\,\theta^{\alpha+y-1}(1-\theta)^{\beta+n-y-1}
\label{betabinomialconj}\end{aligned}\] which is exactly the density of a \(\text{Beta}(\alpha+y,\beta+n-y)\) random variable . In this example, we see that we can obtain the posterior simply by incrementing the first parameter \(\alpha\) by the number of successes (heads) observed and the second parameter \(\beta\) by the number of failures (tails) observed.

To see this in action, observe the visualization below. This shows the \(\text{Beta}(\alpha,\beta)\) prior distribution on \(\theta\), and updates itself to the posterior distribution on \(\theta\) given the observed data with each flip of the coin. To reset it, simply change the true success probability or a parameter of the prior.

As we see, after flipping the coin more and more, the mass of the posterior density concentrates around the true success probability. Notice that obtaining enough data can correct a poorly chosen prior on \(\theta\), but a sharper prior requires more data to change as it describes a higher confidence around a certain value.

With the previous example in mind, we can consider the multinomial version of the same problem. For example, we might have a \(k\)-sided die, and we wish to perform inference on \(\theta\in S_{k-1}:=\{x\in\mathbb{R}^k:\sum_{i=1}^k=1\}\), the vector of probabilities of the die landing on each face. We now define the Dirichlet distribution, which is essentially a multivariate extension of the beta distribution.

We define the \(\text{Dirichlet}(\alpha)\) distribution

Following the previous problem, we can do inference in the same manner, with the Dirichlet distribution being the conjugate prior for multinomial data. Explicitly, consider a \(\text{Dirichlet}(\alpha)\) prior on \(\theta\), where \(\alpha=(\alpha_1,\ldots,\alpha_k)\). After observing a number of rolls of the die, we let \(y_i\) be the number of observed rolls of face \(i\), for \(i=1,\ldots,k\). Then the posterior on \(\theta\) given our observed data follows a \(\text{Dirichlet}(\alpha_1+y_1,\ldots,\alpha_k+y_k)\). It can be viewed exactly as a multivariate extension of the beta-binomial problem. We see an example below, which would correspond to a three sided die with color coded faces.

Before building up to the Dirichlet process through Polya’s urn, we will discuss a property of data called exchangeability and an important result which will be instrumental in establishing the existence of the Dirichlet process.

The reader is likely to be familiar with the concept of *independence*.

Two events \(A\) and \(B\) (measurable subsets of the sample space) are independent if and only if \[P(A\cap B)=P(A)P(B)\,.\]

This notion can be extended to collections of sets by saying the collections \(\mathcal{F}_1,\mathcal{F}_2\) are independent if for every \(E_1\in\mathcal{F}_1\), \(E_2\in\mathcal{F}_2\), \(E_1\) and \(E_2\) are independent.

Finally, two random variables \(X,Y\) are independent if the \(\sigma\)-algebras generated by them, \(\sigma(X)\) and \(\sigma(Y)\), are independent.

We can give a more intuitive characterization as well: real-valued random variables \(X,Y\) are independent if and only if their joint cumulative distribution function (CDF) equals the product of their marginal CDFs, \[F_{X,Y}(t,s)=F_X(t)F_Y(s)\] for all \(t,s\) in the respective sample spaces. If both \(X\) and \(Y\) have a density, we can also characterize independence using them in a similar way \[p_{X,Y}(t,s)=p_X(t)p_Y(s)\] for all \(t,s\) in the respective sample spaces.

Instead of independence, we are interested in the weaker condition of *exchangeability*, which is another property of finite or countable sets of random variables.

The random variables \(X_1,\ldots,X_n\) are *exchangeable* if and only if for any permutation \(\sigma\in S_n\) (the symmetric group on \(n\) elements), the joint distribution of \(X_{\sigma(1)},\ldots,X_{\sigma(n)}\) is the same as the joint distribution of \(X_1,\ldots,X_n\). A countable sequence of random variables \(\{X_n:n\in\mathbb{N}\}\) is exchangeable if the joint distribution is the same under any finite permutation of the indices. Equivalently, we may also say that if is exchangeable if \(X_1,\ldots,X_k\) is exchangeable for every \(k\in\mathbb{N}\).

It is important to note that while independence implies exchangeability, the converse does not hold– we will see an example of this soon!

The reason why we are discussing exchangeable random variables is due to a characterization by Bruno De Finetti

Let \(\{X_n:n\in\mathbb{N}\}\) be a sequence of random variables. The sequence \(\{X_n:n\in\mathbb{N}\}\) is exchangeable if and only if for any \(n\in\mathbb{N}\), \[F(X_1,\ldots,X_n) = \int \prod_{i=1}^n F_X(X_i)\,\mathrm{d}P(F_X)\,.\] where \(F_X\) is the limiting empirical distribution defined as \[F_X(t) = \lim_{k\to\infty} \frac{1}{k}\sum_{i=1}^k \mathbb{I}_{\{x_i \leq t\}}\] when this limit exists. If \(F_X\) is indexed by a parameter \(\theta\), and has a density, we also have \[P(X_1=x_1,\ldots,X_n=x_n) = \int \prod_{i=1}^n p(x_i|\theta)\,\mathrm{d}F(\theta)\,.\] Intuitively, this means that exchangeable random variables are independent and identically distributed, conditioned on their limiting empirical distribution.

A proof in this general form is available in the appendix of O’Neill (2009)

Let \(\{X_n:n\in\mathbb{N}\}\) be a sequence of \(\{0,1\}\)-valued random variables. This sequence is exchangeable if and only if for any \(n\in\mathbb{N}\), \[P(X_1=x_1,\ldots,X_n=x_n)=\int_0^1 \theta^y(1-\theta)^{n-y}\,\mathrm{d}F(\theta) \label{defin}\] where \(y=\sum_{i=1}^n x_i\) and \(F(\theta)\) is the distribution function of the limiting empirical distribution, defined as \[F(\theta) := P(\overline{X}_\infty\leq \theta)\] for \[\overline{X}_\infty := \lim_{k\to\infty} \frac{1}{k}\sum_{i=1}^k X_i\] when this limit exists.

Now, we will switch gears to discuss a model known as Polya’s urn, which will not only provide us with a nontrivial example of exchangeable random variables (that is, an example in which they are exchangeable but not independent), but also serve as a vehicle towards understanding the Dirichlet process.

Suppose an urn contains two colors of balls. When considering finite or infinite sampling from the urn, it is typical to consider sampling with replacement and sampling without replacement. Under these conditions, the number of balls drawn of a certain class follows a binomial or hypergeometric distribution respectively. In the simple Polya’s urn model, we consider a third option– sampling with replacement and the addition of another object of the same class into the model. This type of process operates in a ‘rich get richer’ fashion, as drawing objects from one class makes this class more likely to be drawn in the future.

There exist many generalizations of this model– cases with any finite number of classes, countably many classes, as well as generalizations in which any integer number of balls are added/removed from the urn at each draw. For our purposes, we consider the process where just one additional ball is added at each draw.

Let’s see what draws from a Polya’s urn process look like. Considering an urn with three colors of balls, set the initial quantities in the urn and use the buttons to draw samples from the urn. Change the initial conditions to reset the demonstration.

This model acts as a stepping stone between familiar material and the Dirichlet process. In order to understand this transition, we will study some of the properties of Polya’s urn. Primarily, this provides an excellent example of exchangeable random variables. To gain a better understand of the model, we will discuss an urn with just two colors of balls, initially containing \(a\) black and \(b\) white balls. As we move forward, we will extend the results from this case first to the case with finitely many colors, and then to include an unbounded quantity of colors, which will give rise to the Dirichlet process.

Let \(X_1,\ldots, X_n\) be defined such that \(X_k\) is the indicator random variables for the event of the \(k\)th draw being a black ball. That is, \[X_k=\begin{cases}1 & \text{the $k$th draw is a black ball}\\0 & \text{the $k$th draw is a white ball}\end{cases} \label{polyaindicatordef}\] Observe that \(X_1,\ldots,X_n\) are not independent as each \(X_k\) depends on \(\{X_j:j=1,\ldots,k-1\}\), since the probability of drawing a black ball at time \(k\) depends on the balls drawn at every previous time step. For example, if we begin with one black and one white ball, and draw a black ball on the first draw, the second draw will be a black ball with probability \(\frac{2}{3}\). However, if we drew a white ball on the first draw, the probability of drawing black on the second would be only \(\frac{1}{3}\).

\(X_1,\ldots,X_n\), the random variables corresponding to draws from the urn process, are exchangeable.

*Proof.* Explicitly, the joint distribution of \(X_1,\ldots,X_n\) is given by \[\begin{aligned}
\begin{split}
&P(X_1,\ldots,X_n=x_1,\ldots,x_n) \\
=&P(X_1=x_1)P(X_2=x_2\,|\,X_1=x_1)\cdots P(X_n=x_n\,|\,X_1,\ldots,X_{n-1}=x_1,\ldots,x_{n-1}) \\
=&\left(\frac{a}{a+b}x_1+\frac{b}{a+b}(1-x_1)\right)\cdot\left(\frac{a+x_1}{a+b+1}x_2+\frac{b+(1-x_1)}{a+b+1}(1-x_2)\right)\cdots \\
& \qquad\qquad\qquad\qquad\qquad\qquad\,\cdot \left(\frac{a+\sum_{i=1}^{n-1}x_i}{a+b+n-1}x_n + \frac{b+(n-1-\sum_{i=1}^{n-1}x_i)}{a+b+n-1}(1-x_{n-1})\right) \\
=&\frac{a(a+1)\cdots(a+\sum_{i=1}^nx_i-1)\quad\cdot\quad b(b+1)\cdots(b+(n-\sum_{i=1}^nx_i-1))}{(a+b)(a+b+1)\cdots(a+b+n)} \\
=&\frac{a^{\overline{y}} b^{\overline{n-y}}}{(a+b)^{\overline{n}}} \\
=& \frac{\Gamma(a+y)\Gamma(b+n-y)}{\Gamma(a+b+n)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \\
=& \frac{B(a+y,b+n-y)}{B(a,b)}
\end{split}
\label{polyajointdist}\end{aligned}\] where \(y=\sum_{i=1}^nx_i\), the number of black balls drawn in total, \(z^{\overline{m}}=z(z+1)\cdots(z+m-1)\) denotes the rising factorial, \(\Gamma(\cdot)\) denotes the gamma function, and \(B(\cdot,\cdot)\) denotes, as before, the beta function. Thus, we see explicitly the lack of dependence on individual draws and therefore the exchangeability of the sequence. ◻

As we saw above, the quantity \[\sum_{i=1}^nX_i\,,\] the total number of black balls drawn after \(n\) draws, is sufficient to determine the joint distribution of \(X_1,\ldots,X_n\). That is, \[P\left(X_1,\ldots,X_n=x_1,\ldots,x_n\,|\,X_1,\ldots,X_n\right)=P\left(X_1,\ldots,X_n=x_1,\ldots,x_n\,\Big|\,\sum_{i=1}^nX_i\right)\,.\] Intuitively, the joint distribution of \(X_1,\ldots,X_n\) depends just on \(\sum_{i=1}^nX_i\) rather than \(X_1,\ldots,X_n\). We note that this is actually a characterization of exchangeability for \(\{0,1\}\)-valued random variables.

Using this exchangeability property of the Polya’s urn model, we seek to answer a natural question of Polya’s urn: its behavior as the number of draws from the urn gets very large. In particular, we wish to determine the limiting empirical distribution of draws of the urn, which describes exactly the limiting proportion of balls of a specific color after infinitely many draws. It turns out, thanks to exchangeability, we can apply De Finetti’s theorem to help us determine this limiting distribution. Define \[\overline{X}_k := \frac{1}{k}\sum_{i=1}^k X_i\,,\] and \[\overline{X}_\infty=\lim_{k\to\infty}\overline{X}_k=\lim_{k\to\infty}\frac{1}{k}\sum_{i=1}^k X_i\] the limiting proportion of black balls *that have been drawn*.

The proportion of black balls in the urn converges in distribution to \(\text{Beta}(a,b)\)

After conditioning on this limiting measure, draws from the urn are independent and identically Bernoulli distributed. That is, we have \[P(X_1=x_1,\ldots,X_n=x_n\,|\,\overline{X}_\infty = \theta)=\theta^y(1-\theta)^{n-y}\,,\] where \(y=\sum_{i=1}^n x_i\), or equivalently \[P\left(\sum_{i=1}^nX_i = y\,|\,\overline{X}_\infty=\theta\right) = \binom{n}{y} \theta^y(1-\theta)^{n-y}\] which can also be stated as \[\sum_{i=1}^nX_i \,\Big|\,\overline{X}_\infty \sim \text{Binomial}\left(n,\overline{X}_\infty\right)\,.\]

*Proof.* First, in order to make sure that \(\overline{X}_\infty\) is well-defined, we need to ensure convergence of the sequence \(\{\overline{X}_n:n\in\mathbb{N}\}\). To do this, we examine a property of this sequence: \[\begin{aligned}
E[\overline{X}_{n+1}|\overline{X}_1,\ldots,\overline{X}_n] &=
\left(\frac{n\overline{X}_n+1}{n+1}\right)\left(\overline{X}_n\right) + \left(\frac{n\overline{X}_n}{n+1}\right)\left(1-\overline{X}_n\right) \\
&= \frac{n\overline{X}_n^2 + \overline{X}_n + n\overline{X}_n - n\overline{X}_n^2}{n+1} \\
&= \overline{X}_n\,,\end{aligned}\] so \(\{\overline{X}_n:n\in\mathbb{N}\}\) is a *martingale*. If the reader is not familiar with martingales, they describe stochastic processes with the above property: their conditional expectation given all previous states is exactly equal to the most recent state. This concept was initially developed for the mathematical formalization of gambling. With this in mind, we can apply a fundamental result in martingale theory, that being Doob’s martingale convergence theorem

Moving forward, since each \(X_k\) is \(\{0,1\}\)-valued and exchangeable, as we have shown earlier, De Finetti’s theorem implies that \[P(X_1=x_1,\ldots, X_n=x_n) = \int_0^1 \theta^y (1-\theta)^{n-y}\,\mathrm{d}F(\theta)\] where \(y=\sum_{i=1}^n x_i\) and \(F(\theta)=P(\overline{X}_\infty\leq \theta)\) is the distribution function of the limiting proportion of black balls. In addition, we have an expression for the joint distribution from earlier. Equating these, we observe \[\int_0^1 \theta^y (1-\theta)^{n-y}\,\mathrm{d}F(\theta)=\frac{B(a+y,b+n-y)}{B(a,b)}\,.\] Using the definition of the beta function, we therefore observe \[\begin{aligned} \int_0^1 \theta^y(1-\theta)^{n-y}\,\mathrm{d}F(\theta)&=\frac{1}{B(a,b)}\int_0^1 \theta^{a+y-1}(1-\theta)^{b+n-y-1}\,\mathrm{d}\theta \\ &= \int_0^1\left(\theta^y(1-\theta)^{n-y}\right)\left(\frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}\right)\,\mathrm{d}\theta\end{aligned}\] Since the quantity in parentheses on the right is continuous as a function of \(\theta\), the Riemann integral on the right is equivalent to the Riemann-Stieltjes integral on the left, where \(F(\theta)\) is the antiderivative of the quantity in parentheses. Furthermore, this quantity is exactly the density of a \(\text{Beta(a,b)}\) random variable, so \(F(\theta)\) is the distribution function of a \(\text{Beta}(a,b)\) random variable. Therefore, we have used De Finetti’s theorem to show the desired result, that \[\overline{X}_\infty\sim \text{Beta}(a,b)\,.\] The second claim follows directly from O’Neill’s formulation of De Finetti’s theorem. ◻

Now that we understand this result in the case of the urn with two colors of balls, we seek to generalize. If the urn has \(k\) finitely many colors/classes, an analogous result will hold as well. We let \(X_n:=(X^{(1)}_n,\ldots,X^{(k)}_n)\) denote the proportions of balls of each color in the urn at time \(n\). Similarly to the previous case, given a vector of initial starting quantities \(\alpha=(\alpha_1,\ldots,\alpha_k)\), we have \[\lim_{n\to\infty} \frac{1}{n} \sum_{i=1}^n X_n = \overline{X}_\infty \in L^1\] and \[\overline{X}_\infty\sim\text{Dirichlet}(\alpha)\,.\]

We can see an example of this in the visualization below.

Limiting proportions of each colors are drawn from the urn on the right, and added to the histogram on the left. As we see, the distribution of these limiting proportions converges to the Dirichlet distribution with parameter corresponding to the initial conditions.

Note that we cannot truly simulate a draw of the limiting proportions, we must instead truncate the sequence at some point. However, as we see, this still provides a good approximation of the limiting proportion.

Finally, we can extend this result to once again to the case with countably many colors/classes, known as the Blackwell-MacQueen sampling scheme. Suppose the urn is now filled with \(\alpha\) black balls. At each time interval, we draw a ball from the urn. If a black ball is drawn, we replace the black ball and add an additional ball of a new color. If a non-black ball is drawn, we follow the standard procedure of replacing that ball and adding an additional ball of the same color. In essence, this is an extension of the urn scheme to include an unlimited countable quantity of balls, with the parameter \(\alpha\) representing the system’s affinity for creating new colors.

Once again, we wish to determine the limiting behavior of this process after infinitely many draws from the urn. Similarly to Polya’s urn with finitely many colors, after infinitely many draws, this process will converge to a limiting discrete measure with probability \(1\)

To better understand this GEM distribution, we turn our attention to a property of the Blackwell-MacQueen urn. Since the urn begins with only \(\alpha\) black balls, after the first draw, the urn will necessarily contain \(\alpha\) black balls and one ball of a new color, which we call color 1. Now, considering the urn to contain balls of either color 1 or not of color 1, we can observe that it is equivalent to an urn with balls of only two colors that initially begins with 1 ball of color 1 and \(\alpha\) balls of another color. Similarly, if we ignore any balls of color 1, then balls of color 2 will behave like a Polya’s urn with 1 initial ball of color 2 and \(\alpha\) initial balls of another color. We can continue this to apply to balls of color \(k\), if we ignore balls of colors \(1,\ldots,k-1\).

Let \(\overline{X}_\infty^{(k)}\) be the limiting proportion of balls of color k in the Blackwell-MacQueen urn. Using what we know about Polya’s urn with two colors and our observation above, \[\overline{X}_\infty^{(1)}\sim\text{Beta}(1,\alpha)\] Now, ignoring balls of color 1, the relative limiting proportion of balls of color 2 will be \(\text{Beta}(1,\alpha)\) as well. Therefore, the limiting proportion of balls of color 2 in the urn is given by \[\overline{X}_\infty^{(2)}\,\Big|\,\overline{X}_\infty^{(1)}\sim \left(1-\overline{X}_\infty^{(1)}\right)\text{Beta}(1,\alpha)\,.\] For arbitrary \(k\in\mathbb{N}\), we extend this to observe that \[\overline{X}_\infty^{(k)}\,\Big|\,\overline{X}_\infty^{(1)},\ldots,\overline{X}_\infty^{(k-1)} \sim \left(\prod_{i=1}^{k-1}\left(1-\overline{X}_\infty^{(i)}\right)\right)\text{Beta}(1,\alpha)\,.\] Looking at individual colors in the Blackwell-MacQueen urn scheme in this way provides insight on the behavior of the GEM distribution to which the limiting proportions converge. This type of perspective becomes valuable when we wish to sample from the GEM distribution, and later the Dirichlet process.

The previous ideas are commonly framed with the *stick-breaking process*

Suppose we have a stick of unit length. First, we draw \(\beta_1\sim\text{Beta}(1,\alpha)\), and break off this proportion of the stick. At the second time step, we take the remaining piece, draw \(\beta_2\sim\text{Beta}(1,\alpha)\), and break off this proportion. Continuing this process, we let \(w_k\) denote the length of the amount of the stick broken off at time \(k\). Then \[w_k = \beta_k\prod_{i=1}^{k-1}\left(1-\beta_i\right)\,.\] Through our observations about the Blackwell-MacQueen urn scheme earlier, we see that the stochastic process given by the sequence of weights \(\{w_n:n\in\mathbb{N}\}\) follows the GEM distribution with parameter \(\alpha\). As \(\alpha\) grows larger, the distribution \(\text{Beta}(1,\alpha)\) favors lower values, which favors a lower, more spread out set of weights, corresponding to a stick broken into more relatively even pieces. This is exactly analogous to how larger \(\alpha\) favors the addition of new colors of balls in the Blackwell-MacQueen urn scheme.

Another very related process is the *restaurant seating process*. Suppose a restaurant exists with an unlimited quantity of tables. The first customer sits at a table, and each subsequent customer chooses to sit at either an already populated table or a new table, favoring tables in proportion to the amount of people already there. The restaurant seating process is mathematically defined as a stochastic process \(\{X_n:n\in\mathbb{N}\}\) with realizations in the space of partitions of \(\{1,\ldots,n\}\). \(X_1\) is set to \(\{1\}\), the trivial partition of a one element space.

We define \(X_n\) to be the partition \(E_1,\ldots,E_k\) of \(\{1,\ldots,n\}\), with \(k\leq n\). The subsequent realization \(X_{n+1}\) is obtained by adding \(n+1\) to one of \(E_1,\ldots, E_{k+1}\) with probabilities given by \[P(n+1 \in E_j) =
\begin{cases}
\frac{|E_j|}{n+1} \quad j=1,\ldots,k\\
\frac{1}{n+1} \quad j=k+1
\end{cases}\] where \(E_{k+1}\) is a new block of the partition. We can slightly generalize this process by adding the *scaling parameter* \(\alpha\). In this case, we assign customers to tables with probabilities \[P(n+1 \in E_j) =
\begin{cases}
\frac{|E_j|}{n+\alpha} \quad j=1,\ldots,k\\
\frac{\alpha}{n+\alpha} \quad j=k+1
\end{cases}\] with the condition \(\alpha\geq 0\) such that probabilities are always nonnegative. The scaling parameter \(\alpha\) represents the new customer’s affinity towards joining a new table, with \(\alpha=0\) being the case where all customers will sit at the same table. There exists a further two parameter generalization including an additional parameter \(\beta\), but we will discuss the one-parameter model, equivalent to the case where \(\beta=0\), due to its closer connection with the Dirichlet process.

It is straightforward to see that this process is fundamentally identical to the Blackwell-MacQueen urn scheme, with customers replaced by balls and tables replaced by colors. Indeed, the limiting proportions of customers at each table follows the GEM distribution with parameter \(\alpha\).

Another interesting question is to know the distribution on the number of tables after \(n\) customers have entered the restaurant (equivalently, the number of colors in the Blackwell-MacQueen urn after \(n\) draws).

The expected number of tables after \(n\) customers have entered the restaurant is approximately given by \[\alpha\log\left(1+\frac{n}{\alpha}\right)\,.\] for large \(\alpha\) and large \(n\).

*Proof.* For each customer, regardless of what has happened previously, the probability of starting a new table is given by \(\frac{\alpha}{N+\alpha}\) where \(N\) is the number of customers who have previously entered. Thus, at time \(k\), it is given by \(\frac{\alpha}{k-1+\alpha}\). Note that this is independent of any previous observations. Therefore, letting \(T_n\) denote the number of tables at time \(n\), we have \[\begin{aligned}
E[T_n] &= \sum_{k=1}^n \frac{\alpha}{k-1+\alpha} \\
&= \alpha \sum_{k=0}^{n-1} \frac{1}{k+\alpha} \\
&= \alpha \left(\psi(\alpha+n)-\psi(\alpha)\right) \\
&\approx \alpha(\log(\alpha+n)-\log(\alpha)) \\
&\approx \alpha\log\left(1+\frac{n}{\alpha}\right)\end{aligned}\] where \(\psi\) is the digamma function

Finally, we are ready to use our knowledge of the Blackwell-MacQueen urn scheme and its related representations to define the Dirichlet process.

Let \(H\) be a probability measure over a sample space \(\Omega\) and \(\alpha\in\mathbb{R}\) with \(\alpha>0\). Let \(X\) be a random probability measure on \(\Omega\). If for every finite measurable partition \(E_1,\ldots, E_n\) of \(\Omega\) we have \[\big(X(E_1),\ldots,X(E_n)\big)\sim \text{Dirichlet}\big(\alpha\,H(E_1),\ldots,\alpha\,H(E_n)\big)\] then \(X\) follows a Dirichlet process with *base measure* \(H\) and *concentration parameter* \(\alpha\), denoted \(X\sim \text{DP}(H,\alpha)\).

In this sense, the defining property of the Dirichlet process is that its marginal distributions are Dirichlet distributed. However, while this explicitly defines a Dirichlet process, it does not tell us what it actually is, or even if it exists.

Historically, the existence of the Dirichlet process was first shown by Ferguson

We can construct the Dirichlet process from the stick-breaking process by associating a draw from some underlying distribution \(H\) on a sample space \(\Omega\) with each component of the stick-breaking distribution. Precisely, we can view a realization from the Dirichlet process as a sum of infinitely many weighted point masses.

Considering two infinite dimensional vectors of weights \(\{w_n:n\in\mathbb{N},\sum_{n=1}^\infty w_n=1\}\sim \text{GEM}(\alpha)\) and locations \(\{x_n:n\in\mathbb{N}\}\overset{iid}{\sim}H\), the random measure \[\mu:=\sum_{i=1}^\infty w_i\delta_{x_i}\,.\] follows a Dirichlet process with base measure \(H\) and concentration parameter \(\alpha\).

Thus, this association of stick-breaking weights with locations from \(H\) is a construction of the Dirichlet process. We will not give a proof of this, but the interested reader can refer to section 3 of Sethuraman (1994)

This allows us to form an intuitive understanding of what a realization from the Dirichlet process looks like by considering the processes discussed earlier: the Blackwell-MacQueen urn scheme, restaurant seating process, and the stick-breaking process. Once again, suppose we have an underlying measure \(H\) on a sample space \(\Omega\). Then, to construct the Dirichlet process, we can associate each independent draw from \(H\) a the weight that corresponding to one of the following:

The length of a stick in the stick-breaking process

The limiting proportion of balls of a certain color in the Blackwell-MacQueen urn scheme

The limiting proportion of customers at a specific table in the restaurant seating process

Now, we provide a visualization that illustrates how random measures created in this fashion satisfy the defining property of the Dirichlet process. If \(\mu\sim \text{DP}(H,\alpha)\), then when we divide the sample space \(\Omega\) into a finite partition \(E_1,\ldots, E_n\), the random vector \((\mu(E_1),\ldots,\mu(E_n))\) will follow a Dirichlet distribution with parameter \((\alpha H(E_1),\ldots,\alpha H(E_n))\).

For example, let \(H\) be uniform on \(\Omega=[0,1]\times [0,1]\). We can draw samples from \(\text{DP}(H,\alpha)\) using the stick-breaking process a uniform random variable, and visually represent them as as circles in the square, with area proportional to the weight of each point mass. In the visualization below, move the line endpoints to partition the square, and observe the limiting distribution of the proportion of mass contained one region, which should approximate \(\text{Beta}(\alpha A_1, \alpha A_2)\), where \(A_1,A_2\) denote the area contained in each region.

Remember that in the image on the left, we show just one draw from the Dirichlet process. The histogram on the right is a distributional approximation, in which the proportion of mass contained in a region is sampled over a large number of draws from the process.

Now that we know what the Dirichlet process looks like, we can discuss some of its properties.

Now, we look identify the parameters of the Dirichlet process with a notion of mean and variance

The expectation of a Dirichlet process is its base measure.

*Proof.* Let \(X\sim \text{DP}(H,\alpha)\). Then, using the partitioning property of the Dirichlet process, for any measurable set \(A\) in the sample space, we have \[X(A) \sim \text{Beta}\left(\alpha H(E), \alpha(1-H(E))\right)\] and thus \[E[X(A)] = \frac{\alpha H(A)}{\alpha H(A)+\alpha(1-H(A))} = H(A)\] ◻

Thus, the base measure \(H\) is indeed the mean of the Dirichlet process. However, even if \(H\) is continuous, samples from the Dirichlet process are almost surely discrete

We can use the following visualization to examine what samples from the Dirichlet process look like. Here, we can draw samples from \(\text{DP}(H,\alpha)\) where \(H\) is a standard normal distribution on the real line and \(\alpha\) is user-selected.

To motivate the useful property of the Dirichlet process as a prior in Bayesian inference, we first turn back to its finite dimensional counterpart. In Bayesian inference, the beta distribution is inherently useful as a prior distribution for a binomial data-generating process due to its conjugate nature. More explicitly, when our data is drawn from something of binomial nature, such as repeated flips of a coin, we can put a beta prior on the proportion to yield a beta posterior on the proportion as well. Indeed, the same is done using the Dirichlet distribution for multinomial data, like rolling a \(k\) sided (unfair) die.

We now extend this to the case when our data comes from categorical discrete distribution over a countably infinite number of categories. A principle of this type of problem is the clustering of data points where the number of clusters is thought to grow infinitely with the sample size. An example of this could be grouping articles by their topics, as the number of topics should grow with the number of articles. We seek to determine which group each data point belongs in, when the number of potential groups is unknown and bounded only by the number of data points.

Let \(X\sim \text{DP}(H,\alpha)\), and let \(\Omega\) be the sample space of \(H\). If we consider exchangeable samples \(\theta_1,\ldots,\theta_n\) drawn from \(X\) as our data, we look to infer the posterior distribution \(X|\theta_1,\ldots,\theta_n\).

If \(X\sim\text{DP}(H,\alpha)\) and we observed \(\theta_1,\ldots,\theta_n\) as draws from \(X\), the posterior distribution is given by \[X|\theta_1,\ldots,\theta_n\sim\text{DP}\left(\frac{\alpha\,H+\sum_{i=1}^n\delta_{\theta_i}}{\alpha+n}, \alpha+n\right)\,.\]

*Proof.* To show that the posterior of \(X|\theta_1,\ldots,\theta_n\) must necessarily be a Dirichlet process as well, we consider the defining property of a Dirichlet process. Let \(E_1,\ldots,E_k\) be a finite partition of \(\Omega\). Note that since \(X\) is a Dirichlet process, we have \[\left(X(E_1),\ldots,X(E_k)\right)\sim\text{Dirichlet}(\alpha\,H(E_1),\ldots,\alpha\,H(E_k)\,.\] If we define \[n_j = \sum_{i=1}^k{1_{\{\theta_i\in E_j\}}}\,,\] the count of observed samples that lie in \(E_j\), then we observe that \[\left(X(E_1),\ldots,X(E_k)\right)|\theta_1,\ldots,\theta_n\sim\text{Dirichlet}(\alpha\,H(E_1)+n_1,\ldots,\alpha\,H(E_k)+n_k)\] using the conjugate nature between the Dirichlet and multinomial distributions

Intuitively, this should be logical as we are augmenting the prior base measure \(H\) by adding a collection of point masses corresponding to the observed data.

Examining this posterior distribution, we see that the new base measure is a weighted average of the previous measure \(H\) and the observed data. \(\alpha\) acts as a weight for the prior base distribution \(H\) while the number of observations \(n\) acts as a weight for the empirical distribution of the observed data. As \(\alpha\to 0\), the base measure loses its impact on the data, and the posterior is entirely determined by the observed data. A similar phenomenon occurs as the number of observed data points \(n\) grows large, regardless of the value for \(\alpha\).

Notice that the posterior predictive distribution on a new observation \(\tilde{\theta}\) can be obtained by marginalizing the distribution of \(\tilde{\theta}\) given \(\theta_1,\ldots,\theta_n\) over the posterior of \(X|\theta_1,\ldots,\theta_n\). Thus we have for any measurable \(E\), \[\begin{aligned} \tilde{\theta}|\theta_1,\ldots,\theta_n\sim E(X|\theta_1,\ldots,\theta_n)=\frac{\alpha\,H+\sum_{i=1}^n\delta_{\theta_i}}{\alpha+n}\,.\end{aligned}\] since the base measure of a Dirichlet process is its expectation.

At this point, we pivot from the theoretical foundations and properties of the Dirichlet process to a class of models in which they are widely used as prior distributions. Mixture models are probabilistic models that suppose that the data comes from a mixture of component distributions. Inference on these models seeks to determine the parameters of each distribution from the data, such that the mixture of the distributions best represents the data, without knowing which data points come from each distribution. They are widely used to model subpopulations within a greater population, such as certain species within an animal population.

The general density of a finite mixture model is given by a weighted sum of densities from subpopulations. Explicitly, for a mixture with \(k\) components (subdistributions), its density is \[p(x|\pi,\theta) = \sum_{i=1}^k \pi_i p_i(x|\theta_i)\] where \(\pi\) is a vector of weights often called *mixing weights*, and \(p_i(x)\) is the density of the \(i\)th subdistribution, \(i=1,\ldots,k\). We see that in order for this to be a valid density, the mixing weights must sum to one. When fitting the model, we seek to infer the mixing weights \(\pi\) and component parameters \(\theta\) from the data.

Given observations \(x_1,\ldots,x_n\) and a parametric class of distributions \(F(\theta)\), we suppose that each observation \(x_i\) belongs to a subpopulation indexed by a latent variable \(z_i\). This model has two hyperparameters: \(\alpha\), the parameter of the Dirichlet prior of the mixing weights, and \(\beta\), the parameter(s) of the prior of each component parameter \(\theta_i\), \(i=1,\ldots,k\). \[\begin{aligned} \pi|\alpha&\sim\text{Dirichlet}\left(\frac{\alpha}{k},\ldots,\frac{\alpha}{k}\right)\\ z_i|\pi&\sim\text{Categorical}(\pi) \\ \theta_i|H,\beta &\sim H(\beta) \\ x_i|z_i,\theta_1,\ldots,\theta_k &\sim F(\theta_{z_i})\end{aligned}\] \(H\), the prior on the component parameters, is typically selected to be the conjugate prior to \(F\).

As an example, let’s consider a bivariate Gaussian mixture model with three components. For observed data \(x_1,\ldots,x_n\), we now can reformulate the above model to have multivariate normal component distributions: \[\begin{aligned}
\pi|\alpha &\sim \text{Dirichlet}\left(\frac{\alpha}{3},\frac{\alpha}{3},\frac{\alpha}{3}\right) \\
z_i|\pi &\sim \text{Categorical}(\pi) \\
(\mu_i,\Lambda_i)|\mu_0,\kappa_0,\nu_0,T_0 &\sim \text{NW}(\mu_0,\kappa_0,\nu_0, T_0) \\
x_i|z_i,\mu_1,\Lambda_1,\ldots,\mu_3,\Lambda_3 &\sim \mathcal{N}(\mu_{z_i},\Lambda_{z_i}^{-1})\end{aligned}\] where \(\text{NW}\) denotes the normal-Wishart distribution, the conjugate prior of multivariate normal with unknown mean and variance

Let’s fit this model to some data. We will be considering the Palmer penguins

Now, we extend the idea of finite mixture models to the infinite case. Suppose now that instead of a model where our data comes from a fixed, finite number of component distributions, we instead do not impose such a restriction. Consider a density of the following form: \[p(x|\theta,P) = \int p(x|\theta)\,\mathrm{d}P(\theta)\] That is, the model is a mixture of the component distributions with respect to a mixing measure \(P\). Due to the nice conjuacy properties we showed earlier, the Dirichlet process turns out to be a desirable prior to use for \(P\) in a Bayesian modeling setting.

Suppose we have a set of exchangeable data points \(x_1,\ldots,x_n\). We again suppose that this data is generated from a mixture of components each with a parametric distribution \(F(\theta)\). We let \(P\) represent the mixing distribution over these parameters, and let our prior on \(P\) be a Dirichlet process with base measure \(H\) and concentration \(\alpha\). First introduced in Antoniak (1974)

\[\begin{aligned} x_i|\theta_i\sim F(\theta_i) \\ \theta_i|P\sim P \\ P|H,\alpha \sim \text{DP}(H,\alpha)\end{aligned}\]

While this model nicely makes use of the theory that we have introduced earlier, it can be fairly unintuitive, especially when compared to the finite mixture model. In practice, we can modify the above formulation to make it more analogous with the finite case by adding a latent variable \(z_i\) that once again describes the class label of each data point \(x_i\).

We can suppose that the model is a countable weighted average of subdistributions, with the form \[p(x|\pi,\theta)=\sum_{i=1}^\infty \pi_i\,p_i(x|\theta_i)\,,\qquad \sum_{i=1}^\infty \pi_i = 1\,.\] In practice, since we fit these models to finite sets of data, we note that the number of components is practically bounded by the number of data points \(n\), as otherwise the model would consist of components from which none of the data comes. Furthermore, the actual bound is much tighter, as the number of components actually follows the same logarithmic bound that we see on the number of tables in the restaurant seating process. It may seem as if the infinite case is the same as a finite mixture model with \(n\) components, but the key distinction is that the number of clusters is not fixed, and is fit to the data.

Similarly to the finite case, considering observed data \(x_1,\ldots,x_n\), we suppose again that each \(x_i\) belongs to a subpopulation indexed by a latent variable \(z_i\). However, in this case, we let our hyperparameter \(\alpha\) be the parameter of a GEM distribution used as the prior for the mixing weights \(\pi\), and again let \(\beta\) be the parameter of the prior of each component parameter \(\theta_i\),\(i=1,\ldots,n\). Note that we make use here of the number of data points as a practical bound for the number of components. Explicitly, we can write the model as:

\[\begin{aligned} \pi|\alpha &\sim\text{GEM}(\alpha) \\ z_i|\pi &\sim \text{Categorical}(\pi) \\ \theta_i|H,\beta &\sim H(\beta) \\ x_i|z_i,\theta_1,\ldots,\theta_n &\sim F(\theta_{z_i})\end{aligned}\] Thus, the infinite mixture model, when fit to some data, can be thought of similarly to the finite mixture model, with the key difference being the GEM prior on the mixing weights as opposed to a symmetric Dirichlet distribution.

To see this in action, we can demonstrate this model with multivariate Gaussian components on the same dataset to which we fit the finite model. The explicit modeling relations are similar to the finite case, but with a GEM/stick-breaking prior on the mixing weights. Explicitly, the model can be written as follows:

\[\begin{aligned} \pi|\alpha &\sim\text{GEM}(\alpha) \\ z_i|\pi &\sim \text{Categorical}(\pi) \\ (\mu_i,\Lambda_i)|\mu_0,\kappa_0,\nu_0,T_0 &\sim \text{NW}(\mu_0,\kappa_0,\nu_0,T_0) \\ x_i|z_i,\mu_1,\Lambda_1,\ldots,\mu_n,\Lambda_n &\sim \mathcal{N}(\mu_{z_i},\Lambda_{z_i}^{-1})\end{aligned}\] where \(\mathcal{N}\) denotes the multivariate normal distribution and \(\text{NW}\) denotes the normal-Wishart distribution. The model has 5 hyperparameters: \(\alpha\), the parameter of the stick-breaking prior on the mixing weights, and \(\mu_0,\kappa_0,\nu_0,T_0\), the parameters of the normal-Wishart prior on the mean and precision matrix of each component distribution.

In these examples, we fit the models to the (scaled) data using 1000 iterations of a Monte Carlo algorithm known as Gibbs sampling

As \(\alpha\) varies, the number of clusters that the algorithm realizes can be vastly different. We show a few examples with lower and higher \(\alpha\) values respectively to see this difference.

In the above example, we observe that the DP mixture model is able to correctly identify the number of clusters in the data, and provide a very similar partitioning to the finite mixture model with a prespecified number of clusters. When compared to the true species labels, this model achieves an average clustering accuracy of 0.9824, which is an excellent result when considering that the number of components was not specified in the model.

After defining a model, the next consideration is *fitting it* to our data, commonly referred to as *inference* in a Bayesian setting, equivalent to *training* in machine learning. Recall that Bayes’ theorem tells us that the posterior distribution of our parameters \(\theta\) given our data \(y\) is proportional to the product of the prior distribution on the parameter and the likelihood of the data. \[p(\theta|x)\propto p(x|\theta)p(\theta)\] In many cases, especially in the foundational period of Bayesian statistics, inference was only feasible in cases where the prior and likelihood are of the right form such that the posterior takes on a known distribution. Specifically, conjugate priors were used, where the posterior will take the same form as the prior with updated parameters. The reason that a more general case was not feasible was the difficulty in computing the normalizing factor \(p(x)\), which is necessary to make the posterior a proper distribution.

However, in 1990, Gelfand and Smith published a landmark paper regarding the usage of sampling methods, or Monte Carlo methods, to draw samples from posterior distributions in a general case

Another popular approximate inference method is variational inference (VI). VI is simple in essence: we first select a family of distributions for the posterior and choose the member of the family that is the ‘closest’ to the true posterior, with ‘closest’ being qualified as having the minimum Kullback-Leibler divergence to the true posterior. Effectively, this turns finding the posterior into an optimization problem rather than a Monte Carlo problem. Generally speaking, VI is less computationally intensive than sampling, but does not offer the guarantee of asymptotically exact samples from the posterior

At this point, the reader should have obtained an overview of the Dirichlet process and an intuitive understanding of their theoretical background and their role in mixture modeling. As we see, the Dirichlet process is a useful tool in nonparametric Bayesian modeling, and can easily be extended beyond what we have seen here as well. For example, the hierarchical Dirichlet process

We would like to thank Professor Jeremy Teitelbaum (University of Connecticut) for his overwhelming support and assistance throughout the duration of this project.