lundi 16 novembre 2009

Tagging Game Simulation


After several months of absence due to huge changes in my environment, I've finally spared some time and enthusiasm to invest into my blog! So please be glad dear reader.

Today, I'll take you out for a little and hopefully enjoyable trip into the concept of assigning names to things. This seemingly innocent task is at the center of several contemporary issues surrounding information in the human context. Of course, our long lasting friend Probability Theory comes along with us.

For the purpose of the argument, let's consider the non iterated following game:
Two players are given a common evidence. An evidence is anything the players can identify with good certainty, ranging from single words, to complete works, pictures, etc. The players must then independently come up with an unique word. If the words are the same, then they win the round.
We require the game to be one time because we don't want the players to figure out a cheating strategy, like always returning the same word (although non-matching words can be kept secret to diminish the feasibility of this strategy). Does this game reminds you of something? Actually it is a minimalist version of the ESP game by Luis Von Ahn, which seems to be a pretty efficient way to get good tags on pictures using human minds. His presentation of the work is a must see.

Anyway, we're going to focus on another aspect of the game, namely what is the probability of winning a round? And this should by design also be the probability of the description given by any two players of the evidence being the same, which is in essence the naming problem.

Heuristically, this probability must be pretty low on random evidences. Try something like "sick pigeons fall on people's heads". Candidate 1-word descriptions are for instance: umbrella, stroke, disgusting, absurd, crazy. Then we can try to guess what could be the most "popular" answer for this evidence, and return this one, as it would intuitively increase our chances of winning.

Let's now dive further into a bayesian probabilistic model of the naming game. Let $n$ be a positive integer representing the underlying vocabulary size, $e$ be the random variable representing the evidence (we're not interested in its density function though), and $x$ be the chosen candidate word. For commodity reasons, we'll represent $x$ as an $n$-dimensional vector having one 1 on the coordinate corresponding to the word and 0s everywhere else. The goal is to model the distribution $p(x|e)$, which is in our case a mass function of the form:

$$p(x|e) = \prod_{i=1}^n{\theta_i^{x_i}}$$

where the parameters of the model $\theta_i$ are positive and sum to 1. Notice that each evidence $e$ gets its own set of $\theta$ s. Now, what do the $\theta$ s look like?

A reasonable guess would suppose that they follow something called a Zipf's law. This particular kind of distribution is actually very common in the context of human language (and language in general as a communication protocol) and requires its own post for a decent coverage. In any case, there are both theoretical and practical evidence to support such a choice.

Under this assumption, and for the simplest possible form of a Zipf-like distribution (which is good enough for this back-of-the-envelop model), we can derive our parameters quite simply by:

$\theta_i \propto i^{-s}$

where our words "i" are ordered by decreasing popularity and $1 \leq s$.

We'll fix $n=100,000$ (reasonable for the English language, see AskOxford). $s$ can be seen as a measure of the difficulty of the problem. A low $s$ means a difficult problem where people don't come up with "hit words", and a high $s$ makes the distribution very sharp around the few first words, meaning that some consensus is emerging. As a reference level, $s \sim 1.1$ for the distribution of words in a standard English text.

The model is simple in its outset, but obtaining a nice closed form for $p(x^{(1)}=x^{(2)}|e)$ (namely the probability of 2 independent names being the same) is another challenge. Without loss of generality, given the various independence assumptions, it is straightforward to show that this probability is nothing else than:

$$p=\sum_{i=1}^n{\theta_i^2}$$

And there is no further simplification. In the improbable configuration $n \rightarrow \infty$ and $s>1$ (this condition is not $s>0.5$ because $p(x|e)$ would be otherwise not defined), we have $p \propto \zeta (2s)$, where $\zeta$ is the elusive Riemann's Zeta function.

A sensible range for the values of $s$ is $[1;1.5]$ . Usually, $s=1$ is accepted as a good choice for words frequencies in documents. However, the asymptotic properties of a Zipf distribution having $s=1$ are unpleasant. Because $\sum_{i=1}^n{\frac{1}{i}} \rightarrow \infty$ as $n \rightarrow \infty$, we would have $p \rightarrow 0$ as the vocabulary size grows...

We see how difficult it is to independently give the same name for the same evidence. The naming problem is in general tough stuff, and there are some nice empirical articles in the field of human computer interaction which study how to name the functionalities of a software so that people can easily find what they need.