lundi 16 février 2009

Structure Learning in Markovian Random Fields: the wrong approach

Here is a map of the problematic. Let $X = (X_1,…,X_l)$ be a composite random variable made of a set of random variables $(X_i)_{1 \leq i \leq l}$. Each $X_i$ takes its values over an alphabet of finite cardinality $m$. We have a sample $(x)$ of $X$ values, of cardinality $n$. The goal is to evaluate the probability distribution of X. This data-mining problem is particularly interesting in the EDAs' context.
So we want to approximate $p(X)$ by analyzing the sample. The following is by definition always true (conditional probability):


$p(X=x)=\prod_{1 \leq i \leq l} p(X_i=x_i|(X_k=x_k)_{i < k \leq l})$

Because of the finite cardinality of the sample, computing directly $p(X)$ or the conditional probabilities above is not feasible by simply counting the occurrence of the events. To put it in a nutshell, Markovian Random Field (MRF) is a distribution probability model which is closed to Bayesian Network. Similarly to BN, MRF natively provides a graphical comprehension of the variable dependencies. However, where BN works by identifying implications between variables, thus resulting in an acyclic directed implication graph, MRF focuses on the notion of clique of variables.

A clique is a subset of $(X_1,…,X_l)$, and each clique $c_i$ has an associated clique function $\psi_i$, called a potential. Cliques are chosen so that $p(X=x)$ factorizes over the $(\psi_i)$. We write:


$p(X=x)=\frac{1}{Z}\prod{\psi_i(\pi_i(x))}$

where $Z$ is (a rather problematic) normalization constant, and $\pi_i$ is the projection operator returning the variables in the clique $c_i$.

Of course, choosing the right cliques from the initial sample, and then learning the clique functions is a complex task.

A good choice for the cliques would be to isolate the strongly coupled variables, i.e. the variables for which the joint probability is very different from the product of the independent probabilities. This is exactly the idea of the Chi-2 test. We will now try to briefly evaluate the efficiency of this technique.

Let’s say we want to find the pairs of strongly coupled variables. The following computes the correlation indicator between variables $X_1$ and $X_2$.


$ I_{1,2} = \sum_{x_1,x_2}\frac{(p_1(x_1)p_2(x_2)-p_{1,2}(x_1,x_2))^2}{p_1(x_1)p_2(x_2)} $

where $p_1(x_1)$ denotes $p(X_1=x_1)$, and so on. We approximate those probabilities by counting the occurrences in the sample. The approximation done this way is good if and only if $m^2 \ll n$ in this case.

Evaluating both $p_1$ and $p_{1,2}$ this way costs at least $O(n)$ ($O(m)$ and $O(m^2)$ in space respectively). Computing $I_{1,2}$ is of complexity $O(n+m^2)$.

So, computing all the $I_{i,j}$ this way would take $O((n+m^2)l^2)$, and $O(l^2)$ of space.
But we notice that in computing the upper triangular part of the $(I_{i,j})$ matrix, we perform several times the same computations. Can we do better?

The trouble is not the $p_i$ computation. It takes $O(nl)$ for computing them all (space complexity: $O(ml)$). The real issue is the joint probabilities. There is no way we can compute them all in linear time as above, because we still have $O(l^2)$ of them to compute, and it doesn't seem that dynamic programming techniques are of any help. So $O(nl^2)$, and we are still left with this ugly $l^2$.

Nevertheless, computation could be parallelized by preparing packets of data. This is an interesting problem, which would require a separate topic.

So, finding 2-cliques in the data is quadratic in the number of variables. What about 3-variable dependencies? Chi-2 test for 3 variables costs $O(n+m^3)$. Thus, computing the 3-dimensional $(I_{i,j,k})$ takes something like $O((n+m^3)l^3)$. Computing the k-variable dependencies would take $O((n+m^k)l^k)$ if $k \leq \frac{n}{2}$.

Let’s say the approximation of $p$ done by counting is correct if $n>10m^k$. Thus $k$ is such as $k \leq [\log_m \frac{n}{10}]$. For $m=4$ (ATGC – genetic code) and $n=10000$, we get $k \leq 6$.

When $n$ is very high (data-mining context), and $m$ very low, the chi-2 method costs $O(nl^k)$, were $k$ is the size of the biggest interaction to be computed. Space complexity is neither lovely! This is one big reason why the problem of finding cliques, i.e. the structure of MRFs, is currently addressed by heuristic methods, such as greedy hill climbing. Your favorite search engine will probably return some interesting results on the title of this article (careful though, only the first part of it)!

mercredi 4 février 2009

Pourquoi ce blog?

Vague étonnement et un peu d'incompréhension de la part d'un proche cette après-midi :
Qu'est ce qu'il te prend de faire un blog tout d'un coup? Ça ne te dérange pas de montrer qui tu es au premier venu du web? Cela répond-il à un besoin quelconque?
Pas de doutes, j'ai mis les deux pieds dans le plat avec ce blog! Ces interrogations bien légitimes, ce sont aussi les miennes. Ce sont celles que tout blogueur se pose depuis le tout début. Car le web, ce n'est pas la place du village où circulent des gens avec lesquels vous pourrez toujours faire plus ample connaissance si l'envie vous en prend. Le web est un espace unidirectionnel dans lequel toutes sortes d'individus et d'organisations peuvent récolter ce que vous y avez semé, et de manière beaucoup plus efficace que ce que le commun des mortels s'imagine. Les techniques de datamining existent depuis des dizaines d'années, et l'explosion des données disponibles à la volée constitue un mets de choix. Et lorsque l'on connaît un peu la nature humaine, il est contradictoire de prétendre que l'on n'a jamais rien à cacher sur Internet! Ce n'est pas une question de honte, c'est une exigence de rationalité. Il est par conséquent nécessaire de se fixer des limites à ne pas franchir. En ce qui me concerne, bloguer n'est pas parler à un confident, à un ami, ni même à un parfait inconnu. L'anonymat du web et l'anonymat de l'inconnu sont incommensurables. Ainsi, il ne sera jamais question sur ce blog de détails dits "personnels" de la vie de quiconque, et a fortiori de tous ceux que je connais.

Bloguer, c'est d'abord laisser une trace pour les autres de ce que l'on juge important, beau, nécessaire ou simplement drôle. Cela est peut-être une façon détournée d'exposer sa vie, car ces choses sont constitutives de notre existence même. Mais c'est aussi la raison pour laquelle il y a tant d'écrivains et tant de livres. Nous espérons confusément être un jour utile à quelqu'un, éveiller en cette personne des pensées semblables aux nôtres, partager l'expérience de la condition humaine pour emprunter les termes d'Hannah Arendt.

Le blog est avant tout une création (plus ou moins "artistique"). Il est le résultat d'une production humaine, qui in fine ne possède pas de réponse à la question du pourquoi. Pour ma part, bloguer est une nouvelle expérience. Je verrai bien où elle me mènera. Carpe diem!

lundi 2 février 2009

Evolutionary metaheuristics: what is a suitable fitness function?

Evolutionary metaheuristic methods such as Genetic Algorithms, or more recently Estimation of Distribution Algorithms are useful when the optimization problem to solve presents no apparent structure.

In other words, when you have to find an $x_0$ living in a domain $D$ such as

$f(x_0)=\max_{x \in D} f(x)$

you have either:
  1. A messy domain $D$, for example a huge discret domain
  2. A messy function $f$, for example a non-convex or discontinuous one
you may reasonably invoke these metaheuristics. In the evolutionary metaheuristics, $f$ is called the fitness function, and a population $P \subset D$ is evolved towards the high values of $f$. Finally, you can pick your $x_0$ in $P$ if all went well.

However, as exciting as evolutionary computing may sound, this tool will perform poorly, and even fail to evolve any population when applied on ill designed fitness functions. The existence of such unfriendly functions is given by the famous no free lunch theorem(s). It states that for static optimization tasks such as ours, when benchmarking any algorithm over all the possible $f$ functions, it performs as well as a random search over $D$. Or in other words, any given algorithm can't be better than any other one on all possible fitness functions. Again: for each and every optimization algorithm, there is a disastrous class of $f$ functions on which it fails computing a good solution. This result holds independently of the type of algorithm. They are all included, even the ones we don't know yet! There is no static optimization algorithm performing well on every $f$, just like there is no hotel serving free lunches everyday.

This is a theoretical result: it may seem the space where the $f$ functions are living is so huge, we won't even be capable of exhibiting such disastrous functions for evolutionary algorithms. As a matter of fact, we can!

Disclaimer: the following part is solely based on my previous experience with evolutionary algorithms. It doesn't rely on a truly scientific evidence, just on my own intuition.

Genetic Algorithms and Estimation of Distribution Algorithms rely implicitly on a minimal local stability of the fitness function.
Let $\delta$ be a distance over $D$. For convenience, we assume that $f(D) \subset \mathbb{R}$. Minimal local stability means that for almost any $x$ in $D$, slightly changing $x$ will only reasonably affect $f(x)$. This is (a little bit more) formalized by:

$\exists e, C > 0; e \sim \min_{x \neq y} {\delta(x,y)}; C \ll \max{f} - \min{f}$
for almost any $x,y \in D$
$\delta(x,y) \leq e \Rightarrow \mid f(x) - f(y) \mid < C$

The local discontinuities of the fitness don't matter, because they are soon assimilated by the population in GA or the distribution model for EDA. Everywhere else, the behavior of the fitness is kind of predictable, and we can rely on this stability to carry out the evolution process. In EDA, local stability plays a major role, because it ultimately legitimates the approximations done during the learning of the model parameters.

In GA, when we are crossing-over (notation $\otimes$) $x$ and $y$, we hope that, most of the time, $f(x \otimes y)$ will be comparable to $\max(f(x),f(y))$. On the early stages of evolving $P$, we can say that the algorithm is searching for a stability zone in $D$, where this assumption can be made. The existence of this stability zone in $D$ is to my mind related to the propriety of local stability of $f$. One heuristic argument in favor of this assumption, is that GA will fail when $f$ is discontinuous everywhere, because of the induced randomness. There is no direction to go for the algorithm with such a messy fitness function.

The moral of the story is to always check if the fitness function has some kind of regularity. If $f$ is locally stable, we can hope that the algorithms will perform honorably. However, this is probably only a necessary condition for good results...