Here is a map of the problematic. Let $X = (X_1,…,X_l)$ be a composite random variable made of a set of random variables $(X_i)_{1 \leq i \leq l}$. Each $X_i$ takes its values over an alphabet of finite cardinality $m$. We have a sample $(x)$ of $X$ values, of cardinality $n$. The goal is to evaluate the probability distribution of X. This data-mining problem is particularly interesting in the EDAs' context.
So we want to approximate $p(X)$ by analyzing the sample. The following is by definition always true (conditional probability):
So we want to approximate $p(X)$ by analyzing the sample. The following is by definition always true (conditional probability):
$p(X=x)=\prod_{1 \leq i \leq l} p(X_i=x_i|(X_k=x_k)_{i < k \leq l})$
Because of the finite cardinality of the sample, computing directly $p(X)$ or the conditional probabilities above is not feasible by simply counting the occurrence of the events. To put it in a nutshell, Markovian Random Field (MRF) is a distribution probability model which is closed to Bayesian Network. Similarly to BN, MRF natively provides a graphical comprehension of the variable dependencies. However, where BN works by identifying implications between variables, thus resulting in an acyclic directed implication graph, MRF focuses on the notion of clique of variables.
A clique is a subset of $(X_1,…,X_l)$, and each clique $c_i$ has an associated clique function $\psi_i$, called a potential. Cliques are chosen so that $p(X=x)$ factorizes over the $(\psi_i)$. We write:
A clique is a subset of $(X_1,…,X_l)$, and each clique $c_i$ has an associated clique function $\psi_i$, called a potential. Cliques are chosen so that $p(X=x)$ factorizes over the $(\psi_i)$. We write:
$p(X=x)=\frac{1}{Z}\prod{\psi_i(\pi_i(x))}$
where $Z$ is (a rather problematic) normalization constant, and $\pi_i$ is the projection operator returning the variables in the clique $c_i$.
Of course, choosing the right cliques from the initial sample, and then learning the clique functions is a complex task.
A good choice for the cliques would be to isolate the strongly coupled variables, i.e. the variables for which the joint probability is very different from the product of the independent probabilities. This is exactly the idea of the Chi-2 test. We will now try to briefly evaluate the efficiency of this technique.
Let’s say we want to find the pairs of strongly coupled variables. The following computes the correlation indicator between variables $X_1$ and $X_2$.
Of course, choosing the right cliques from the initial sample, and then learning the clique functions is a complex task.
A good choice for the cliques would be to isolate the strongly coupled variables, i.e. the variables for which the joint probability is very different from the product of the independent probabilities. This is exactly the idea of the Chi-2 test. We will now try to briefly evaluate the efficiency of this technique.
Let’s say we want to find the pairs of strongly coupled variables. The following computes the correlation indicator between variables $X_1$ and $X_2$.
$ I_{1,2} = \sum_{x_1,x_2}\frac{(p_1(x_1)p_2(x_2)-p_{1,2}(x_1,x_2))^2}{p_1(x_1)p_2(x_2)} $
where $p_1(x_1)$ denotes $p(X_1=x_1)$, and so on. We approximate those probabilities by counting the occurrences in the sample. The approximation done this way is good if and only if $m^2 \ll n$ in this case.
Evaluating both $p_1$ and $p_{1,2}$ this way costs at least $O(n)$ ($O(m)$ and $O(m^2)$ in space respectively). Computing $I_{1,2}$ is of complexity $O(n+m^2)$.
So, computing all the $I_{i,j}$ this way would take $O((n+m^2)l^2)$, and $O(l^2)$ of space.
But we notice that in computing the upper triangular part of the $(I_{i,j})$ matrix, we perform several times the same computations. Can we do better?
The trouble is not the $p_i$ computation. It takes $O(nl)$ for computing them all (space complexity: $O(ml)$). The real issue is the joint probabilities. There is no way we can compute them all in linear time as above, because we still have $O(l^2)$ of them to compute, and it doesn't seem that dynamic programming techniques are of any help. So $O(nl^2)$, and we are still left with this ugly $l^2$.
Nevertheless, computation could be parallelized by preparing packets of data. This is an interesting problem, which would require a separate topic.
So, finding 2-cliques in the data is quadratic in the number of variables. What about 3-variable dependencies? Chi-2 test for 3 variables costs $O(n+m^3)$. Thus, computing the 3-dimensional $(I_{i,j,k})$ takes something like $O((n+m^3)l^3)$. Computing the k-variable dependencies would take $O((n+m^k)l^k)$ if $k \leq \frac{n}{2}$.
Let’s say the approximation of $p$ done by counting is correct if $n>10m^k$. Thus $k$ is such as $k \leq [\log_m \frac{n}{10}]$. For $m=4$ (ATGC – genetic code) and $n=10000$, we get $k \leq 6$.
When $n$ is very high (data-mining context), and $m$ very low, the chi-2 method costs $O(nl^k)$, were $k$ is the size of the biggest interaction to be computed. Space complexity is neither lovely! This is one big reason why the problem of finding cliques, i.e. the structure of MRFs, is currently addressed by heuristic methods, such as greedy hill climbing. Your favorite search engine will probably return some interesting results on the title of this article (careful though, only the first part of it)!
Evaluating both $p_1$ and $p_{1,2}$ this way costs at least $O(n)$ ($O(m)$ and $O(m^2)$ in space respectively). Computing $I_{1,2}$ is of complexity $O(n+m^2)$.
So, computing all the $I_{i,j}$ this way would take $O((n+m^2)l^2)$, and $O(l^2)$ of space.
But we notice that in computing the upper triangular part of the $(I_{i,j})$ matrix, we perform several times the same computations. Can we do better?
The trouble is not the $p_i$ computation. It takes $O(nl)$ for computing them all (space complexity: $O(ml)$). The real issue is the joint probabilities. There is no way we can compute them all in linear time as above, because we still have $O(l^2)$ of them to compute, and it doesn't seem that dynamic programming techniques are of any help. So $O(nl^2)$, and we are still left with this ugly $l^2$.
Nevertheless, computation could be parallelized by preparing packets of data. This is an interesting problem, which would require a separate topic.
So, finding 2-cliques in the data is quadratic in the number of variables. What about 3-variable dependencies? Chi-2 test for 3 variables costs $O(n+m^3)$. Thus, computing the 3-dimensional $(I_{i,j,k})$ takes something like $O((n+m^3)l^3)$. Computing the k-variable dependencies would take $O((n+m^k)l^k)$ if $k \leq \frac{n}{2}$.
Let’s say the approximation of $p$ done by counting is correct if $n>10m^k$. Thus $k$ is such as $k \leq [\log_m \frac{n}{10}]$. For $m=4$ (ATGC – genetic code) and $n=10000$, we get $k \leq 6$.
When $n$ is very high (data-mining context), and $m$ very low, the chi-2 method costs $O(nl^k)$, were $k$ is the size of the biggest interaction to be computed. Space complexity is neither lovely! This is one big reason why the problem of finding cliques, i.e. the structure of MRFs, is currently addressed by heuristic methods, such as greedy hill climbing. Your favorite search engine will probably return some interesting results on the title of this article (careful though, only the first part of it)!
Aucun commentaire:
Enregistrer un commentaire