mercredi 13 octobre 2010

The Social Network: a flawed voting scheme

As it happens, I saw The Social Network. For my defense, I should mention that this is solely the unfortunate result of my friends' perfidy: a moment of inattention and bim! you're suddenly learning about the romantic life of Mark Zuckerberg - or whatever. Nothing really made me go "wait a second", except in the very beginning of the film when our protagonist happily codes a voting scheme for ranking Harvard's girls - by "hotness" of course. As the story goes, he calls his best friend over for him to explain "the chess playing algorithm". I am clueless as to what chess has to do with this, but let's move on. (EDIT: They are probably talking about the Elo rating system which does indeed make some sense.)

Back at my university, we had the same thing on our local network periodically going on (at each beginning of the academic year - I let you guess why). The concept is fairly primitive. You are presented two girls' pictures and you click on which of the two you like best. Then two other girls are randomly paired and presented to you, and so on. In the end, after many many votes, you magically end up with some ranking of the girls. No comments on the social value of this. So, what is really going on behind the scene?

Let's get more precise. Basically, you have $n$ voters and $m$ choices. Each voter has a complete ordering over $\{1,...,m\}$, meaning that for any two choices $x,y$, one has either $x < y$ or $x > y$. Is complete ordering a realistic assumption? Well, it does not matter, as long as you are only given the $<$ or $>$ alternative in the voting scheme, so we will restict ourselves to this case. There are exactly $2^{\frac{m(m-1)}{2}$ possible distinct complete orderings, some of them are not transitive: you can have $a>b$ and $b>c$ but $a < c$. But is it possible that if I prefer $a$ to $b$ and $b$ to $c$, then I actually prefer $c$ to $a$? The answer is negative if I am being self-consistent. Thus, an opinion is a transitive complete ordering, i.e. a permutation. Therefore, voters' opinions are reasonably accounted for by permutations over the $m$ choices, and there is exactly $m!$ possible distinct opinions.

Thus, if $S_m$ denotes the group of permutations of size $m$, the voting scheme is adequately represented by a function $f$ which type is:

$$f : S_m^n \rightarrow S_m$$
Functions such as $f$ are called aggregation functions, because their aim is to aggregate voters' information into a single, hopefully fair and representative global choice. Now, unpleasant things start happening here. Even without knowing how $f$ works, we have some important limitations on what can $f$ achieve at best. Perhaps the simplest thing to think of is Condorcet's voting paradox: there exist configurations of voters' rankings such that $f$ will contradict a majority of voters' preferences.

Take 3 voters having personal preferences $a>b>c$, $b>c>a$ and $c>a>b$. We see that a strict majority of 2 voters prefer $a$ to $b$, $b$ to $c$ and $c$ to $a$. Thus, the majority ordering is not commutative, but $f$ is restricted to deliver a permutation. Thus any resulting aggregated ranking will contradict at least one of the 3 majority preferences.
Arrow's impossibility theorem extends and precises this puzzling result: basically there exists no fair (several natural requirements) aggregating functions for this problem. So, whatever the actual algorithm does, it can not claim to find the majority ranking of candidates - because no such thing exists in general. That's fairly poor for something supposed to give the general opinion of voters, right?

But in all honesty, we did not need Arrow's theorem to figure out that the idea does not work so well. Basically, the easiest way to implement this stuff is to keep track of the number of clicks on each candidate and to rank them accordingly. Thus, cheating is extremely easy by repeatedly clicking on someone's face. And the more you click, the stronger your voice is. Well, from my perspective, it seems that this is the number one reason why after a short initial enthusiastic phase, such applications invariably die out. And that's for the best :-)

mardi 12 octobre 2010

A poor man's fractal

Self-similarity is more common than one could think. Below is a succession of four x5 zooms in the middle part of a symmetric 1 dimensional random walk with a million steps. This extensively studied stochastic process is defined by the following simple iteration:

$$X_{t+1}=X_t + B_t$$
where the $B_t$ are independent random variables taking +1, -1 values with probability $\frac{1}{2}$. Basically you flip a coin at each time $t$ and go up or down accordingly.
Each shaded green region is reproduced in the graph immediately below it. The interesting thing to notice is how similar does the $X$ curve behaves at any zoom level. Of course, there are some artifacts due to the anti-aliasing algorithm - most notably in the variation of the perceived thickness of the curve - but overall, the eye notices the same general uneveness at any level. Put in other words, if you ignore the labels on the $x$ and $y$ axis, looking at any such graph of a region of $X$ is not sufficient for you to determine at what scale you are observing $X$. It could be on a thousand time steps as it could be on billions of billions! Benoît Mandelbrot already gave this example quite a long time ago but I thought it would be nice (and easy) to get a better looking picture than in the early publications.

By the way, if you have ever glanced at charts of stock market prices, you'll probably have noticed that they too look the same on many scales - and between different stocks. There are some elegant and appealing (but far from mainstream) mathematical developements in fractal methods applied to finance - developements that are rooted in this observation of self-similarity. The big buzzword here is stable distributions: probability distributions that do not necessarily have means or standard deviations, but still possess the wonderful property of being stable by addition, namely the addition of two random variables that follow a stable distribution will follow a stable distribution itself. But I am already off-topic.

A last word about the symmetric 1 dimensional walk. No matter where you start ($X_0$), you'll get back to this value with probability 1 some time in the future. The little surprise is that you may have to wait an infinite amount of time before getting back: the expectation of the return time is infinite. You mileage may vary, infinitely in this case. Actually, it is possible to prove that the time to return $T$ follows asymptotically a scaling distribution:

$$\mathbb{P}(T \geq t) = O(\frac{1}{\sqrt{t}})$$

$T$ is heavy tailed: its tail probablity decreases very slowly. If stock prices follow symmetric 1 dimensional random walks (it does not appear to be so), then no matter what happens, you could always get your money back, for some infinitely remote time in the future, provided the stock still exists...

dimanche 1 août 2010

Les chaussettes du panier à linge propre

Voici un autre petit calcul orienté vie pratique que les amateurs du genre devraient apprécier. La motivation de celui-ci réside dans le constat de la difficulté à appareiller mes chaussettes à la sortie du sèche-linge. 

La plupart de mes paires chaussettes sont d'une couleur unie blanche ou noire. Elles se distinguent cependant par quelques motifs subtils pour les paires noires et par des traits de couleur au niveau du mollet pour les paires de sport. Ne pouvant pas distinguer clairement les motifs et les traits de couleur des chaussettes pliées avec les autres vêtements, je dois effectuer une sorte de tirage sans remise de mes chaussettes et les appareiller au fur et à mesure. Généralement, je dois en tirer au moins 5 ou 6 avant d'avoir une paire complète, en sachant que j'ai grosso-modo une quizaine de paires dans mon panier.

Intuitivement, 6 chaussettes avant de trouver une paire c'est vraiment beaucoup. Y a-t-il une explication mathématique à mon manque de bol ? Il s'avère qu'il y en a une, bien que la formule finale soit particulièrement laide. Après calcul, l'espérance du nombre de chaussettes qu'il faut tirer sans remise hors du panier à linge propre contenant $n$ paires avant d'avoir une première paire est:

$$e_n=\sum_{k=0}^{n-1} (k+1) \frac{k}{2n-k} 2^{k-1} \frac{(2n-k)!}{(2n-1)!} \frac{(n-1)!}{(n-k)!}$$

L'application numérique donne $e_{15} \approx 6.9$, ce qui est a peu près conforme à l'expérience, modulo mon discernement. Il est a noter qu'asymptotiquement, on a un comportement du type birthday paradox où $e_n \sim \sqrt{n}$. Je joins à la formule les deux récurrences que j'ai utilisées pour son établissement - for debugging purposes.

$$a_{k+1}=2 \frac{n-k}{2n-k}a_k, a_1=1$$
$$b_{k+1}=\frac{k}{2n-k}a_k$$

Où $a_k$ est la probabilité qu'il n'y ait aucune paire parmi $k$ chaussettes tirées sans remise, et $b_k$ est la probabilité qu'on trouve une paire au bout du $k$-ième tirage. Que les bonnes âmes se dévouent et vérifient la formule et/ou trouvent une simplification esthétique :)

jeudi 4 mars 2010

Guess my password!

This article is somewhat linked to the tagging game simulation, but will take a more security-oriented point of view. We will show some theoretical results on the security of password protected resources, such as your favorite email account, your online banking account, or more simply your operating system's account. We will consider two problems here, with the last one explaining the logs you observe when you connect a ssh server to the open Internet - or why your credential choices matter a lot.
  • Suppose you are the bad guy and want to hack into a password protected resource by brute force guessing. What is your expected number of tries?
Well, that depends on the probability distribution of the apparition frequencies of the passwords in use. In an ideal world, all $n$ passwords would be unique so that the distribution would be uniform. Thus, your expected number of tries would be $n/2=O(n)$. So far, so good.

However, this uniformity assumption is unrealistic in many cases. A simple example is when people chose the name of their beloved ones as a password. I know what you are thinking: "hey, everybody knows that you should not use dictionary words or otherwise easily guessable passwords". First of all, everybody don't know, and some people do not even want to hear about it. Second, there are cases when these are exactly the kind of passwords you will get or create. Just think of those account retrieval questions: "what is the name of your mother/your favorite pet?".

So, in most of the cases (and the studies), the probability distribution is highly non-uniform: a few passwords are far more common than all the others. Once again, we will postulate a power law for this distribution: the Zipf distribution, which is the quantitative custard tart human generated content. And according to this inspiring post on the Security Research Lab of Cambridge blog human names also follow this power law.

For $i=1..n$ passwords ordered by decreasing popularity and $s \geq 1$, the associated Zipf distribution is defined by:

$$p_i=\frac{1}{Z}\frac{1}{i^s}$$

where $p_i$ is the probability of occurrence of password $i$ among all the accounts, $Z$ is the normalization factor such as all $p_i$ sum to $1$:

$$Z=\sum_{i}\frac{1}{i^s}$$

Suppose you, the bad guy, have a list of those passwords in decreasing popularity order - which is not hard to compile for names. Then, your strategy would be to try these one by one in that order. Thus, your expected number of tries is now:

$$E=\sum_i i p_i=\frac{1}{Z}\sum_{i}\frac{1}{i^{s-1}}$$

We'll suppose that $s \in ]1;2[$ - otherwise our model is not as interesting (usually, $s$ is taken to be $1.1$). Under these conditions, we obtain:

$$E=\frac{1}{Z(2-s)}n^{2-s}+O(1)$$

So, $E=O(n^{2-s})$ which is not so good compared to $O(n)$. For $s=1.1$, $E=O(n^{0.9})$ which is or is not acceptable, depending on your philosophy of life.

But wait, this is not how bad guys act on the web: they rarely focus on a single target. Their problem is more something like the following:
  • I know the $k$ most common passwords for a given service. How many accounts will I have to probe before succeeding?
It turns out that the answer is surprisingly low as we shall now see. The expected number of probed accounts is now:

$$E=\sum_i i(1-\bar p_k)^{i-1} \bar p_k$$

where $\bar p_k=\sum_{i \leq k} p_i$ is the probability of hacking into an account with one of the $k$ most common passwords. Unrolling the computation yields a surprise. It turns out that this expected number is nothing else than:

$$E \sim \frac{1}{\bar p_k}$$

when $n>>1$. Now substituting with the actual value of $\bar p_k$, we obtain:

$$E \sim \frac{Z}{\sum_{i \leq k}{i^s}}$$.

Here comes the final result: take $k=3$ (3 tentatives on each account), $n=1,000,000$ (number of English proper names?) and $s=1.1$. You'll get $E \sim 6$. That's right, 6 accounts on average.

Edit: Obviously, 6 is an underestimate because the distribution of passwords do not exactly follow a power law, especially for the most common passwords. The bare Zipf distribution has a tendency to overestimate the probabilities of the head events. A more realistic model would be a Zipf-Mandelbrot distribution. Do you think people are not so foolish? See how people choose their passwords.

Despite its simplicity, the model gives the basic intuition why script kiddies can penetrate a decent amount of systems in a small period of time.

mardi 2 mars 2010

Notes on the Therac-25 case

The Therac-25 device is now a classic example of lethal failure for a complex system with ill-designed software control over a physical process. The technical causes of failure are extensively known today, as engineers learned (somewhat slowly) from their past mistakes. These aspects truly deserve to be known by anyone interested in the safety of critical systems but another equally decisive factor in the failure was proper human response, or the lack of it.

The Therac-25 was a GeV medical linear accelerator used for radiotherapy during the 80's, until the device was recalled for major changes when its unsafeness became all too blatant. Basically, the device can operate in two modes. In the first one, the accelerated electron beam is shaped and then directly targeted to the patient for skin-level treatment. In the second one, the accelerated electrons are collided on a tungsten target, producing X-rays which are shaped and directed to the patient for in-depth treatment. In the latter mode, the beam energy easily reached a few GeV, and in the former one, maximum beam energy was of 25 MeV.

Guess what happened? From 1985 to 1987, not one or two, but at least 6 patients unfortunately acted as the tungsten target for the high energy beam, ultimately leading to two radiation induced deaths and various permanent incapacities for all the others. Their disturbing stories are shortly presented in Nancy Levenson's report of the case for the IEEE Computer journal (1995 update).

Of course, there are technical explanations for this failure. The software (written in assembly language - i.e. the lowest possible level which is not '0's and '1's) was largely reported as a shoddy reuse of an earlier version and presented several critical race conditions. To put it in a nutshell, if instructions where given too rapidly to the machine (e.g. switch from electron mode to X-ray mode and fire), the tungsten target would not have enough time to position correctly, meanwhile the electron accelerator would already be firing at full power, as if in X-ray mode.

Interestingly, the previous model Therac-20 which had its code reused by the Therac-25 also had similar race conditions. However, the big difference is that the Therac-20 had built-in hardware locks which would prevent the accelerator from firing at full power had the tungsten target been misaligned. Thus, software malfunction occured, but merely resulted in loss of time and nothing more (restart the device). On the Therac-25, there were no such mechanisms, so that the software alone was expected to ensure total safety - after all, software is pure logic and pure logic never produces wrong results, right?

Moreover, the whole design of the interlocking between software and hardware was highly dysfunctional, as there was no position sensor which could have reported a wrong alignment in the X-ray mode.

It is nonetheless puzzling that 6 accidents were needed to recognize the Therac-25 as a fundamentally unsafe machine. Actually in the first of the cases, such recognition never happened at all, nor from the hospital or the manufacturer. The very slow learning process appears to be a result of overconfidence in software, overconfidence in product reliability and last but not least, overconfidence in manufacturer's experts and practices in this case.

To finish with, here is a memorable quote from N. Levenson's report:
Virtually all complex software can be made to behave in an unexpected fashion under some conditions: there will always be another software bug. [...] We cannot eliminate all software errors, but we can often protect against their worst effects, and we can recognize their likelihood in our decision making.
And don't take for granted that reused code is safer: it all depends on how you are using it.

Following are two free links about security for today's software developer: