This article is somewhat linked to the tagging game simulation, but will take a more security-oriented point of view. We will show some theoretical results on the security of password protected resources, such as your favorite email account, your online banking account, or more simply your operating system's account. We will consider two problems here, with the last one explaining the logs you observe when you connect a ssh server to the open Internet - or why your credential choices matter a lot.
- Suppose you are the bad guy and want to hack into a password protected resource by brute force guessing. What is your expected number of tries?
Well, that depends on the probability distribution of the apparition frequencies of the passwords in use. In an ideal world, all $n$ passwords would be unique so that the distribution would be uniform. Thus, your expected number of tries would be $n/2=O(n)$. So far, so good.
However, this uniformity assumption is unrealistic in many cases. A simple example is when people chose the name of their beloved ones as a password. I know what you are thinking: "hey, everybody knows that you should not use dictionary words or otherwise easily guessable passwords". First of all, everybody don't know, and some people do not even want to hear about it. Second, there are cases when these are exactly the kind of passwords you will get or create. Just think of those account retrieval questions: "what is the name of your mother/your favorite pet?".
So, in most of the cases (and the studies), the probability distribution is highly non-uniform: a few passwords are far more common than all the others. Once again, we will postulate a power law for this distribution: the Zipf distribution, which is the quantitative custard tart human generated content. And according to this inspiring post on the Security Research Lab of Cambridge blog human names also follow this power law.
For $i=1..n$ passwords ordered by decreasing popularity and $s \geq 1$, the associated Zipf distribution is defined by:
$$p_i=\frac{1}{Z}\frac{1}{i^s}$$
where $p_i$ is the probability of occurrence of password $i$ among all the accounts, $Z$ is the normalization factor such as all $p_i$ sum to $1$:
$$Z=\sum_{i}\frac{1}{i^s}$$
Suppose you, the bad guy, have a list of those passwords in decreasing popularity order - which is not hard to compile for names. Then, your strategy would be to try these one by one in that order. Thus, your expected number of tries is now:
$$E=\sum_i i p_i=\frac{1}{Z}\sum_{i}\frac{1}{i^{s-1}}$$
We'll suppose that $s \in ]1;2[$ - otherwise our model is not as interesting (usually, $s$ is taken to be $1.1$). Under these conditions, we obtain:
$$E=\frac{1}{Z(2-s)}n^{2-s}+O(1)$$
So, $E=O(n^{2-s})$ which is not so good compared to $O(n)$. For $s=1.1$, $E=O(n^{0.9})$ which is or is not acceptable, depending on your philosophy of life.
But wait, this is not how bad guys act on the web: they rarely focus on a single target. Their problem is more something like the following:
- I know the $k$ most common passwords for a given service. How many accounts will I have to probe before succeeding?
It turns out that the answer is surprisingly low as we shall now see. The expected number of probed accounts is now:
$$E=\sum_i i(1-\bar p_k)^{i-1} \bar p_k$$
where $\bar p_k=\sum_{i \leq k} p_i$ is the probability of hacking into an account with one of the $k$ most common passwords. Unrolling the computation yields a surprise. It turns out that this expected number is nothing else than:
$$E \sim \frac{1}{\bar p_k}$$
when $n>>1$. Now substituting with the actual value of $\bar p_k$, we obtain:
$$E \sim \frac{Z}{\sum_{i \leq k}{i^s}}$$.
Here comes the final result: take $k=3$ (3 tentatives on each account), $n=1,000,000$ (number of English proper names?) and $s=1.1$. You'll get $E \sim 6$. That's right, 6 accounts on average.
Edit: Obviously, 6 is an underestimate because the distribution of passwords do not exactly follow a power law, especially for the most common passwords. The bare Zipf distribution has a tendency to overestimate the probabilities of the head events. A more realistic model would be a Zipf-Mandelbrot distribution. Do you think people are not so foolish? See how people choose their passwords.
Despite its simplicity, the model gives the basic intuition why script kiddies can penetrate a decent amount of systems in a small period of time.
Despite its simplicity, the model gives the basic intuition why script kiddies can penetrate a decent amount of systems in a small period of time.