When we use data for product development, we do it with a greater objective: to guide decision making into something less subjective — perhaps exact.

The problem is that without the mastery of the vast body of knowledge that guarantees part of this much-desired scientificity, we are bound to fall into traps that we did not even know could exist in the first place. Here’s another one of those.

The Law of Large Numbers

The Law of Large Numbers (LLN) is a mathematical theorem — and to understand why we call it a law and not a theorem, look for the Strong Law of Large Numbers — which states that the average of the results obtained for a large number of experiments should approach the expected theoretical value — also called Mathematical Hope or simply expected value.

This means that these two values get closer as the number of experiments increases.

LLN studies how a numerical series behaves when its number of experiments (or trials) tends to infinity. One of its consequences is the Infinite Monkey Theorem.

No relation to Design but it is worth researching to find out what mathematicians do when they get bored.

LLN guarantees that — when certain conditions are met — the experimental result will converge to the theoretical probabilistic result.

A practical example

The definitions can sometimes be a little too harsh to grasp. Let’s see how it behaves in an everyday example. I simulated ten independent releases (trials) of an unbiased dice.

Figure 1. A random roll of ten dice using Random.org.

We observed that the average of the values obtained in the dice was 5. We know that the arithmetic mean of the possible values of a die is (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.5. Why are these values different?

In an unbiased dice (a dice whose faces all have a 1/6 chance of falling upwards) the average of the throws should be close to 3.5. This is the value we call Hope — it is the theoretical value expected for the experiment, which is provided by Probabilistic Theory.

But it turns out that the mathematical framework used always presupposes infinite independent experiments, which is not something we can replicate well in the real world.

This means that we will always have an observed experimental value slightly different from the expected value. How different these numbers will be will depend on the number of tests done.

Figure 2. Visualization of the convergence of the observed average to the theoretical average.

Figure 2 is a simulation of a thousand data entries showing the average distribution observed as the number of tests increases. Initially, the experimental value is quite different from the expected value, as in our example above where the average was 5.

But, as we can see, this value becomes closer to the theoretical value as the tests increase. If we could throw dices forever, these values would be exactly the same.

And that’s exactly what LLN says: the numbers will be different at first but, as the experiments accumulate, the observed value of 5 will slowly converge until it gets very, very close to 3.5.

With millions of experiments, we could consider that the value, for practical purposes, is exactly 3.5.

The Law of Small Numbers

To assume that the Law of Large Numbers is also valid for small samples is the statistical analysis bias that Daniel Kahneman and Amos Tversky called the Law of Small Numbers.

They were able to show that this inability to accurately judge statistical events is common to almost all of us — even for people with training in mathematics or psychology. You can read the full paper here.

Basically, the Law of Small Numbers says that we will treat experiments carried out with very different samples in the same way: the findings of a usability test with 5 users will be treated as if they were from an experiment with 5,000 users.

Going back to the dice example, this bias would make us believe that the theoretical value is 5 (or close to that) because that is what we observed in the experiment — ignoring the inexpressive amount of tests. In practice, it is overconfidence in what has been found.

The Law of Small Numbers says that we will treat experiments carried out with very different samples equally.

Figure 3. Visualization of the convergence of the observed average to the theoretical average.

Now, in Figure 3, we have the same situation as in the previous figure: the release of an unbiased die. We can already see that, unlike the previous one, in this test (generated randomly by an algorithm),

the average value observed never actually reached the value 3 — which is already a good indication of how the sample variance can greatly affect the experimental average observed in small samples.

If the researcher stopped the study with only 10 trials (solid red line), she would imagine that the trend (dashed red line) of the data is decreasing and that the average is close to 3.25 — which would be an error since with only a few more trials would observe that the value would start getting bigger.

Believing that the theoretical average is close to 3.25 and that the data show a downward trend as more tests are carried out is what we do daily when we test with 5 users and say “the average time of execution for this task is 79 seconds”.

If that reasoning seems incongruous in the case of a simple die, imagine the horrors of trying to apply it to something complex like human beings.

Impacts on the designer’s workflow

Biases in statistical analysis and lack of probabilistic intuition are generalized behaviors that have been studied for some time. Most of them, however, like the Monte Carlo Fallacy, do not usually bring harm to the modern designer (unless he is also a fan of Blackjack).

Unfortunately, the same cannot be said of the Law of Small Numbers.

Below, there are some behaviors identified by Daniel and Amos applied in the UX research context.

The behaviors are stronger when applied to usability tests and the metrics taken from them, but we also find this bias in small quantitative tests. See if you recognize any situation.

1 — Bet on hypothesis validation

The designer believes that the numbers extracted from their usability tests have some statistical value, overestimating the power of the tests.

Bet on the validation of research hypotheses (which are sometimes not even falsified in the light of the proposed experiment) based on insignificantly small samples without realizing that the chances against the experiment’s validity are extremely high.

2 — Excessively relying on initial standards

It unduly relies on initial trends from data from the first tests — just like the researcher who stopped the study with 10 trials in Figure 3. In addition, he trusts the stability of the observed pattern, overestimating the significance of the findings.

In other words, you see the experimental average decreasing for the first tests, therefore believes this behavior will be maintained for any extension that is made of the study

3 — Believe too much in the replicability of studies

The Law of Small Numbers makes people have high confidence in the replicability of the results achieved — basically underestimating the importance of confidence intervals.

It is as if we were sure that when performing the same usability test, we would find exactly the same results, which is unlikely. The two examples of experiments with the dice showed us just that: experiments with low sampling are likely to almost always differ widely.

4 — Ignoring the Sample Variance

The designer whose life is governed by the Law of Small Numbers rarely — if ever — attributes deviations in the results obtained to sample variance, whose impact is less diluted in small samples. He always finds a causal explanation (or at least tries to) for the observed discrepancies.

It’s like we’ve said before: if experiments are so different for something as simple as a dice, imagine how little control we have over experiments that involve hundreds of variables.

And how do we mitigate this effect?

We don’t always have the luxury of doing dozens of usability tests or doing quantitative research with thousands of users. In such cases, the best thing to do is to be aware of all the biases that might be involved and try to avoid falling into common pitfalls.

The following proposals are but suggestions. I do not claim to bring the final answer to this complicated problem, just a take on how to prevent this bias from destroying your research.

1 —Find complementary information via alternative forms of research

Fill in the blanks left by usability tests and interviews with alternative methods. Discover new ways to collect information about your user.

Be a good scientist: use the data to try to destroy your hypotheses — and not prove them right.

Observe behavioral metrics, listen to feedback through all points of contact between your company and the customer. Be creative and collect different categories of data from multiple sources.

2 — Improve the sampling design and increase the variability

The quality of your outcome is proportional to the quality of your sampling. That is, good research begins with the sampling design.

The real world is quite diverse and your sample needs to reflect that. Testing with specific income, age, behavior, and other groups will invariably bring biases. Convenience sampling should not be the norm. In the impossibility of transposing this, be at least aware and point that out in the final report.

And of course, study sampling theory to learn more about the biases behind different sampling designs.

3 — Redo variable tests

When testing a new feature or flow, take the opportunity to re-test that old design that is now live. New research will help to consolidate the numbers obtained previously.

Whenever possible, redo experiments. So you increase the body of evidence for (or against) your solutions.

4 — Do not make inferences from usability tests

We should not stop doing usability tests just because the numbers we sometimes try to pull out of them are not statistically significant — no, not at all.

They are great for finding usability flaws or errors in the product concept. We just can’t use the statistical tools in an unreasonable way to pretend what we do allows us to safely infer the behavior of a population.

Usability tests are not used to infer populational behavior. Not in a scientifically confident way. You will only create resistance in the team to change what needs to be changed because you are now attached to half-truths.

That 28 seconds value for the average time to complete a task or that 75% success rate in task number 2 of your 4-user test has no statistical value and cannot be used to talk about the rest of the users. At least not confidently. If you need this level of sophistication, there are superior tools and methods for doing so — and they do work.

Final considerations

Perhaps the bias of the Law of Small Numbers does not have a unique and infallible solution, but it seems worthwhile (and I would dare say interesting) to know it. It is as Daniel and Amos say in their article:

Even if the bias cannot be unlearned, we can still learn to recognize its existence and take the necessary precautions.

We need to be aware of the challenges that Design faces when it collides with other disciplines. The advantage, however, is that these disciplines usually already have a vast body of knowledge we can take advantage of.

We must always remember to study a lot and not get stuck in the bubble of what we are already familiar with. Thus, we make the discipline of Design even stronger (and more fun).