P value is the probability that the results we are seeing are real and not by random chance. P-Hacking is a term used to describe the scientific manipulation of data to get the desired P value. All of us do this with our experiments, consciously or not.

Source: Atoz Markets

To prevent P-Hacking you need to understand 3 concepts - Multiple comparison, Power analysis and Confidence intervals.

Multiple comparison → Limit the number of cohorts, metrics and segments

The more comparisons you make the more likely you are to see a false positive. In other words, you can end up making business decisions based on incorrect metric values. In context of experiments comparisons can be cohorts, metrics, or dimensions of a metric.

PMs (including I) get this wrong when we slap 10s of metrics in hope of finding some positive result or under the guise of protecting other parts of the business.

You can avoid that by deciding the comparisons before running the experiment and limiting them. As a general rule of thumb, I use at most 5 metrics and 2/3 different treatment cohorts per experiment.

If you absolutely need to use more metrics, you should “correct” for the problem.

One way to do so is Bonferroni correction (in practice the correction is best left to Data Scientists).

However, adjusting the results reduces the power and requires the experiment to have a greater sample size.

You should factor that in while conducting the power analysis.

Power analysis → Compute the optimal stopping point

Power is the probability of finding an effect if one can be found. Power analysis is used to estimate the minimum sample size (number of users per cohort) needed for the experiment to get to the desired power.

PMs get this wrong when we stop the experiment before it has enough samples to save time because the experiment is trending positive.

This is the one the biggest reasons that experiments fail. As seen in the figure T’ is the optimal stopping point based on the power analysis. Running the experiment till that point will make you come to the conclusion that the change has a negative impact on the metric. However, at T, the change seems to incorrectly have a positive impact.

The power analysis depends on the decision metric, minimum detectable effect (expected lift), and the desired power level:

80% power is considered as high confidence.
Minimum detectable effect is inversely related to the required sample size. For example, a higher sample size is needed to detect a 1% change as compared to a 10% change. You can shorten the experiment duration by either using a different metric or by increasing the expected lift.

There are several online calculators that you can use to conduct a power analysis (Power and Sample Size | Free Online Calculators, Sample Size Calculator and Power/Sample Size Calculator)

Confidence intervals (CI) → Better identify P-Hacking

While this section doesn’t directly prevent P-Hacking it will help you understand and identify it better.

CI is the range the metric value lies in. Usually we use the 95% CI. A CI of -10 to +10 indicates that the mean of the metric value lies in that range with a 95% probability.

PMs get this wrong when we only look at the exact metric value and not the entire interval.

In my experience the majority of people make decisions by only looking at the mean of the CI not the entire range. This creates a false sense of security. Practical advice for interpreting the CI:

The smaller the CI the higher is the confidence in the results.
The further the confidence interval is from 0 the more significant the results will be.
If the CI around the decision metric includes 0 then we can’t say that the results are significant.
Most times the CI stabilizes after having achieved adequate power

I am a Marketplace Product Manager at Yelp. I lead product for "Request-A-Quote" (messaging based Home & Local services product).

Previously, I built the Yelp experimentation program from the ground up