Risk assessments are a very integral part of the criminal justice system. They aid judges in creating awareness of hazards and help identify who may be at risk.

For example, a risk assessment that is used to inform a judge on sentencing decisions should be able to predict whether or not a defendant is going to commit a new crime during or after their probation. Data, or information about other defendants, is a very important part of forming this risk assessment.

In the criminal justice system, there is increasing support for using algorithmic models (machine learning) in deriving risk assessments to aid judges in the decision-making process — these models learn from information about past and current defendants. Advocates argue that machine learning may lead to more efficient decisions and decrease the bias that is inherent in human judgement.

Critics argue that such models perpetuate inequalities found in historical data and therefore harm historically marginalized groups of people.

Although fairness is not a purely technical problem, we can still leverage basic statistical frameworks for evaluating fairness, and as a result, compelling phenomenon still arise.

In this study, we will explore a risk assessment algorithm called COMPAS, created by Northpointe (now equivant). COMPAS examines a defendant’s criminal record and other personal information to assess how likely they are to recidivate in the next two years. You can read more about the investigation carried out by ProPublica on this issue, which drew attention to the ethical implications of leveraging machine learning in decision-making.

We will look at the COMPAS risk scores between Caucasians and African Americans. A COMPAS risk score of 1 indicates ‘low risk’ while a risk score of 10 indicates ‘high risk.’ In addition, we will follow ProPublica’s analysis and filter data where the number of days before screening is over or under 30.

First, let’s visualize the number of defendants per decile score:

With 3,175 African American defendants and 2,103 Caucasian defendants in the sample, we can see that for the Caucasian group, the distribution appears skewed towards lower risk decile scores.

Next, we look at the number of defendants per violent decile score. A rating of 1 indicates ‘low risk’ of being violent while a rating of 10 indicates ‘high risk’ of being violent:

We also see that for the Caucasian group, the distribution is skewed towards lower risk violent decile scores. For both visualizations, we cannot attribute this difference to only race. There could be confounders, such as gender, age, and other attributes the COMPAS score examines that effect these risk scores. For the rest of this article, we will look at three common statistical criteria used for answering the question ‘Is this algorithm fair?’

Equalizing positive rates (the number of times we predict a defendant to recidivate given they are Caucasian is equal to the number of times we predict a defendant to recidivate given they are African American).
Equalizing error rates (the proportion of times we misclassify a defendant who actually recidivated is the same for both Caucasians and African Americans, and the proportion of times we misclassify a defendant who did not actually recidivate is the same for both groups).
Calibration (among all defendants that get a risk score r, on average an r proportion of them should actually be classified as positive — aka, likely to recidivate)

According to the criteria above, the three probabilities are listed below, respectively:

P(δ(X) = 1 | A = Caucasian) = P(δ(X) = 1 | A = African American) where δ is our decision rule and X is our dataP(δ(X) = 1 | Y = 0, A = Caucasian) = P(δ(X) = 1 | Y = 0, A = African American),
P(δ(X) = 0 | Y = 1, A = Caucasian) = P(δ(X) = 0 | Y = 1, A = African American)P(Y = 1 | R = r, A = Caucasian) = P(Y = 1 | R = r, A = African American) = r

Although we do not have the true data of whether or not the defendant recidivated at prediction time, we can observe what happens when we use COMPAS risk scores to create a classifier to predict whether an individual will recidivate.

We start off by observing the outcomes of a classifier when the decision threshold occurs at each decile score. Ignore rates for decile 10 and 1, as it is trivial to achieve equality for both of those deciles (can you see why?).

For positive rates, we find:

From this visualization, we can clearly see that the classifier does not satisfy equalizing positive rates for all thresholds. African Americans are more likely to be classified as ‘high risk’ than Caucasians for all decision thresholds.

We can still achieve equalized positive rates. Scroll down to the next criterion to see how we can extrapolate a similar method to equalize positive rates.

Does enforcing equal positive rates solve all issues of fairness in this situation? We can come up with decision rules that are undeniably unfair, but still satisfy the criterion of equal positive rates.

In this scenario, equalizing positive rates would not adequately address fairness because it is not just the number of labelings of defendants for ‘high risk’ that matters in the criminal justice system. For example, we can classify everyone as ‘high risk’ but that is indisputably unfair.

For error rates, we find:

From these visualizations, we can clearly see that the classifier does not satisfy equalizing error rates for all thresholds. In particular, the first graph shows that African Americans who did not recidivate in the next two years were more likely to be misclassified as ‘high risk’. The second graph shows that Caucasians who recidivated within the next two years were more likely to be mistakenly labeled as ‘low risk’.

But, we can still equalize error rates by choosing two thresholds where error rates do equal. A common way to achieve this is by utilizing a ROC Curve. We look for the intersection of the two curves shown below.

We can equalize error rates by selecting two thresholds, one for each group, such that the true positive rate and false positive rate are equal. This is true because the false negative rate is just (1 — true positive rate).

Although equalizing error rates would ensure that both groups would have the same proportion of misclassifications, complex issues still arise.

First, at decision time, judges don’t know who is truly a ‘high risk’ or ‘low risk’ defendant. Racial differences of defendants often strike people as unfair for risk assessments. Secondly, in order to equalize the error rates for African Americans and Caucasians, it will be necessary to make the predictions worse for one of the groups.

Rather than worsening the predictions for one of the groups, it would be better to think critically about why the error rates are different between groups and try to address some of the underlying causes.

For calibration, we find:

In order to achieve calibration, one must satisfy the constraint mentioned previously, which translates to “among the defendants who recieved the same COMPAS score, a comparable percentage of black defendants reoffend in comparison to white defendants.”

The ‘Rate of Positive Outcomes’ is the rate at which, given a COMPAS score, defendants actually recidivate.

We can visualize this criterion by the graph above, and that is precisely what Northpointe argues the COMPAS algorithm achieves. Although the graph above does not look quite calibrated, the deviation we see in some of the deciles may be due to the scarcity of the data in the corresponding group and deciles. For example, the score decile of 10 has 227 defendants for African Americans and 50 defendants for Caucasians.

Calibration is often natural to consider for fairness because it is an a priori guarantee. The decision-maker sees the score R(X) = r at decision time, and knows based on this score what the frequency of positive outcomes is on average.

Why is any of this important?

It turns out that ProPublica’s analysis of Northpointe’s risk assessment algorithm, COMPAS, found that black defendants were far more likely than white defendants to be misclassified as a higher risk of recidivism and white defendants were more likely to be misclassified as a lower risk of recidivism.

We have shown that this statement is what equalizing error rates aims to solve, which COMPAS fails to satisfy. Interestingly, Northpointe claims that the COMPAS algorithm is fair because it is calibrated (although the above graph does not seem calibrated, and because we are only working with a data sample, we can assume that the scoring algorithm is actually calibrated given all the data).

Two common non-discrimination criterion that machine learners and scientists work to satisfy when creating classification algorithms are sufficiency and separation. In this study, separation says that the classifier decisions are independent of race conditioned on whether or not recidivism occurred.

This means that for examples where recidivism actually occurred, the probability that the classifier outputs a positive decision (likely to recidivate) should not differ between the races.

This is precisely what the definition of equalizing error rates is, and what ProPublica argues is not satisfied by the COMPAS algorithm, and is therefore unfair. Sufficiency says that whether or not recidivism occurred is independent of race conditioned on the classifier decisions.

This means that for all of the examples where the classifier outputs a positive decision, the probability of recidivism actually having occurred for those examples should not differ between the races.

This is precisely what the definition of calibration is in our case and what Northpointe satisfies in the COMPAS algorithm, which they argue is fair.

So, why not satisfy both criterion? A collection of results known as “incompatibility results” prove that these three fairness criteria cannot occur independently. This means that we can only satisfy one of these criterions. If we calibrate the COMPAS algorithm, then we cannot also equalize error rates.

In conclusion, statistical fairness criteria on their own cannot be used as a “proof of fairness.” But, they can provide a starting point for thinking about issues of fairness and help surface important normative questions about decision-making.

In this study, we unraveled the trade-offs and tensions between different potential interpretations of fairness in an attempt to find a useful solution.

This study brings to light the ethical implications of delegating power to machine learning and algorithms for guiding impactful decisions, and shows that a purely technical solution to fairness is very complex and many times inadequate.

In sentencing decisions and predictive policing, maybe it is best to abandon the use of learned models unless trained on non-discriminative data (don’t include race as an attribute) and evaluated by fairness experts in all relevant domains.