Fairness in Machine Learning

6 min readJul 11, 2021

Decisions driven by machine learning are omnipresent in our society. Anything from recommending your next TV show on Netflix to deciding if a given medicine is safe enough to prescribe is driven by machine learning, and for good reason. If used correctly, data can be used to help us give a more structured and objective approach to decision making. However, it is important that data scientists realize that data driven decision making does not inherently mean the decisions are fair.

Those last two sentences may seem contradictory. It is tempting to assume that since a model makes decisions in a neutral manner using objective data, any model made in good faith will be fair. That is what Amazon hoped when creating a model to make hiring suggestions based on applicant’s resumes: machine learning would find the best applicants based on merit. Unfortunately, since the tech industry is male dominated, the model learned to favor men when choosing the best applications. This is not the only instance of machine learning leading to unfair decision making. In fact, a wider study titled Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions by Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, Kristian Lum examines this phenomenon.

The Foundation of a Fair Model

The paper starts by outlining societal biases (bias in the sense of fairness, rather than in the strictly statistical sense) that can inhibit building a fair model. Following that, it outlines how we can measure fairness, constraints in measuring fairness, causality of the inputs into machine learning models, and finally touches on how to move forward. “Prediction-Based Decisions and Fairness” utilizes two scenarios to illustrate points: a model to help predict if a potential borrower will pay back a loan, and a model to predict whether a person who has been arrested should be detained during trial or not.

When attempting to build a fair model, biases that are ingrained in society can influence the data. When teaching a model about risk for release during pre-trial, the person committing another offense would be seen as a failure (assuming no re-offence when there was). This is logically consistent with the goal of minimizing unnecessary holding of individuals while also minimizing crime that could be prevented. In order for this to be fair, we would have to assume policing to be fair. However, if minority communities are policed harder than non-minority communities, then the model will be taught that minority communities are at higher risk for reoffending.

Measuring Fairness

When measuring fairness, we can start with single-threshold fairness. Essentially, if we create a function to provide a score, where being above a threshold will lead to one result, and being below the threshold will lead to another, the threshold needs to be the same across groups. That may be a bit abstract so let’s use an example. If a loan company creates a function to determine the ability of an applicant to pay back a loan, where having a score above x number of points will result in providing a loan, and a score below x number of points will lead to a decline, there should be a single threshold. This is a good start but is not necessarily everything needed to create a fair model. “Prediction-Based Decisions and Fairness” contends that there are three assumptions made with single-threshold fairness that are not always true:

Evaluations need to be made separately. However in some instances the evaluation of a datapoint could affect the evaluation of another.
Evaluations are symmetric. In other words the decisions made by a model will impact groups similarly.
Evaluations are simultaneous. Over time, decisions made by a model can affect the actions of groups.

When ensuring a model is fair it is important to check prediction accuracy over groups. Confusion matrices and checking for AUC parity are powerful tools in this respect.

The Northpointe Debate

A specific instance where machine learning fairness is examined in the article is a debate between ProPublica and Northpointe (now Equivant) over a tool called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions). COMPAS is a tool used to provide a risk score for defendants pre-trial. ProPublica published an article called “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.” The article uses both statistical analysis and case studies where black defendants appeared to be given higher risk scores for similar or lesser offenses than white counterparts. Northpointe responded to the article with a study that alleged that the accuracy of scores between white and black defendants was not statistically significant. “Prediction Based Decisions and Fairness” does not take the side of ProPublica or Northpointe. Instead Mitchell et al points out that without perfect prediction, it is essentially impossible to conclude whether the model can be conclusively called fair.

Conclusion and My Thoughts

All of this is really only touching on the work of Mitchell et al, and I encourage you to read the paper, as it goes more in depth. However, I would like to avoid this blog being a spark notes version of the paper, so I would like to discuss the conclusion of the article and give some of my own thoughts. The conclusion of the article stresses that machine learning should not be avoided, and that human intuition can fall into the same pitfalls of reaching unfair conclusions. Essentially, we need to be mindful when selecting features for our models, and fully articulate our goals when making a model.

In a practical sense it may be impossible to make a completely fair model. When reading the article, the example used for building a model to decide who would get accepted or rejected for loans was interesting to me. I used to work for a loan company, and know how much care goes into making sure that gender and race are not direct factors into getting a loan, and as far as auditing for fairness, there really is no way to tell if somebody denied would have actually paid off the loan with no issues, so gauging accuracy would not be possible. Practically, we can only make laws ensuring that loan companies give everyone a fair shot at a loan.

On the subject of releasing defendants before their court date based on risk level, I think it is important to remember that it is only a piece in a wider puzzle of creating a fair system. Creating such a model should be with care and consistently re-evaluated to make sure that no group is being unfairly singled out as more safe or more dangerous, but further than that any findings for why a group could be singled out, we need to evaluate our assumptions on why we think a given feature is “better” or “worse” than another. On a wider scale, features that appear to be effective for identifying defendants that fail to appear in court should be clearly communicated. If such a feature is correctly determining risk to fail to appear in court is found in particular groups, it is worth looking further into to find out if they can be alleviated.

In my (admittedly brief) experience building predictive models, I have found that it can be easy to tunnel in on making accuracy go up. But it is important to try to be cognizant of the goal. When trying to predict risk for releasing defendants pre-trial, getting a good accuracy should just be the start. For those who were deemed high risk but never seemed to have issues again, why were they considered high risk? For those deemed low risk who failed to show up, why didn’t they? You may find that something as simple as text reminders may have prevented them from missing their court date.

Fairness in Machine Learning

The Foundation of a Fair Model

Measuring Fairness

The Northpointe Debate

Conclusion and My Thoughts

Written by George Ferre