Several years ago, the non-profit investigated a machine-learning software program used by courts around the country to predict the likelihood of future criminal behavior and help inform parole and sentencing decisions.
In their investigation, ProPublica found that the program identified Black defendants as potential repeat criminals almost twice as often as white defendants. But when actual crime recidivism rates were compared with the software鈥檚 predictions, there was no such disparity 鈥 Blacks and whites re-offended at roughly the same rates. Further, white defendants were more likely to be misidentified as 鈥渓ow risk鈥 than Black defendants.
ProPublica determined that the cause of this bias was the software algorithm itself, and that the data informing the algorithm basically ensured a racist outcome.
Since then, the issue of 鈥渂ias鈥 and 鈥渇airness鈥 in machine-learning and artificial intelligence (AI) has crept into debates about how Google and Facebook ads target customers, how credit scores are calculated, the limitations (and legality) of facial-recognition software, and many other areas of our increasingly data-driven lives. As a result, eliminating bias in AI algorithms has also become a serious area of study for scientists and engineers responsible for developing the next generation of artificial intelligence.
Defining 鈥渇airness鈥 in AI
Aileen Nielsen is a data scientist and professor of Law and Economics at ETH Zurich who studies issues of fairness and bias in machine learning and artificial intelligence. In a recent presentation for , Fairness Considerations in Applied AI, Nielsen explained why creating an algorithmic form of fairness is so difficult, and outlined steps that data scientists can take to guard against building algorithms that inadvertently discriminate.
鈥淒efining fairness is not as easy as you might think,鈥 Nielsen said, adding that an inability to agree on what is 鈥渇air鈥 in different spheres of life has led to a situation where 鈥渋ndustry, technology, and the law are all going to be working with different definitions of fairness.鈥 After all, something deemed fair in one context may be unfair 鈥 or even illegal 鈥 in another.
Consider that for almost a decade, Facebook allowed advertisers to target their ads using so-called 鈥渆thnic affinity鈥 labels, which identified one鈥檚 affinity for a specific ethnic group (Black, Hispanic, Italian, etc.) without identifying the user鈥檚 ethnicity. So in theory 鈥測ou could be a white woman with an 鈥榓ffinity鈥 for Hispanic culture,鈥 Nielsen said. But in practice, Facebook鈥檚 affinity labels did an effective job of identifying a user鈥檚 actual ethnicity, which advertisers used to target or avoid certain ethnic populations.
This sort of targeting 鈥渋sn鈥檛 necessarily bad if you are just selling a product,鈥 Nielsen added. 鈥淲here it becomes clearly illegal is when you鈥檙e advertising something like housing, jobs, or credit,鈥 and are using Facebook鈥檚 ad-targeting algorithm to discriminate against specific ethnic populations. After multiple lawsuits, Facebook finally agreed last year to revise its ad-targeting algorithm, 鈥渂ut it was happening in plain sight,鈥 Nielsen noted.

Bias and unfairness can creep into algorithms any number of ways, Nielsen explained 鈥 often unintentionally. In all likelihood, 鈥渘o one at Facebook wanted to facilitate discrimination, but something went wrong down the pipeline,鈥 she said, emphasizing that this is why programmers need to be aware of how bias can seep in, and learn to anticipate how certain choices in the development of algorithms can inadvertently lead to unfair outcomes.
Causes of unfair bias
When data scientists use 鈥減roxies鈥 鈥 i.e., substituting one data set for another 鈥 they run the risk of skewing the data if the proxy itself is faulty. One example Nielsen pointed out was that of hospitals that use 鈥渉ealthcare spending鈥 as a proxy for 鈥渞ace,鈥 (in leiu of difficult-to-obtain medical records) without accounting for different healthcare spending patterns, or the fact that there are numerous disparities in how Black and white people are treated once they are admitted to a hospital.
Old, bad data
Old or incomplete data sets are another source of potential bias, both in the development and implementation of algorithmic models, she said, noting that some of the data sets that developers are using to build models are more than 40 years old, and that old data can contain hidden biases.
Incomplete data sets can be just as misleading. If, for example, the data is incomplete because (as has been the case in the past with credit scores) women and minorities are unequally represented. In such cases, Nielsen observed, models based on incorrect or incomplete data can create informational 鈥渇eedback loops鈥 that strengthen and reinforce even small biases in the data. And in any case, minorities are by definition less visible to machine-learning algorithms because most of the training data used to 鈥渢each鈥 algorithms is skewed in favor of the majority, she added.
Labeling
Incorrect or imprecise 鈥渓abeling鈥 can also cause bias and fairness problems, Nielsen said. Labeling is how data scientists annotate and classify certain properties and characteristics of a data point in order to make it searchable by an algorithm.
Nielsen demonstrated the importance of labeling by doing a Google search for the phrase 鈥渦nprofessional hairstyles for work,鈥 which resulted in a collection of images that are predominately women, many of them Black. She then modified the search by adding the word 鈥渕en,鈥 which again yielded many Black hairstyles, including those of some women and certain people 鈥 like Harvard professor Cornel West 鈥 who have impeccable professional credentials.
According to Nielsen, these search results are influenced mainly by how Google has labeled or 鈥渢agged鈥 elements in each picture. For better or worse, coders tend to tag the most distinctive or unusual elements in a photo, she said, which dictates how the image is 鈥渟een鈥 by the algorithm. Also, the hairstyle search is complicated by the word 鈥渦nprofessional,鈥 which Nielsen admitted 鈥渉as some assumptions built into it, so it鈥檚 legitimate to ask if the question itself is fair.鈥
The larger point is that 鈥渓abels imply cognitive and linguistic biases,鈥 Nielsen said, so it鈥檚 important for data scientists to be aware of how choices in the labeling process can introduce unintentional biases in any given data set. Unfairness can also creep in the back door through models that are technically 鈥渂lind鈥 to race and ethnicity, but end up discriminating anyway because optimizing for 鈥渁ccuracy鈥 tends to favor the majority.
鈥淏lindness isn鈥檛 always a bad thing, but it usually is for our machine-learning models,鈥 Nielsen said. 鈥淏lindness does not produce fairness. What we actually need is domain awareness and an awareness that machine learning does not inherently privilege minority groups.鈥
Fixing the fairness problem
For data scientists, addressing the problem of unfairness in machine learning and artificial intelligence requires defining certain statistical qualities of 鈥渇airness,鈥 then tweaking and testing the algorithm to ensure a fair (or at least fairer) result. Unfortunately, Nielsen explained, the road to fairness in machine learning is littered with obstacles, not all of which can be easily overcome.

鈥淔airness remains an art,鈥 Nielsen admitted. 鈥淣ot every method is 100% guaranteed not to introduce additional discrimination, and no method is guaranteed to only reduce your accuracy by a certain amount.鈥
For example, one can strive for demographic parity in a data set, but individual differences in the population might make a blind equalizing approach unfair. Likewise, group-oriented definitions of fairness might conflict with definitions of fairness at the individual level, and labels used to identify a group might be considered fair or unfair, depending on who is creating or interpreting those labels.
Considerations for scientists
Still, there are all sorts of opportunities to think about fairness in the applied data science and applied AI pipeline, Nielsen insisted. For example, scientists who use proxy data should ask themselves whether it is 鈥渓ikely that the proxy is going to be misunderstood further down the pipeline and misapplied.鈥
The U.K. government this very problem when, after canceling the country鈥檚 annual A-level tests for university admission due to the coronavirus, it tried to use an algorithm to simulate what the results would have been, using student grades as a proxy for A-level scores. The result? About 40% of students saw their scores downgraded from their teachers鈥 predictions, and students from disadvantaged schools were disproportionately impacted. Parents and teachers were understandably furious, and the government鈥檚 misguided data experiment was promptly scrapped.
Ensuring data quality is another safeguard against bias, Nielsen said. For example, if a data set is compromised 鈥 through sampling bias (e.g., police or income data), old data, or obvious discrimination 鈥 scientists could consider creating a fresh data set that doesn鈥檛 reflect old biases or other questionable attributes, she explained.
Performance modeling and stress testing are also important in identifying algorithm bias, especially if the algorithm is going to be deployed in public. 鈥淎 bank might develop a credit algorithm for one region and find no discrimination in that region, Nielsen offered. 鈥淏ut they might give or sell that algorithm to another bank where that鈥檚 not the case, and the algorithm does lead to drastically disparate impacts for different gender or racial groups.鈥
Larger social concerns
Similar concerns are currently being raised about certain contract tracing and personal surveillance tools being used in some countries (e.g., ) to combat COVID-19. And it鈥檚 not just about data privacy.
鈥淏lindness does not produce fairness. What we actually need is domain awareness and an awareness that machine learning does not inherently privilege minority groups.鈥
鈥淚 think we鈥檙e going to see an interesting social debate, not just with data privacy, but also with automated decision-making tools that might quarantine people, that might give people a risk score, that will have very real impacts on our freedom of movement and our access to healthcare, and things like that,鈥 Nielsen said.
There鈥檚 also a concern that AI surveillance tools developed for contact tracing could be deployed more generally to support so-called 鈥溾 and other forms of digital repression,鈥 Nielsen cautioned. 鈥淏ut in a way, I think it鈥檚 going to highlight fairness issues even more when countries do start adopting these tools.鈥
Developing trust in AI
In the meantime, working to develop more trustworthy forms of artificial intelligence may soon become a professional obligation for data scientists, especially those developing AI tools for use in the public domain. 鈥淥nce there are agreed upon mechanisms for developing trustworthy AI, it鈥檚 going to become an ethical 鈥 and possibly legal 鈥 duty to be thinking about these things,鈥 Nielsen surmised.
At the moment, however, agreement about what is fair in an AI algorithm remains elusive. In data science, there are at least 25 different definitions of fairness, Nielsen noted, but 鈥渕ost methods are achieving only one鈥 of those definitions.
The rest will have to wait for a better algorithm.