Bias in datasets for AI+ML: why we also need to pay attention to the other side of the coin that is fairness in AI + ML

By Bryan Boots

In May 2019, Bryan presented on “Bias in Datasets for AI+ML” at the annual Code for America Summit in Oakland, CA. This article is adapted from that presentation.

You may have heard a lot about a push to reduce or remove bias in algorithms used for artificial intelligence (AI) and machine learning (ML) in recent years. This has become a large and important topic and area of active research for computer scientists, for business leaders, for legal scholars, for ethics scholars, and others in the worlds of academics, government, and business.

A related and equally important topic – but less frequently-discussed – is the need to consider how datasets that are fed into these AI/ML algorithms can bias the recommendations given by these systems, regardless of the fairness of the algorithm.

First though, some definitions. When we talk about AI/ML out there in the world today, we’re not yet at the point – or even very close to it – of having an artificial general intelligence a la Terminator. What most people really mean when they say AI/ML these days in an everyday business applications sense is automation of statistical inference. What does that mean?

Many (though not all) AI/ML systems rely on statistical techniques of varying levels of sophistication, as well as other mathematical tools such as linear algebra, matrix operations, etc.; they  often but not always utilize massive datasets, and (again, often but not always) immense computing power.

Numerous sub-domains or applications of AI/ML exist, including recommendation systems, computer vision, natural language processing (NLP), prediction models, and myriad others. Examples include Amazon’s product recommendation engine; Netflix’s show and movie recommendation engine; Pandora’s recommendation engine; predictive policing (PredPol); facial recognition (Clearview AI); self-driving cars (Waymo); news outlets writing stories using AI systems; and a long, long list of many others.

One thing all AI/ML systems have in common is they are based on two core components: an algorithm, and a dataset (or datasets). Even when considering deep learning models – which many correctly point to as “black boxes” – these are ultimately driven by dataset(s) and algorithm(s), one or both of which can be biased.

At its most basic level, an algorithm is a set of instructions. Anyone can write an algorithm without writing any computer code (of course, implementing these algorithms using code in the fastest and most efficient manner is one of the pursuits of computer scientists).

A simple analogy is a recipe for chocolate chip cookies. To make the perfect chocolate chip cookie, you need 1) the right recipe (set of instructions), and 2) the right ingredients (and in the right amounts, and of high quality!). The right recipe – your instruction set – is your algorithm. The right ingredients (in the right amounts, and of high quality) is your datasets that you feed into your algorithm.

Keeping this analogy in mind, it should become clearer why the right “ingredients” – the right dataset(s) – that you feed to your AI/ML algorithm can make a huge difference. In your baking endeavors, if you use sub-par ingredients, you will get unappetizing cookies. In your AI/ML endeavors, if you use the wrong dataset or representation of dataset, your algorithm can make recommendations that are wrong or even downright harmful.

One example of potentially harmful recommendations of AI/ML systems resulting from biased datasets comes from what are known as “criminal risk assessment” systems. These AI/ML-based systems ingest large amounts of historical data about crime recidivism rates, they apply algorithms to these data, and they give a prediction as to whether or not a specific individual is likely to commit another crime – and therefore, whether or not the court should grant that person parole. What’s the problem here? One of the problems is that the system described here can only make predictions based on what has happened in the past; but perhaps the person in question truly has reformed and is a changed person. Such a change would not be reflected in these data. Additionally, the data are population level data; they may in fact be useful and accurate for predicting how a group of convicted offenders is likely to behave as a whole in terms of recidivism; but applying population-level insights to a specific individual is fraught with moral issues, let alone the statistical problems with doing so.

To take this same thinking and apply it to a realistic business application, let’s say your company sells two products: Product A, and Product B. Product B is a brand new product that you just launched one month ago, so you have very little historical data on actual sales figures. Product A has been sold very successfully by your company for 15 years, but it is starting to show its age. Because you’ve been selling it for 15 years, though, you have extensive historical data on your sales for it – who buys it, why, how often, at what price, how have they responded to promotions, etc.

Now let’s say an unenlightened colleague of yours recommends applying an AI/ML algorithm to your data to identify the best sales prospects for Product B. What are some of the potential problems here? If you use the historical data that is overwhelmingly based on sales from Product A, you may receive any number of bad recommendations – nobody wants to buy the product, the pricing is all wrong, you need to promote it in the wrong ways, etc. Moreover, perhaps Product B is really aimed at an adjacent market to the one you’re already serving – if this is the case, basing recommendations for how to promote the product to existing customers who have little need for Product B can be wasteful at best, disastrous at worst.

If your organization is considering implementing or has already implemented some form of AI/ML system to improve your business processes, you’re on the forefront of today’s digital revolution. You should, however, be sure you’re asking your vendors (if you purchase outside technology) or your technology team (if you develop these solutions internally) questions not only related to the fairness and reliability of the algorithms they have chosen to implement, but also the “ingredients” – the datasets – that they have used to feed these algorithms.



About the Author

Bryan is Assistant Teaching Professor with UMKC’s Bloch School, and Managing Director for Venture Creation with UMKC’s Regnier Institute for Entrepreneurship & Innovation. You can reach him at

Photo by Charles Deluvio on Unsplash

Idea Bar Categories
Community Data Analytics Finance Leadership Management Professional Development Strategy Student Life Supply Chain

Subscribe to the Idea Bar to receive new articles via email.