An Autobiographical Journey Through AI

In 2015, 25-year-old graduate student Joy Buolamwini had an unnerving experience when using face-tracking software for a class project. The software identified the visages of her light-skinned classmates as faces. It even recognized a white beauty mask that she donned as a face. Then Joy looked into the webcam: no detection. The software did not recognize Joy’s features as a face. She wondered: How could one investigate whether systematic misclassification occurs with AI algorithms?

Three years later, Joy and co-author Timnit Gebru published the article “Gender shades: Intersectional accuracy disparities in commercial gender classification.” They ran a dataset of 1,270 faces from three African and three European countries through gender classification software from IBM, Microsoft, and Face++. Each algorithm classified the scanned image as male or female. Overall, the algorithms achieved high accuracy, correctly identifying the gender of 88, 94, and 90 percent of the faces, respectively. But a different picture emerged when Buolamwini and Gebru examined accuracy for subgroups defined by skin tone and gender (Figure 1). Nearly 100 percent of the light-skinned male faces were correctly classified as male. For other groups, however, the algorithms made more errors. In particular, two of the three algorithms correctly classified darker-skinned female faces as female only 65 percent of the time; the third algorithm was slightly better at 80 percent, but still far below the accuracy for lighter-skinned men.

Figure 1. Percentage of faces classified as the correct gender. Data from Table 4 of Buolamwini and Gebrit (2018).

Buolamwini and Gebrit (2018) provided striking, concrete evidence of how uncritical use of artificial intelligence (AI) algorithms can affect population subgroups differently, and established Joy as an expert on the potential of AI to increase inequities. In the next few months after the article’s publication, while completing her PhD at MIT, Joy met with tech industry leaders, spoke at the Davos World Economic Forum, and testified before Congress. If algorithms make mistakes on gender more often for darker-skinned females than other population groups, what other predictions might have inequitable accuracy? Screening resumes for hiring? Approval for a mortgage? Triage in an emergency room? Self-driving cars recognizing objects as pedestrians? A store matching a shopper’s face to a database of known shoplifters?*

Dr. Joy Buolamwini has now written a book about her journey from child of Ghanaian immigrants in Oxford Mississippi to leader in algorithmic justice. I read Unmasking AI: My Mission to Protect What Is Human in a World of Machines while traveling to a mini-conference on use of AI in Federal Statistics. I had thought this would be a dry, somewhat somnifacient book about the details of AI algorithms. Instead, the reader learns how various AI methods work as the author weaves her discoveries about them into suspenseful episodes from her life. Will she finish her PhD? (Spoiler alert: yes.) How will her testimony be received by Congress? Will her efforts for algorithmic justice influence the trajectory of AI research? You’ll have to read the book to find out.

Along the way, she also describes hidden biases in other technologies. For example, cameras are often thought to be neutral, but for much of the 20th century, film exposure and development were calibrated using a “Shirley card” — an image of a white woman — which caused people with dark skin to be poorly rendered in photographs. The problem was fixed, Buolamwini writes, only after “furniture and chocolate companies complained that the rich browns of their products were not being well represented.”

Buolamwini reminds us that AI is not a new concept. McCarthy et al. (1955) used the term artificial intelligence in a Dartmouth research proposal for the summer of 1956, which was “to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” Part of the proposal considered “how can a computer be programmed to use language.”

Buolamwini does not mention the role of statisticians in the development of AI, but in fact statisticians have been using AI methods for more than 200 years. The branch of AI called machine learning involves predicting an outcome y from a set of inputs. That is precisely what is done by linear regression, which was described in 1805 by French mathematician Adrien-Marie Legendre (Plackett, 1972). Many of the methods used today for machine learning involve much more computation, and allow the form of the model to be determined by the data (instead of specified by the analyst), but the essential feature of predicting y from other variables is the same.

Machine learning methods are statistical tools, and just like other statistical tools they need to be used properly. In regression classes, statisticians teach that one should make predictions from a linear regression model only for the population sampled and only within the range of the data used to fit the model (in other words, don’t extrapolate beyond your data). These same principles apply to other AI tools. The prediction models for the gender classification algorithms had been developed on databases consisting mostly of white men. It’s not surprising, then, that the algorithms would be less accurate for other groups. When IBM, after meeting with Buolamwini, developed a new gender classification model with more diverse training datasets, the accuracy for darker-skinned females exceeded 95 percent.

One theme of the conference I attended on AI for federal statistics was on using data that are already “out there” to reduce costs of producing official statistics. The National Academies of Sciences, Engineering, and Medicine (2023) report on using alternative data sources describes one of the drawbacks of conveniently collected data (as opposed to randomly selected samples or designed experiments): such datasets often reflect the biases of society. Buolamwini (2023, chapter 10) puts this well: “When ground truth is shaped by convenience sampling, grabbing what is most readily available and applying labels in a subjective manner, it represents the standpoint of the makers of the system, not a standalone objective truth.”


Copyright (c) 2024 Sharon L. Lohr

Footnotes and References

*In 2019, the National Institute of Standards and Technology issued a report that confirmed the gender and race disparities for facial recognition (Grother et al., 2019). The authors evaluated 189 algorithms from 99 developers for two types of matches: “one-to-one,” which confirms whether an input photo matches a specific photo in a database (such as for unlocking a smartphone) and “one-to-many,” which determines whether an input photo matches any photo in a database (which might be used for criminal investigations or deportee detection). Grother et al. (2019) found racial disparities for both types of matching. For one-to-one matching, false positive rates (a false positive occurs if the algorithm declares two faces to belong to the same person when in fact the faces belong to different persons) were higher for Asian and African American faces, which would make it easier for an imposter to unlock their phones or pretend to be them at a border crossing. False positives were 2 to 5 times higher in women than in men, with differences varying by age, country of origin, and algorithm. False positive rates were also higher for African American females in one-to-many matching with an FBI mugshot database, which is “particularly important because the consequences could include false accusations.”

Buolamwini, J. (2023). Unmasking AI: My Mission to Protect What Is Human in a World of Machines. New York: Random House.

Buolamwini, J. and Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 81, 77–91.

Grother, P., Ngan, M., and Hanayoka, K. (2019). Face Recognition Vendor Test (FRVT). Part 3: Demographic Effects. Washington, DC: National Institute of Standards and Technology.

Lohr, S. (2019). Measuring Crime: Behind the Statistics. Boca Raton, FL: CRC Press. Chapter 10 on “Big Data and Crime Statistics” discusses use of AI models for policing. The models themselves are neutral, but if the training data exhibit bias, then that bias is amplified in the predictions.

McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E. (1955). A proposal for the Dartmouth Summer Research Project on Artificial Intelligence. Reprinted in AI Magazine in 2006.

National Academies of Sciences, Engineering, and Medicine (2023). Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Lohr, S. L., Weinberg, D. H., and Marton, K., eds. Washington, DC: National Academies Press.

Plackett, R. L. (1972). Studies in the history of probability and statistics XXIX: The discovery of the method of least squares. Biometrika 59(2), 239-251.

Sharon Lohr