Words used in text-mining research carry bias, study finds

The word lists packaged and shared amongst researchers to measure for bias in online texts often carry words, or “seeds,” with baked-in biases and stereotypes, which could skew their findings, new Cornell research finds.

For instance, the presence of the seed term “mom” in a text analysis exploring gender in domestic work would skew results female.

“We need to know what biases are coded in models and datasets. What our paper does is step back and turn a critical lens on the measurement tools themselves,” said Maria Antoniak, doctoral student and first author of “Bad Seeds: Evaluating Lexical Methods for Bias Measurement,” presented at the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing in August.

“What we find is there can be biases there as well. Even the tools we use are designed in particular ways,” Antoniak said.

Antoniak co-authored “Bad Seeds” with her adviser, David Mimno, associate professor in the Department of Information Science in the Cornell Ann S. Bowers College of Computing and Information Science.

“The seeds can contain biases and stereotypes on their own, especially if packaged for future researchers to use,” she said. “Some seeds aren’t documented or are found deep into the code. If you just use the tool, you’d never know.”

In the digital humanities, and in the broader field of natural language processing (NLP), scholars bring computing power to bear on written language, mining thousands of digitized volumes and millions of words to find patterns that inform a wide range of inquiry.

It’s through this kind of computational analysis that digital humanities and NLP scholars at Cornell are learning more about gender bias in sports journalismthe impeccable skills of Ancient Greek authors at imitating their predecessorsthe best approaches to soothing a person reaching out to a crisis text line, and the culture that informed British fiction in the late 19th and early 20th centuries.

In past research, Antoniak mined online birth stories and learned about new parents’ feelings of powerlessness in the delivery room. Most recently, in a paper published at this month’s ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), she and Cornell co-authors analyzed an online book review community to understand how users refine and redefine literary genres.

This type of text analysis can also be used to measure bias throughout an entire digital library, or corpus, whether that’s all of Wikipedia, say, or the collected works of Shakespeare. To do that, researchers use online lexicons, or banks of words and seed terms. These lexicons are not always vetted: some are crowd-sourced, hand-curated by researchers or pulled from prior research.

Antoniak’s motivation to investigate lexicons for bias came after seeing wonky results in her own research when using an online lexicon of seed terms. 

“I trusted the words, which came from trusted authors, but when I looked at the lexicon, it wasn’t what I expected,” Antoniak said. “The original researchers may have done a fabulous job in curating their seeds for their datasets. That doesn’t mean you can just pick it up and apply it to the next case.”

As explained in “Bad Seeds,” the seeds used for bias measurement can themselves have cultural and cognitive biases.

“The goal isn’t to undermine findings but to help researchers think through potential risks of seed sets used for bias detection,” she said. “Investigate them and test them for yourself to ensure results are trustworthy.”

As part of her findings, Antoniak recommends digital humanities and NLP researchers trace the origins of seed sets and features, manually examine and test them, and document all seeds and rationales.

This research is supported by the National Science Foundation.

Louis DiPietro is a communications specialist for the Cornell Ann S. Bowers College of Computing and Information Science.

Media Contact

Abby Butler