It’s axiomatic that history is written by the winners.
But what if you were part of a group that was so marginalized by society that you often don’t even make it into the history books?
This is the case for black women, say a group of social science and humanities researchers who are using supercomputers to shine a light in the dark corners of history to find obscure historical records that point to the presence of black women, even when the women themselves were not able to speak for themselves.
I recently sat down with this dynamic group of researchers to talk about their fascinating research and, more broadly, how social scientists are using supercomputers and Big Data to revolutionize humanities research.
Gathered around the table were:
- Ruby Mendenhall, Associate Professor in Sociology, African American Studies, Urban and Regional Planning, and Social Work, University of Illinois, Urbana-Champaign (UIUC)
- Nicole Marie Brown, Visiting Professor in Sociology, Illinois Wesleyan University
- Malaika McKee, Visiting Assistant Professor of African American Studies, UIUC
- Ismini Lourentzou, Doctoral Student, Computer Science, UIUC
- Mark Van Moer, Senior Visualization Programmer, National Center for Supercomputing Applications, UIUC
- Michael Black, Assistant Professor in English, University of Massachusetts, Lowell
In a nutshell, what was this research about?
Mendenhall: Except for such outstanding figures like the poet Phillis Wheatley and black women abolitionists like Maria Stewart, the first woman to speak in public, for more than 300 years, American black women’s voices were for the most part lost to history, because they were in chattel slavery and seen as subhuman. In many parts of the U.S., it was illegal for a slave to read and write or for a white person to teach a slave to read and write. So black women’s first-person voices often do not show up in the historical record. Writing and entering the historical record was an act of power and privilege. When you do see information on black women, it’s usually through the voices of white men.
However, we know that black women were there, in large numbers! I’ve long been interested in the experiences of black women in America—their resilience in the face of incredible oppression. We wanted to shine a light on places in history where black women are invisible to see if we could document their presence and, maybe, recover their voices.
How did you go about it?
Mendenhall: The traditional way of going about research like this it to squirrel yourself away in a dark library and pore over stacks of dusty historical records, searching for references to your research topic. In academia, this is called close reading, and it’s very labor-intensive. To cover the period of time we wanted to cover—from 1746 to 2014—and the amount of data we had in front of us—hundreds of thousands of documents—it would take several researchers several lifetimes. It just wasn’t a practical project to consider.
However, around 2007, Kevin Franklin, the Executive Director of the Institute for Computing in the Humanities, Arts, and Social Sciences (I-CHASS) gave a presentation to University of Illinois faculty in African American Studies about using Big Data in social science research. I was really intrigued about how I could use this technology in my research, so I talked to Kevin over the years about various Big Data projects.
I also talked to Michael Simeone, who was then the Associate Director of I-CHASS, about my research questions and if I could use Big Data to answer them. I also invited Michael to give lectures about Big Data in my research methods class. As a result of these conversations, we worked together on a fellowship with the Institute for Advanced Computing and Technology (IACAT), which gave me access to the needed computing resources and to computer science and data visualization experts.
What data did you want to examine?
Mendenhall: We worked with Harriet Green, an English and digital and humanities librarian, to identify research sources. She told us about HathiTrust, one of the largest digitized historical collections, which includes many unusual documents such as diaries, essays, and periodicals not found in other collections. Our other source was JSTOR, a digital library containing many academic journals, scholarly books, and primary sources. Our initial searches involved looking at over one million documents for information about black women. Our final corpus of data included 800,000 documents.
What were you looking for?
Brown: That’s an interesting question. When you do close reading, you start at places you know. But we learned that by using high-performance computing (HPC), we were able to uncover not only the presence of black women in traditional texts—the places we knew—but the places where they were absent. For example, if black children were being discussed by physicians, we knew that black mothers were there in those contexts, and we could find those instances about their lived experiences in nontraditional documents. Computation allowed us to systematically mine large amount of data and to identify places where black women’s experience and history could be reclaimed.
McKee: One example is in a fascinating book called Medical Apartheid by Harriet Washington, in which the author shares her research on ways that the medical field has used black bodies to advance the field of medicine from colonial times on. Very macabre experiments were done on black women to advance the field of gynecology, for example. So in our data mining, we searched for terms like “uterus,” “vaginas,” “natal reports,” and believed, based on the context, that the material that came back was about black women, even though they weren’t explicitly named.
Mendenhall: It’s important to point out that our computational research was augmented with close readings and intermediate readings (looking at the metadata) to confirm our hypotheses.
McKee: Absolutely. The distant reading—computerized scans by algorithms—has to be married to close reading to make sure that you’re getting what you think you’re getting.
The computer brawn behind the brains
Mendenhall began her Big Data research three years ago on a cluster at University of Illinois, Urbana-Champaign. When she outgrew that system, she worked with the National Science Foundation’s Extreme Science and Engineering Discovery Environment (XSEDE) and Extended Collaborative Support Services (ECSS) to locate a suitable HPC resource. It recommended the Bridges supercomputer at Pittsburgh Supercomputing Center (PSC). Mendenhall had also previously used the Blacklight system at PSC before it was decommissioned last year.
Bridges is a data-focused machine designed for Big Data applications like Mendenhall’s work, and has the aim of introducing new communities to HPC. Bridges is built on Hewlett-Packard Enterprise (HPE) large-memory servers, Apollo 2000 nodes, Intel® Xeon® processors E7-8860 v3, and the Intel® Omni-Path Architecture, an element of the Intel® Scalable System Framework. Bridges includes four HPE Superdome X servers with 12 terabytes of shared memory, 42 HPE ProLiant DL580 servers, and 800 Apollo 2000 nodes.
Intel Omni-Path fabric provides connectivity between the different nodes, giving each node complete visibility into all of the storage nodes along multiple paths. This results in extremely high performance and resilience, delivering high performance from the I/O system despite possible congestion from other jobs.
Bridges is well suited to text mining applications because of its large memory and core count. MALLET and CTM, the primary software programs Mendenhall’s team used, work best when loading the entire data set into memory. Both applications are multithreaded Java applications, so the more cores/nodes, the better. Bridges’ individual Large Shared Memory and Extreme Share Memory nodes can support 3TB and 12TB of RAM, respectively.
How long did it take the supercomputer to pore through hundreds of thousands of documents?
Mendenhall and Lourentzou: As we mentioned, it would take several lifetimes of close reading for scholars to review and rank 800,000 documents. Using a standard sequential-processing computer, it would have taken about three months. Using a supercomputer, it took 1.5 days.
In academic research, time is money, so speedups of this magnitude are critical in completing projects and arriving at results. Researchers receive grants for specific periods of time, so if you can get results back in days versus months and push your research to the next level, that’s huge.
Ismini, you provided the computer science know-how, along with Mike Black, to turn these humanities researchers’ questions into algorithms, correct?
Lourentzou: Yes, this project is a real-life application of text mining research, and I was honored to be part of it. I was already working on a project related to topic modeling, so it was a perfect fit for me. I was able to divide this team’s research materials into subtopics based on different time periods and use an algorithm to search for similarities and differences among these subtopics. HPC is a great tool for this.
And you found that the algorithm was everything, correct?
McKee: Absolutely. In fact, through this work, it’s become exceedingly clear to us as social scientists that we need to become docents with computer scientists and digital humanities scholars if we’re going to create meaningful research. You have to ask the right questions and, in computer science as in most other fields, it’s white males who are in charge of creating the questions!
Look at Facebook or Google or any of the other search and social media sites we use every day—even major media outlets. Who is writing the algorithms that determine what we see? By and large, white males.
If we as black female social scientists don’t understand Big Data’s effect on everything happening in our society, we’re at a disadvantage. We need to understand that algorithms often yield research results that reinforce social and racial inequality, because they’re being written without people of color perspectives. So we as black women need to understand the algorithms, write the algorithms, if we want to uncover what needs to be uncovered.
You sound fired up!
McKee: We are! We are very excited about the potential of applying Big Data and HPC to social science research and harnessing them to answer new questions.
Such as?
McKee: Ruby and I are working on a racial micro-aggressions project to help universities have deeper conversations about campus diversity. Many American universities are diverse but filled with racial tension. We are creating a phone app that will let members of a campus community enter instances of racial micro-aggression in real time, in a social media context. We will use Big Data to analyze this input and provide input about the campus racial climate, if you will, in real time. Our goal is to give university administrators real data with which to understand what’s going on in classrooms, dorms, and hallways, to make campus environments more welcoming.
What else?
Mendenhall: We’re applying Big Data to uncover the voices of black women who worked with W.E.B. Du Bois, the American sociologist, historian, and civil rights activist. We are doing more research into the role of black women in American medicine, which Malaika mentioned earlier. And we’d love to apply these research techniques to the Black Lives Matter movement and the many interactions between black citizens and police that are playing out across America. The possibilities of uncovering hidden stories and hidden voices are endless.
To learn more about Mendenhall’s research, visit https://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=137797 .