Fact or Fiction?
Vetting text with natural language processing algorithms
While teaching my instrumentation and measurement courses, I am often met with surprise at the statement, “Scientists are the members of our civilization charged with discovering, documenting and disseminating truth.” I then spend more than a few minutes justifying this assertion. To support this statement, think about the consequences facing other professions that don’t deal in facts. Exaggerations in advertising result in increased sales, poetic license is used to embellish solid stories into dramatic screenplays, and political mud-slinging is viewed simply as playing hardball. In contrast, the fabrication of scientific research findings has eradicated individual careers and entire organizations. If you need to know the truth of the matter, ask a scientist.
The U.S. Department of Homeland Security (DHS) is aware of this fact and has awarded a multimillion-dollar research grant to a cadre of computer science departments from the University of Pittsburgh, Cornell University and the University of Utah to develop automated algorithms capable of discerning fact from opinion in written text. The group is led by Professor Janyce Wiebe, Director of Pitt’s Intelligent Systems Program and benefits from the talents of Professor Claire Cardie of Cornell and Professor Ellen Riloff of Utah, all of whom are experts in Natural Language Processing (NLP). Even though reading and writing represent two-thirds of the foundational topics of education, computers currently only display aptitude for the final third of arithmetic. NLP is a field of algorithmic intelligence that strives to imbue computer systems with a conversational interface. While data mining is successful at finding relationships between price and sales figures, the data must be gathered and properly formatted by human operators capable of gleaning information from written reports. An NLP interface would increase the speed of this process and enable analysis of data appearing in the worldwide social database containing news reports, blogs and discussion groups we call the Internet.
A simple translation of text into digital representation of meaning is itself a difficult process; however, for data mining results to have any integrity, the original data must be true in the first place. Suffering from the axiom of “garbage in = garbage out,” NLP systems must possess the ability to discern between fact and opinion. If one were to read early sixteenth century texts concerning the nature of our universe, a simple poll would reveal a majority opinion that the earth is at the center. Heliocentric theories would be regarded as fringe heresy, even if they were accompanied by supporting facts. The modern concern of DHS is one of committing resources to an imagined threat or dismissing a real threat as being false.
The DHS grant calls for the development of accurate (read truthful) and robust techniques for extracting, summarizing, and tracking information about global events and beliefs from free text. As with all scientific instruments, this process is enabled by proper calibration. Before an analytical balance is used to make measurements, it must be shown the accepted standard concept of one gram. NLP instruments must be trained to recognize domain-specific patterns and relationships that identify the difference between asserted facts and subjective beliefs. This involves the use of traditional classification methods that have been trained to recognize statements as assertions when accompanied by words like “said” and “according to” and as subjective opinions when modified by transitive verbs such as “fears,” “suspects” or “suggests.” Subjective expressions are then further classified by their source so that they may be evaluated for their level of expert reliability. “…suggests it will snow tomorrow” is more reliable when appearing on a dispatch from the National Weather Service than on a Daily Horoscope page. The development of new scientific instruments is often accompanied by a refinement in our understanding of the universe. As a scientist, I am anxious to witness this tool’s ability to gather truth from such a vast source of information. As an editorialist, I am also anxious to see the results when it is applied to the front pages of our nation’s major newspapers.
Bill Weaver is an assistant professor in the Integrated Science, Business and Technology Program at La Salle University. He may be contacted at editor@ScientificComputing.com.