Computer scientists across the U.S. are attempting to build a machine that could translate some of the world’s most obscure languages.
Scientists plan to build a system that can respond to inquiries typed in English based on documents written in so-called “low resource” languages, which means there is relatively little written material in these languages. These could potentially including Kurdish, Serbo-Croatian, Khmer, Hmong, and Somali.
Teams will initially explore how the systems can work in a competition between several research universities including Johns Hopkins University, the University of Southern California, Columbia University, as well as Raytheon BBN Technologies. The project will begin this month and run in three phases over the course of the next four years.
The project will be completed in part thanks to a $10.7 million grant from the Office of the Director of National Intelligence (DNI), which is seeking to create a system to gather intelligence and analyze more languages where there are very few or no automated tools available for information retrieval or machine translation.
“The biggest challenge we’re going to have with this setup is there’s not much data,” said Philipp Koehn, a computer science professor in Johns Hopkin’s Whiting School of Engineering in a statement to his university. Koehn is leading a group of professors, research scientists, post-doctoral fellows, and doctoral students at Johns Hopkin’s that is participating in these research efforts.
The project aims to sharply cut the time and the amount of information needed to put a translation system into use for intelligence agents, according to the DNI.
Koehn said the DNI is expected to send the researchers information on a specific language they can use to test the new system.
The groups are going to compile online samples of the target language that has already been translated into English—about 3,500 pages’ worth of text and begin machine analysis of language patterns, including sentence structure and the positions of verbs, adjectives and other components.
By using this analysis technique, scientists hope to develop algorithms that automatically translate the target language.
The researchers hope that system will be able to respond to queries that include a word or term and a topic area or domain, where the responses produced should tell the user how the material is relevant to the query.