Scientists have taken the mask off a new neural network to better understand how it makes its decisions.
Researchers from the Lincoln Laboratory’s Intelligence and Decision Technologies Group at the Massachusetts Institute of Technology have created a new, seemingly transparent neural network that performs human-like reasoning procedures to answer questions about the contents of images.
The model, dubbed the Transparency by Design Network (TbD-net), visually renders its thought process as it solves problems, enabling human analysts to interpret its decision-making process, which ultimately outperforms today’s best visual-reasoning neural networks.
Neural networks are comprised of input and output layers, as well as layers in between that transform the input into the correct output. Some deep neural networks are so complex that it is impossible to follow the transformation process.
However, the researchers hope to make the inner workings transparent for the new network, which could allow the researchers to teach the neural network to correct any incorrect assumptions.
“Progress on improving performance in visual reasoning has come at the cost of interpretability,” Ryan Soklaski, who built TbD-net with fellow researchers Arjun Majumdar, David Mascharka, and Philip Tran, said in a statement.
To close the gap between performance and interpretability, the researchers included a collection of modules—small neural networks that are specialized to perform specific subtasks. For example, when TbD-net is asked a visual reasoning question about an image, it breaks down the question into subtasks and assigns the appropriate module to fulfill its part. Each module builds off the previous module’s deduction to eventually reach a final answer.
The entire network uses AI techniques to interpret human language questions and breaks the sentences into subtasks, followed by multiple computer vision AI techniques that interpret the imagery.
“Breaking a complex chain of reasoning into a series of smaller sub-problems, each of which can be solved independently and composed, is a powerful and intuitive means for reasoning,” Majumdar said in a statement.
Each module’s output is depicted visually in an “attention mask,”—which shows heat-map blobs over objects in the image that the module is identifying as the answer. The visualizations allow human analysts to see how a module is interpreting the image.
To answer questions like “what color is a large metal cube in a given image,” the module first isolates the large objects in the image to produce an attention mask. The module then takes the output and selects which of the objects identified as large by the previous module are also metal.
That module’s output is sent to the next module, which identifies which of those large, metal objects is also a cube and then sent to a module that can determine the color of objects.
TbD-net achieved a 98.7 percent accuracy after using a visual question-answering dataset consisting of 70,000 training images and 700,000 questions with test and validation sets of 15,000 images and 150,000 questions.
Because the network is transparent, the researchers were able to see what went wrong and refine the system to achieve an improved 99.1 percent accuracy.