Web searching with object instance recognition
Toward the end of last summer, LEGO introduced the latest incarnation of its most successful product line, the LEGO Mindstorms NXT Robotics Toolset. Complete with a 32-bit microcontroller, three rotational actuators, and four sensors for touch, sound, light and distance, these kits have become the foundation of our undergraduate Intelligent Systems course as well as my 11-year-old son’s favorite birthday present. Unlike traditional LEGO models, the Mindstorms are bundled with
click to enlarge
Neighboring image features are grouped into triplets that are transformed mathematically to account for differences in translation, rotation and scaling between the “wild image” and the canonical image in the database.
programming software used to animate the creations. While there is something to be said about the amount of imagination required to zoom around the house with a large-scale Rebel X-Wing fighter in hand, the prospect of designing your own X-Wing with actuating wings and blinking laser blaster complete with sounds sampled from the Star Wars DVD, permits an even greater opportunity to exercise additional areas of the designer’s brain.
And therein lies the most important lesson of the NXT Toolset — it does not include a biological brain. The instructions provide the steps necessary to assemble a humanoid robot and program it to perform some impressive movements, but breathing life into the creations requires good old imagination or the help of some special effects. The outer shell of the binocular ultrasonic range sensor is reminiscent of the 1986 film Short Circuit and it is our brain that infers that it is complete with stereo color vision, a photographic memory and object recognition. Alas, a quick glance at the schematic reveals it simply contains a collinear transmitter/receiver pair and a timing circuit that calculates the distance between itself and a planar object placed in its field of view. This is great stuff to a sensor enthusiast, but the robot has little chance of recognizing the face of its creator.
Researcher Larry Zitnick, currently with the Interactive Visual Media Group at Microsoft Research, is working on the development of systems capable of more than range finding. As a recent graduate of the Robotics Institute at Carnegie Mellon University, Dr. Zitnick is an expert in image analysis and has set to the task of developing algorithms having the ability to locate visual objects and recall their identity. The technology is being realized in a research prototype called “Lincoln” (available at lincoln.msresearch.us) that allows its users to search the Web for information about an object by snapping a color photograph of it with an Internet-enabled camera phone or PDA. Lincoln analyzes the digital photograph for words and printed images such as those appearing on a DVD jacket or in a magazine. The system then compares these features to those contained in a cataloged image database and, if found, the image’s keywords are used as the basis of a Web search. At the moment, the system is designed to work with books, magazines, posters, paintings and labeled products. In practice, the user sees an object of interest, snaps its picture, and Lincoln returns with Web information such as a product Web site, price or concert venues. Especially when using the small keypad of a PDA or the number pad of a Smartphone, the system saves wear and tear on the user’s thumbs.
There is no small amount of behind-the-scenes technical wizardry required for the handheld electronic device to display such a high functionality. The algorithmic brain of the system must extract image features and compare them to features in a database of images. As with all measurement systems, the immediate problem is one of calibration. The image of the object found in the “wild” is very unlikely to have been acquired under the conditions of lighting, resolution and perspective used to capture the image in the database. Similar to the triangulation used by the Global Positioning System to determine spatial location, Lincoln uses a triplet of three neighboring features within the image to compensate for location, magnification and perspective. The 2-D position of each vertex of the wild image triplet can be transformed mathematically to compensate for image translation, rotation and magnification and then compared with the canonical version of the image contained in the database. The problem becomes computationally large as the number of unique image triplets increases. Similar objects, such as people in the same family, require a large number of fine details in order to tell them apart. Even our brains are often confused by identical twins. For this reason, Lincoln does not work with people, plants or pets. Until our technology matures enough to include the production of biological neural network computers, our current maxim will hold true: A small amount of electronic ability requires an enormous amount of human imagination.
Bill Weaver is an assistant professor in the Integrated Science, Business and Technology Program at La Salle University. He may be contacted at editor@ScientificComputing.com.