Researchers from IBM demonstrated the future of large-scale
storage systems by successfully scanning 10 billion files on a single system in
just 43 minutes, shattering the previous record of one billion files in three
hours by a factor of 37.
Growing at unprecedented scales, this advance unifies data
environments on a single platform, instead of being distributed across several
systems that must be separately managed. It also dramatically reduces and
simplifies data management tasks, allowing more information to be stored in the
same technology, rather than continuing to buy more and more storage.
In 1998, IBM researchers unveiled a highly scalable,
clustered parallel file system called General Parallel File System (GPFS),
which was furthered tuned to make this breakthrough possible. GPFS represents a
major advance of scaling for storage performance and capacity, while keeping
management costs flat. This innovation could help organizations cope with the
exploding growth of data, transactions and digitally aware sensors, and other
devices that comprise Smarter Planet systems. It is ideally suited for
applications requiring high-speed access to large volumes of data such as data
mining to determine customer buying behaviors across massive data sets, seismic
data processing, risk management and financial analysis, weather modeling, and
scientific research.
Driving new levels of
storage performance
This breakthrough was achieved using GPFS running on a cluster of 10 eight
core systems and solid state storage, taking 43 minutes to perform this selection.
The GPFS management rules engine provides the comprehensive capabilities to
service any data management task.
GPFS’s advanced algorithm makes possible the full use of all
processor cores on all of these machines in all phases of the task (data read,
sorting, and rules evaluation). GPFS exploits the solid-state storage appliances
with only 6.8 TB of capacity for excellent random performance and high data
transfer rates for containing the metadata storage. The appliances sustainably
perform hundreds of millions of data input-output operations, while GPFS
continuously identifies, selects and sorts the right set of files among the 10
billion on the system.
“[The] demonstration of GPFS scalability will pave the
way for new products that address the challenges of a rapidly growing,
multi-zettabyte world,” says Doug Balog, vice president, storage
platforms, IBM. “This has the potential to enable much larger data
environments to be unified on a single platform and dramatically reduce and
simplify data management tasks such as data placement, aging, backup and
migration of individual files.”
The previous record was also set by IBM researchers at the
Supercomputing 2007 conference in Reno,
NV, where they demonstrated the
ability to scan one billion files in three hours.
“Businesses in every industry are looking to the future
of storage and data management as we face a problem springing from the very
core of our success—managing the massive amounts of data we create on a daily
basis,” says Bruce Hillsberg, director of storage systems, IBM Research—Almaden.
“From banking systems to MRIs and traffic sensors, our day-to-day lives
are engulfed in data. But, it can only be useful if it is effectively stored,
analyzed, and applied, and businesses and governments have relied on smarter
technology systems as the means to manage and leverage the constant influx of
data and turn it into valuable insights.”