Pioneering algorithms sort through massive unstructured datasets using familiar desktop tools
click to enlarge Figure 1: K-means discovers clusters of data and can be accelerated with Star-P. This plot shows five clusters. |
Despite the astonishingly powerful and inexpensive computation available to scientists, engineers and analysts these days, it’s all for naught if the researchers aren’t really sure what they’re looking to discover in the first place. Much of today’s most important research seeks the unknown results of sifting through huge proverbial haystacks of data — a.k.a. knowledge discovery (KD). To this end, researchers are pioneering new mathematical techniques for sorting through enormous unstructured datasets to try to find meaningful trends, patterns or clusters of data. The technology is enabling them to figure out which relevant questions to ask, which data would be useful to that end, and what knowledge can be derived from the inquiries. Some popular knowledge discovery techniques are derived from established algorithms such as k-means clustering, principal component analysis (PCA), non-negative matrix factorization (NMF) and singular-value decomposition (SVD). The problem is that datasets in the multi-terabyte and petabyte range have become commonplace in science, government and industry. However, parallel implementations of those algorithms on high performance computers (HPCs), necessary for both memory capacity and execution speed, have not kept pace.
Key algorithms
Before we can address the computational challenge of executing the code on large datasets, it’s important to understand how the knowledge discovery algorithms work. When users believe that data is highly redundant or irrelevant, but in unknown ways, they often use dimensionality reduction. For instance, human faces can be characterized by numerous anatomical characteristics that can be automatically extracted from their images. However, it’s difficult to know which of those characteristics vary together or do not help discriminate faces. Dimensionality or rank reduction algorithms reduce data from a large number of dimensions down to a few to several dimensions.
Choosing the right dimensions is based on those dimensions’ abilities to maximize the discrimination among the different data points, effectively giving a brief synopsis of the data. The reduced number of dimensions enables clustering algorithms, which are often the next step in a workflow, to work more effectively and can make the data more understandable to a human. Some dimensionality reduction algorithms preserve a notion of “distance” between data points, while others do not. This can be important when the original data has some notion of physical distance to it. Prominent dimensionality-reduction algorithms include PCA, SVD and NMF. Some researchers favor NMF because the dimensions it identifies are more readily interpretable due to its non-negative values but, since it is an optimization, it does not define unique factors and is sensitive to initial conditions.
Clustering, also known as unsupervised classification, is another important algorithm in knowledge discovery. Clustering assigns each data point to a cluster, maximizing the similarity among points in the same cluster and minimizing the similarity among points in different clusters. Clustering on real-world data can be difficult because many popular algorithms require the user to specify how many clusters should be created, which is hard to know in advance, and because the algorithms don’t provide insight into why data points were assigned to particular clusters. As examples, clustering is used in climate change research to identify the types of land cover that exist (shrub, grass, urban, agricultural, etcetera), and to identify annual climate patterns that are typical and have good predictive power. K-means clustering works well in this situation, where k centroids define the k clusters in the result, based on a chosen distance metric; often users run the algorithm for a range of values of k to see which value gives the best results. Other clustering algorithms include hierarchical, spectral (which can start to look like dimensionality reduction) and nearest-neighbor.
The parallel KD challenge
The challenge is that knowledge discovery users seeking to exploit these KD algorithms are typically domain experts — scientists or analysts — not math experts or computer scientists. Their habit is to explore and visualize the data interactively from their desktops using very high level languages (VHLLs) like the M language of MATLAB1 or Python. Without HPCs and appropriate tools for using them at their disposal, they are forced to limit the size of datasets and the types of problems they wish to explore rather than make a very large and time-consuming investment in parallelizing the KD algorithms. This serial-only approach sacrifices the ability to scale problems to larger computers in order to retain the productivity gains of VHLLs. As a result, the execution time of many researchers’ analyses has ballooned from minutes or hours to days and even weeks, denying them the benefit of interacting with their data in real time.
However, the traditional approach to knowledge discovery with HPCs requires the serial algorithms originally developed in a VHLL to be rewritten into C, then parallelized using the message passing interface (MPI). While this approach can yield code that can scale to large sizes, the C and MPI interfaces are at such a low conceptual level that a rewrite can often take several months, even for experienced parallel programmers. In addition, once parallelized, modifications to the algorithms are equally expensive. Further, common KD algorithms often do not decompose naturally to run in parallel on multiple physical processors, making implementation even more cumbersome. This programming cost leads researchers to parallelize only KD algorithms they are certain they want to use, foregoing the ability to try several different KD algorithms on full-scale data.
New techniques
Fortunately, new programming models have emerged that allow knowledge discovery users to explore massive, unstructured datasets using their familiar desktop tools, and then to automatically parallelize the code to run on HPCs. This new problem-solving model combines the best of both worlds — the familiarity and interactivity of popular desktop tools with the computational muscle of the world’s most powerful new straightforward and interactive parallelization software and multi-core servers. It has the added benefit of not requiring research teams to make huge investments in hardware and programmer training.
One example of this approach is a system that researchers at Pacific Northwest National Laboratory (PNNL) are creating to identify faces in streaming video. In another example, ecologists are using an improved modeling approach to look at wildlife migration and gene flows across fragmented landscapes using electrical network theory.
Designing Wildlife Habitats for Wide-ranging Species
Understanding Associations among Faces in Streaming Video
View to the future
Over the next few years, we can expect the availability of packages like the Knowledge Discovery Suite to accelerate the development of effective algorithms for knowledge discovery, providing much improved insight to researchers who are facing an onslaught of data from increasingly finer-grained sensors.
Reference
1. MATLAB is a registered trademark of The MathWorks. Other product or brand names are trademarks or registered trademarks of their respective holders. ISC’s products are not sponsored or endorsed by The Mathworks, or by any other trademark owner referred to in this document.
Steve Reinhardt is Vice President of Joint Research at Interactive Supercomputing. He may be contacted at [email protected].
Acronyms
GPU Graphics Processing Unit | HPC High Performance Computer | ICA Independent Component Analysis | KD Knowledge Discovery | KDS Knowledge Discovery Suite | MPI Message Passing Interface | NMF Non-negative Matrix Factorization | PCA Principal Component Analysis | PNNL Pacific Northwest National Laboratory | SVD Singular-value Decomposition | VHLL Very High Level Language | Y2Y Yellowstone to Yukon Corridor