Researchers have developed software using two new techniques to help computer chip designers improve memory systems. The techniques rely on “performance cloning,” which can assess the behavior of software without compromising privileged data or proprietary computer code.
Computer chip manufacturers try to design their chips to provide the best possible performance. But to find the most effective designs, manufacturers need to know what sort of software their clients will be using.
“For example, programs that model protein folding use a lot of computing power, but very little data — so manufacturers know to design chips with lots of central processing units (CPUs), but significantly less memory storage than would be found on other chips,” says Yan Solihin, an associate professor of electrical and computer engineering at North Carolina State University and an author of two papers describing the new techniques.
However, many large customers — from major corporations to Wall Street firms — don’t want to share their code with outsiders. And that makes it tough for chip manufacturers to develop the best possible chip designs.
One way to address this problem is through performance cloning. The concept behind performance cloning is that a chip manufacturer would give profiler software to a client. The client would use the profiler to assess its proprietary software, and the profiler would then generate a statistical report on the proprietary software’s performance. That report could be given to the chip manufacturer without compromising the client’s data or code.
The profiler report would then be fed into generator software, which can develop a synthetic program that mimics the performance characteristics of the client’s software. This synthetic program would then serve as the basis for designing chips that will better meet the client’s needs.
Previous work at Ghent University and the University of Texas at Austin has used performance cloning to address issues related to CPU design — but those initiatives did not focus on memory systems, which are an important element of overall chip design.
Researchers have now developed software using two new techniques to help optimize memory systems.
The first technique, called MEMST (Memory EMulation using Stochastic Traces), assesses memory in a synthetic program by focusing on the amount of memory a program uses, the location of the data being retrieved and the pattern of retrieval.
For example, MEMST looks at how often a program retrieves data from the same location in a short period of time, and at how likely a program is to retrieve data from a location that is near other data that’s been retrieved recently. Both of these variables affect how quickly the program can retrieve data.
The second technique, called MeToo, focuses on memory timing behavior — how often the program retrieves data and whether the program has periods in which it makes many memory requests in a short time. Memory timing behavior can have a significant impact on how a system’s memory system is designed.
For example, if you think of memory requests as cars, you don’t want to have a traffic jam — so you may want to be sure there are enough lanes for the traffic. These traffic lanes equate to memory bandwidth; the broader the bandwidth, the more lanes there are.
“Both MEMST and MeToo are useful for chip designers, particularly for designers who work on memory components, such as DRAM, memory controllers and memory buses,” Solihin says.
The new techniques expand on previous work done by Solihin that used performance cloning to look at cache memory.
“Our next step is to take MEMST and MeToo, as well as our work on cache memory, and develop an integrated program that we can commercialize,” says Solihin, author of the forthcoming Fundamentals of Parallel Multicore Architecture, which addresses memory hierarchy design.
The paper on MEMST, “MEMST: Cloning Memory Behavior using Stochastic Traces,” will be presented at the International Symposium on Memory Systems, being held October 5 to 8 in Washington, DC. The paper was co-authored by Solihin and Ganesh Balakrishnan of Advanced Micro Devices, a former NC State Ph.D. student.
The paper on MeToo, “MeToo: Stochastic Modeling of Memory Traffic Timing Behavior,” will be presented at the International Conference on Parallel Architecture and Compilation, being held October 18 to 21 in San Francisco, CA, Lead author of the paper is Yipeng Wang, a Ph.D. student at NC State. Co-authors are Balakrishnan and Solihin. The work was supported by the National Science Foundation under grant number CNS- 0834664.
“MEMST: Cloning Memory Behavior Using Stochastic Traces”
Authors: Ganesh Balakrishnan, Advanced Micro Devices; Yan Solihin, North Carolina State University
Presented: International Symposium on Memory Systems, Oct. 5-8, Washington, D.C.
Abstract: Memory Controller and DRAM architecture are critical aspects of Chip Multi Processor (CMP) design. A good design needs an in-depth understanding of end-user workloads. However, designers rarely get insights into end-user workloads because of the proprietary nature of source code or data. Workload cloning is an emerging approach that can bridge this gap by creating a proxy for the proprietary workload (clone). Cloning involves profiling workloads to glean key statistics and then generating a clone offline for use in the design environment. However, there are no existing cloning techniques for accurately capturing memory controller and DRAM behavior that can be used by designers for a wide design space exploration. We propose Memory EMulation using Stochastic Traces, MEMST, a highly accurate black box cloning framework for capturing DRAM and MC behavior. We provide a detailed analysis of statistics that are necessary to model a workload accurately. We will also show how a clone can be generated from these statistics using a novel stochastic method. Finally, we will validate our framework across a wide design space by varying DRAM organization, address mapping, DRAM frequency, page policy, scheduling policy, input bus bandwidth, chipset latency, DRAM die revision, DRAM generation and DRAM refresh policy. We evaluated MEMST using CPU2006, BioBench, Stream and PARSEC benchmark suites across the design space for single-core, dual-core, quad-core and octa-core CMPs. We measured both performance and power metrics for the original workload and clones. The clones show a very high degree of correlation with the original workload for over 7900 data points with an average error of 1.8% and 1.6% for transaction latency and DRAM power respectively.
“MeToo: Stochastic Modeling of Memory Traffic Timing Behavior”
Authors: Yipeng Wang and Yan Solihin, North Carolina State University; Ganesh Balakrishnan, Advanced Micro Devices
Presented: International Conference on Parallel Architecture and Compilation, Oct. 18-21, San Francisco, California
Abstract: The memory subsystem (memory controller, bus, and DRAM) is becoming a bottleneck in computer system performance. Optimizing the design of the multicore memory subsystem requires good understanding of the representative workload. A common practice in designing the memory subsystem is to rely on trace simulation. However, the conventional method of relying on traditional traces faces two major challenges. First, many software users are apprehensive about sharing their code (source or binaries) due to the proprietary nature of the code or secrecy of data, so representative traces are sometimes not available. Second, there is a feedback loop where memory performance affects processor performance, which in turn alters the timing of memory requests that reach the bus. Such feedback loop is difficult to capture with traces. In this paper, we present MeToo, a framework for generating synthetic memory traffic for memory subsystem design exploration. MeToo uses a small set of statistics that summarizes the performance behavior of the original applications, and generates synthetic traces or executables stochastically, allowing applications to remain proprietary. MeToo uses novel methods for mimicking the memory feedback loop. We validate MeToo clones, and show very good fit with the original applications’ behavior, with an average error of only 4.2%, which is a small fraction of the errors obtained using geometric inter-arrival (commonly used in queueing models) and uniform inter-arrival.