Maximizing Application Performance through Inter-Procedural Optimization
Newer generation compilers deliver whole program optimization
Fred Chow, Ph.D.
Software applications are typically comprised of multiple source files that are compiled separately and linked together to create an executable program. This approach, known as separate compilation, limits most traditional commercial compilers. Incomplete program information is available during compilation, forcing compilers to make worst-case assumptions about programs that access external data or call external functions. Newer generation compilers solve this problem by performing whole program optimization, enabling them to make intelligent decisions on where and how to perform various optimizations, resulting in higher performing applications. By collecting information over the entire program, a compiler can make a better decision regarding the applicability and safety of many optimization techniques. The result is a wider variety of optimization techniques used to greater effectiveness.
Figure 1: Compilation model built on inter-procedural analysis and optimization to create higher-performing applications
Whole program optimization can be performed through a technique called inter-procedural analysis, which links all source files together early in the compilation process before most optimization and code generation is performed. Using intermediate representation files, a compiler can perform inter-procedural analysis on the entire program, invoke the back-end of the compiler to optimize and generate object code and, finally, invoke the standard linker to produce the final executable. Inter-procedural analysis occurs in two steps: analysis and optimization. In the analysis stage, information is collected for the whole program. Application source code is analyzed on several levels, and the information gathered is used during the optimization stage to make intelligent decisions on how best to transform the application source code and achieve the best possible performance.
The first step in analyzing application source code is in understanding how different functions relate to one another. To help this effort, a compiler may first construct a program call graph, or a representation of all functions and their caller/callee relationship. Using the call graph, the compiler determines whether each program variable is modified or referenced inside a function call. Alias information is generated for every variable whose address is taken, enabling pointer accesses to be optimized more aggressively and minimizing their impact on application performance. All variable information gathered during the analysis phase is stored, enabling the back-end of the compiler to perform additional optimization steps.
As shown in Figure 1, the newer generation compilers use a compilation model built on inter-procedural analysis and optimization to create high performing applications. Once the application source is analyzed, several types of optimizations are performed automatically by the compiler:
•Inlining: Perhaps the most important inter-procedural analysis performed, inlining replaces calls to a function with the body of the function. Function call overhead is eliminated, and the back-end phases of the compiler are able to work on larger sections of code, potentially enabling the compiler to take advantage of other optimizations that would have been impossible when less code was available. For example, inlining may result in the formation of a loop nest that enables aggressive loop transformations. Inlining should be performed with great care in an effort to ensure performance degradation does not result. Large function and program sizes can cause higher instruction cache misses, run out of registers, use memory too frequently, or slow down later stages of the compilation process. Compilers that take a whole program compilation approach are able to inline any function into any other function, even if they are not located in the same source file. Newer generation compilers are better able to perform inlining more aggressively than traditional compilers.
• Constant propagation and function cloning: Many function calls pass constants, including variable addresses, as parameters. Replacing a formal parameter with a known constant value creates opportunities for optimization. For example, portions of a function often become unreachable and can be deleted as dead code. Constants can be exploited in two ways. First, if all calls to a function use a given constant, constant propagation is performed, modifying the function without increasing program size. Second, if a function is called with several constant parameters, function cloning is performed to create alternate functions that use customized parameters.
•Dead variable elimination: Dead variable elimination removes all global variables that are never used, as well as the code that updates them, thereby speeding execution.
•Dead function elimination: Dead functions — functions that are never used — can result from inlining and cloning techniques, or from continual program modification during the development process. Dead function elimination removes such functions, saving valuable space, and reducing memory and cache consumption.
•Common padding: Common padding improves array alignments in FORTRAN common blocks. Traditional compilers may be unable to coordinate changes to the layout of user variables in a common block. Newer compilers can take advantage of all subroutines being available. By implementing common padding, array alignments can be improved, enabling arrays to be vectorized and accessed more efficiently and potentially reducing cache conflicts, improving performance. A similar technique can rearrange C and C++ structures.
•Common block splitting: Common block splitting divides a FORTRAN common block into smaller pieces, reducing data cache conflicts during program execution.
•Procedure re-ordering: Procedure reordering organizes functions based on their call relationship, potentially reducing instruction cache thrashing during program execution.
Figure 2: SPEC CPU2000 “base” benchmark improvements achieved using inter-procedural analysis
To demonstrate the profound effect that inter-procedural analysis can have on application performance, my colleagues and I ran the CPU2000 benchmark suite compiled using the new PathScale EKO Compiler on a 2.2 Ghz Opteron-based system configured with 64 KB on-chip primary cache, 1 MB on-chip secondary cache, and 1 GB of main memory. The bar chart in Figure 2 illustrates the effects of inter-procedural analysis on the “base” tests of the CPU2000 benchmark. In these tests, all compilations use the same four optimization options, providing a test suited to users who are not interested in finding the peak flags for a given application. This inter-procedural analysis system improves the performance of 17 of the benchmarks, with up to 26.6 percent improvement and an average (geometric mean) improvement of six percent. Use of additional inter-procedural tuning flags can enable further improvement in all these programs.
Developers are constantly seeking new ways to improve application performance. While architecture and algorithms are important factors, development tools also can have a significant impact on application performance. Traditional compilers perform a separate compilation on each program source file, forcing them to make worst-case assumptions that result in poor application performance. Developers need more sophisticated tools that can take advantage of all program information to better apply analysis and optimization techniques and create high performance applications. These tools are being delivered to the market today by a new generation of compilers with the capabilities described above.
Fred Chow, Ph.D. is Director of Compiler Engineering at PathScale, Inc. He may be contacted at email@example.com