Will Any Answer Do?
Can supercomputers make simulation answers more accurate?
Computer modeling now supports nearly every area of scientific research and, in many cases, is the most (or only) viable method of getting useful predictions to explore phenomena and test hypotheses. With this pervasive role of computer models in scientific research, the correctness of these models is critical to scientific advancement, perhaps even more so than other key characteristics such as ease of use, speed or cost. In short, results matter. This is an interesting contention — that correctness trumps ease, speed and cost. Even as I write it, I’m unsure — but consider, however simple, quick and cheap a prediction is, it can never advance science if it is wrong or, just as bad, it is unclear how close to the truth it is. Clearly, hard-to-use, slow-to-run or expensive-to-use models can affect whether it is possible to use the simulation; but correctness can affect whether it is even worth using.
Who checks?
Computer models may be developed by the research group using them, by close collaborators or by a distant third party, such as a commercial provider. Models developed in-house often use components from third-party providers, such as libraries for interprocess communications, data management and I/O or common mathematical functions. It is quite reasonable to assume that it is the responsibility of the model (or component) supplier to make sure that it is fit-for-purpose. However, whether you use third-party components in developing your own models or exclusively use commercially supplied models, it is entirely your responsibility to ensure that your use of the model/component is within a certified part of parameter space. In other words, are you using it for a purpose for which it is tested to be fit?
Beyond that, it is simply good scientific practice to make sure your equipment (or model, in this case) is behaving as expected, i.e. that the predictions are credible and corroborate with known data points. You may even get lucky — many a scientific discovery has arisen due to investigation of unexpected behavior of measuring equipment.
Finding fault
Developers of models (including researchers who develop their own software) have a responsibility to test the models before they are released for production use. Techniques could include comparing the predictions to known data points, physical measurements, historical data of known accuracy, or predictions from other models (perhaps using different principles or algorithms).
Of course, it is crucial not to make the dangerous assumption that physical testing is always better than computer predictions. Physical testing has its own sources of errors, assumptions, regions of validity, etcetera. Disagreement between computer and measurement does not mean the computer is wrong — explore the accuracy of both!
All computer simulation users must guard against the easy trap that the predictions must be more accurate than they were previously, simply because the model is now is now using a higher resolution, or has a snazzy new physics algorithm, or ran for longer, or processed more input data. These all might produce more accurate results — in fact, they are the main methods for getting more accurate results — but there is no guarantee.
HPC is here to help
High performance computing (HPC) can make better modeling and predictions possible through faster and more accurate production use of the models. The extra computational power of HPC can deliver innovation and real advantage into any research program or business. But, as with any method of research or design, the user has a responsibility to check the predictions. Thankfully, HPC itself offers help by using some of that extra computational power to validate the modeling.
In fact, one of the biggest roles of HPC in science and engineering is exploring the validity of the models. The models are pushed to extreme scales, data sets, boundary conditions, etcetera, to help establish confidence that the use of the model in less extreme regions of parameter space is safe.
Extreme scales — whether finer resolution, longer time, or parallel processing — offer the potential for more accurate simulations, but they also open up new sources of error. Perhaps the larger number of iterations required for the longer time span moves the algorithms beyond their stability? Maybe the greater number of data points introduces numbers both bigger (counts, ranges) and smaller (time-steps, discretization) than the code is numerically capable of handling? Is the developer sure that all subroutines or library calls have been designed/certified for this new regime?
HPC (i.e. parallel processing) offers more computer power, but it brings its own set of correctness errors, such as race conditions. Since parallel processing is most likely the only foreseeable choice for the future, the verification issues associated with it are becoming pervasive. Where parallel processing becomes massively parallel, i.e. supercomputing, fault tolerance is becoming critical. This must be accounted for within the model itself, not just assumed to be dealt with by the system. Errors might include the obvious, such as node failures, but also the less obvious, such as data corruption in memory, incorrect instruction executions, and intermittent and silent errors.
New methods
The huge increase in computational power from HPC means that different methods may be better able to deliver the science. For example, the model could be re-architected to be tolerant of infrastructure faults and/or data corruptions or the model could be enhanced to perform ongoing accuracy checks during simulations (consuming some of that extra HPC grunt). Perhaps some speed can be traded off for more accurate methods — this could be especially powerful if combined with optimization work to make the model more scalable — replacing the lost speed by efficiently scaling to use more processors.
These model enhancements could be done either by working in collaboration with a group specializing in HPC software development and allowing the core model development team to focus on the science aspects of the model, or by growing HPC skills within the (science) model development team. In the latter case, the HPC effort should be seen in the same light as mathematics or basic programming skills — i.e. valuable underpinning expertise, without which the science might not be possible.
Andrew Jones is Vice President of HPC Consulting at Numerical Algorithms Group. He may be reached at [email protected].