![]() |
“Given a data set, you can fit thousands of models at the push of a button, but how do you choose the best? With so many candidate models, overfitting is a real danger. Is the monkey who typed Hamlet really a good writer?” Thus, with two very excellent observations and some humor, the back cover describes how the authors assist the researcher in selecting, refining, and evaluating mathematical models that may describe their data.
Right up-front, the authors tell us that some mathematical preparation beyond the minimal is necessary to fully utilize this book. This includes knowledge of the likelihood function, applied regression, and matrix calculations. The targeted audience is at the statistical master’s level with experience in regression modeling. This may not be as onerous as it sounds, and as many scientists have frequent use for model construction, and as the biological and physical processes are studied and modeled on ever more complex levels, this tool would be a very useful addition to the analytic repertoire. Unfortunately the math will be found challenging by many. Note that this is an applied text and the authors do not rely on heavy-duty theory and proofs, but there are many theorems and derivations to assist with understanding and many interesting problems from a variety of disciplines.
The reader’s interest is piqued in the first chapter as several problems are introduced and briefly discussed. They come from such diverse areas as anthropology (Egyptian skull measurements), literature (plagiarism detection), medicine (cirrhosis survival data; low birth weight data), and sports (football; speed skating). At this point, the authors suggest the importance of standard tools such as Akaike’s information criterion (AIC), the Bayesian information criterion (BIC), and likelihood estimation, as well as newer tools such as the focused information criterion. These tools are covered in later chapters and for those interested in the applied aspects, some materials may be skipped. The later chapters will briefly show how the equations are used to build the model, as well as inputs to final selection. Unfortunately for the non-statistician, the actual plug-and-chug (stick in all the numbers and see how it’s actually done) is omitted from the book. There is a Web site with the data and programs in “R” to assist the novice, however. Each chapter is supplemented by exercises that lean heavily to the mathematical (read theoretic) aspects a bit more that the applied. Actual work with data is left to the Website.
I found a few of the derivations rather onerous but, sticking to the applications, there are excellent summaries as to how the data will be analyzed as well as references. The latter section is quite helpful and inclusive. The Subject index is much too brief (as is usual in these texts with minimal cross-referencing). Check out the Table of Contents at the Amazon.com site to get a more detailed flavor of the material.
By way of summary, this is a decent first reference on the subject, well-targeted to its mathematical audience but somewhat heavy for the research scientist. It is an excellent introduction, however, to a subject that is presently underutilized in the applied world and deserves much more exercise in practice.
Availability
Model Selection and Model Averaging, by Gerda Claeskens and Nils Lid Hjort. (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press, New York.
2008. $70.
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.