New strategies to clean, normalize and analyze data are covered with extraordinary clarity and minimal mathematics
The authors’ preface takes me back 10 years to a Joint Statistical Meeting (JSM). This once-a-year event gathers statisticians, data analysts and programmers to hear many fine presentations as well as take classes and partake in hands-on training. I had just begun to get very interested in genomics and was shocked to find that my JSM meeting had turned into something akin to the Genetics Society of America meeting! It almost seemed as if every other presentation had to do with genomics and the analysis of microarray data. The authors comment on this explosion of interest.
The introduction of newer sequencing methodologies, DNA microarrays and high-throughput technology has resulted in a deluge of large data sets that require new strategies to clean, normalize and analyze the data. All of these and more are covered in approximately 300 pages with extraordinary clarity and minimal mathematics. The index overview is displayed below.
1. A Brief Introduction
2. Genomics Basics
3. Microarrays
4. Processing the Scanned Image
5. Preprocessing Microarray Data
6. Summarization
7. Two-Group Comparative Experiments
8. Model-Based Inference and Experimental Design Considerations
9. Analysis of Gene Sets
10. Pattern Discovery
11. Class Prediction
12. Protein Arrays
Of course, these are followed with the obligatory References and Index sections. The reference section is particularly rich in some of the more famous papers and names from the early history of microarrays.
Most chapters are closed with a software notes section. The authors, fortunately, recommend some good commercial statistical packages such as SAS, JMP (but curiously not JMP Genomics and other suites specifically designed for genomic studies), DataDesk, Matlab and MINITAB. I’m not familiar with DataDesk, and believe that Matlab can be more profitably used by engineers than the biologists at which the book is aimed. And, of course, for the academicians who may be into pain and suffering, there is always the program-writing software such as R and S-PLUS, as well as those specifically tailored for genomics, in this case Bioconductor and the authors’ own DNAMER. The one big (and in your editor’s humble opinion ONLY) advantage to these programming versions is the price. They’re free!
One note that your editor has strong feelings on is the use of multiple programming languages. Academe and those other users badly need to standardize on one package that everyone can use and compare results among the groups. R seems to be the most frequently used and is a good choice, having a learning curve that is less steep than most, and many pre-written statistical routines.
OK, off the soapbox and onto the book. One of the more refreshing sections, early on, describes the 11 data sets used throughout the book to illustrate analytic techniques, with background information so as to acquaint the users with the domain knowledge necessary to adequately interpret the analytic results. When we do get into the technical areas, most are examples of clarity and brevity in scientific writing. In this, as in most areas of science, it would be ridiculously easy to write at the level of doctoral candidates, but the authors make the mathematics accessible to the biologists and the biology accessible to the computer types.
While we are on the topic of mathematical genomics, the level of the math used is mostly high school algebra with minimal calculus and, as the data comes from fixed slides, no need of differential equations at this introductory level. As an added bonus, there are none of the dreaded “The proof is left to the student” nor the much dreaded “It can be shown that…”. (Sure it can, after 3 pages of calculations).
The output graphics are the simple types that come out of R, and the mathematical terms all defined and described, usually at first appearance. The uses of flowcharts to describe the processes such as normalization of data are an added plus, as is the extremely logical flow of chapter materials. The reader is led from the introduction through the transformations and normalizations that precede data analysis, always being careful to justify these mathematical acrobatics with necessary domain knowledge. The importance of both biological and technical replicates is stressed in one entire subsection, and this is of primary concern, as estimating variation without true replicates is a prime statistical felony, one in which the chip developers circumvent in a variety of creative ways.
In summary, this is an excellent text for both life scientist and computer/mathematicians. Highly recommended.
Availability
• Exploration and Analysis of DNA Microarray and Other High-Dimensional Data. D. Amaratunga, J. Cabrera, and Ziv Shkedy. John Wiley & Sons. Hoboken, NJ pp. 317 + xiii. (2014). $120
John Wass is a statistician based in Chicago, IL. He may be reached at [email protected].