While having many pluses, this book is not for every budding data analyst
I can most simply describe this book by quoting from the back cover: Motivation — “…how can you get started in a wide-ranging, interdisciplinary field that’s so clouded in hype?” Background Needed — “If you’re familiar with linear algebra, probability, and statistics, and have programming experience…”
However, all this just describes a limited set of skills needed to get started and says nothing about the authors attack on the problem (i.e., learning data science, or at least getting started) and the level of knowledge needed to derive maximal benefit from this book. In that respect, it is a most maddening exercise in bouncing from the overly simple to the overly complex. And it does this rapidly and through most of the technical portions of the book. Let’s take a closer look…
To begin, in reading the first 54 pages, the reader gets the idea that there is a lot of intense introspection on data science and scientists, and that the principal author (with a Ph.D. in statistics) would rather be in the psychology or philosophy department. After that, we get a gradual introduction to linear regression and data distributions in the context of algorithms. This section is nicely done, as it is well-thought-out and of solid pedagogical value. The authors go slowly and explain all of the terminology in concise and straightforward terms. Same goes for the next gradual step, multiple linear regression. A big plus here is the review of assumptions that must be met if the test is actually to be used, a subject of paramount importance that is usually glanced over or buried in many textbooks.
K-Nearest Neighbor algorithms get a similar fine treatment with note taken of scaling problems and explanations of each of the more commonly used distance metrics. However, we soon get hot and heavy into mathematical symbols and
expression, as well as R code that can quickly bury the uninitiated. This is typified by the inclusion of texts by Hastie and Tibshirani as well as Casella and Berger in the suggested reading. These are anything but introductory texts and will not be profitably read by many scientists, business types and technicians, some of whom are actually doing data analyses with large data sets or trying to understand reports from others.
And the authors do state in the preface: “Don’t expect a machine learning textbook. Instead, expect full immersion into the multifaceted aspects of data science from multiple points of view. This is a survey of the existing landscape…” but it is not an extensive how-to manual.
For those new to R, a lot more introduction is needed than merely snippets of code. I find that many authors and even commercial vendors touting the marvels of their software leave out the very first step of actually getting the data into the program. Usually, they have the data set cleaned and prepped and pre-loaded into the program. This can require extra steps in areas where databases need be matched as to data and labels.
There are other, small technical glitches such as Figure 4.1, where the text is way too small and light to read. Also, much in that chapter sounds like it was addressed primarily to the IT department, so their comments above ring true: you already need (what I consider to be advanced) knowledge in statistics and computer programming as well as some domain knowledge in the area of work.
In summarizing this interesting book, it does have many useful hints, tips and tricks to addressing specific types of problems, as well as pitfalls. The hammer and nail story with linear regression is classic! Explanations of algorithms are excellent, and there are also interesting asides on people and the history of algorithms, statistics, etcetera. It also was very nice to see all known versions of key words describing variables and analytic features, which is often quite confusing to the novice. I would appreciate far more scientific examples than the business ones that were in abundance. However, author/contributor backgrounds must be considered.
Interested readers are strongly urged to go to the book’s site at Amazon.com and read sections of the scanned-in pages. While having many pluses, this book is not for every budding data analyst.
Availability
Doing Data Science: Straight Talk from the Front Line, by Rachel Schutt and Cathy O’Neil. O’Reilly Media, Inc. Sebastopol, CA. pp 406 (2014). $ 39.99. ISBN: 1449358659
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.