Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

Doing Data Science: Straight Talk from the Front Line

By R&D Editors | June 5, 2014

In summarizing this interesting book, it does have many useful hints, tips and tricks to addressing specific types of problems, as well as pitfalls. I would appreciate far more scientific examples than the business ones that were in abundance. However, author/contributor backgrounds must be considered.While having many pluses, this book is not for every budding data analyst

I can most simply describe this book by quoting from the back cover: Motivation — “…how can you get started in a wide-ranging, interdisciplinary field that’s so clouded in hype?” Background Needed — “If you’re familiar with linear algebra, probability, and statistics, and have programming experience…”

However, all this just describes a limited set of skills needed to get started and says nothing about the authors attack on the problem (i.e., learning data science, or at least getting started) and the level of knowledge needed to derive maximal benefit from this book. In that respect, it is a most maddening exercise in bouncing from the overly simple to the overly complex. And it does this rapidly and through most of the technical portions of the book. Let’s take a closer look…

To begin, in reading the first 54 pages, the reader gets the idea that there is a lot of intense introspection on data science and scientists, and that the principal author (with a Ph.D. in statistics) would rather be in the psychology or philosophy department. After that, we get a gradual introduction to linear regression and data distributions in the context of algorithms. This section is nicely done, as it is well-thought-out and of solid pedagogical value. The authors go slowly and explain all of the terminology in concise and straightforward terms. Same goes for the next gradual step, multiple linear regression. A big plus here is the review of assumptions that must be met if the test is actually to be used, a subject of paramount importance that is usually glanced over or buried in many textbooks.

K-Nearest Neighbor algorithms get a similar fine treatment with note taken of scaling problems and explanations of each of the more commonly used distance metrics. However, we soon get hot and heavy into mathematical symbols and
expression, as well as R code that can quickly bury the uninitiated. This is typified by the inclusion of texts by Hastie and Tibshirani as well as Casella and Berger in the suggested reading. These are anything but introductory texts and will not be profitably read by many scientists, business types and technicians, some of whom are actually doing data analyses with large data sets or trying to understand reports from others.

And the authors do state in the preface: “Don’t expect a machine learning textbook. Instead, expect full immersion into the multifaceted aspects of data science from multiple points of view. This is a survey of the existing landscape…” but it is not an extensive how-to manual.

For those new to R, a lot more introduction is needed than merely snippets of code. I find that many authors and even commercial vendors touting the marvels of their software leave out the very first step of actually getting the data into the program. Usually, they have the data set cleaned and prepped and pre-loaded into the program. This can require extra steps in areas where databases need be matched as to data and labels.

There are other, small technical glitches such as Figure 4.1, where the text is way too small and light to read. Also, much in that chapter sounds like it was addressed primarily to the IT department, so their comments above ring true: you already need (what I consider to be advanced) knowledge in statistics and computer programming as well as some domain knowledge in the area of work.

In summarizing this interesting book, it does have many useful hints, tips and tricks to addressing specific types of problems, as well as pitfalls. The hammer and nail story with linear regression is classic! Explanations of algorithms are excellent, and there are also interesting asides on people and the history of algorithms, statistics, etcetera. It also was very nice to see all known versions of key words describing variables and analytic features, which is often quite confusing to the novice. I would appreciate far more scientific examples than the business ones that were in abundance. However, author/contributor backgrounds must be considered.

Interested readers are strongly urged to go to the book’s site at Amazon.com and read sections of the scanned-in pages. While having many pluses, this book is not for every budding data analyst.

Availability
Doing Data Science: Straight Talk from the Front Line, by Rachel Schutt and Cathy O’Neil. O’Reilly Media, Inc. Sebastopol, CA. pp 406 (2014). $ 39.99. ISBN: 1449358659

John Wass is a statistician based in Chicago, IL. He may be reached at [email protected].

Related Articles Read More >

Abstract of modern high tech internet data center room with rows of racks with network and server hardware. 3d rendering
A startup says it found hidden memory behavior in NVIDIA GPUs and is building a security layer around it
Bioptimus launches massive patient data atlas to train its biology AI
Basecamp Research partners with Anthropic, NVIDIA to build the world’s largest genomic database
Could AI smell cancer? Science says yes
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.

R&D World Digital Issues

Fall 2025 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

R&D 100 Awards
Research & Development World
  • Subscribe to R&D World Magazine
  • Sign up for R&D World’s newsletter
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2026 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • 2025 R&D 100 Award Winners
    • 2025 Professional Award Winners
    • 2025 Special Recognition Winners
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • Subscribe
    • Video
    • Webinars
    • Content submission guidelines for R&D World
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE