It’s easy to get overwhelmed by the marketing hype these days around “big data.” But what does it actually mean in practice? What’s the difference between big data, data science and the kinds of analytics that organizations have been using for years? How do big data systems and modern data science work together to answer new kinds of questions? Scientific Computing spoke with Roy Wilds, Ph.D., of PHEMI, a Vancouver-based big data company, to find out.
Dr. Wilds is the Chief Data Scientist at PHEMI Systems, a big data warehouse company, where he leads the PHEMI data science team helping customers extract the most insights from their data. He has advanced knowledge in data mining, machine learning and analyzing very large datasets, with a strong background in Python, R and SQL, and substantial expertise using Hadoop’s distributed technologies. Roy graduated with an Honors Bachelor of Science from Simon Fraser University in the Department of Physics, obtained his Master of Science from McGill University in the Department of Mathematics and Statistics and holds a Ph.D. in the Department of Mathematics and Statistics from McGill University.
What’s the biggest misconception about big data?
I think when people talk about the data revolution, they’re usually focusing on big data systems. But that’s really only half the story. “Big data” implies this new generation of distributed computing frameworks, like Apache Spark, that can collect huge amounts of information from all manner of sources and aggregate it in one place. That’s a big deal. But the other half of the story is about what you do with that data—and data science is a big part here.
When we go beyond classical analytics and get into modern data science, we can start asking different kinds of questions. We can look at other types of analysis, such as natural language processing and dimensionality reduction, and create more sophisticated proactive and predictive data models.
Most organizations, especially in healthcare and life sciences, are already collecting vast amounts of information. If we want to tap its full potential to provide prescriptive insights, we need to be thinking about both sides of that story.
Can you give some examples?
I can offer several. Take the city of Houston. They have massive amounts of health data from the Texas Medical Center that they’re now correlating with decades of data from public opinion surveys, standardized testing from the public school systems and other sources. All of these sources used to be completely separate, but by aggregating them, they’re discovering some amazing connections such as a much stronger link between asthma and student performance than anyone had realized.
I’ll give you another, also in Houston. They have decades’ worth of data about infrastructure damage from hurricanes and storm surges. They’re linking that information with years worth of data on socioeconomic trends in the region to predict which populations will be most impacted by flooding. Both of these data science projects are now becoming the basis for policy.
We also work with a lot of health systems and academic research institutions that are doing precision medicine. They’re ingesting huge amounts of data on gene mutations, patient demographics and clinical outcomes for different types of disease treatments, usually cancer. They’re using patients’ individual genetic profiles to predict how well they’ll respond to certain chemotherapies, so they can steer some patients towards one drug and other patients to another.
In all of these cases, you can see some common threads. First, you have massive amounts of unstructured data that can’t easily be put into a traditional relational database, but that can be accommodated by a distributed big data framework. When you do that, you’re suddenly able to ask much more open-ended questions than are possible with conventional relational data models. That’s the leap you take when you move from basic analytics to big data and modern data science.
Are those examples typical? How effectively are most organizations using their data right now?
A lot of organizations are still very much in the infancy of the data revolution. By that I mean they understand that they should be learning more from their data, but they’re relying on basic analytics to do it. They’re using traditional relational databases, or even just Excel spreadsheets, to try to see what’s happening with their customers or patients or citizens.
This approach is fine if you’re trying to understand certain statistics about a relatively limited set of data, or if you’re looking for specific answers to well-defined questions. But the reality is, the information you need to gain predictive insights is usually much messier. Think about a healthcare organization trying to develop more targeted therapies. Or even more difficult, imagine you want to find out if there are early indicators for disease in sources that haven’t really been mined before. You’d want to analyze a whole range of complex and unstructured data sources like genomes and imaging studies, legacy medical record systems, text from physician notes. You can’t capture all of that in a spreadsheet. And even if you could, you can’t do anything with it unless you’re able to wrangle that data into a form where you can run meaningful data science on it. That’s the power of big data and distributed programming frameworks—the ability to bring all of that together in a way that it actually becomes useful.
Can you point to a real-world example that really speaks to the jump from conventional analytics to more advanced data science?
One great one is natural language processing. Let’s say you have tens of thousands of hours of transcripts of recorded physician notes that are part of a patient’s clinical record, and you want to mine that data and compare it with patient outcomes to see if you can identify patterns that hadn’t been recognized before.
If you were trying to do that in the world of traditional relational databases, you run into a few problems right off the bat. First, all of this information is usually in highly complex, nested documents—so you’d have to spend some time developing an initial data model to store this in a database. One of the major benefits of distributed big data programming frameworks is that they can skip much of that process. Instead of tediously designing and then extracting, transforming, loading, or “ETL’ing” these files, you have a system that can collect that whole mass of text and use Spark to extract structures that are analytics-ready.
But then you also have to deal with the reality that human language is inherently messy. There may be a dozen different terms and acronyms and shorthand that different clinicians use to refer to the same thing. There may be acronyms that mean one thing in a certain context and something entirely different in another.
A group of researchers at Stanford are actually working through this problem right now. They’re analyzing millions of physician notes, trying to determine whether patients who were prescribed a common heartburn medication ended up being more likely have a heart attack or other cardiac complications. To do it, they first have to negotiate all of the messiness associated with natural language to isolate actual cardiac complications. And then they have to verify that it occurred after the patient was prescribed the medication. These researchers have managed to navigate these issues, and several others, and uncovered very strong evidence that these medications really do increase risk of cardiac problems.
That’s the power of combining big data systems with modern data science. You’re going from, at best, basic descriptive analytics, to predictive data models that show you things you couldn’t see before and drive real, meaningful change.