By combining data from a variety of nontraditional sources, a team led by researchers at Harvard Medical School and Boston Children’s Hospital has developed predictive models of flu-like activity that provide robust real-time estimates (“nowcasts”) of flu activity and accurate forecasts of flu-like illness levels up to three weeks into the future.
The team’s findings, published October 29, 2015, in PLoS Computational Biology, show that their approach, called ensemble modeling, results in predictions that are more robust than those generated from any one data source alone and that rival in real time the accuracy of the CDC’s retrospective flu reporting.
“We’ve focused for many years on using individual data sources for tracking a range of diseases,” said study senior author John Brownstein, HMS associate professor of pediatrics at Boston Children’s and co-founder of the disease tracking site HealthMap. “This represents the next logical step: combining data in a new way where the whole is more valuable than the sum of its parts.”
“Weather forecasting is an established discipline and has become ingrained in society,” he added. “We think the time is ripe for the same to happen with disease forecasting.”
While the CDC closely monitors seasonal flu-like illness activity across the U.S., the data reports it generates and distributes to clinicians and public health authorities is historically one to two weeks out-of-date. As accurate predictions could help guide hospitals and health systems in allocating resources for flu care, many groups have attempted to create models that could provide accurate real-time snapshots of current flu activity, as well as predictions of forthcoming activity. The most famous of these attempts was probably Google Flu Trends, launched in 2008 but decommissioned in 2015.
“There are many data sources and models that can be used to predict flu-like symptoms in the population,” said study lead author Mauricio Santillana of the Boston Children’s Computational Health Informatics Program and the Harvard John A. Paulson School of Engineering and Applied Sciences. “But our question was, if we have many models each predicting flu activity, do we gain anything by combining them?”
Santillana and Brownstein’s team started with four separate nowcasting models of flu-like illness activity. Each was fed aggregated, anonymized, national-level data from one of four sources:
- Google searches
- electronic health record (EHR) manager athenahealth
- Flu Near You, a participatory surveillance system developed by HealthMap
In an approach similar to that used by weather forecasters to predict hurricane tracks, the team then used machine-learning techniques to generate a set of ensemble models that incorporated the results produced by the other four single-source models.
To determine their ensemble models’ accuracy and robustness, Santillana and Brownstein’s team compared their results to those of each of the four real-time source models, as well as to both CDC’s historical flu-like illness reports and Google Flu Trends-based nowcasts from the 2013-14 and 2014-15 flu seasons.
The ensemble models not only outperformed their four real-time source models, but when compared to CDC’s historical flu-like illness reports, they generated better forecasts of both the timing and the magnitude of flu-like illness activity at each time horizon measured (“this week,” “next week,” “in two weeks”) than models that rely on historical information only.
The ensemble predictions also accurately tracked CDC’s reports of actual flu activity, with near perfect correlation (0.99 Pearson correlation) for real-time estimates and slightly smaller correlation (0.90 Pearson correlation) at the two-week time horizon.
Thus, Santillana points out, the answer to his question is yes. “If we combine multiple data sources, we get a stronger, more robust, more accurate prediction of flu activity.”
One of the keys to the model’s success, he added, is the inclusion of social media and EHR data.
“People sometimes wonder if the information that we are getting from social media or EHRs is really valuable, and we could get away with building models based on historical data,” he said. “But we found that the data sources we had access to provided us with information that was better than just looking at historical patterns.”
The research team hopes to increase the models’ geographic resolution — right now, it only predicts flu activity on a national scale — as well as extend the models’ capabilities to track other diseases where multiple data sources are available (e.g., dengue) and disease activity in other nations. They also hope to produce a publicly available flu prediction tool based on their models.
“What have people in informatics, medicine and public health dreamed of for years? The ability to leverage all manner of data — historic, social, EHR and so on — to create a learning health system,” said Brownstein, who is also chief innovation officer at Boston Children’s. “With this approach, we think we’ve taken a big step in that direction. Our job now is to see if we can refine and expand upon it and apply it in ways that can benefit as many people as possible.”
The study was supported by the National Library of Medicine (grant number R01LM010812) and the National Institute of Environmental Health Sciences (grant number K01ES025438).