Technology is taking giant leaps and bounds, and with it, the information with which society operates daily. Nevertheless, the volume of data needs to be organized, analyzed and crossed to predict certain patterns. This is one of the main functions of what is known as ‘Big Data’, the 21st century crystal ball capable of predicting the response to a specific medical treatment, the workings of a smart building and even the behavior of the Sun based on certain variables.
Researchers in the KIDS research group from the University of Cordoba’s Department of Computer Science and Numerical Analysis were able to improve the models that predict several variables simultaneously based on the same set of input variables, thus reducing the size of data necessary for the forecast to be exact. One example of this is a method that predicts several parameters related to soil quality based on a set of variables such as crops planted, tillage and the use of pesticides.
“When you are dealing with a large volume of data, there are two solutions. You either increase computer performance, which is very expensive, or you reduce the quantity of information needed for the process to be done properly,” says researcher Sebastian Ventura, one of the authors of the research article.
When building a predictive model there are two issues that need to be dealt with: the number of variables that come into play and the number of examples entered into the system for the most reliable results. With the idea that less is more, the study has been able to reduce the number of examples, by eliminating those that are redundant or “noisy,” and that therefore do not contribute any useful information for the creation of a better predictive model.
As Oscar Reyes, the lead author of the research, points out “we have developed a technique that can tell you which set of examples you need so that the forecast is not only reliable but could even be better.” In some databases, of the 18 that were analyzed, they were able to reduce the amount of information by 80% without affecting the predictive performance, meaning that less than half the original data was used. All of this, says Reyes, “means saving energy and money in the building of a model, as less computing power is required.” In addition, it also means saving time, which is interesting for applications that work in real-time, since “it doesn’t make sense for a model to take half an hour to run if you need a prediction every five minutes.”
As pointed out by the authors of the research, these systems that predict several variables simultaneously (which could be related to one another), based on several variables -known as multi-output regression models,- are gaining more notable importance due to the wide range of applications that “could be analyzed under this paradigm of automatic learning,” such as for example those related to healthcare, water quality, cooling systems for buildings and environmental studies.