The predictive analytics landscape covers a wide variety of techniques and methods designed to derive insights from data. These techniques, which include statistical modeling methods, classification rules, forecasting techniques, simulation models, machine learning tools, and so on, have been used successfully for many years on structured data (data that consists of numeric or categorical attributes, where the number of categories is limited). In recent times, the volume and variety of data available for analysis has exploded, and most of this data is in non-traditional forms, which the traditional techniques were not designed to handle.
This article describes how you can transform non-traditional data, such as unstructured data (text) or semi-structured data (networks), into a structured form that you can then use to augment traditional data. Combining both types of data provides greater opportunities for actionable insight.
Traditional predictive modeling tools use structured data to predict a response variable, such as the likelihood of responding to a credit card offer, the probability of defaulting on a loan, or the possibility of reacting adversely to a drug treatment. Often, these applications include many sources of unstructured data that, until recently, have gone untapped. One of the most commonly available forms of such data is textual data, such as call center notes, warranty claims, survey responses, social media data, and blogs and tweets about new product releases.
An illustrative example of how such text data can be used in predictive models is described in Chakraborty and Pagolu (2014). Predictive models to detect which customers are likely to “churn” (switch to a different carrier in telecommunications, or to a different bank in the financial industry, and so on) are often used in many types of industries. Recent studies have shown that adding insight gained from some form of customer feedback (for example, survey responses) can improve the predictive power of the model. Text data is first transformed into structured data using some type of transformation, such as Singular Value Decomposition (SVD) or clustering. Then this structured data is used to augment the other traditional attributes in the model. Nareddy and Chakraborty (2011) include a detailed example that illustrates this process and shows that adding information from textual data can reduce misclassification rates considerably.
Network Data Other Sources of Data
Most predictive models (regression model, decision trees, neural networks, and so on) are built using attributes that pertain to individual observations, which often contain all the information about a specific customer or individual item. For example, if you want to identify customers who might be good candidates for a new smartphone model, you might build a probability-of-response model based on specific characteristics of the individual customers. You might intuit that a new smartphone campaign would be more successful if you offered attractive deals to some of the folks on your target list who are most popular in social networks. Modern social media enable you to collect information about “friends” and “friends of friends” so that you can easily build up a “network” of customers, from which you can gain valuable insight. This data is usually represented by a graph that shows individuals as nodes and relationships as links between them.
Examples of analytical techniques used for incorporating network information into predictive models can be found in Baesens and Verbeke (2012). One such example of transforming network information into a structured form for a traditional predictive model relates to the use of first- and second-order relationships between customers (through social media connections or otherwise). These relationships can then be used as attributes in addition to traditional attributes, such as an individual’s age, recency of contact, number of contacts, and so on. Figure 1 shows some typical data for an analysis to determine the likelihood that a customer will “churn” (switch to a competing product). The premise for including such information in a predictive model is that individual customers are not isolated entities; they are often influenced by friends and relatives in their decisions to continue with a company’s services or to switch to a competitor.
Other Sources of Data
In addition to text data and network data, other sources of nontraditional data, such as voice, video, image, and streaming sensor data, can also be used effectively. The most common method of incorporating voice data into the analysis is to convert it to text and then use text analytics techniques. For video and image data, an initial analysis always includes the ability to find similarities and detect anomalies and temporal and spatial variations. In all such cases, you can take best advantage of your existing techniques by finding a way to transform the nontraditional data into a structured form and then using traditional techniques to mine it.
Analyzing all sources of data of multiple types in one overarching framework is clearly an area of rich opportunities for research and is worthwhile for adding valuable insight to business decisions.
- Baesens, B., and Verbeke, W. 2012. “Social Networks in Data Mining: Challenges and Applications.” SAS Talks. http://support.sas.com/community/events/sastalks/presentations/SocialNetworksinDataMiningSAStalks.pdf
- Chakraborty, G., and Pagolu, M. 2014. “Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining.” In Proceedings of the SAS Global 2014 Conference. http://support.sas.com/resources/papers/proceedings14/1288-2014.pdf.
- Nareddy, M., and Chakraborty, G. 2011. “Improving Customer loyalty Program through Text Mining of Customers’ Comments.” In Proceedings of the SAS Global 2011 Conference. http://support.sas.com/resources/papers/proceedings11/223-2011.pdf.
Radhika Kulkarni, Ph.D. is SAS Vice President for Advanced Analytics R&D, and a 2014 INFORMS Fellow. She may be reached at editor@ScientificComputing.com.