As data sets grow larger and more complex, machine learning methods are becoming more pervasive in the biomedical community to explore data, as well as to build predictive models for phenotypes of interest. Many researchers are not trained in building and interpreting such models, and it can be difficult to choose the appropriate approach in a given context. Here, I will describe two types of machine learning methods; clustering and classification models, and discuss how and when they can be used and how to avoid some common pitfalls.
When choosing between machine learning methods, it is important to distinguish between supervised and unsupervised methods, which are used in different contexts and for different purposes. Unsupervised methods do not use any external information (annotations, such as disease status or other traits) about the objects to be analyzed, but rather try to find dominating structure or patterns in the data, patterns that can then be interpreted by the researcher. Clustering is an example of an unsupervised method, where the goal is to partition the objects into subgroups (without using any external annotation information). Also principal component analysis (PCA) is an example of an unsupervised method.
Supervised methods, on the other hand, typically aim at building models that predict or “explain” some pre-specified annotation, e.g., disease status or the response to a treatment. This annotation may or may not correspond to the main pattern(s) in the data. Classification, or predictive modeling, is an example of supervised learning. Given some data and a sample annotation, the aim is to build a model from the data that is able to predict the value of the sample annotation in a new sample for which we are only given the data.
It is important to recognize if the goal of a study requires a supervised or unsupervised approach. For example, if the goal is to build a model that can predict the disease status of a patient, we should use a supervised approach. Using an unsupervised approach like clustering or PCA will likely mix the signal that we are interested in with other, unrelated, signals and generate a worse predictor, unless the disease status is actually the dominating signal in the data. On the other hand, if the goal is to get an overview of a data set, to see which are the strongest patterns and whether the samples naturally partition into subgroups, an unsupervised method like clustering or PCA should be used.
An aspect to keep in mind when using supervised methods — especially on high-dimensional data sets — is that, since we are explicitly searching for patterns that are associated with the annotation we want to predict, we will most certainly find something in the vast data space that can predict the annotation well in the current data set. However, this is not what we are ultimately interested in (since we already know the annotation values in this data set). Instead, we are interested in knowing whether the derived model can generalize, i.e., predict the value of the annotation in an independent data set that the model has not seen before, and where we may have only the data, but no information about the annotation. Thus, supervised models must always be validated in independent data sets, and a good predictive performance in the training data does not provide any evidence that the model is good. This is usually less important for unsupervised methods, which are more often used to summarize, explore and describe a given data set.
Clustering
Two types of clustering methods are offered in Qlucore Omics Explorer: hierarchical clustering (combined with heatmaps) and k-means clustering. Both are used for the same purpose: to find subgroups among the samples, such that samples within one group are more “similar” to each other than samples belonging to different groups, where “similar” can be formally defined in various ways. The difference is that the hierarchical clustering builds a “cluster tree” (or dendrogram), which organizes the samples hierarchically but does not directly divide them into clusters, while the k-means clustering partitions the samples into a pre-defined number of groups.
Practical situations where one would like to use a clustering approach are e.g.:
- to evaluate whether there are subtypes of a particular disease, i.e., if the samples group into different clusters based on some measured data. These clusters may represent different disease types, which have different prognosis and behavior.
- to explore a data set and look for artifacts. This can be done by clustering the data and examining whether the obtained clusters are associated with the signal(s) or interest, or rather with spurious ones, such as batch effects or other technical artifacts.
Classification
Classification models consist of two parts: the variables that are used and a rule to combine the values of these variables in order to obtain a predicted value of a given sample annotation. Both are important, and are usually determined together. Several ways of building a classification model, such as random trees, support vector machines and k-nearest neighbor algorithms, are provided in Qlucore Omics Explorer.
Two examples of practical situations where one would like to use a classification approach are
- to build a model that can use gene expression values to predict the prognosis of a cancer patient
- to build a model that can assign a sample to one of several previously defined disease subtypes, based on some observed biomedical data
As noted above, it is important that a predictive model is evaluated on independent data, and not on the same data where it was built. Overfitting refers to the situation where a model is “too specifically adapted” to a given data set and does not generalize to other data sets. Usually, this is a sign that the model has adapted too much to the random noise in the training data set, in its strive to build a model that fits well in this data. The noise in an independent data set will likely be different, and then the model may not work any more.
Cross-validation is a technique that can be used to estimate a model’s expected performance based on a single data set. The underlying idea is to subdivide the entire data set into a training and test set (multiple times), build the model on the training part and evaluate the performance on the test part (which was not used to build the model).
The word classification is usually used to describe predictive modeling where the sample annotation is categorical. To predict a numeric/continuous annotation, one typically uses regression.
Charlotte Soneson is an analyst at Qlucore AB.