Machine Learning

Overview

Teaching: 70 min
Exercises: 20 min
Questions
  • What is machine learning?

  • What kind of tasks can I apply machine learning to?

  • How do I perform machine learning with scikit-learn?

  • How do I interpret my results?

Objectives
  • Get a rough overview on machine learning.

  • Clean up a real world data set.

  • Get an introduction scikit-learn.

  • Apply dimensionality reduction, clustering, classification, and anomaly detection.

  • Visualize results in diverse ways.

Episode Credit

This episode is adapted and enhanced from https://github.com/YaleDHLab/lab-workshops/tree/master/machine-learning.

As a real-world dataset for practicing machine learning tasks on, we will use relative abundances of gut microbiota. In a brand new publication

Wibowo, Marsha C., Zhen Yang, Maxime Borry, Alexander Hübner, Kun D. Huang, Braden T. Tierney, Samuel Zimmerman, et al. 2021. “Reconstruction of Ancient Microbial Genomes from the Human Gut.” Nature, May. https://doi.org/10.1038/s41586-021-03532-0.

modern industrial and pre-industrial fecal microbiota were compared to some ancient samples (palaeofaeces).

ETL relative microbial abundances

import pandas as pd

While the following command is running, take a look at the spreadsheet itself.

raw = pd.read_excel(
    "data/41586_2021_3532_MOESM6_ESM.xlsx",
    sheet_name="Tab1-MetaPhlAn2_output",
    header=None,
)

What did we extract from the spreadsheet?

raw.head(10)
                                                 0           1    ...       799       800
0                                                NaN  Paleofeces  ...      Soil      Soil
1                                                NaN      UT43.2  ...  1043.4.1  3567.1.1
2                                         k__Archaea     7.18607  ...         0    0.0049
3                        k__Archaea|p__Euryarchaeota     7.18607  ...         0    0.0049
4        k__Archaea|p__Euryarchaeota|c__Archaeoglobi           0  ...         0         0
5  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
6  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
7  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
8        k__Archaea|p__Euryarchaeota|c__Halobacteria           0  ...         0         0
9  k__Archaea|p__Euryarchaeota|c__Halobacteria|o_...           0  ...         0         0

[10 rows x 801 columns]

This is the output from a particular software (MetaPhlAn2) for estimating microbial abundances from metagenomic sequencing (MGS) data. The single letter with double underscore prefix in the first column denotes the taxonomic rank. Multiple ranks are divided by | and describe a lineage.

The first two rows in the table describe the type of sample and a sample identifier.

raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3636 entries, 0 to 3635
Columns: 801 entries, 0 to 800
dtypes: object(801)
memory usage: 22.2+ MB

Due to the sample identifiers in every column, they are all of type object rather than numerical.

Let’s make this tidy!

Tidying up

Sample meta information

We will split the information that describes the samples. We will also get to know another Pandas feature, the pipe which allows us to chain function calls. In order to make full use of the chaining, we also need a function to assign values to the data frame.

def set_element(df, row_idx, col_idx, value):
    result = df.copy()
    result.iat[row_idx, col_idx] = value
    return result

Our meta information is in the first two rows. We then set meaningful names in the currently empty fields. Next we set the first column as the row index. Then we transpose the table such that our two rows become two columns and reset the index of that transposed data frame such that we end up with a range index starting at zero.

observations = (
    raw.iloc[:2, :]
    .pipe(set_element, 0, 0, "label")
    .pipe(set_element, 1, 0, "sample")
    .set_index(0)
    .T.reset_index(drop=True)
)

The result looks as follows:

observations.head(10)
0           label   sample
0      Paleofeces   UT43.2
1      Paleofeces    AW107
2      Paleofeces    AW108
3      Paleofeces   AW110A
4      Paleofeces    Zape1
5      Paleofeces    Zape2
6      Paleofeces    Zape3
7      Paleofeces   UT30.3
8  Non-industrial  MEX3158
9  Non-industrial  MEX3160

Inspect the observations

Can you confirm that we haven’t lost any samples? Additionally, is there something interesting to be said about the data?

Solution

observations.shape
(800, 2)

The observations have the same dimension as the relative abundances.

observations.describe()
0            label      sample
count          800         800
unique           4         800
top     Industrial  A34_02_1FE
freq           418           1

We see that there are four unique labels and that the label ‘Industrial’ is the most common.

observations["label"].value_counts()
Industrial        418
Non-industrial    371
Paleofeces          8
Soil                3
Name: label, dtype: int64

An overview on the number of samples per label.

Now that we have the sample meta information in a tidy format, it is time that we clean up the relative abundances.

Relative abundances

There is a lot of information contained in the first column. Let’s split this information into multiple columns. The taxonomic rank deserves its own column so that we can later filter on it. For convenience, we also keep the kingdom of every sample for later data filtering.

Our strategy is then:

def transform_taxonomic_rank(df):
    result = df.rename(columns={0: "taxonomy"})
    lineage = result["taxonomy"].str.split("|")
    result["kingdom"] = (
        lineage.map(lambda element: element[0])
        .str.split("__", n=1, expand=True)
        .iloc[:, -1]
    )
    result[["rank", "taxonomy"]] = lineage.map(lambda element: element[-1]).str.split(
        "__", n=1, expand=True
    )
    return result
transformed = transform_taxonomic_rank(raw.iloc[2:, :])

Let’s see what our result looks like.

Inspect the transformed data frame

Briefly investigate the transformed data frame.

Solution

transformed.head(10)
                         taxonomy        1        2        3  ... 799     800  kingdom rank
2                         Archaea  7.18607  3.52607  0.10627  ...   0  0.0049  Archaea    k
3                   Euryarchaeota  7.18607  3.52607  0.10627  ...   0  0.0049  Archaea    p
4                    Archaeoglobi        0        0        0  ...   0       0  Archaea    c
5                 Archaeoglobales        0        0        0  ...   0       0  Archaea    o
6                Archaeoglobaceae        0        0        0  ...   0       0  Archaea    f
7   Archaeoglobaceae_unclassified        0        0        0  ...   0       0  Archaea    g
8                    Halobacteria        0        0        0  ...   0       0  Archaea    c
9                 Halobacteriales        0        0        0  ...   0       0  Archaea    o
10               Halobacteriaceae        0        0        0  ...   0       0  Archaea    f
11                  Halobacterium        0        0        0  ...   0       0  Archaea    g

[10 rows x 803 columns]
transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3634 entries, 2 to 3635
Columns: 803 entries, taxonomy to rank
dtypes: object(803)
memory usage: 22.3+ MB
transformed.describe()
              taxonomy       1       2       3  ...     799     800   kingdom  rank
count             3633  3634.0  3634.0  3634.0  ...  3634.0  3634.0      3633  3634
unique            3527   150.0   134.0   132.0  ...   123.0   292.0         4     9
top     Viruses_noname     0.0     0.0     0.0  ...     0.0     0.0  Bacteria     s
freq                 5  3312.0  3349.0  3342.0  ...  3399.0  3056.0      3020  1459

[4 rows x 803 columns]
transformed["kingdom"].value_counts()
Bacteria     3020
Viruses       479
Eukaryota      91
Archaea        43
Name: kingdom, dtype: int64
transformed["rank"].value_counts()
s               1459
t               1267
g                503
f                216
o                 98
c                 58
p                 28
k                  4
unclassified       1
Name: rank, dtype: int64

From our inspection of the data, we decide that it’s best to remove the abundance of unclassified sequencing reads. In addition, I don’t trust the read assignment to virus genomes and they don’t play a large role. We remove those as well.

Similarly to before, we then set the new columns as the row index, then we convert all abundances from object types to float types, transpose the data frame, and reset the index again for a nice range starting from zero.

abundance = (
    transformed.query("kingdom != 'Viruses' and rank != 'unclassified'")
    .set_index(["kingdom", "rank", "taxonomy"])
    .astype(float)
    .T.reset_index(drop=True)
)

We have chosen three columns as the index here. That means, we are creating an index with multiple levels (a pandas.MultiIndex). Multi indeces can be tricky to work with. We will just use them here but do not expect a full understanding.

abundance.head(10)
kingdom   Archaea                ...               Eukaryota              
rank            k             p  ...                       s             t
taxonomy  Archaea Euryarchaeota  ... Enterocytozoon_bieneusi GCA_000209485
0         7.18607       7.18607  ...                     0.0           0.0
1         3.52607       3.52607  ...                     0.0           0.0
2         0.10627       0.10627  ...                     0.0           0.0
3         0.32794       0.32794  ...                     0.0           0.0
4         0.52044       0.52044  ...                     0.0           0.0
5         0.02790       0.02790  ...                     0.0           0.0
6         0.03779       0.03779  ...                     0.0           0.0
7         2.80318       2.80318  ...                     0.0           0.0
8         0.00000       0.00000  ...                     0.0           0.0
9         0.00000       0.00000  ...                     0.0           0.0

[10 rows x 3154 columns]

Relative abundances are percentage numbers of the counts. They apply to each taxonomic rank so that adding up the relative abundances per rank should be close to 100%. It may be lower since we removed viruses and unclassified reads.

import numpy as np

We can create a basic sanity check based on above ideas. For each rank individually, the relative abundance should sum up to (maximally) 100 in every sample.

for rank in abundance.columns.unique(level="rank"):
    view = abundance.loc[:, abundance.columns.get_loc_level(rank, level="rank")[0]]
    total = view.sum(axis=1)
    assert ((total < 100.0) | np.isclose(total, 100.0)).all()

How many variables (columns, features) do we actually have per rank?

for rank in abundance.columns.unique(level="rank"):
    print(rank, abundance.columns.get_loc_level(rank, level="rank")[0].sum())
k 3
p 27
c 57
o 92
f 184
g 443
s 1265
t 1083

Since we have the most variables at the species level and because I wouldn’t outright trust MGS’ ability to distinguish strains, we will use the species level for our machine learning tasks. We can always go to a higher taxonomic rank to test performance on more aggregate signals.

species_abundance = abundance.loc[
    :, abundance.columns.get_loc_level("s", level="rank")[0]
].copy()
species_abundance.shape
(800, 1265)

This was a lot of work, you may think, and we haven’t even done any machine learning yet! You are completely right about that feeling. According to a white paper by Google engineers, creating machine learning models is a tiny fraction of the work.

Image taken from a NIPS conference paper.

Introduction to Machine Learning

Andrew Ng, a prominent machine learning expert, has defined machine learning as “the science of getting computers to act without being explicitly programmed.” This workshop is meant to give a quick introduction to some of the techniques one can use to build algorithms that meet this criterion. Specifically, we will discuss the following sub-fields within machine learning:

Let’s dive in!

Dimensionality Reduction

Our relative abundances data set at the species level has a large number of variables (features in machine learning lingo). We have 1265 features and only 800 samples. Microbiome data are often also quite sparse that means most species are only observed in a small number of samples.

Such “high-dimensional” data sets can be quite hard to work with and reason about, not to mention impossible to visualize. High-dimensional data also pose specific challenges to many machine learning models (see The Curse of Dimensionality). To work around these challenges, it’s often helpful to reduce the number of dimensions required to express a given data set.

One popular way to reduce the dimensionality of a data set is to use a technique called Principal Component Analysis. PCA tries to find a lower dimensional representation of data points by projecting them down into a smaller dimensional space in a way that minimizes loss of information.

To get an intuition about PCA, suppose you have points in two dimensions, and you wish to reduce the dimensionality of your data set to a single dimension. To do so, you could find the center of the points then create a line $L$ with a random orientation that passes through that center. One can then project each point onto $L$ such that an imaginary line between the point and $L$ forms a right angle. Within this “projection”, each 2D point can be represented by just its position along the 1D line $L$, effectively giving us a 1D representation of the point’s position in its original space. Furthermore, we can use the difference between the largest and smallest values of points projected onto $L$ as a measure of the amount of “variance” or “spread” within the data captured by $L$. The greater this spread, the greater the amount of “signal” from the original data set is represented in the projection.

If one were to slowly rotate $L$ and continue measuring the difference between the greatest and smallest values on $L$ at each orientation, one could find the orientation of the projection line that minimizes information loss (maximizes spread). This line of minimal information loss is shown in pink below. Once that line is discovered, we can project all of our points onto that lower-dimensional embedding. See the red points below when the black line is colinear with the pink line:

For a beginner-friendly deep dive into the mechanics behind this form of dimensionality reduction, check out Josh Starmer's step-by-step guide to PCA.

What makes this kind of dimensionality reduction useful for research? There are two primary uses: (visual) data exploration and data analysis.

Exploratory data visualization with UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization but also for general non-linear dimension reduction. The author Leland McInnes has a recorded presentation about how the method works in detail.

UMAP considers the distances between samples in a particular space (the manifold) and projects those distances to a Cartesian coordinate system with typically two dimensions. That means we need to transform our relative abundances into a square distance matrix that contains the pairwise distance between every sample.

In microbiome analyses, a common metric to compare samples is the Bray-Curtis dissimilarity. It is a $\beta$-diversity measure since we can use it to compare samples from different ecological niches. We use scikit-bio for a fast implementation of this measure.

from skbio.diversity import beta_diversity
bc_dm = beta_diversity(
    metric="braycurtis", counts=species_abundance, ids=observations["sample"]
)
print(bc_dm)
800x800 distance matrix
IDs:
'UT43.2', 'AW107', 'AW108', 'AW110A', 'Zape1', 'Zape2', 'Zape3', 'UT30.3', ...
Data:
[[0.         0.75103997 0.81117371 ... 0.97354678 0.98426476 0.95241561]
 [0.75103997 0.         0.50414618 ... 0.99686836 0.99895874 0.96373897]
 [0.81117371 0.50414618 0.         ... 0.9969896  0.99919453 0.97885096]
 ...
 [0.97354678 0.99686836 0.9969896  ... 0.         0.8080406  0.59635894]
 [0.98426476 0.99895874 0.99919453 ... 0.8080406  0.         0.6525434 ]
 [0.95241561 0.96373897 0.97885096 ... 0.59635894 0.6525434  0.        ]]

Now we are ready to create a lower dimensional projection of this distance matrix.

We would like to take this opportunity to introduce a cool (and in this case useful) feature of Jupyter: interactivity. In order to make the interactivity work as intended, we will define a few functions. One for the UMAP, another to plot the result, and a third to handle the interactivity.

from umap import UMAP
def get_umap_embedding(num):
    """Create a particular projection using a given number of neighbours."""
    result = pd.DataFrame(
        data=UMAP(
            n_neighbors=num, random_state=123456, metric="precomputed", n_components=2
        ).fit_transform(bc_dm.data.copy()),
        columns=["D1", "D2"],
        index=observations.index,
    )
    result[["label", "sample"]] = observations[["label", "sample"]].copy()
    return result
import altair as alt
def plot_umap(df, num):
    """Create an interactive altair chart from a UMAP result."""
    selection = alt.selection_multi(fields=["label"])
    color = alt.condition(
        selection,
        alt.Color("label:N", scale=alt.Scale(scheme="set1"), legend=None),
        alt.value("lightgray"),
    )
    shape = alt.Shape("label", legend=None)

    scatter = (
        alt.Chart(df, title=f"Number of Neighbours = {num}")
        .encode(
            x="D1", y="D2", color=color, shape=shape, tooltip=["sample:N", "label:N"]
        )
        .mark_point(filled=True)
        .interactive()
    )

    legend = (
        alt.Chart(df)
        .encode(
            y=alt.Y("label:N", axis=alt.Axis(orient="right")), color=color, shape=shape
        )
        .mark_point()
        .add_selection(selection)
    )

    return (scatter | legend).configure_view(continuousWidth=800, continuousHeight=600)
from functools import lru_cache
@lru_cache()
def run_umap(num: int):
    embedding = get_umap_embedding(num)
    return plot_umap(embedding, num)
from ipywidgets import interact, widgets

In order to avoid expensive recalculation of results when moving the slider to previous numbers, we use caching.

interact(run_umap, num=widgets.IntSlider(min=5, max=95, step=10, value=15))

The number of neighbours considered is an important parameter for UMAP. Nonetheless, the global structures look very similar. Compare the results to the PCA plot from the paper.

There are very few palaeofaeces and soil samples. We exclude both from further machine learning tasks.

industrial_vs_non_obs = observations[
    observations["label"].isin(["Industrial", "Non-industrial"])
].reset_index(drop=True)
industrial_vs_non_obs.shape
(789, 2)
industrial_vs_non_abn = species_abundance.loc[
    industrial_vs_non_obs.index, :
].reset_index(drop=True)
industrial_vs_non_abn.shape
(789, 1265)

Clustering

Clustering is a powerful machine learning technique and one that often requires some kind of distance metric. The goal of a clustering algorithm is to create some groups of observations, where each group contains similar observations.

There are a variety of methods for clustering vectors, including density-based clustering, hierarchical clustering, and centroid clustering. One of the most intuitive and most commonly used centroid-based methods is K-Means Clustering.

Given a collection of points in a space, K-Means selects K “centroid” points randomly (colored green below), then assigns each non-centroid point to the centroid to which it’s closest. Using these preliminary groupings, the next step is to find the geometric center of each group, using the same technique one would use to find the center of a square. These group centers become the new centroids, and again each point is assigned to the centroid to which it’s closest. This process continues until centroid movement falls below some minimal movement threshold, after which the clustering is complete.

As discussed before, we use PCA to reduce the number of dimensions in our data to two.

from sklearn.decomposition import PCA
pca = PCA(n_components=2).fit(industrial_vs_non_abn)
pca.explained_variance_ratio_
array([0.26181605, 0.1059267 ])
reduced_species = pca.transform(industrial_vs_non_abn)
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 16, 12
def plot_kmeans_clusters():
    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = reduced_species[:, 0].min() - 1, reduced_species[:, 0].max() + 1
    y_min, y_max = reduced_species[:, 1].min() - 1, reduced_species[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Obtain labels for each point in mesh. Use last trained model.
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(
        Z,
        interpolation="nearest",
        extent=(xx.min(), xx.max(), yy.min(), yy.max()),
        cmap=plt.cm.Paired,
        aspect="auto",
        origin="lower",
    )

    for key, sub in industrial_vs_non_obs.groupby("label", as_index=False, sort=False):
        plt.scatter(
            reduced_species[sub.index, 0],
            reduced_species[sub.index, 1],
            label=key,
        )

    # Plot the centroids as a white X
    centroids = kmeans.cluster_centers_
    plt.scatter(
        centroids[:, 0],
        centroids[:, 1],
        marker="x",
        s=169,
        linewidths=3,
        color="w",
        zorder=10,
    )

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.legend()
    plt.show()
kmeans = KMeans(n_clusters=2, n_init=16, random_state=123456).fit(reduced_species)
plot_kmeans_clusters()

plot of chunk unnamed-chunk-49

kmeans = KMeans(n_clusters=8, n_init=16, random_state=123456).fit(reduced_species)
plot_kmeans_clusters()

plot of chunk unnamed-chunk-51

Classification

The goal of a classification task is to predict whether a given observation in a data set possesses some particular property or attribute. To make these predictions, we measure the attributes of several labeled data observations, then compare unknown observations to those measurements.

In our data set we have microbiome samples labeled either as ‘industrial’ or ‘non-industrial’. Scikit-learn can help us to split out data into a training set and a testing set.

K-Nearest neighbors classifiers

With a $K$-nearest neighbors classifier, we start with a labeled data set. We then add new, unlabeled observations to the data. For each, we consult the $K$ labeled observations to which the unlabeled observation is closest, where $K$ is an odd integer we use for all classifications. We then find the most common label among those $K$ observations (the “$K$-nearest neighbors”) and give the new observation that label.

The following diagram shows this scenario. Our new observation (represented by the question mark) has some points near it that are labeled with a triangle or star. Suppose we have chosen to use 3 for our value of $K$. In that case, we consult the 3 nearest labeled points near the question mark. Those 3 nearest neighbors have labels: star, triangle, triangle. Using a majority vote, we give the question mark a triangle label.

Examining the plot above, we can see that if $K$ were set to 1, we would classify the question mark as a star, but if K is 3 or 5, we would classify the question mark as a triangle. That is to say, $K$ is an important parameter in a $K$-nearest neighbors classifier.

Let’s see how we can use our labeled microbiome samples for classification.

With scikit-learn’s convenience function, we split our data into a training set and a test set. By default the size is approximately 3:1.

from sklearn.model_selection import train_test_split
train_abundance, test_abundance, train_label, test_label = train_test_split(
    reduced_species, industrial_vs_non_obs["label"]
)
np.unique(train_label, return_counts=True)
(array(['Industrial', 'Non-industrial'], dtype=object), array([317, 274]))
np.unique(test_label, return_counts=True)
(array(['Industrial', 'Non-industrial'], dtype=object), array([101,  97]))
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import matthews_corrcoef
@lru_cache
def run_knn(num):
    knn_clf = KNeighborsClassifier(num).fit(train_abundance, train_label)
    return matthews_corrcoef(test_label, knn_clf.predict(test_abundance))
interact(run_knn, num=widgets.IntSlider(min=1, max=9, step=2, value=3))

Measure performance

Can you systematically measure the classifier model performance (as measured by Matthews’ correlation coefficient) for a wide range of $K$-values? Plot the results using any method that we have learnt.

Solution

We create a data frame for storing the $K$-values and performance.

performance = pd.DataFrame({"k": list(range(1, 99, 2))})
performance["mcc"] = performance["k"].map(run_knn)
performance.plot.scatter("k", "mcc")

plot of chunk unnamed-chunk-61

For each observation we pass as input to knn_clf.predict(), the function returns a label (either 0 or 1). Just like that, we’ve trained a machine learning classifier and classified some test data!

The classification example above shows how we can classify just a single point in space, but suppose we want to analyze the way a classifier would classify each possible point in some space. To do so, we can transform our space into a grid of units, then classify each point in that grid. Analyzing a space in this way is known as identifying a classifier’s decision boundary, because this analysis shows one the boundaries between different classification outcomes in the space. This kind of analysis is very helpful in training machine learning models, because studying a classifier’s decision boundary can help one see how to improve the classifier.

Let’s take a look at the decision boundary of our classifier:

def plot_decision_boundary(clf, X, labels, h=0.1):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)

    # Plot also the training points
    for key, sub in labels.groupby("label", as_index=False, sort=False):
        plt.scatter(X[sub.index, 0], X[sub.index, 1], edgecolor="black", label=key)

    plt.legend(loc="best")
    plt.show()
knn_clf = KNeighborsClassifier(5).fit(train_abundance, train_label)
plot_decision_boundary(
    knn_clf,
    train_abundance,
    pd.DataFrame({"label": train_label}).reset_index(drop=True),
)

plot of chunk unnamed-chunk-64

For each pixel in the plot above, we retrieve the $K$ closest points with known labels. We then use a majority vote of those labels to assign the label of the pixel. This is exactly analogous to predicting a label for an unlabeled point. In both cases, we take a majority vote of the $K$ closest points with known labels. Working in this way, we can use labeled data to classify unlabeled data. That’s all there is to $K$-nearest neighbors classification!

It’s worth noting that $K$-nearest neighbors is only one of many popular classification algorithms. From a high-level point of view, each classification algorithm works in a similar way. Each requires a certain number of observations with known labels and each uses those labeled observations to classify unlabeled observations. However, different classification algorithms use different logic to assign a label to new observations which means different classification algorithms have very different decision boundaries.

In the chart below, each row plots the decision boundaries several classifiers give the same data set. Notice how some classifiers work better with certain data shapes.

Image credit

For an intuitive introduction to many of these classifiers, including Support Vector Machines, Decision Trees, Neural Networks, and Naive Bayes classifiers, see Luis Serrano’s introduction to machine learning video discussed in the Going Further section below.

Anomaly Detection

Anomaly detection refers to the identification of anomalies, or outliers, in data sets. While detecting anomalies in a single dimension can be quite simple, finding anomalies in high-dimensional space is a difficult problem.

One technique for classifying anomalies in high-dimensional data sets is an Isolation Forest. An Isolation Forest identifies outliers in a data set by randomly dividing a space until each point is isolated from each other. After repeating this procedure several times, the Isolation Forest identifies points that are more quickly isolated from other points as outliers.

This illustration demonstrates the method by which these outliers are quickly identified. Isolated points are colored green and labeled with the iteration on which they were isolated. If you repeat the procedure several times, you’ll see the outlier is consistently isolated quickly, which allows the Isolation Forest to identify that point as an outlier.

If we run the simulation above a number of times, we should see the outlier point is consistently isolated quickly, while it usually takes more iterations to isolate the other points. This is the chief intuition behind the Isolation Forests outlier classification strategy. Outliers are isolated quickly because they are farther from other points in the data set.

Let’s try this on our microbiome samples.

from sklearn.ensemble import IsolationForest
isolation_clf = IsolationForest(random_state=123456, n_jobs=-1).fit(train_abundance)
def plot_iforest_decision_boundary(clf, X, test_X, test_result, margin=6, mesh=0.5):
    """
    Create and display the decision boundary for an isolation forest.
    """
    # get the x and y grid domains
    x_domain = [X[:, 0].min() - margin, X[:, 0].max() + margin]
    y_domain = [X[:, 1].min() - margin, X[:, 1].max() + margin]
    # get a list of values from min to max, counting by `mesh`
    x_range = np.arange(x_domain[0], x_domain[1], mesh)
    y_range = np.arange(y_domain[0], y_domain[1], mesh)
    # create the data with which to color the background grid
    xx, yy = np.meshgrid(x_range, y_range)
    # classify each unit of the grid
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    # reshape Z into a 2D grid
    Z = Z.reshape(xx.shape)
    # fill in the grid values
    plt.contourf(xx, yy, Z, cmap=plt.cm.YlGn)
    # add the training points
    plt.scatter(
        X[:, 0],
        X[:, 1],
        c="green",
        edgecolors="black",
        alpha=0.4,
        label="training observation",
    )
    inliers = test_X[test_result == 1, :]
    outliers = test_X[test_result == -1, :]
    # plot the inliers and outliers
    if inliers.shape[0] > 0:
        plt.scatter(
            inliers[:, 0],
            inliers[:, 1],
            c="white",
            edgecolors="black",
            label="non-outlier",
        )
    if outliers.shape[0] > 0:
        plt.scatter(
            outliers[:, 0], outliers[:, 1], c="red", edgecolors="black", label="outlier"
        )
    # add a legend to the plot
    plt.legend(loc="best")
    plt.show()
plot_iforest_decision_boundary(
    isolation_clf, train_abundance, test_abundance, isolation_clf.predict(test_abundance)
)

plot of chunk unnamed-chunk-68

We see that any samples outside the central area is classified as an outlier. In just a few lines of code, we can create, train, and deploy a machine learning model for detecting outliers in high-dimensional data!

Removed samples

In the original study, some samples were removed. Can our classifier detect them as outliers?

Prepare data

As before we load the data from the spreadsheet and tidy it up.

removed = pd.read_excel(
    "data/41586_2021_3532_MOESM6_ESM.xlsx", sheet_name="Tab2-MetaPhlAn2_removed_samples"
)
removed
                                            Unnamed: 0    TS929A  ...    AW113    AW116
0                                           k__Archaea  44.35047  ...  0.47962  0.02775
1                          k__Archaea|p__Euryarchaeota  44.35047  ...  0.47962  0.02775
2       k__Archaea|p__Euryarchaeota|c__Methanobacteria  44.35047  ...  0.47962  0.02775
3    k__Archaea|p__Euryarchaeota|c__Methanobacteria...  44.35047  ...  0.47962  0.02775
4    k__Archaea|p__Euryarchaeota|c__Methanobacteria...  44.35047  ...  0.47962  0.02775
..                                                 ...       ...  ...      ...      ...
729  k__Viruses|p__Viruses_noname|c__Viruses_noname...   0.00000  ...  0.46174  0.00000
730  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
731  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
732  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
733  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000

[734 rows x 8 columns]
def transform_removed_taxonomic_rank(df: pd.DataFrame) -> pd.DataFrame:
    result = df.rename(columns={"Unnamed: 0": "taxonomy"})
    lineage = result["taxonomy"].str.split("|")
    result["kingdom"] = (
        lineage.map(lambda element: element[0])
        .str.split("__", n=1, expand=True)
        .iloc[:, -1]
    )
    result[["rank", "taxonomy"]] = lineage.map(lambda element: element[-1]).str.split(
        "__", n=1, expand=True
    )
    return result
intermediate = transform_removed_taxonomic_rank(removed)
intermediate.head(10)
                                  taxonomy    TS929A  TS889  ...    AW116  kingdom  rank
0                                  Archaea  44.35047    0.0  ...  0.02775  Archaea     k
1                            Euryarchaeota  44.35047    0.0  ...  0.02775  Archaea     p
2                          Methanobacteria  44.35047    0.0  ...  0.02775  Archaea     c
3                       Methanobacteriales  44.35047    0.0  ...  0.02775  Archaea     o
4                      Methanobacteriaceae  44.35047    0.0  ...  0.02775  Archaea     f
5                       Methanobrevibacter  44.35047    0.0  ...  0.02741  Archaea     g
6               Methanobrevibacter_smithii   0.00000    0.0  ...  0.01242  Archaea     s
7  Methanobrevibacter_smithii_unclassified   0.00000    0.0  ...  0.01242  Archaea     t
8          Methanobrevibacter_unclassified  44.35047    0.0  ...  0.01499  Archaea     s
9                           Methanosphaera   0.00000    0.0  ...  0.00035  Archaea     g

[10 rows x 10 columns]
intermediate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   taxonomy  734 non-null    object 
 1   TS929A    734 non-null    float64
 2   TS889     734 non-null    float64
 3   TS895     734 non-null    float64
 4   UT2.12    734 non-null    float64
 5   UT3.6     734 non-null    float64
 6   AW113     734 non-null    float64
 7   AW116     734 non-null    float64
 8   kingdom   734 non-null    object 
 9   rank      734 non-null    object 
dtypes: float64(7), object(3)
memory usage: 57.5+ KB
intermediate[["taxonomy", "kingdom", "rank"]].describe()
              taxonomy   kingdom rank
count              734       734  734
unique             731         4    8
top     Viruses_noname  Bacteria    s
freq                 3       675  258
intermediate["kingdom"].value_counts()
Bacteria     675
Eukaryota     27
Viruses       20
Archaea       12
Name: kingdom, dtype: int64
intermediate["rank"].value_counts()
s    258
t    204
g    129
f     75
o     33
c     21
p     10
k      4
Name: rank, dtype: int64
removed_abundance = (
    intermediate.query("kingdom != 'Viruses'")
    .set_index(["kingdom", "rank", "taxonomy"])
    .T.reset_index(drop=True)
)
removed_abundance
kingdom    Archaea                ...           Eukaryota              
rank             k             p  ...                   s             t
taxonomy   Archaea Euryarchaeota  ... Sordaria_macrospora GCA_000182805
0         44.35047      44.35047  ...             0.00000       0.00000
1          0.00000       0.00000  ...             0.00000       0.00000
2          0.52994       0.52994  ...             0.00000       0.00000
3          0.00666       0.00666  ...             0.00000       0.00000
4          0.03470       0.03470  ...             0.00000       0.00000
5          0.47962       0.47962  ...             0.96528       0.96528
6          0.02775       0.02775  ...             0.55437       0.55437

[7 rows x 714 columns]

Now that we have the data in a tidy format, we should include missing variables before reducing the whole set again to two dimensions by PCA.

removed_species = removed_abundance.loc[
    :, removed_abundance.columns.get_loc_level("s", level="rank")[0]
].copy()
removed_species.shape
(7, 254)
industrial_vs_non_abn.shape
(789, 1265)
removed_species = removed_species.reindex(
    columns=removed_species.columns.union(industrial_vs_non_abn.columns), fill_value=0
)
removed_species
kingdom                     Archaea  ...       Eukaryota
rank                              s  ...               s
taxonomy Halobacterium_unclassified  ... Ustilago_maydis
0                                 0  ...               0
1                                 0  ...               0
2                                 0  ...               0
3                                 0  ...               0
4                                 0  ...               0
5                                 0  ...               0
6                                 0  ...               0

[7 rows x 1229 columns]
reduced_removed = PCA(n_components=2).fit_transform(removed_abundance.values)
reduced_removed.shape
(7, 2)
plot_iforest_decision_boundary(
    isolation_clf,
    reduced_species,
    reduced_removed,
    isolation_clf.predict(reduced_removed),
)

plot of chunk unnamed-chunk-87

As we could have expected from the small number of shared species, the removed samples are far from all the other samples and thus all outliers.

Going Further

The snippets above are meant only to give a brief introduction to some of the most popular techniques in machine learning so you can decide whether this kind of analysis might be useful in your research. If it seems like machine learning will be important in your work, you may want to check out some of the resources listed below (arranged roughly from least to most technical):

Key Points

  • Preparing data in the right format is often the hardest task.

  • Machine learning provides methods for tasks such as dimensionality reduction, clustering, classification, and anomaly detection.

  • Having good visualizations is crucial for interpretation.