Introduction to Python for Data Science

Reading and Working with Tabular Data

Overview

Teaching: 50 min
Exercises: 40 min
Questions
  • What is a Pandas data frame?

  • How do I get an overview on my tabular data?

  • How do I read tabular data in different formats?

  • How do I access subsets of a data frame?

  • How do I calculate simple statistics like the mean?

Objectives
  • Read data in various formats.

  • Select individual values and subsections from data.

  • Transform a data frame from messy to tidy data.

  • Understand common data workflows (ETL) and strategies (split-apply-combine).

Read data

We use Pandas for reading, writing, and handling tabular data in Python as it is the de facto standard tool to do so.

Just as np is by convention the alias for numpy, so is pd for pandas.

import pandas as pd

We start by reading some data from a comma-separated values (CSV) file.

growth = pd.read_csv("data/yeast-growth.csv")

Inspect your data

Depending on your current Python environment a pandas.DataFrame object will automatically be displayed as text formatted to a table or as a prettier HTML table.

growth
    well  timepoint     od concentration_level  concentration
0      a          1  0.017                 low           0.01
1      b          1  0.017                 low           0.03
2      c          1  0.018              medium           1.00
3      d          1  0.017              medium           3.00
4      e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
450    c         65  0.228              medium           1.00
451    d         65  0.221              medium           3.00
452    e         65  0.065              medium          30.00
453    f         65  0.035                high         100.00
454    g         65  0.031                high         300.00

[455 rows x 5 columns]

By default, the output is configured to not show too many rows and columns. We can access more specific parts of a data frame using specific methods.

The beginning of a data frame can be accessed with the head method.

growth.head()
  well  timepoint     od concentration_level  concentration
0    a          1  0.017                 low           0.01
1    b          1  0.017                 low           0.03
2    c          1  0.018              medium           1.00
3    d          1  0.017              medium           3.00
4    e          1  0.017              medium          30.00

Correspondingly, the end of a data frame can be accessed with the tail method.

growth.tail()
    well  timepoint     od concentration_level  concentration
450    c         65  0.228              medium            1.0
451    d         65  0.221              medium            3.0
452    e         65  0.065              medium           30.0
453    f         65  0.035                high          100.0
454    g         65  0.031                high          300.0

You can also ask for a random number of rows. This is great for inspection since you have a chance to see rows from different parts of the data frame. It can also be convenient to develop an analysis with a subset of your data first because it will be faster (this matters if you work with tables that contain millions of rows).

growth.sample(10)
    well  timepoint     od concentration_level  concentration
332    d         48  0.120              medium           3.00
263    e         38  0.046              medium          30.00
139    g         20  0.032                high         300.00
376    f         54  0.036                high         100.00
415    c         60  0.231              medium           1.00
424    e         61  0.063              medium          30.00
392    a         57  0.234                 low           0.01
142    c         21  0.049              medium           1.00
312    e         45  0.051              medium          30.00
12     f          2  0.015                high         100.00

Notice the random indices.

Summaries

Looking at the actual values in your data is important to develop an understanding of them if you work with new data. However, in order to get an overview on your data, there are some useful summary methods.

If you just want to know the dimensions of your data frame, you can do so with the shape attribute.

growth.shape
(455, 5)

As you will see later on, it is sometimes necessary to look at the data types that pandas has inferred from your table.

growth.dtypes
well                    object
timepoint                int64
od                     float64
concentration_level     object
concentration          float64
dtype: object

The info method conveniently combines information on the shape and types in your data frame. In addition, it tells you about the index used and the memory consumption.

growth.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   well                 455 non-null    object 
 1   timepoint            455 non-null    int64  
 2   od                   455 non-null    float64
 3   concentration_level  455 non-null    object 
 4   concentration        455 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 17.9+ KB

Instead of summarizing the data frame object itself, you can get a high-level overview on the statistical distribution of your data.

growth.describe()
        timepoint          od  concentration
count  455.000000  455.000000     455.000000
mean    33.000000    0.077193      62.005714
std     18.782314    0.068110     102.928567
min      1.000000    0.015000       0.010000
25%     17.000000    0.032000       0.030000
50%     33.000000    0.042000       3.000000
75%     49.000000    0.099500     100.000000
max     65.000000    0.237000     300.000000

By default, this will only describe numeric columns. You can also get an overview on all columns including text and categorical columns (generally of type object).

growth.describe(include="all")
       well   timepoint          od concentration_level  concentration
count   455  455.000000  455.000000                 455     455.000000
unique    7         NaN         NaN                   3            NaN
top       a         NaN         NaN              medium            NaN
freq     65         NaN         NaN                 195            NaN
mean    NaN   33.000000    0.077193                 NaN      62.005714
std     NaN   18.782314    0.068110                 NaN     102.928567
min     NaN    1.000000    0.015000                 NaN       0.010000
25%     NaN   17.000000    0.032000                 NaN       0.030000
50%     NaN   33.000000    0.042000                 NaN       3.000000
75%     NaN   49.000000    0.099500                 NaN     100.000000
max     NaN   65.000000    0.237000                 NaN     300.000000

The NaN values displayed in the output are simply Pandas’ concept of missing values. Just pretend that they are empty cells.

Wording

I try to use data frame when talking about the object for computation and table when talking about data more generally; but I might mix them up as in my mind they are the same concept.

Tangent: Tidy Data

Hadley Wickham coined the concept of tidy data for statistics. What is tidy data?

You can also dive into the original publication from 2014 but it suffices to say that a tidy data structure is more intuitive for data visualization (plotting) and statistical analysis.

Reading files from different locales

Depending on what country’s standard your computer is set to (the ‘locale’), software such as Excel will use different characters to separate fields, for example, the default for a computer with the DK locale will be to use ; to separate fields and , to separate decimals. Try finding the right arguments to pandas.read_table to get something sensible out of "data/example-uk.txt" and "data/example-dk.txt". Use the inspection methods from before to see how pandas reads the data.

UK

Solution

uk_data = pd.read_table("data/example-uk.txt")
uk_data
      name;height;age;income
0       Ian;183;27;12,000.01
1     Peter;162;28;11,000.50
2  Bernhard;173;30;10,000.00
3    Steven;163;32;12,500.00
uk_data = pd.read_table("data/example-uk.txt", sep=";")
uk_data
       name  height  age     income
0       Ian     183   27  12,000.01
1     Peter     162   28  11,000.50
2  Bernhard     173   30  10,000.00
3    Steven     163   32  12,500.00
uk_data.dtypes
name      object
height     int64
age        int64
income    object
dtype: object
uk_data = pd.read_table("data/example-uk.txt", sep=";", thousands=",")
uk_data
       name  height  age    income
0       Ian     183   27  12000.01
1     Peter     162   28  11000.50
2  Bernhard     173   30  10000.00
3    Steven     163   32  12500.00
uk_data.dtypes
name       object
height      int64
age         int64
income    float64
dtype: object

DK

Solution

dk_data = pd.read_table("data/example-dk.txt")
dk_data
     name;height;age;income
0       Ian;183;27;12000,01
1     Peter;162;28;11100,50
2  Bernhard;173;30;11000,00
3    Steven;163;32;12500,00
dk_data = pd.read_table("data/example-dk.txt", sep=";")
dk_data
       name  height  age    income
0       Ian     183   27  12000,01
1     Peter     162   28  11100,50
2  Bernhard     173   30  11000,00
3    Steven     163   32  12500,00
dk_data.dtypes
name      object
height     int64
age        int64
income    object
dtype: object
dk_data = pd.read_table("data/example-dk.txt", sep=";", decimal=",")
dk_data
       name  height  age    income
0       Ian     183   27  12000.01
1     Peter     162   28  11100.50
2  Bernhard     173   30  11000.00
3    Steven     163   32  12500.00
dk_data.dtypes
name       object
height      int64
age         int64
income    float64
dtype: object

Other data formats

Excel

When you read data from spreadsheets (.xls, .xlsx, .ods), you do not need to worry about field separators or number formatting (but you need extra packages such as xlrd, openpyxl, or odfpy).

excel_data = pd.read_excel("data/example-dk.xlsx")
excel_data
       name  height  age    income
0       Ian     183   27  12000.01
1     Peter     162   28  11100.50
2  Bernhard     173   30  11000.00
3    Steven     163   32  12500.00

However, you may need to be careful with Excel’s interpretation of strings as dates (see Gene name errors are widespread in the scientific literature and Guidelines for human gene nomenclature) when loading tabular files in Excel.

Pandas will try to detect the right dependency for opening a spreadsheet.

date_data = pd.read_excel("data/example-datetime.ods")
date_data
        date      time
0 2020-03-02  03:34:23
1 2021-02-02  02:24:00
2 2021-09-01  16:01:22

Common date and time formats are automatically detected and converted to suitable Python data types.

date_data.dtypes
date    datetime64[ns]
time            object
dtype: object

You can conveniently combine multiple columns into one datetime object.

pd.read_excel("data/example-datetime.ods", parse_dates={"datetime": ["date", "time"]})
             datetime
0 2020-03-02 03:34:23
1 2021-02-02 02:24:00
2 2021-09-01 16:01:22

Further formats

There are many more formats that Pandas can read data from. Just type pd.read_ and press tab to see the list of auto-completion options. A few notable ones:

pd.read_csv(
    "https://raw.githubusercontent.com/data-science-for-biotech/python-pandas-viz-ml/gh-pages/_episodes_rmd/data/yeast-growth.csv"
)
    well  timepoint     od concentration_level  concentration
0      a          1  0.017                 low           0.01
1      b          1  0.017                 low           0.03
2      c          1  0.018              medium           1.00
3      d          1  0.017              medium           3.00
4      e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
450    c         65  0.228              medium           1.00
451    d         65  0.221              medium           3.00
452    e         65  0.065              medium          30.00
453    f         65  0.035                high         100.00
454    g         65  0.031                high         300.00

[455 rows x 5 columns]

Accessing data

Let’s look at our old friend the growth data table again.

growth
    well  timepoint     od concentration_level  concentration
0      a          1  0.017                 low           0.01
1      b          1  0.017                 low           0.03
2      c          1  0.018              medium           1.00
3      d          1  0.017              medium           3.00
4      e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
450    c         65  0.228              medium           1.00
451    d         65  0.221              medium           3.00
452    e         65  0.065              medium          30.00
453    f         65  0.035                high         100.00
454    g         65  0.031                high         300.00

[455 rows x 5 columns]

By label

We can access individual columns (pandas.Series) by their names as string.

growth["od"]
0      0.017
1      0.017
2      0.018
3      0.017
4      0.017
       ...  
450    0.228
451    0.221
452    0.065
453    0.035
454    0.031
Name: od, Length: 455, dtype: float64

Notice how the display of series is different from data frames.

We can make powerful label based selections with the loc attribute.

growth.loc[:, "od"]
0      0.017
1      0.017
2      0.018
3      0.017
4      0.017
       ...  
450    0.228
451    0.221
452    0.065
453    0.035
454    0.031
Name: od, Length: 455, dtype: float64

Our data frames have two dimensions: the rows and columns. Here, we specify that we want all rows : and a single column "od". We can also select one or more columns as a list which will always return a data frame.

growth.loc[:, ["od"]]
        od
0    0.017
1    0.017
2    0.018
3    0.017
4    0.017
..     ...
450  0.228
451  0.221
452  0.065
453  0.035
454  0.031

[455 rows x 1 columns]

We use at to access an individual element with labels.

growth.at[451, "od"]
0.221

By index

We might be tempted to access a column by its index position directly.

growth[2]
Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 2

Detailed traceback:
  File "<string>", line 1, in <module>
  File "/home/moritz/.pyenv/versions/3.8.10/envs/biotech/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/moritz/.pyenv/versions/3.8.10/envs/biotech/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err

This does not work. Only labels are allowed here. Instead we use iloc for index location.

growth.iloc[:, 2]
0      0.017
1      0.017
2      0.018
3      0.017
4      0.017
       ...  
450    0.228
451    0.221
452    0.065
453    0.035
454    0.031
Name: od, Length: 455, dtype: float64

Again, we can select one or more columns as a data frame.

growth.iloc[:, [2]]
        od
0    0.017
1    0.017
2    0.018
3    0.017
4    0.017
..     ...
450  0.228
451  0.221
452  0.065
453  0.035
454  0.031

[455 rows x 1 columns]

We can also access individual elements.

growth.iat[451, 2]
0.221

By slices

We can also define label-based ranges over rows and columns.

growth.loc[:, "od":"concentration"]
        od concentration_level  concentration
0    0.017                 low           0.01
1    0.017                 low           0.03
2    0.018              medium           1.00
3    0.017              medium           3.00
4    0.017              medium          30.00
..     ...                 ...            ...
450  0.228              medium           1.00
451  0.221              medium           3.00
452  0.065              medium          30.00
453  0.035                high         100.00
454  0.031                high         300.00

[455 rows x 3 columns]
growth.loc[10:15, :]
   well  timepoint     od concentration_level  concentration
10    d          2  0.015              medium           3.00
11    e          2  0.019              medium          30.00
12    f          2  0.015                high         100.00
13    g          2  0.016                high         300.00
14    a          3  0.016                 low           0.01
15    b          3  0.015                 low           0.03

loc

Be careful that in label-based accession with loc, ranges are inclusive unlike numeric ranges in Python in general. Also be aware that since we have an integer index here, the label coincides with the index position.

growth.iloc[10:15, :]
   well  timepoint     od concentration_level  concentration
10    d          2  0.015              medium           3.00
11    e          2  0.019              medium          30.00
12    f          2  0.015                high         100.00
13    g          2  0.016                high         300.00
14    a          3  0.016                 low           0.01

N.B.: Index location ranges are exclusive.

By conditions

Pandas inherits from numpy the conditional selection of elements. These are essentially Boolean series or data frames known as masks.

growth[growth["od"] > 0.2]
    well  timepoint     od concentration_level  concentration
308    a         45  0.201                 low           0.01
315    a         46  0.208                 low           0.01
322    a         47  0.217                 low           0.01
329    a         48  0.227                 low           0.01
330    b         48  0.205                 low           0.03
336    a         49  0.229                 low           0.01
337    b         49  0.213                 low           0.03
343    a         50  0.232                 low           0.01
344    b         50  0.219                 low           0.03
350    a         51  0.233                 low           0.01
351    b         51  0.224                 low           0.03
357    a         52  0.237                 low           0.01
358    b         52  0.229                 low           0.03
364    a         53  0.236                 low           0.01
365    b         53  0.230                 low           0.03
366    c         53  0.207              medium           1.00
371    a         54  0.235                 low           0.01
372    b         54  0.231                 low           0.03
373    c         54  0.213              medium           1.00
378    a         55  0.237                 low           0.01
379    b         55  0.230                 low           0.03
380    c         55  0.218              medium           1.00
385    a         56  0.232                 low           0.01
386    b         56  0.231                 low           0.03
387    c         56  0.226              medium           1.00
392    a         57  0.234                 low           0.01
393    b         57  0.233                 low           0.03
394    c         57  0.230              medium           1.00
399    a         58  0.232                 low           0.01
400    b         58  0.228                 low           0.03
401    c         58  0.231              medium           1.00
406    a         59  0.230                 low           0.01
407    b         59  0.228                 low           0.03
408    c         59  0.230              medium           1.00
413    a         60  0.228                 low           0.01
414    b         60  0.225                 low           0.03
415    c         60  0.231              medium           1.00
420    a         61  0.227                 low           0.01
421    b         61  0.225                 low           0.03
422    c         61  0.230              medium           1.00
427    a         62  0.226                 low           0.01
428    b         62  0.222                 low           0.03
429    c         62  0.230              medium           1.00
430    d         62  0.205              medium           3.00
434    a         63  0.229                 low           0.01
435    b         63  0.225                 low           0.03
436    c         63  0.230              medium           1.00
437    d         63  0.212              medium           3.00
441    a         64  0.224                 low           0.01
442    b         64  0.221                 low           0.03
443    c         64  0.231              medium           1.00
444    d         64  0.217              medium           3.00
448    a         65  0.224                 low           0.01
449    b         65  0.220                 low           0.03
450    c         65  0.228              medium           1.00
451    d         65  0.221              medium           3.00

These masks can be combined using Boolean logic, making them very powerful tools for selecting just the right data.

growth[(growth["od"] > 0.2) & (growth["concentration_level"] == "medium")]
    well  timepoint     od concentration_level  concentration
366    c         53  0.207              medium            1.0
373    c         54  0.213              medium            1.0
380    c         55  0.218              medium            1.0
387    c         56  0.226              medium            1.0
394    c         57  0.230              medium            1.0
401    c         58  0.231              medium            1.0
408    c         59  0.230              medium            1.0
415    c         60  0.231              medium            1.0
422    c         61  0.230              medium            1.0
429    c         62  0.230              medium            1.0
430    d         62  0.205              medium            3.0
436    c         63  0.230              medium            1.0
437    d         63  0.212              medium            3.0
443    c         64  0.231              medium            1.0
444    d         64  0.217              medium            3.0
450    c         65  0.228              medium            1.0
451    d         65  0.221              medium            3.0

You can combine masks with loc to also sub-select columns.

growth.loc[(growth["od"] > 0.2) & (growth["concentration_level"] == "medium"), ["od", "timepoint"]]
        od  timepoint
366  0.207         53
373  0.213         54
380  0.218         55
387  0.226         56
394  0.230         57
401  0.231         58
408  0.230         59
415  0.231         60
422  0.230         61
429  0.230         62
430  0.205         62
436  0.230         63
437  0.212         63
443  0.231         64
444  0.217         64
450  0.228         65
451  0.221         65

Sometimes these conditions can be tedious to write, use the query method for a shorter form. Be careful, query uses Python’s condition logic, not numpy’s.

growth.query("od > 0.2 and concentration_level == 'medium'")
    well  timepoint     od concentration_level  concentration
366    c         53  0.207              medium            1.0
373    c         54  0.213              medium            1.0
380    c         55  0.218              medium            1.0
387    c         56  0.226              medium            1.0
394    c         57  0.230              medium            1.0
401    c         58  0.231              medium            1.0
408    c         59  0.230              medium            1.0
415    c         60  0.231              medium            1.0
422    c         61  0.230              medium            1.0
429    c         62  0.230              medium            1.0
430    d         62  0.205              medium            3.0
436    c         63  0.230              medium            1.0
437    d         63  0.212              medium            3.0
443    c         64  0.231              medium            1.0
444    d         64  0.217              medium            3.0
450    c         65  0.228              medium            1.0
451    d         65  0.221              medium            3.0

query is fancy enough to know how to use variables.

threshold = 0.2

growth.query("od > @threshold and concentration_level == 'medium'")
    well  timepoint     od concentration_level  concentration
366    c         53  0.207              medium            1.0
373    c         54  0.213              medium            1.0
380    c         55  0.218              medium            1.0
387    c         56  0.226              medium            1.0
394    c         57  0.230              medium            1.0
401    c         58  0.231              medium            1.0
408    c         59  0.230              medium            1.0
415    c         60  0.231              medium            1.0
422    c         61  0.230              medium            1.0
429    c         62  0.230              medium            1.0
430    d         62  0.205              medium            3.0
436    c         63  0.230              medium            1.0
437    d         63  0.212              medium            3.0
443    c         64  0.231              medium            1.0
444    d         64  0.217              medium            3.0
450    c         65  0.228              medium            1.0
451    d         65  0.221              medium            3.0

Views

Selecting a subset of your data frame by any of the previous means creates a view on that part of the data frame. This is faster since the data is not copied and lets you change elements in a part of the table. Be careful that this may lead to unintended consequences.

growth.loc[0:5, :] = None
growth.head(10)
   well  timepoint     od concentration_level  concentration
0  None        NaN    NaN                None            NaN
1  None        NaN    NaN                None            NaN
2  None        NaN    NaN                None            NaN
3  None        NaN    NaN                None            NaN
4  None        NaN    NaN                None            NaN
5  None        NaN    NaN                None            NaN
6     g        1.0  0.015                high         300.00
7     a        2.0  0.015                 low           0.01
8     b        2.0  0.018                 low           0.03
9     c        2.0  0.021              medium           1.00

A comprehensive tutorial can be found in the Pandas documentation.

Working with data

Just like numpy, Pandas offers basic statistical methods out-of-the-box. We can apply them to single columns,

growth["od"].min()
0.015

multiple columns,

growth[["od", "concentration"]].max()
od                 0.237
concentration    300.000
dtype: float64

or even rows.

growth[["od", "concentration"]].mean(axis=1)
0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
         ...   
450      0.6140
451      1.6105
452     15.0325
453     50.0175
454    150.0155
Length: 455, dtype: float64
growth[["od", "concentration"]].std(axis=1)
0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
450      0.545886
451      1.965050
452     21.167241
453     70.685929
454    212.110114
Length: 455, dtype: float64

Split-apply-combine

Another powerful operation and a very common strategy for data transformation is split, apply, combine. This pattern is provided by pandas in the form of the groupby method.

The split-apply-combine strategy. Divide data into chunks, then run a given (reducing) function on each chunk sepearately, and finally combine the chunks again.

Image credit
growth.groupby("well")[["od", "concentration"]].mean()
            od  concentration
well                         
a     0.123500           0.01
b     0.117969           0.03
c     0.109375           1.00
d     0.088328           3.00
e     0.043891          30.00
f     0.033937         100.00
g     0.029738         300.00

Cleaning messy data

Can you transform this messy table into a tidy one? You may need the melt function to change the table layout.

messy = pd.read_csv("data/yeast-growth-messy.csv")

Solution

messy
      V1      V2      V3     V4     V5     V6  ...    V63    V64    V65    V66    V67    V68
0  Test1     low    0.01  0.017  0.015  0.016  ...  0.228  0.227  0.226  0.229  0.224  0.224
1  Test1     low    0.03  0.017  0.018  0.015  ...  0.225  0.225  0.222  0.225  0.221  0.220
2  Test1  medium    1.00  0.018  0.021  0.016  ...  0.231  0.230  0.230  0.230  0.231  0.228
3  Test1  medium    3.00  0.017  0.015  0.017  ...  0.190  0.196  0.205  0.212  0.217  0.221
4  Test1  medium   30.00  0.017  0.019  0.016  ...  0.062  0.063  0.062  0.063  0.063  0.065
5  Test1    high  100.00  0.016  0.015  0.015  ...  0.037  0.035  0.035  0.034  0.035  0.035
6  Test1    high  300.00  0.015  0.016  0.015  ...  0.031  0.032  0.031  0.031  0.030  0.031

[7 rows x 68 columns]
messy.columns
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
       'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41',
       'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51',
       'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61',
       'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68'],
      dtype='object')
del messy["V1"]
from string import ascii_lowercase
messy["well"] = list(ascii_lowercase[:len(messy)])
long_data = pd.melt(messy, id_vars=["V2", "V3", "well"])
long_data
         V2      V3 well variable  value
0       low    0.01    a       V4  0.017
1       low    0.03    b       V4  0.017
2    medium    1.00    c       V4  0.018
3    medium    3.00    d       V4  0.017
4    medium   30.00    e       V4  0.017
..      ...     ...  ...      ...    ...
450  medium    1.00    c      V68  0.228
451  medium    3.00    d      V68  0.221
452  medium   30.00    e      V68  0.065
453    high  100.00    f      V68  0.035
454    high  300.00    g      V68  0.031

[455 rows x 5 columns]
long_data["variable"].replace(dict(zip(long_data["variable"].unique(), range(1, len(long_data["variable"].unique()) + 1))), inplace=True)
long_data.rename(columns={"V2": "concentration_level", "V3": "concentration", "variable": "timepoint"}, inplace=True)
long_data
    concentration_level  concentration well  timepoint  value
0                   low           0.01    a          1  0.017
1                   low           0.03    b          1  0.017
2                medium           1.00    c          1  0.018
3                medium           3.00    d          1  0.017
4                medium          30.00    e          1  0.017
..                  ...            ...  ...        ...    ...
450              medium           1.00    c         65  0.228
451              medium           3.00    d         65  0.221
452              medium          30.00    e         65  0.065
453                high         100.00    f         65  0.035
454                high         300.00    g         65  0.031

[455 rows x 5 columns]

Tangent: Extract, Transform, Load (ETL)

The process we have just seen is known as extract, transform, load or simply ETL. It describes the ubiquitous data workflow of reading data (extract with, for example, pd.read_csv), cleaning up the data (transform) as we have done with the messy growth curves above, and storing the tidy data again for subsequent use (load with, for example, to_excel).

Outlook

Key Points

  • Use pandas.read_* and pandas.DataFrame.to_* to import / export data.

  • The method info describes the data frame object.

  • The method describe summarizes value distributions in columns.

  • You can use labels or index locations to select both subsets and elements from your data frame.

  • Selection using conditions is very powerful.

  • Selections create views on your original data.

  • Use mean, max, min, and others to calculate simple statistics.

  • Use split-apply-combine to calculate statistics within groups in a data frame.


Morning Break

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


Visualize Your Data

Overview

Teaching: 45 min
Exercises: 30 min
Questions
  • What tools exist to plot data in Python?

  • How do I make a basic plot?

  • What visuals are available?

  • How can I best visualize groups of data?

Objectives
  • Get an introduction to Pandas plotting.

  • Get a high-level overview on plotting in Python.

  • Learn about the grammar of graphics.

  • Learn to apply the grammar of graphics through altair.

  • Learn how to group data.

  • Learn how to use facets.

  • Develop an understanding of the rich visualization diversity that exists.

Native Pandas plotting

import pandas as pd

Let’s work with the familiar growth data from last episode.

growth = pd.read_csv("data/yeast-growth.csv")

Pandas provides a few quick plotting methods. It uses matplotlib under the hood to generate graphics.

import matplotlib.pyplot as plt

The default plot method draws lines for every numeric column in the data frame using the index as the horizontal axis.

growth.plot()

plot of chunk unnamed-chunk-6

We can specify a different kind of plot, choosing columns explicitly.

growth.plot.scatter("timepoint", "od")

plot of chunk unnamed-chunk-7

Other types of plots include a histogram. You can also see that we use matplotlib directly in order to customize the plot further.

growth["concentration"].plot.hist(bins=30)
<AxesSubplot:ylabel='Frequency'>
plt.xlabel("concentration")
Text(0.5, 0, 'concentration')
plt.show()

plot of chunk unnamed-chunk-8

Another type of plot for showing distributions is the box plot which, again, by default uses numeric columns.

growth.plot.box()

plot of chunk unnamed-chunk-9

My opinion on native Pandas plotting:

State of plotting in Python

https://pyviz.org/overviews/index.html

Grammar of Graphics

The grammar of graphics was described by Leland Wilkinson. It was later updated by Hadley Wickham with ideas developed during the creation of his famous R plotting software ggplot2. This blog has a decent summary of both. In lose terms, the idea is that every visualization has the following ingredients:

This all sounds a bit abstract. Let’s dive into examples.

Introduction to altair

We need the altair package, of course.

import altair as alt
width = 800
height = 600

Here we create an altair Chart object with our tidy growth data frame. The default coordinate system is Cartesian with two axes. We encode our data and specify that the time points should describe our x-axis coordinate and optical density the y-axis coordinate. Finally, we want this two-dimensional data to be displayed by the visual mark points.

The default output size is a bit small in my opinion so we configure it to be a bit larger.

alt.Chart(growth).encode(
    x="timepoint", y="od"
).mark_point().configure_view(continuousWidth=width, continuousHeight=height)

Basic point mark plot

We can, at any point, store our current chart specification in a variable and later apply further specifications to it.

base = alt.Chart(growth).encode(x="timepoint", y="od")

How about using lines as the visual mark?

base.mark_line().configure_view(continuousWidth=width, continuousHeight=height)

Ungrouped line mark plot

The above plot has the same problems as the Pandas default plot. Our data was recorded per well. We can use this variable to encode a third dimension that groups our data. Common options are colour or shape/linetype.

base.encode(color="well").mark_line(point=True).configure_view(
    continuousWidth=width, continuousHeight=height
)

Line mark plot grouped by color

This looks much better already. We also changed the visual mark to be a point and line.

Instead of defining separate groups for each growth time course, we can separate each individual time course into its own plot. This is known as facetting. Using facets can be handy but information is more spread out. Comparing different time courses is harder in this format.

base.mark_line(point=True).facet("well", columns=3)

Faceted line mark plot

Let’s look again at the individual data points grouped by well.

chart = alt.Chart(growth).encode(x="timepoint", y="od", color="well").mark_point()
chart.configure_view(continuousWidth=width, continuousHeight=height)

Colored point mark plot

Instead of connecting each consecutive data point with a straight line, we can apply a smoothing data transformation. In the following example, we use a locally estimated scatterplot smoothing (LOESS) transformation.

(
    chart
    + chart.transform_loess(
        "timepoint", "od", groupby=["well"], bandwidth=0.15
    ).mark_line()
).configure_view(continuousWidth=width, continuousHeight=height)

LOESS transform plot

We can now combine the concepts that we have learnt to introduce a fourth dimension, the measured concentration. Since the concentration is a measured quantity, we use a continuous colour scale to represent it. We still group our data by wells and apply smoothing. As an example, we vary the point shape.

Please not that for the first time we explicitly encode the type of the variables. Q means quantitative (continuous) and N means nominal (categorical).

fancy_chart = alt.Chart(growth).encode(
    x="timepoint:Q",
    y="od:Q",
    detail="well:N",
    color=alt.Color("concentration:Q", scale=alt.Scale(type="log", scheme="greenblue")),
)
(
    fancy_chart.transform_loess(
        "timepoint", "od", groupby=["well", "concentration"], bandwidth=0.1
    ).mark_line()
    + fancy_chart.encode(shape="well:N").mark_point()
).configure_view(continuousWidth=width, continuousHeight=height)

Fancy plot with grouping, LOESS transform, and color by concentration

Other visual marks

Naturally, altair offers more than just lines and points.

An interactive histogram:

brush = alt.selection_interval(encodings=["x"])

base = (
    alt.Chart(growth).mark_bar().encode(y="count():Q").properties(width=600, height=100)
)

alt.vconcat(
    base.encode(
        alt.X(
            "od:Q", bin=alt.Bin(maxbins=30, extent=brush), scale=alt.Scale(domain=brush)
        )
    ),
    base.encode(
        alt.X("od:Q", bin=alt.Bin(maxbins=30)),
    ).add_selection(brush),
)

Or how about boxplot of optical density per well.

alt.Chart(growth).mark_boxplot().encode(x="well:O", y="od:Q").configure_view(continuousWidth=width, continuousHeight=height)

A boxplot showing the distribution of optical density grouped by well.

Many more examples exist.

Key Points

  • Pandas provides quick ways to create simple visualizations.

  • A layered grammar of graphics implementation provides a structured approach to plotting.

  • A good implementation can make expressing complex visualizations straightforward.


Lunch Break

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


Machine Learning

Overview

Teaching: 70 min
Exercises: 20 min
Questions
  • What is machine learning?

  • What kind of tasks can I apply machine learning to?

  • How do I perform machine learning with scikit-learn?

  • How do I interpret my results?

Objectives
  • Get a rough overview on machine learning.

  • Clean up a real world data set.

  • Get an introduction scikit-learn.

  • Apply dimensionality reduction, clustering, classification, and anomaly detection.

  • Visualize results in diverse ways.

Episode Credit

This episode is adapted and enhanced from https://github.com/YaleDHLab/lab-workshops/tree/master/machine-learning.

As a real-world dataset for practicing machine learning tasks on, we will use relative abundances of gut microbiota. In a brand new publication

Wibowo, Marsha C., Zhen Yang, Maxime Borry, Alexander Hübner, Kun D. Huang, Braden T. Tierney, Samuel Zimmerman, et al. 2021. “Reconstruction of Ancient Microbial Genomes from the Human Gut.” Nature, May. https://doi.org/10.1038/s41586-021-03532-0.

modern industrial and pre-industrial fecal microbiota were compared to some ancient samples (palaeofaeces).

ETL relative microbial abundances

import pandas as pd

While the following command is running, take a look at the spreadsheet itself.

raw = pd.read_excel(
    "data/41586_2021_3532_MOESM6_ESM.xlsx",
    sheet_name="Tab1-MetaPhlAn2_output",
    header=None,
)

What did we extract from the spreadsheet?

raw.head(10)
                                                 0           1    ...       799       800
0                                                NaN  Paleofeces  ...      Soil      Soil
1                                                NaN      UT43.2  ...  1043.4.1  3567.1.1
2                                         k__Archaea     7.18607  ...         0    0.0049
3                        k__Archaea|p__Euryarchaeota     7.18607  ...         0    0.0049
4        k__Archaea|p__Euryarchaeota|c__Archaeoglobi           0  ...         0         0
5  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
6  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
7  k__Archaea|p__Euryarchaeota|c__Archaeoglobi|o_...           0  ...         0         0
8        k__Archaea|p__Euryarchaeota|c__Halobacteria           0  ...         0         0
9  k__Archaea|p__Euryarchaeota|c__Halobacteria|o_...           0  ...         0         0

[10 rows x 801 columns]

This is the output from a particular software (MetaPhlAn2) for estimating microbial abundances from metagenomic sequencing (MGS) data. The single letter with double underscore prefix in the first column denotes the taxonomic rank. Multiple ranks are divided by | and describe a lineage.

The first two rows in the table describe the type of sample and a sample identifier.

raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3636 entries, 0 to 3635
Columns: 801 entries, 0 to 800
dtypes: object(801)
memory usage: 22.2+ MB

Due to the sample identifiers in every column, they are all of type object rather than numerical.

Let’s make this tidy!

Tidying up

Sample meta information

We will split the information that describes the samples. We will also get to know another Pandas feature, the pipe which allows us to chain function calls. In order to make full use of the chaining, we also need a function to assign values to the data frame.

def set_element(df, row_idx, col_idx, value):
    result = df.copy()
    result.iat[row_idx, col_idx] = value
    return result

Our meta information is in the first two rows. We then set meaningful names in the currently empty fields. Next we set the first column as the row index. Then we transpose the table such that our two rows become two columns and reset the index of that transposed data frame such that we end up with a range index starting at zero.

observations = (
    raw.iloc[:2, :]
    .pipe(set_element, 0, 0, "label")
    .pipe(set_element, 1, 0, "sample")
    .set_index(0)
    .T.reset_index(drop=True)
)

The result looks as follows:

observations.head(10)
0           label   sample
0      Paleofeces   UT43.2
1      Paleofeces    AW107
2      Paleofeces    AW108
3      Paleofeces   AW110A
4      Paleofeces    Zape1
5      Paleofeces    Zape2
6      Paleofeces    Zape3
7      Paleofeces   UT30.3
8  Non-industrial  MEX3158
9  Non-industrial  MEX3160

Inspect the observations

Can you confirm that we haven’t lost any samples? Additionally, is there something interesting to be said about the data?

Solution

observations.shape
(800, 2)

The observations have the same dimension as the relative abundances.

observations.describe()
0            label      sample
count          800         800
unique           4         800
top     Industrial  A34_02_1FE
freq           418           1

We see that there are four unique labels and that the label ‘Industrial’ is the most common.

observations["label"].value_counts()
Industrial        418
Non-industrial    371
Paleofeces          8
Soil                3
Name: label, dtype: int64

An overview on the number of samples per label.

Now that we have the sample meta information in a tidy format, it is time that we clean up the relative abundances.

Relative abundances

There is a lot of information contained in the first column. Let’s split this information into multiple columns. The taxonomic rank deserves its own column so that we can later filter on it. For convenience, we also keep the kingdom of every sample for later data filtering.

Our strategy is then:

def transform_taxonomic_rank(df):
    result = df.rename(columns={0: "taxonomy"})
    lineage = result["taxonomy"].str.split("|")
    result["kingdom"] = (
        lineage.map(lambda element: element[0])
        .str.split("__", n=1, expand=True)
        .iloc[:, -1]
    )
    result[["rank", "taxonomy"]] = lineage.map(lambda element: element[-1]).str.split(
        "__", n=1, expand=True
    )
    return result
transformed = transform_taxonomic_rank(raw.iloc[2:, :])

Let’s see what our result looks like.

Inspect the transformed data frame

Briefly investigate the transformed data frame.

Solution

transformed.head(10)
                         taxonomy        1        2        3  ... 799     800  kingdom rank
2                         Archaea  7.18607  3.52607  0.10627  ...   0  0.0049  Archaea    k
3                   Euryarchaeota  7.18607  3.52607  0.10627  ...   0  0.0049  Archaea    p
4                    Archaeoglobi        0        0        0  ...   0       0  Archaea    c
5                 Archaeoglobales        0        0        0  ...   0       0  Archaea    o
6                Archaeoglobaceae        0        0        0  ...   0       0  Archaea    f
7   Archaeoglobaceae_unclassified        0        0        0  ...   0       0  Archaea    g
8                    Halobacteria        0        0        0  ...   0       0  Archaea    c
9                 Halobacteriales        0        0        0  ...   0       0  Archaea    o
10               Halobacteriaceae        0        0        0  ...   0       0  Archaea    f
11                  Halobacterium        0        0        0  ...   0       0  Archaea    g

[10 rows x 803 columns]
transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3634 entries, 2 to 3635
Columns: 803 entries, taxonomy to rank
dtypes: object(803)
memory usage: 22.3+ MB
transformed.describe()
              taxonomy       1       2       3  ...     799     800   kingdom  rank
count             3633  3634.0  3634.0  3634.0  ...  3634.0  3634.0      3633  3634
unique            3527   150.0   134.0   132.0  ...   123.0   292.0         4     9
top     Viruses_noname     0.0     0.0     0.0  ...     0.0     0.0  Bacteria     s
freq                 5  3312.0  3349.0  3342.0  ...  3399.0  3056.0      3020  1459

[4 rows x 803 columns]
transformed["kingdom"].value_counts()
Bacteria     3020
Viruses       479
Eukaryota      91
Archaea        43
Name: kingdom, dtype: int64
transformed["rank"].value_counts()
s               1459
t               1267
g                503
f                216
o                 98
c                 58
p                 28
k                  4
unclassified       1
Name: rank, dtype: int64

From our inspection of the data, we decide that it’s best to remove the abundance of unclassified sequencing reads. In addition, I don’t trust the read assignment to virus genomes and they don’t play a large role. We remove those as well.

Similarly to before, we then set the new columns as the row index, then we convert all abundances from object types to float types, transpose the data frame, and reset the index again for a nice range starting from zero.

abundance = (
    transformed.query("kingdom != 'Viruses' and rank != 'unclassified'")
    .set_index(["kingdom", "rank", "taxonomy"])
    .astype(float)
    .T.reset_index(drop=True)
)

We have chosen three columns as the index here. That means, we are creating an index with multiple levels (a pandas.MultiIndex). Multi indeces can be tricky to work with. We will just use them here but do not expect a full understanding.

abundance.head(10)
kingdom   Archaea                ...               Eukaryota              
rank            k             p  ...                       s             t
taxonomy  Archaea Euryarchaeota  ... Enterocytozoon_bieneusi GCA_000209485
0         7.18607       7.18607  ...                     0.0           0.0
1         3.52607       3.52607  ...                     0.0           0.0
2         0.10627       0.10627  ...                     0.0           0.0
3         0.32794       0.32794  ...                     0.0           0.0
4         0.52044       0.52044  ...                     0.0           0.0
5         0.02790       0.02790  ...                     0.0           0.0
6         0.03779       0.03779  ...                     0.0           0.0
7         2.80318       2.80318  ...                     0.0           0.0
8         0.00000       0.00000  ...                     0.0           0.0
9         0.00000       0.00000  ...                     0.0           0.0

[10 rows x 3154 columns]

Relative abundances are percentage numbers of the counts. They apply to each taxonomic rank so that adding up the relative abundances per rank should be close to 100%. It may be lower since we removed viruses and unclassified reads.

import numpy as np

We can create a basic sanity check based on above ideas. For each rank individually, the relative abundance should sum up to (maximally) 100 in every sample.

for rank in abundance.columns.unique(level="rank"):
    view = abundance.loc[:, abundance.columns.get_loc_level(rank, level="rank")[0]]
    total = view.sum(axis=1)
    assert ((total < 100.0) | np.isclose(total, 100.0)).all()

How many variables (columns, features) do we actually have per rank?

for rank in abundance.columns.unique(level="rank"):
    print(rank, abundance.columns.get_loc_level(rank, level="rank")[0].sum())
k 3
p 27
c 57
o 92
f 184
g 443
s 1265
t 1083

Since we have the most variables at the species level and because I wouldn’t outright trust MGS’ ability to distinguish strains, we will use the species level for our machine learning tasks. We can always go to a higher taxonomic rank to test performance on more aggregate signals.

species_abundance = abundance.loc[
    :, abundance.columns.get_loc_level("s", level="rank")[0]
].copy()
species_abundance.shape
(800, 1265)

This was a lot of work, you may think, and we haven’t even done any machine learning yet! You are completely right about that feeling. According to a white paper by Google engineers, creating machine learning models is a tiny fraction of the work.

Image taken from a NIPS conference paper.

Introduction to Machine Learning

Andrew Ng, a prominent machine learning expert, has defined machine learning as “the science of getting computers to act without being explicitly programmed.” This workshop is meant to give a quick introduction to some of the techniques one can use to build algorithms that meet this criterion. Specifically, we will discuss the following sub-fields within machine learning:

Let’s dive in!

Dimensionality Reduction

Our relative abundances data set at the species level has a large number of variables (features in machine learning lingo). We have 1265 features and only 800 samples. Microbiome data are often also quite sparse that means most species are only observed in a small number of samples.

Such “high-dimensional” data sets can be quite hard to work with and reason about, not to mention impossible to visualize. High-dimensional data also pose specific challenges to many machine learning models (see The Curse of Dimensionality). To work around these challenges, it’s often helpful to reduce the number of dimensions required to express a given data set.

One popular way to reduce the dimensionality of a data set is to use a technique called Principal Component Analysis. PCA tries to find a lower dimensional representation of data points by projecting them down into a smaller dimensional space in a way that minimizes loss of information.

To get an intuition about PCA, suppose you have points in two dimensions, and you wish to reduce the dimensionality of your data set to a single dimension. To do so, you could find the center of the points then create a line $L$ with a random orientation that passes through that center. One can then project each point onto $L$ such that an imaginary line between the point and $L$ forms a right angle. Within this “projection”, each 2D point can be represented by just its position along the 1D line $L$, effectively giving us a 1D representation of the point’s position in its original space. Furthermore, we can use the difference between the largest and smallest values of points projected onto $L$ as a measure of the amount of “variance” or “spread” within the data captured by $L$. The greater this spread, the greater the amount of “signal” from the original data set is represented in the projection.

If one were to slowly rotate $L$ and continue measuring the difference between the greatest and smallest values on $L$ at each orientation, one could find the orientation of the projection line that minimizes information loss (maximizes spread). This line of minimal information loss is shown in pink below. Once that line is discovered, we can project all of our points onto that lower-dimensional embedding. See the red points below when the black line is colinear with the pink line:

For a beginner-friendly deep dive into the mechanics behind this form of dimensionality reduction, check out Josh Starmer's step-by-step guide to PCA.

What makes this kind of dimensionality reduction useful for research? There are two primary uses: (visual) data exploration and data analysis.

Exploratory data visualization with UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualization but also for general non-linear dimension reduction. The author Leland McInnes has a recorded presentation about how the method works in detail.

UMAP considers the distances between samples in a particular space (the manifold) and projects those distances to a Cartesian coordinate system with typically two dimensions. That means we need to transform our relative abundances into a square distance matrix that contains the pairwise distance between every sample.

In microbiome analyses, a common metric to compare samples is the Bray-Curtis dissimilarity. It is a $\beta$-diversity measure since we can use it to compare samples from different ecological niches. We use scikit-bio for a fast implementation of this measure.

from skbio.diversity import beta_diversity
bc_dm = beta_diversity(
    metric="braycurtis", counts=species_abundance, ids=observations["sample"]
)
print(bc_dm)
800x800 distance matrix
IDs:
'UT43.2', 'AW107', 'AW108', 'AW110A', 'Zape1', 'Zape2', 'Zape3', 'UT30.3', ...
Data:
[[0.         0.75103997 0.81117371 ... 0.97354678 0.98426476 0.95241561]
 [0.75103997 0.         0.50414618 ... 0.99686836 0.99895874 0.96373897]
 [0.81117371 0.50414618 0.         ... 0.9969896  0.99919453 0.97885096]
 ...
 [0.97354678 0.99686836 0.9969896  ... 0.         0.8080406  0.59635894]
 [0.98426476 0.99895874 0.99919453 ... 0.8080406  0.         0.6525434 ]
 [0.95241561 0.96373897 0.97885096 ... 0.59635894 0.6525434  0.        ]]

Now we are ready to create a lower dimensional projection of this distance matrix.

We would like to take this opportunity to introduce a cool (and in this case useful) feature of Jupyter: interactivity. In order to make the interactivity work as intended, we will define a few functions. One for the UMAP, another to plot the result, and a third to handle the interactivity.

from umap import UMAP
def get_umap_embedding(num):
    """Create a particular projection using a given number of neighbours."""
    result = pd.DataFrame(
        data=UMAP(
            n_neighbors=num, random_state=123456, metric="precomputed", n_components=2
        ).fit_transform(bc_dm.data.copy()),
        columns=["D1", "D2"],
        index=observations.index,
    )
    result[["label", "sample"]] = observations[["label", "sample"]].copy()
    return result
import altair as alt
def plot_umap(df, num):
    """Create an interactive altair chart from a UMAP result."""
    selection = alt.selection_multi(fields=["label"])
    color = alt.condition(
        selection,
        alt.Color("label:N", scale=alt.Scale(scheme="set1"), legend=None),
        alt.value("lightgray"),
    )
    shape = alt.Shape("label", legend=None)

    scatter = (
        alt.Chart(df, title=f"Number of Neighbours = {num}")
        .encode(
            x="D1", y="D2", color=color, shape=shape, tooltip=["sample:N", "label:N"]
        )
        .mark_point(filled=True)
        .interactive()
    )

    legend = (
        alt.Chart(df)
        .encode(
            y=alt.Y("label:N", axis=alt.Axis(orient="right")), color=color, shape=shape
        )
        .mark_point()
        .add_selection(selection)
    )

    return (scatter | legend).configure_view(continuousWidth=800, continuousHeight=600)
from functools import lru_cache
@lru_cache()
def run_umap(num: int):
    embedding = get_umap_embedding(num)
    return plot_umap(embedding, num)
from ipywidgets import interact, widgets

In order to avoid expensive recalculation of results when moving the slider to previous numbers, we use caching.

interact(run_umap, num=widgets.IntSlider(min=5, max=95, step=10, value=15))

The number of neighbours considered is an important parameter for UMAP. Nonetheless, the global structures look very similar. Compare the results to the PCA plot from the paper.

There are very few palaeofaeces and soil samples. We exclude both from further machine learning tasks.

industrial_vs_non_obs = observations[
    observations["label"].isin(["Industrial", "Non-industrial"])
].reset_index(drop=True)
industrial_vs_non_obs.shape
(789, 2)
industrial_vs_non_abn = species_abundance.loc[
    industrial_vs_non_obs.index, :
].reset_index(drop=True)
industrial_vs_non_abn.shape
(789, 1265)

Clustering

Clustering is a powerful machine learning technique and one that often requires some kind of distance metric. The goal of a clustering algorithm is to create some groups of observations, where each group contains similar observations.

There are a variety of methods for clustering vectors, including density-based clustering, hierarchical clustering, and centroid clustering. One of the most intuitive and most commonly used centroid-based methods is K-Means Clustering.

Given a collection of points in a space, K-Means selects K “centroid” points randomly (colored green below), then assigns each non-centroid point to the centroid to which it’s closest. Using these preliminary groupings, the next step is to find the geometric center of each group, using the same technique one would use to find the center of a square. These group centers become the new centroids, and again each point is assigned to the centroid to which it’s closest. This process continues until centroid movement falls below some minimal movement threshold, after which the clustering is complete.

As discussed before, we use PCA to reduce the number of dimensions in our data to two.

from sklearn.decomposition import PCA
pca = PCA(n_components=2).fit(industrial_vs_non_abn)
pca.explained_variance_ratio_
array([0.26181605, 0.1059267 ])
reduced_species = pca.transform(industrial_vs_non_abn)
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = 16, 12
def plot_kmeans_clusters():
    # Step size of the mesh. Decrease to increase the quality of the VQ.
    h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

    # Plot the decision boundary. For that, we will assign a color to each
    x_min, x_max = reduced_species[:, 0].min() - 1, reduced_species[:, 0].max() + 1
    y_min, y_max = reduced_species[:, 1].min() - 1, reduced_species[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Obtain labels for each point in mesh. Use last trained model.
    Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1)
    plt.clf()
    plt.imshow(
        Z,
        interpolation="nearest",
        extent=(xx.min(), xx.max(), yy.min(), yy.max()),
        cmap=plt.cm.Paired,
        aspect="auto",
        origin="lower",
    )

    for key, sub in industrial_vs_non_obs.groupby("label", as_index=False, sort=False):
        plt.scatter(
            reduced_species[sub.index, 0],
            reduced_species[sub.index, 1],
            label=key,
        )

    # Plot the centroids as a white X
    centroids = kmeans.cluster_centers_
    plt.scatter(
        centroids[:, 0],
        centroids[:, 1],
        marker="x",
        s=169,
        linewidths=3,
        color="w",
        zorder=10,
    )

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.legend()
    plt.show()
kmeans = KMeans(n_clusters=2, n_init=16, random_state=123456).fit(reduced_species)
plot_kmeans_clusters()

plot of chunk unnamed-chunk-49

kmeans = KMeans(n_clusters=8, n_init=16, random_state=123456).fit(reduced_species)
plot_kmeans_clusters()

plot of chunk unnamed-chunk-51

Classification

The goal of a classification task is to predict whether a given observation in a data set possesses some particular property or attribute. To make these predictions, we measure the attributes of several labeled data observations, then compare unknown observations to those measurements.

In our data set we have microbiome samples labeled either as ‘industrial’ or ‘non-industrial’. Scikit-learn can help us to split out data into a training set and a testing set.

K-Nearest neighbors classifiers

With a $K$-nearest neighbors classifier, we start with a labeled data set. We then add new, unlabeled observations to the data. For each, we consult the $K$ labeled observations to which the unlabeled observation is closest, where $K$ is an odd integer we use for all classifications. We then find the most common label among those $K$ observations (the “$K$-nearest neighbors”) and give the new observation that label.

The following diagram shows this scenario. Our new observation (represented by the question mark) has some points near it that are labeled with a triangle or star. Suppose we have chosen to use 3 for our value of $K$. In that case, we consult the 3 nearest labeled points near the question mark. Those 3 nearest neighbors have labels: star, triangle, triangle. Using a majority vote, we give the question mark a triangle label.

Examining the plot above, we can see that if $K$ were set to 1, we would classify the question mark as a star, but if K is 3 or 5, we would classify the question mark as a triangle. That is to say, $K$ is an important parameter in a $K$-nearest neighbors classifier.

Let’s see how we can use our labeled microbiome samples for classification.

With scikit-learn’s convenience function, we split our data into a training set and a test set. By default the size is approximately 3:1.

from sklearn.model_selection import train_test_split
train_abundance, test_abundance, train_label, test_label = train_test_split(
    reduced_species, industrial_vs_non_obs["label"]
)
np.unique(train_label, return_counts=True)
(array(['Industrial', 'Non-industrial'], dtype=object), array([317, 274]))
np.unique(test_label, return_counts=True)
(array(['Industrial', 'Non-industrial'], dtype=object), array([101,  97]))
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import matthews_corrcoef
@lru_cache
def run_knn(num):
    knn_clf = KNeighborsClassifier(num).fit(train_abundance, train_label)
    return matthews_corrcoef(test_label, knn_clf.predict(test_abundance))
interact(run_knn, num=widgets.IntSlider(min=1, max=9, step=2, value=3))

Measure performance

Can you systematically measure the classifier model performance (as measured by Matthews’ correlation coefficient) for a wide range of $K$-values? Plot the results using any method that we have learnt.

Solution

We create a data frame for storing the $K$-values and performance.

performance = pd.DataFrame({"k": list(range(1, 99, 2))})
performance["mcc"] = performance["k"].map(run_knn)
performance.plot.scatter("k", "mcc")

plot of chunk unnamed-chunk-61

For each observation we pass as input to knn_clf.predict(), the function returns a label (either 0 or 1). Just like that, we’ve trained a machine learning classifier and classified some test data!

The classification example above shows how we can classify just a single point in space, but suppose we want to analyze the way a classifier would classify each possible point in some space. To do so, we can transform our space into a grid of units, then classify each point in that grid. Analyzing a space in this way is known as identifying a classifier’s decision boundary, because this analysis shows one the boundaries between different classification outcomes in the space. This kind of analysis is very helpful in training machine learning models, because studying a classifier’s decision boundary can help one see how to improve the classifier.

Let’s take a look at the decision boundary of our classifier:

def plot_decision_boundary(clf, X, labels, h=0.1):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)

    # Plot also the training points
    for key, sub in labels.groupby("label", as_index=False, sort=False):
        plt.scatter(X[sub.index, 0], X[sub.index, 1], edgecolor="black", label=key)

    plt.legend(loc="best")
    plt.show()
knn_clf = KNeighborsClassifier(5).fit(train_abundance, train_label)
plot_decision_boundary(
    knn_clf,
    train_abundance,
    pd.DataFrame({"label": train_label}).reset_index(drop=True),
)

plot of chunk unnamed-chunk-64

For each pixel in the plot above, we retrieve the $K$ closest points with known labels. We then use a majority vote of those labels to assign the label of the pixel. This is exactly analogous to predicting a label for an unlabeled point. In both cases, we take a majority vote of the $K$ closest points with known labels. Working in this way, we can use labeled data to classify unlabeled data. That’s all there is to $K$-nearest neighbors classification!

It’s worth noting that $K$-nearest neighbors is only one of many popular classification algorithms. From a high-level point of view, each classification algorithm works in a similar way. Each requires a certain number of observations with known labels and each uses those labeled observations to classify unlabeled observations. However, different classification algorithms use different logic to assign a label to new observations which means different classification algorithms have very different decision boundaries.

In the chart below, each row plots the decision boundaries several classifiers give the same data set. Notice how some classifiers work better with certain data shapes.

Image credit

For an intuitive introduction to many of these classifiers, including Support Vector Machines, Decision Trees, Neural Networks, and Naive Bayes classifiers, see Luis Serrano’s introduction to machine learning video discussed in the Going Further section below.

Anomaly Detection

Anomaly detection refers to the identification of anomalies, or outliers, in data sets. While detecting anomalies in a single dimension can be quite simple, finding anomalies in high-dimensional space is a difficult problem.

One technique for classifying anomalies in high-dimensional data sets is an Isolation Forest. An Isolation Forest identifies outliers in a data set by randomly dividing a space until each point is isolated from each other. After repeating this procedure several times, the Isolation Forest identifies points that are more quickly isolated from other points as outliers.

This illustration demonstrates the method by which these outliers are quickly identified. Isolated points are colored green and labeled with the iteration on which they were isolated. If you repeat the procedure several times, you’ll see the outlier is consistently isolated quickly, which allows the Isolation Forest to identify that point as an outlier.

If we run the simulation above a number of times, we should see the outlier point is consistently isolated quickly, while it usually takes more iterations to isolate the other points. This is the chief intuition behind the Isolation Forests outlier classification strategy. Outliers are isolated quickly because they are farther from other points in the data set.

Let’s try this on our microbiome samples.

from sklearn.ensemble import IsolationForest
isolation_clf = IsolationForest(random_state=123456, n_jobs=-1).fit(train_abundance)
def plot_iforest_decision_boundary(clf, X, test_X, test_result, margin=6, mesh=0.5):
    """
    Create and display the decision boundary for an isolation forest.
    """
    # get the x and y grid domains
    x_domain = [X[:, 0].min() - margin, X[:, 0].max() + margin]
    y_domain = [X[:, 1].min() - margin, X[:, 1].max() + margin]
    # get a list of values from min to max, counting by `mesh`
    x_range = np.arange(x_domain[0], x_domain[1], mesh)
    y_range = np.arange(y_domain[0], y_domain[1], mesh)
    # create the data with which to color the background grid
    xx, yy = np.meshgrid(x_range, y_range)
    # classify each unit of the grid
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    # reshape Z into a 2D grid
    Z = Z.reshape(xx.shape)
    # fill in the grid values
    plt.contourf(xx, yy, Z, cmap=plt.cm.YlGn)
    # add the training points
    plt.scatter(
        X[:, 0],
        X[:, 1],
        c="green",
        edgecolors="black",
        alpha=0.4,
        label="training observation",
    )
    inliers = test_X[test_result == 1, :]
    outliers = test_X[test_result == -1, :]
    # plot the inliers and outliers
    if inliers.shape[0] > 0:
        plt.scatter(
            inliers[:, 0],
            inliers[:, 1],
            c="white",
            edgecolors="black",
            label="non-outlier",
        )
    if outliers.shape[0] > 0:
        plt.scatter(
            outliers[:, 0], outliers[:, 1], c="red", edgecolors="black", label="outlier"
        )
    # add a legend to the plot
    plt.legend(loc="best")
    plt.show()
plot_iforest_decision_boundary(
    isolation_clf, train_abundance, test_abundance, isolation_clf.predict(test_abundance)
)

plot of chunk unnamed-chunk-68

We see that any samples outside the central area is classified as an outlier. In just a few lines of code, we can create, train, and deploy a machine learning model for detecting outliers in high-dimensional data!

Removed samples

In the original study, some samples were removed. Can our classifier detect them as outliers?

Prepare data

As before we load the data from the spreadsheet and tidy it up.

removed = pd.read_excel(
    "data/41586_2021_3532_MOESM6_ESM.xlsx", sheet_name="Tab2-MetaPhlAn2_removed_samples"
)
removed
                                            Unnamed: 0    TS929A  ...    AW113    AW116
0                                           k__Archaea  44.35047  ...  0.47962  0.02775
1                          k__Archaea|p__Euryarchaeota  44.35047  ...  0.47962  0.02775
2       k__Archaea|p__Euryarchaeota|c__Methanobacteria  44.35047  ...  0.47962  0.02775
3    k__Archaea|p__Euryarchaeota|c__Methanobacteria...  44.35047  ...  0.47962  0.02775
4    k__Archaea|p__Euryarchaeota|c__Methanobacteria...  44.35047  ...  0.47962  0.02775
..                                                 ...       ...  ...      ...      ...
729  k__Viruses|p__Viruses_noname|c__Viruses_noname...   0.00000  ...  0.46174  0.00000
730  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
731  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
732  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000
733  k__Viruses|p__Viruses_noname|c__Viruses_noname...  12.62477  ...  0.00000  0.00000

[734 rows x 8 columns]
def transform_removed_taxonomic_rank(df: pd.DataFrame) -> pd.DataFrame:
    result = df.rename(columns={"Unnamed: 0": "taxonomy"})
    lineage = result["taxonomy"].str.split("|")
    result["kingdom"] = (
        lineage.map(lambda element: element[0])
        .str.split("__", n=1, expand=True)
        .iloc[:, -1]
    )
    result[["rank", "taxonomy"]] = lineage.map(lambda element: element[-1]).str.split(
        "__", n=1, expand=True
    )
    return result
intermediate = transform_removed_taxonomic_rank(removed)
intermediate.head(10)
                                  taxonomy    TS929A  TS889  ...    AW116  kingdom  rank
0                                  Archaea  44.35047    0.0  ...  0.02775  Archaea     k
1                            Euryarchaeota  44.35047    0.0  ...  0.02775  Archaea     p
2                          Methanobacteria  44.35047    0.0  ...  0.02775  Archaea     c
3                       Methanobacteriales  44.35047    0.0  ...  0.02775  Archaea     o
4                      Methanobacteriaceae  44.35047    0.0  ...  0.02775  Archaea     f
5                       Methanobrevibacter  44.35047    0.0  ...  0.02741  Archaea     g
6               Methanobrevibacter_smithii   0.00000    0.0  ...  0.01242  Archaea     s
7  Methanobrevibacter_smithii_unclassified   0.00000    0.0  ...  0.01242  Archaea     t
8          Methanobrevibacter_unclassified  44.35047    0.0  ...  0.01499  Archaea     s
9                           Methanosphaera   0.00000    0.0  ...  0.00035  Archaea     g

[10 rows x 10 columns]
intermediate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   taxonomy  734 non-null    object 
 1   TS929A    734 non-null    float64
 2   TS889     734 non-null    float64
 3   TS895     734 non-null    float64
 4   UT2.12    734 non-null    float64
 5   UT3.6     734 non-null    float64
 6   AW113     734 non-null    float64
 7   AW116     734 non-null    float64
 8   kingdom   734 non-null    object 
 9   rank      734 non-null    object 
dtypes: float64(7), object(3)
memory usage: 57.5+ KB
intermediate[["taxonomy", "kingdom", "rank"]].describe()
              taxonomy   kingdom rank
count              734       734  734
unique             731         4    8
top     Viruses_noname  Bacteria    s
freq                 3       675  258
intermediate["kingdom"].value_counts()
Bacteria     675
Eukaryota     27
Viruses       20
Archaea       12
Name: kingdom, dtype: int64
intermediate["rank"].value_counts()
s    258
t    204
g    129
f     75
o     33
c     21
p     10
k      4
Name: rank, dtype: int64
removed_abundance = (
    intermediate.query("kingdom != 'Viruses'")
    .set_index(["kingdom", "rank", "taxonomy"])
    .T.reset_index(drop=True)
)
removed_abundance
kingdom    Archaea                ...           Eukaryota              
rank             k             p  ...                   s             t
taxonomy   Archaea Euryarchaeota  ... Sordaria_macrospora GCA_000182805
0         44.35047      44.35047  ...             0.00000       0.00000
1          0.00000       0.00000  ...             0.00000       0.00000
2          0.52994       0.52994  ...             0.00000       0.00000
3          0.00666       0.00666  ...             0.00000       0.00000
4          0.03470       0.03470  ...             0.00000       0.00000
5          0.47962       0.47962  ...             0.96528       0.96528
6          0.02775       0.02775  ...             0.55437       0.55437

[7 rows x 714 columns]

Now that we have the data in a tidy format, we should include missing variables before reducing the whole set again to two dimensions by PCA.

removed_species = removed_abundance.loc[
    :, removed_abundance.columns.get_loc_level("s", level="rank")[0]
].copy()
removed_species.shape
(7, 254)
industrial_vs_non_abn.shape
(789, 1265)
removed_species = removed_species.reindex(
    columns=removed_species.columns.union(industrial_vs_non_abn.columns), fill_value=0
)
removed_species
kingdom                     Archaea  ...       Eukaryota
rank                              s  ...               s
taxonomy Halobacterium_unclassified  ... Ustilago_maydis
0                                 0  ...               0
1                                 0  ...               0
2                                 0  ...               0
3                                 0  ...               0
4                                 0  ...               0
5                                 0  ...               0
6                                 0  ...               0

[7 rows x 1229 columns]
reduced_removed = PCA(n_components=2).fit_transform(removed_abundance.values)
reduced_removed.shape
(7, 2)
plot_iforest_decision_boundary(
    isolation_clf,
    reduced_species,
    reduced_removed,
    isolation_clf.predict(reduced_removed),
)

plot of chunk unnamed-chunk-87

As we could have expected from the small number of shared species, the removed samples are far from all the other samples and thus all outliers.

Going Further

The snippets above are meant only to give a brief introduction to some of the most popular techniques in machine learning so you can decide whether this kind of analysis might be useful in your research. If it seems like machine learning will be important in your work, you may want to check out some of the resources listed below (arranged roughly from least to most technical):

Key Points

  • Preparing data in the right format is often the hardest task.

  • Machine learning provides methods for tasks such as dimensionality reduction, clustering, classification, and anomaly detection.

  • Having good visualizations is crucial for interpretation.


Afternoon Break

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


Machine Learning (continued)

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Please see the previous episode for the content.

Key Points


Wrap-up and Outlook

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Getting help:

Further course materials (in the same style of the course):

Good coding practices and advise:

Python libraries relevant for biotech (not covered in the course):

Popular visualization and dashboard libraries (not covered in the course):

Finding more packages:

Key Points