Visualize Your Data

Overview

Teaching: 45 min
Exercises: 30 min
Questions
  • What tools exist to plot data in Python?

  • How do I make a basic plot?

  • What visuals are available?

  • How can I best visualize groups of data?

Objectives
  • Get an introduction to Pandas plotting.

  • Get a high-level overview on plotting in Python.

  • Learn about the grammar of graphics.

  • Learn to apply the grammar of graphics through altair.

  • Learn how to group data.

  • Learn how to use facets.

  • Develop an understanding of the rich visualization diversity that exists.

Native Pandas plotting

import pandas as pd

Let’s work with the familiar growth data from last episode.

growth = pd.read_csv("data/yeast-growth.csv")

Pandas provides a few quick plotting methods. It uses matplotlib under the hood to generate graphics.

import matplotlib.pyplot as plt

The default plot method draws lines for every numeric column in the data frame using the index as the horizontal axis.

growth.plot()

plot of chunk unnamed-chunk-6

We can specify a different kind of plot, choosing columns explicitly.

growth.plot.scatter("timepoint", "od")

plot of chunk unnamed-chunk-7

Other types of plots include a histogram. You can also see that we use matplotlib directly in order to customize the plot further.

growth["concentration"].plot.hist(bins=30)
<AxesSubplot:ylabel='Frequency'>
plt.xlabel("concentration")
Text(0.5, 0, 'concentration')
plt.show()

plot of chunk unnamed-chunk-8

Another type of plot for showing distributions is the box plot which, again, by default uses numeric columns.

growth.plot.box()

plot of chunk unnamed-chunk-9

My opinion on native Pandas plotting:

State of plotting in Python

https://pyviz.org/overviews/index.html

Grammar of Graphics

The grammar of graphics was described by Leland Wilkinson. It was later updated by Hadley Wickham with ideas developed during the creation of his famous R plotting software ggplot2. This blog has a decent summary of both. In lose terms, the idea is that every visualization has the following ingredients:

This all sounds a bit abstract. Let’s dive into examples.

Introduction to altair

We need the altair package, of course.

import altair as alt
width = 800
height = 600

Here we create an altair Chart object with our tidy growth data frame. The default coordinate system is Cartesian with two axes. We encode our data and specify that the time points should describe our x-axis coordinate and optical density the y-axis coordinate. Finally, we want this two-dimensional data to be displayed by the visual mark points.

The default output size is a bit small in my opinion so we configure it to be a bit larger.

alt.Chart(growth).encode(
    x="timepoint", y="od"
).mark_point().configure_view(continuousWidth=width, continuousHeight=height)

Basic point mark plot

We can, at any point, store our current chart specification in a variable and later apply further specifications to it.

base = alt.Chart(growth).encode(x="timepoint", y="od")

How about using lines as the visual mark?

base.mark_line().configure_view(continuousWidth=width, continuousHeight=height)

Ungrouped line mark plot

The above plot has the same problems as the Pandas default plot. Our data was recorded per well. We can use this variable to encode a third dimension that groups our data. Common options are colour or shape/linetype.

base.encode(color="well").mark_line(point=True).configure_view(
    continuousWidth=width, continuousHeight=height
)

Line mark plot grouped by color

This looks much better already. We also changed the visual mark to be a point and line.

Instead of defining separate groups for each growth time course, we can separate each individual time course into its own plot. This is known as facetting. Using facets can be handy but information is more spread out. Comparing different time courses is harder in this format.

base.mark_line(point=True).facet("well", columns=3)

Faceted line mark plot

Let’s look again at the individual data points grouped by well.

chart = alt.Chart(growth).encode(x="timepoint", y="od", color="well").mark_point()
chart.configure_view(continuousWidth=width, continuousHeight=height)

Colored point mark plot

Instead of connecting each consecutive data point with a straight line, we can apply a smoothing data transformation. In the following example, we use a locally estimated scatterplot smoothing (LOESS) transformation.

(
    chart
    + chart.transform_loess(
        "timepoint", "od", groupby=["well"], bandwidth=0.15
    ).mark_line()
).configure_view(continuousWidth=width, continuousHeight=height)

LOESS transform plot

We can now combine the concepts that we have learnt to introduce a fourth dimension, the measured concentration. Since the concentration is a measured quantity, we use a continuous colour scale to represent it. We still group our data by wells and apply smoothing. As an example, we vary the point shape.

Please not that for the first time we explicitly encode the type of the variables. Q means quantitative (continuous) and N means nominal (categorical).

fancy_chart = alt.Chart(growth).encode(
    x="timepoint:Q",
    y="od:Q",
    detail="well:N",
    color=alt.Color("concentration:Q", scale=alt.Scale(type="log", scheme="greenblue")),
)
(
    fancy_chart.transform_loess(
        "timepoint", "od", groupby=["well", "concentration"], bandwidth=0.1
    ).mark_line()
    + fancy_chart.encode(shape="well:N").mark_point()
).configure_view(continuousWidth=width, continuousHeight=height)

Fancy plot with grouping, LOESS transform, and color by concentration

Other visual marks

Naturally, altair offers more than just lines and points.

An interactive histogram:

brush = alt.selection_interval(encodings=["x"])

base = (
    alt.Chart(growth).mark_bar().encode(y="count():Q").properties(width=600, height=100)
)

alt.vconcat(
    base.encode(
        alt.X(
            "od:Q", bin=alt.Bin(maxbins=30, extent=brush), scale=alt.Scale(domain=brush)
        )
    ),
    base.encode(
        alt.X("od:Q", bin=alt.Bin(maxbins=30)),
    ).add_selection(brush),
)

Or how about boxplot of optical density per well.

alt.Chart(growth).mark_boxplot().encode(x="well:O", y="od:Q").configure_view(continuousWidth=width, continuousHeight=height)

A boxplot showing the distribution of optical density grouped by well.

Many more examples exist.

Key Points

  • Pandas provides quick ways to create simple visualizations.

  • A layered grammar of graphics implementation provides a structured approach to plotting.

  • A good implementation can make expressing complex visualizations straightforward.