Reading and Working with Tabular Data

Overview

Teaching: 50 min
Exercises: 40 min

Questions

What is a Pandas data frame?

How do I get an overview on my tabular data?

How do I read tabular data in different formats?

How do I access subsets of a data frame?

How do I calculate simple statistics like the mean?

Objectives

Read data in various formats.

Select individual values and subsections from data.

Transform a data frame from messy to tidy data.

Understand common data workflows (ETL) and strategies (split-apply-combine).

Read data

We use Pandas for reading, writing, and handling tabular data in Python as it is the de facto standard tool to do so.

Just as np is by convention the alias for numpy, so is pd for pandas.

import pandas as pd

We start by reading some data from a comma-separated values (CSV) file.

growth = pd.read_csv("data/yeast-growth.csv")

Inspect your data

Depending on your current Python environment a pandas.DataFrame object will automatically be displayed as text formatted to a table or as a prettier HTML table.

growth

    well  timepoint     od concentration_level  concentration
    a          1  0.017                 low           0.01
    b          1  0.017                 low           0.03
    c          1  0.018              medium           1.00
    d          1  0.017              medium           3.00
    e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
  c         65  0.228              medium           1.00
  d         65  0.221              medium           3.00
  e         65  0.065              medium          30.00
  f         65  0.035                high         100.00
  g         65  0.031                high         300.00

[455 rows x 5 columns]

By default, the output is configured to not show too many rows and columns. We can access more specific parts of a data frame using specific methods.

The beginning of a data frame can be accessed with the head method.

growth.head()

  well  timepoint     od concentration_level  concentration
  a          1  0.017                 low           0.01
  b          1  0.017                 low           0.03
  c          1  0.018              medium           1.00
  d          1  0.017              medium           3.00
  e          1  0.017              medium          30.00

Correspondingly, the end of a data frame can be accessed with the tail method.

growth.tail()

    well  timepoint     od concentration_level  concentration
  c         65  0.228              medium            1.0
  d         65  0.221              medium            3.0
  e         65  0.065              medium           30.0
  f         65  0.035                high          100.0
  g         65  0.031                high          300.0

You can also ask for a random number of rows. This is great for inspection since you have a chance to see rows from different parts of the data frame. It can also be convenient to develop an analysis with a subset of your data first because it will be faster (this matters if you work with tables that contain millions of rows).

growth.sample(10)

    well  timepoint     od concentration_level  concentration
  d         48  0.120              medium           3.00
  e         38  0.046              medium          30.00
  g         20  0.032                high         300.00
  f         54  0.036                high         100.00
  c         60  0.231              medium           1.00
  e         61  0.063              medium          30.00
  a         57  0.234                 low           0.01
  c         21  0.049              medium           1.00
  e         45  0.051              medium          30.00
   f          2  0.015                high         100.00

Notice the random indices.

Summaries

Looking at the actual values in your data is important to develop an understanding of them if you work with new data. However, in order to get an overview on your data, there are some useful summary methods.

If you just want to know the dimensions of your data frame, you can do so with the shape attribute.

growth.shape

(455, 5)

As you will see later on, it is sometimes necessary to look at the data types that pandas has inferred from your table.

growth.dtypes

well                    object
timepoint                int64
od                     float64
concentration_level     object
concentration          float64
dtype: object

The info method conveniently combines information on the shape and types in your data frame. In addition, it tells you about the index used and the memory consumption.

growth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   well                 455 non-null    object 
 1   timepoint            455 non-null    int64  
 2   od                   455 non-null    float64
 3   concentration_level  455 non-null    object 
 4   concentration        455 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 17.9+ KB

Instead of summarizing the data frame object itself, you can get a high-level overview on the statistical distribution of your data.

growth.describe()

        timepoint          od  concentration
count  455.000000  455.000000     455.000000
mean    33.000000    0.077193      62.005714
std     18.782314    0.068110     102.928567
min      1.000000    0.015000       0.010000
25%     17.000000    0.032000       0.030000
50%     33.000000    0.042000       3.000000
75%     49.000000    0.099500     100.000000
max     65.000000    0.237000     300.000000

By default, this will only describe numeric columns. You can also get an overview on all columns including text and categorical columns (generally of type object).

growth.describe(include="all")

       well   timepoint          od concentration_level  concentration
count   455  455.000000  455.000000                 455     455.000000
unique    7         NaN         NaN                   3            NaN
top       a         NaN         NaN              medium            NaN
freq     65         NaN         NaN                 195            NaN
mean    NaN   33.000000    0.077193                 NaN      62.005714
std     NaN   18.782314    0.068110                 NaN     102.928567
min     NaN    1.000000    0.015000                 NaN       0.010000
25%     NaN   17.000000    0.032000                 NaN       0.030000
50%     NaN   33.000000    0.042000                 NaN       3.000000
75%     NaN   49.000000    0.099500                 NaN     100.000000
max     NaN   65.000000    0.237000                 NaN     300.000000

The NaN values displayed in the output are simply Pandas’ concept of missing values. Just pretend that they are empty cells.

Wording

I try to use data frame when talking about the object for computation and table when talking about data more generally; but I might mix them up as in my mind they are the same concept.

Tangent: Tidy Data

Hadley Wickham coined the concept of tidy data for statistics. What is tidy data?

You can also dive into the original publication from 2014 but it suffices to say that a tidy data structure is more intuitive for data visualization (plotting) and statistical analysis.

Reading files from different locales

Depending on what country’s standard your computer is set to (the ‘locale’), software such as Excel will use different characters to separate fields, for example, the default for a computer with the DK locale will be to use ; to separate fields and , to separate decimals. Try finding the right arguments to pandas.read_table to get something sensible out of "data/example-uk.txt" and "data/example-dk.txt". Use the inspection methods from before to see how pandas reads the data.

UK

Solution

uk_data = pd.read_table("data/example-uk.txt")
uk_data

      name;height;age;income
     Ian;183;27;12,000.01
   Peter;162;28;11,000.50
Bernhard;173;30;10,000.00
  Steven;163;32;12,500.00

uk_data = pd.read_table("data/example-uk.txt", sep=";")
uk_data

       name  height  age     income
     Ian     183   27  12,000.01
   Peter     162   28  11,000.50
Bernhard     173   30  10,000.00
  Steven     163   32  12,500.00

uk_data.dtypes

name      object
height     int64
age        int64
income    object
dtype: object

uk_data = pd.read_table("data/example-uk.txt", sep=";", thousands=",")
uk_data

       name  height  age    income
     Ian     183   27  12000.01
   Peter     162   28  11000.50
Bernhard     173   30  10000.00
  Steven     163   32  12500.00

uk_data.dtypes

name       object
height      int64
age         int64
income    float64
dtype: object

DK

Solution

dk_data = pd.read_table("data/example-dk.txt")
dk_data

     name;height;age;income
     Ian;183;27;12000,01
   Peter;162;28;11100,50
Bernhard;173;30;11000,00
  Steven;163;32;12500,00

dk_data = pd.read_table("data/example-dk.txt", sep=";")
dk_data

       name  height  age    income
     Ian     183   27  12000,01
   Peter     162   28  11100,50
Bernhard     173   30  11000,00
  Steven     163   32  12500,00

dk_data.dtypes

name      object
height     int64
age        int64
income    object
dtype: object

dk_data = pd.read_table("data/example-dk.txt", sep=";", decimal=",")
dk_data

       name  height  age    income
     Ian     183   27  12000.01
   Peter     162   28  11100.50
Bernhard     173   30  11000.00
  Steven     163   32  12500.00

dk_data.dtypes

name       object
height      int64
age         int64
income    float64
dtype: object

Other data formats

Excel

When you read data from spreadsheets (.xls, .xlsx, .ods), you do not need to worry about field separators or number formatting (but you need extra packages such as xlrd, openpyxl, or odfpy).

excel_data = pd.read_excel("data/example-dk.xlsx")
excel_data

       name  height  age    income
     Ian     183   27  12000.01
   Peter     162   28  11100.50
Bernhard     173   30  11000.00
  Steven     163   32  12500.00

However, you may need to be careful with Excel’s interpretation of strings as dates (see Gene name errors are widespread in the scientific literature and Guidelines for human gene nomenclature) when loading tabular files in Excel.

Pandas will try to detect the right dependency for opening a spreadsheet.

date_data = pd.read_excel("data/example-datetime.ods")
date_data

        date      time
2020-03-02  03:34:23
2021-02-02  02:24:00
2021-09-01  16:01:22

Common date and time formats are automatically detected and converted to suitable Python data types.

date_data.dtypes

date    datetime64[ns]
time            object
dtype: object

You can conveniently combine multiple columns into one datetime object.

pd.read_excel("data/example-datetime.ods", parse_dates={"datetime": ["date", "time"]})

             datetime
2020-03-02 03:34:23
2021-02-02 02:24:00
2021-09-01 16:01:22

Further formats

There are many more formats that Pandas can read data from. Just type pd.read_ and press tab to see the list of auto-completion options. A few notable ones:

All file-based read functions accept a remote URL as an option. We could have loaded the growth data with

pd.read_csv(
    "https://raw.githubusercontent.com/data-science-for-biotech/python-pandas-viz-ml/gh-pages/_episodes_rmd/data/yeast-growth.csv"
)

    well  timepoint     od concentration_level  concentration
    a          1  0.017                 low           0.01
    b          1  0.017                 low           0.03
    c          1  0.018              medium           1.00
    d          1  0.017              medium           3.00
    e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
  c         65  0.228              medium           1.00
  d         65  0.221              medium           3.00
  e         65  0.065              medium          30.00
  f         65  0.035                high         100.00
  g         65  0.031                high         300.00

[455 rows x 5 columns]

In the context of the web, read_json can be very handy
Tables are often stored in SQL databases, in that case read_sql is your friend

Accessing data

Let’s look at our old friend the growth data table again.

growth

    well  timepoint     od concentration_level  concentration
    a          1  0.017                 low           0.01
    b          1  0.017                 low           0.03
    c          1  0.018              medium           1.00
    d          1  0.017              medium           3.00
    e          1  0.017              medium          30.00
..   ...        ...    ...                 ...            ...
  c         65  0.228              medium           1.00
  d         65  0.221              medium           3.00
  e         65  0.065              medium          30.00
  f         65  0.035                high         100.00
  g         65  0.031                high         300.00

[455 rows x 5 columns]

By label

We can access individual columns (pandas.Series) by their names as string.

growth["od"]

    0.017
    0.017
    0.018
    0.017
    0.017
       ...  
  0.228
  0.221
  0.065
  0.035
  0.031
Name: od, Length: 455, dtype: float64

Notice how the display of series is different from data frames.

We can make powerful label based selections with the loc attribute.

growth.loc[:, "od"]

    0.017
    0.017
    0.018
    0.017
    0.017
       ...  
  0.228
  0.221
  0.065
  0.035
  0.031
Name: od, Length: 455, dtype: float64

Our data frames have two dimensions: the rows and columns. Here, we specify that we want all rows : and a single column "od". We can also select one or more columns as a list which will always return a data frame.

growth.loc[:, ["od"]]

        od
  0.017
  0.017
  0.018
  0.017
  0.017
..     ...
0.228
0.221
0.065
0.035
0.031

[455 rows x 1 columns]

We use at to access an individual element with labels.

growth.at[451, "od"]

0.221

By index

We might be tempted to access a column by its index position directly.

growth[2]

Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 2

Detailed traceback:
  File "<string>", line 1, in <module>
  File "/home/moritz/.pyenv/versions/3.8.10/envs/biotech/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/moritz/.pyenv/versions/3.8.10/envs/biotech/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err

This does not work. Only labels are allowed here. Instead we use iloc for index location.

growth.iloc[:, 2]

    0.017
    0.017
    0.018
    0.017
    0.017
       ...  
  0.228
  0.221
  0.065
  0.035
  0.031
Name: od, Length: 455, dtype: float64

Again, we can select one or more columns as a data frame.

growth.iloc[:, [2]]

        od
  0.017
  0.017
  0.018
  0.017
  0.017
..     ...
0.228
0.221
0.065
0.035
0.031

[455 rows x 1 columns]

We can also access individual elements.

growth.iat[451, 2]

0.221

By slices

We can also define label-based ranges over rows and columns.

growth.loc[:, "od":"concentration"]

        od concentration_level  concentration
  0.017                 low           0.01
  0.017                 low           0.03
  0.018              medium           1.00
  0.017              medium           3.00
  0.017              medium          30.00
..     ...                 ...            ...
0.228              medium           1.00
0.221              medium           3.00
0.065              medium          30.00
0.035                high         100.00
0.031                high         300.00

[455 rows x 3 columns]

growth.loc[10:15, :]

   well  timepoint     od concentration_level  concentration
  d          2  0.015              medium           3.00
  e          2  0.019              medium          30.00
  f          2  0.015                high         100.00
  g          2  0.016                high         300.00
  a          3  0.016                 low           0.01
  b          3  0.015                 low           0.03

loc

Be careful that in label-based accession with loc, ranges are inclusive unlike numeric ranges in Python in general. Also be aware that since we have an integer index here, the label coincides with the index position.

growth.iloc[10:15, :]

   well  timepoint     od concentration_level  concentration
  d          2  0.015              medium           3.00
  e          2  0.019              medium          30.00
  f          2  0.015                high         100.00
  g          2  0.016                high         300.00
  a          3  0.016                 low           0.01

N.B.: Index location ranges are exclusive.

By conditions

Pandas inherits from numpy the conditional selection of elements. These are essentially Boolean series or data frames known as masks.

growth[growth["od"] > 0.2]

    well  timepoint     od concentration_level  concentration
  a         45  0.201                 low           0.01
  a         46  0.208                 low           0.01
  a         47  0.217                 low           0.01
  a         48  0.227                 low           0.01
  b         48  0.205                 low           0.03
  a         49  0.229                 low           0.01
  b         49  0.213                 low           0.03
  a         50  0.232                 low           0.01
  b         50  0.219                 low           0.03
  a         51  0.233                 low           0.01
  b         51  0.224                 low           0.03
  a         52  0.237                 low           0.01
  b         52  0.229                 low           0.03
  a         53  0.236                 low           0.01
  b         53  0.230                 low           0.03
  c         53  0.207              medium           1.00
  a         54  0.235                 low           0.01
  b         54  0.231                 low           0.03
  c         54  0.213              medium           1.00
  a         55  0.237                 low           0.01
  b         55  0.230                 low           0.03
  c         55  0.218              medium           1.00
  a         56  0.232                 low           0.01
  b         56  0.231                 low           0.03
  c         56  0.226              medium           1.00
  a         57  0.234                 low           0.01
  b         57  0.233                 low           0.03
  c         57  0.230              medium           1.00
  a         58  0.232                 low           0.01
  b         58  0.228                 low           0.03
  c         58  0.231              medium           1.00
  a         59  0.230                 low           0.01
  b         59  0.228                 low           0.03
  c         59  0.230              medium           1.00
  a         60  0.228                 low           0.01
  b         60  0.225                 low           0.03
  c         60  0.231              medium           1.00
  a         61  0.227                 low           0.01
  b         61  0.225                 low           0.03
  c         61  0.230              medium           1.00
  a         62  0.226                 low           0.01
  b         62  0.222                 low           0.03
  c         62  0.230              medium           1.00
  d         62  0.205              medium           3.00
  a         63  0.229                 low           0.01
  b         63  0.225                 low           0.03
  c         63  0.230              medium           1.00
  d         63  0.212              medium           3.00
  a         64  0.224                 low           0.01
  b         64  0.221                 low           0.03
  c         64  0.231              medium           1.00
  d         64  0.217              medium           3.00
  a         65  0.224                 low           0.01
  b         65  0.220                 low           0.03
  c         65  0.228              medium           1.00
  d         65  0.221              medium           3.00

These masks can be combined using Boolean logic, making them very powerful tools for selecting just the right data.

growth[(growth["od"] > 0.2) & (growth["concentration_level"] == "medium")]

    well  timepoint     od concentration_level  concentration
  c         53  0.207              medium            1.0
  c         54  0.213              medium            1.0
  c         55  0.218              medium            1.0
  c         56  0.226              medium            1.0
  c         57  0.230              medium            1.0
  c         58  0.231              medium            1.0
  c         59  0.230              medium            1.0
  c         60  0.231              medium            1.0
  c         61  0.230              medium            1.0
  c         62  0.230              medium            1.0
  d         62  0.205              medium            3.0
  c         63  0.230              medium            1.0
  d         63  0.212              medium            3.0
  c         64  0.231              medium            1.0
  d         64  0.217              medium            3.0
  c         65  0.228              medium            1.0
  d         65  0.221              medium            3.0

You can combine masks with loc to also sub-select columns.

growth.loc[(growth["od"] > 0.2) & (growth["concentration_level"] == "medium"), ["od", "timepoint"]]

        od  timepoint
0.207         53
0.213         54
0.218         55
0.226         56
0.230         57
0.231         58
0.230         59
0.231         60
0.230         61
0.230         62
0.205         62
0.230         63
0.212         63
0.231         64
0.217         64
0.228         65
0.221         65

Sometimes these conditions can be tedious to write, use the query method for a shorter form. Be careful, query uses Python’s condition logic, not numpy’s.

growth.query("od > 0.2 and concentration_level == 'medium'")

    well  timepoint     od concentration_level  concentration
  c         53  0.207              medium            1.0
  c         54  0.213              medium            1.0
  c         55  0.218              medium            1.0
  c         56  0.226              medium            1.0
  c         57  0.230              medium            1.0
  c         58  0.231              medium            1.0
  c         59  0.230              medium            1.0
  c         60  0.231              medium            1.0
  c         61  0.230              medium            1.0
  c         62  0.230              medium            1.0
  d         62  0.205              medium            3.0
  c         63  0.230              medium            1.0
  d         63  0.212              medium            3.0
  c         64  0.231              medium            1.0
  d         64  0.217              medium            3.0
  c         65  0.228              medium            1.0
  d         65  0.221              medium            3.0

query is fancy enough to know how to use variables.

threshold = 0.2

growth.query("od > @threshold and concentration_level == 'medium'")

    well  timepoint     od concentration_level  concentration
  c         53  0.207              medium            1.0
  c         54  0.213              medium            1.0
  c         55  0.218              medium            1.0
  c         56  0.226              medium            1.0
  c         57  0.230              medium            1.0
  c         58  0.231              medium            1.0
  c         59  0.230              medium            1.0
  c         60  0.231              medium            1.0
  c         61  0.230              medium            1.0
  c         62  0.230              medium            1.0
  d         62  0.205              medium            3.0
  c         63  0.230              medium            1.0
  d         63  0.212              medium            3.0
  c         64  0.231              medium            1.0
  d         64  0.217              medium            3.0
  c         65  0.228              medium            1.0
  d         65  0.221              medium            3.0

Views

Selecting a subset of your data frame by any of the previous means creates a view on that part of the data frame. This is faster since the data is not copied and lets you change elements in a part of the table. Be careful that this may lead to unintended consequences.

growth.loc[0:5, :] = None

growth.head(10)

   well  timepoint     od concentration_level  concentration
None        NaN    NaN                None            NaN
None        NaN    NaN                None            NaN
None        NaN    NaN                None            NaN
None        NaN    NaN                None            NaN
None        NaN    NaN                None            NaN
None        NaN    NaN                None            NaN
   g        1.0  0.015                high         300.00
   a        2.0  0.015                 low           0.01
   b        2.0  0.018                 low           0.03
   c        2.0  0.021              medium           1.00

A comprehensive tutorial can be found in the Pandas documentation.

Working with data

Just like numpy, Pandas offers basic statistical methods out-of-the-box. We can apply them to single columns,

growth["od"].min()

0.015

multiple columns,

growth[["od", "concentration"]].max()

od                 0.237
concentration    300.000
dtype: float64

or even rows.

growth[["od", "concentration"]].mean(axis=1)

         NaN
         NaN
         NaN
         NaN
         NaN
         ...   
    0.6140
    1.6105
   15.0325
   50.0175
  150.0155
Length: 455, dtype: float64

growth[["od", "concentration"]].std(axis=1)

           NaN
           NaN
           NaN
           NaN
           NaN
          ...    
    0.545886
    1.965050
   21.167241
   70.685929
  212.110114
Length: 455, dtype: float64

Split-apply-combine

Another powerful operation and a very common strategy for data transformation is split, apply, combine. This pattern is provided by pandas in the form of the groupby method.

The split-apply-combine strategy. Divide data into chunks, then run a given (reducing) function on each chunk sepearately, and finally combine the chunks again.

Image credit

growth.groupby("well")[["od", "concentration"]].mean()

            od  concentration
well                         
a     0.123500           0.01
b     0.117969           0.03
c     0.109375           1.00
d     0.088328           3.00
e     0.043891          30.00
f     0.033937         100.00
g     0.029738         300.00

Cleaning messy data

Can you transform this messy table into a tidy one? You may need the melt function to change the table layout.

messy = pd.read_csv("data/yeast-growth-messy.csv")

Solution

messy

      V1      V2      V3     V4     V5     V6  ...    V63    V64    V65    V66    V67    V68
Test1     low    0.01  0.017  0.015  0.016  ...  0.228  0.227  0.226  0.229  0.224  0.224
Test1     low    0.03  0.017  0.018  0.015  ...  0.225  0.225  0.222  0.225  0.221  0.220
Test1  medium    1.00  0.018  0.021  0.016  ...  0.231  0.230  0.230  0.230  0.231  0.228
Test1  medium    3.00  0.017  0.015  0.017  ...  0.190  0.196  0.205  0.212  0.217  0.221
Test1  medium   30.00  0.017  0.019  0.016  ...  0.062  0.063  0.062  0.063  0.063  0.065
Test1    high  100.00  0.016  0.015  0.015  ...  0.037  0.035  0.035  0.034  0.035  0.035
Test1    high  300.00  0.015  0.016  0.015  ...  0.031  0.032  0.031  0.031  0.030  0.031

[7 rows x 68 columns]

messy.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
       'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41',
       'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51',
       'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61',
       'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68'],
      dtype='object')

del messy["V1"]

from string import ascii_lowercase

messy["well"] = list(ascii_lowercase[:len(messy)])

long_data = pd.melt(messy, id_vars=["V2", "V3", "well"])
long_data

         V2      V3 well variable  value
     low    0.01    a       V4  0.017
     low    0.03    b       V4  0.017
  medium    1.00    c       V4  0.018
  medium    3.00    d       V4  0.017
  medium   30.00    e       V4  0.017
..      ...     ...  ...      ...    ...
medium    1.00    c      V68  0.228
medium    3.00    d      V68  0.221
medium   30.00    e      V68  0.065
  high  100.00    f      V68  0.035
  high  300.00    g      V68  0.031

[455 rows x 5 columns]

long_data["variable"].replace(dict(zip(long_data["variable"].unique(), range(1, len(long_data["variable"].unique()) + 1))), inplace=True)

long_data.rename(columns={"V2": "concentration_level", "V3": "concentration", "variable": "timepoint"}, inplace=True)

long_data

    concentration_level  concentration well  timepoint  value
                 low           0.01    a          1  0.017
                 low           0.03    b          1  0.017
              medium           1.00    c          1  0.018
              medium           3.00    d          1  0.017
              medium          30.00    e          1  0.017
..                  ...            ...  ...        ...    ...
            medium           1.00    c         65  0.228
            medium           3.00    d         65  0.221
            medium          30.00    e         65  0.065
              high         100.00    f         65  0.035
              high         300.00    g         65  0.031

[455 rows x 5 columns]

Tangent: Extract, Transform, Load (ETL)

The process we have just seen is known as extract, transform, load or simply ETL. It describes the ubiquitous data workflow of reading data (extract with, for example, pd.read_csv), cleaning up the data (transform) as we have done with the messy growth curves above, and storing the tidy data again for subsequent use (load with, for example, to_excel).

Outlook

Tidy pandas data frames are excellent structures for visualizing data, either with the built-in methods or other packages. This will be part of the next episode.
The package statsmodels can be used to fit statistical models to tidy data frames in a convenient fashion using R-style formulas.
Data frames due to their column names are more descriptive than pure numeric matrices (numpy.ndarray) which is handy in many tasks, for example, in machine learning as shown in the last episode.

Key Points

Use pandas.read_* and pandas.DataFrame.to_* to import / export data.

The method info describes the data frame object.

The method describe summarizes value distributions in columns.

You can use labels or index locations to select both subsets and elements from your data frame.

Selection using conditions is very powerful.

Selections create views on your original data.

Use mean, max, min, and others to calculate simple statistics.

Use split-apply-combine to calculate statistics within groups in a data frame.

lesson home

Introduction to Python for Data Science

next episode

Reading and Working with Tabular Data

Overview

Read data

Inspect your data

Summaries

Wording

Tangent: Tidy Data

Reading files from different locales

UK

Solution

DK

Solution

Other data formats

Excel

Further formats

Accessing data

By label

By index

By slices

`loc`

By conditions

Views

Working with data

Split-apply-combine

Cleaning messy data

Solution

Tangent: Extract, Transform, Load (ETL)

Outlook

Key Points

lesson home

next episode