Sets

Overview

Teaching: 10 min
Exercises: 10 min

Questions

What is a set, and how do I use it?

Objectives

Explain how sets work.

Learn about set operations.

A set keeps an unordered collection of unique items.

Different from lists
- Lists have a defined order and can contain duplicate values, whereas sets are unordered and contain only unique items.
Efficient operations
- Sets offer fast membership testing and support operations like union, intersection, and difference.
Mutability
- You can add or remove items from a set, but duplicate entries are ignored.

beatles = set(['John', 'Paul', 'George', 'Ringo'])
print('Beatles:', beatles)
print('Length:', len(beatles))
beatles.add('Ringo')  # Adding 'Ringo' again does nothing
print('Beatles:', beatles)
print('Length:', len(beatles))

Beatles: {'John', 'Paul', 'George', 'Ringo'}
Length: 4
Beatles: {'John', 'Paul', 'George', 'Ringo'}
Length: 4

Check Membership with `in`

Use the in keyword to test if an element exists in a set.

print('Ringo is one of the Beatles:', 'Ringo' in beatles)
print('Keith is one of the Beatles:', 'Keith' in beatles)

Ringo is one of the Beatles: True
Keith is one of the Beatles: False

Adding and Removing Items

Add Items with `add()`

beatles.add('Pete')
print('After adding Pete:', beatles)

After adding Pete: {'John', 'Paul', 'George', 'Ringo', 'Pete'}

Remove Items with `remove()`

beatles.remove('Pete')
print('After removing Pete:', beatles)

After removing Pete: {'John', 'Paul', 'George', 'Ringo'}

Set Operations

Union of Sets

Combine two sets using union() (or the | operator):

odd = set([1, 3, 5, 7, 9])
even = set([2, 4, 6, 8, 10])
all_numbers = odd.union(even)
print('Odd numbers:', odd)
print('Even numbers:', even)
print('All numbers:', all_numbers)

Odd numbers: {1, 3, 5, 7, 9}
Even numbers: {2, 4, 6, 8, 10}
All numbers: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

Note

Since sets are unordered, the printed order of elements may vary.

Intersection of Sets

Retrieve common elements with intersection():

primes = set([2, 3, 5, 7])
odd_primes = primes.intersection(odd)
print('Primes that are odd:', odd_primes)

Primes that are odd: {3, 5, 7}

Difference of Sets

Get elements in one set but not in another with difference():

even_non_primes = even.difference(primes)
print('Even numbers that are not prime:', even_non_primes)

Even numbers that are not prime: {4, 6, 8, 10}

Order Matters in difference()

The result of difference() depends on the order of the sets.

Converting a Set to a Sorted List

To display a set in order, convert it to a list and sort it:

sorted_primes = sorted(primes)
print('Sorted primes:', sorted_primes)

Sorted primes: [2, 3, 5, 7]

Initialising Challenge

What does the following program print?

letters = set('Hello world!')
sorted_letters = list(letters)
sorted_letters.sort()
print('Letters in greeting:', sorted_letters)

Fill in the Blanks Challenge

Fill in the blanks “__” so that the program below produces the output shown.

multiples_of_two = set([2, 4, 6, 8, 10])
multiples_of_three = set([3, 6, 9])
result1 = multiples_of_two.__(multiples_of_three)
print('1', result1)
result2 = multiples_of_three.__(multiples_of_two)
sorted_result2 = sorted(result2)
print('2', sorted_result2)

1 {6}
2 [3, 9]

Comparing Bacterial Isolates

You have two sets representing bacteria isolated from two different sources:
clinical_isolates = {"Staphylococcus aureus", "Escherichia coli", "Pseudomonas aeruginosa", "Klebsiella pneumoniae"}
environmental_isolates = {"Bacillus subtilis", "Escherichia coli", "Staphylococcus epidermidis", "Pseudomonas aeruginosa"}
Write code to:

Print the bacteria common to both sets (intersection).

Print the bacteria unique to the clinical sample (difference).

Print all unique bacteria from both sets (union) as a sorted list.

Solution
~~~python # 1. Intersection: bacteria present in both samples common_bacteria = clinical_isolates.intersection(environmental_isolates) print("Common bacteria:", common_bacteria) # 2. Difference: bacteria unique to the clinical sample unique_clinical = clinical_isolates.difference(environmental_isolates) print("Unique to clinical sample:", unique_clinical) # 3. Union: all unique bacteria from both sets, sorted alphabetically all_bacteria = sorted(clinical_isolates.union(environmental_isolates)) print("All unique bacteria:", all_bacteria) ~~~

Key Points

A set stores an unordered collection of unique values.

Sets automatically remove duplicate entries.

Sets support operations like union, intersection, and difference.

Dictionaries

Overview

Teaching: 15 min
Exercises: 15 min

Questions

What is a dictionary, and how do I use it?

Objectives

Explain how dictionaries work.

Learn about dictionary operations.

A dictionary allows you to keep data associated to a custom key

Different from lists
- You access elements in lists by their position, but in a dictionary the keys don’t need to be ordered.
Different from sets
- Sets contain only unique values with no associated keys, while dictionaries map keys to values.
A very fast data structure
- Lookups, insertions, and deletions in dictionaries are very fast (average-case O(1)).

Basic Operations

Creation: Use curly braces {} or the dict() constructor.
Access: Use square brackets (dict[key]) or the get() method.
Modification: Add or update entries using assignment (dict[key] = value).
Deletion: Use the del statement or the pop() method.
Keys: Use dict.keys() to get all keys in the dictionary.
Values: Use dict.values() to get all values in the dictionary.

# A simple example of a dictionary
# The keys can be strings or numbers (ints or floats)
# With the values in the dictionary you have a bit more freedom 
data = {'a': "Hello", 'b': 2, 'c': ["apple", "banana", "cherry"]}
print(len(data))         # Output: 3
print(data['a'])         # Output: Hello

Example: DNA Codon Table

Consider a simplified DNA codon table where each codon (a triplet of nucleotides) maps to an amino acid.

# Example input here
codon_to_amino = {
    "ATG": "Methionine",
    "TTT": "Phenylalanine",
    "TTC": "Phenylalanine",
    "TAA": "Stop",
    "TAG": "Stop",
    "TGA": "Stop"
}
# Access amino acid for codon 'ATG'
print("Codon ATG codes for:", codon_to_amino["ATG"])

Codon ATG codes for: Methionine

Initialising

What does the following program print?

codon_dict = {"ATG": "Methionine", "TAA": "Stop", "TAG": "Stop"}
codon_dict["ATG"] = "Start"  # Change: ATG now maps to Start
codon_dict["TGA"] = "Stop"
print("Updated codon dictionary:", codon_dict)

Hint: The order of keys may vary.

Solution

The program prints a dictionary with keys `'ATG'`, `'TAA'`, `'TAG'`, and `'TGA'`. The value for `'ATG'` is updated to `"Start"`. For example: ~~~python Updated codon dictionary: {'ATG': 'Start', 'TAA': 'Stop', 'TAG': 'Stop', 'TGA': 'Stop'} ~~~

Fill in the Blanks

Fill in the blanks so that the program below retrieves the correct amino acid for the given codon.
codon_translation = {"GGT": "Glycine", "GGC": "Glycine", "GGA": "Glycine", "GGG": "Glycine"}
amino_acid = codon_translation[______]
print("Amino acid for GGT:", amino_acid)
Expected output: Amino acid for GGT: Glycine

Solution

Replace the blank with `"GGT"`: ~~~python amino_acid = codon_translation["GGT"] ~~~

Adding a New Codon

Extend the dictionary by adding the codon "CCC" for "Proline" to the dictionary below.
codon_dict = {"ATG": "Methionine", "TAA": "Stop"}
# Add code here
print(codon_dict)
Expected output: {‘ATG’: ‘Methionine’, ‘TAA’: ‘Stop’, ‘CCC’: ‘Proline’}

Solution

~~~python codon_dict["CCC"] = "Proline" ~~~

Key Points

A dictionary stores values accessible by unique keys.

Dictionaries may contain values of different types.

Programming Style

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How can I make my programs more readable?

How do most programmers format their code?

Objectives

Provide sound justifications for basic rules of coding style.

Refactor one-page programs to make them more readable and justify the changes.

Use Python community coding standards (PEP-8).

Follow standard Python style in your code.

PEP8 is a style guide for Python that discusses topics such as:
- How you should name variables.
- How you should use indentation.
- How to structure your import statements.
  Adhering to PEP8 makes it easier for other Python friends (and yourself) to read and understand your code. Tools like the PEP8 library or the “flake8” extension in VS Code can help check your code for compliance.
```
# Run the Zen of Python
import this
```

Use docstrings for functions

If the first statement in a function is a string literal (not assigned to a variable), it becomes the function’s docstring.
Docstrings provide online help accessible via the help() function.

def average(values):
    """
       Return the average of values, or None if no values are supplied.
    """
    if len(values) == 0:
        return None
    return sum(values) / len(values)

help(average)

Help on function average in module __main__:

average(values)
    Return the average of values, or None if no values are supplied.

Multiline Strings

Often use multiline strings for documentation. These start and end with three quote characters (single or double).
"""This string spans
multiple lines.

Blank lines are allowed."""

Examples of Bad/Not Pythonic Code

Example 1: Poor Naming and Formatting

# Bad: confusing function name, lack of whitespace, and multiple statements on one line
def f(x,y):return x+y; print(f(1,2))

Issues:

The function name f is not descriptive.
No whitespace around operators or after commas.
Multiple statements on one line reduce readability.

Improved Version:

def add_numbers(a, b):
    """
    Return the sum of a and b.
    """
    return a + b

result = add_numbers(1, 2)
print(result)

Example 2: Importance of Good Comments

Without Comments:

def process_data(data):
    result = []
    for d in data:
        if d % 2 == 0:
            result.append(d ** 2)
        else:
            result.append(d ** 3)
    return result

print(process_data([1, 2, 3, 4]))

Issues:

Missing comments and docstring.

With Clear Comments:

def process_data(data):
    """
    Process each number in the list:
    - Square even numbers.
    - Cube odd numbers.
    """
    result = []
    for d in data:
        # Check if the number is even
        if d % 2 == 0:
            result.append(d ** 2)
        else:
            # Number is odd: cube it
            result.append(d ** 3)
    return result

print(process_data([1, 2, 3, 4]))

Exercises

What Will Be Shown?

Highlight the lines in the code below that will be available as online help. Are there lines that should be made available but won’t be? Will any lines produce a syntax error or a runtime error?

"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.

def overall_max(sequences):
    '''Determine overall maximum edit distance.'''

    highest = 0
    for left in sequences:
        for right in sequences:
            '''Avoid checking sequence against itself.'''
            if left != right:
                this = edit_distance(left, right)
                highest = max(highest, this)

    # Report.
    return highest

Document This

Turn the comment on the following function into a docstring and check that help displays it properly.
def middle(a, b, c):
    # Return the middle value of three.
    # Assumes the values can actually be compared.
    values = [a, b, c]
    values.sort()
    return values[1]

Messy code

Read the code and try to predict what it does.
Run it: Does it produce the expected counts?
Refactor the code to improve its readability and structure.
Compare your solution with a partner and discuss your changes.

# Messy code - fix me!
dna = "ATCGATCGAATTCG"
k = 3
kmers = {}
i = 0
while i < len(dna):
    if i + k <= len(dna):
        s = ""
        j = 0
        while j < k:
            s = s + dna[i+j]
            j = j + 1
        if s in kmers:
            kmers[s] = kmers[s] + 1
        else:
            kmers[s] = 1
    i = i + 1
print(kmers)

Solution

~~~python def count_kmers(dna, k): """ Count all k-mers (substrings of length k) in the given DNA string. Parameters: dna (str): The DNA sequence. k (int): The length of each k-mer. Returns: dict: A dictionary mapping each k-mer to its count. """ counts = {} for i in range(len(dna) - k + 1): kmer = dna[i:i+k] counts[kmer] = counts.get(kmer, 0) + 1 return counts # Example usage dna_sequence = "ATCGATCGAATTCG" kmer_length = 3 kmer_counts = count_kmers(dna_sequence, kmer_length) print(kmer_counts) ~~~

~~~

Key Points

Follow standard Python style in your code.

Use docstrings to provide online help.

Debugging

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How can I debug my program?

Objectives

Debug code containing an error systematically.

Identify ways of making code less error-prone and more easily tested.

Once testing has uncovered problems, the next step is to fix them. Many novices do this by making more-or-less random changes to their code until it seems to produce the right answer, but that’s very inefficient (and the result is usually only correct for the one case they’re testing). The more experienced a programmer is, the more systematically they debug, and most follow some variation on the rules explained below.

Know What It’s Supposed to Do

The first step in debugging something is to know what it’s supposed to do. “My program doesn’t work” isn’t good enough: in order to diagnose and fix problems, we need to be able to tell correct output from incorrect. If we can write a test case for the failing case — i.e., if we can assert that with these inputs, the function should produce that result — then we’re ready to start debugging. If we can’t, then we need to figure out how we’re going to know when we’ve fixed things.

But writing test cases for scientific software is frequently harder than writing test cases for commercial applications, because if we knew what the output of the scientific code was supposed to be, we wouldn’t be running the software: we’d be writing up our results and moving on to the next program. In practice, scientists tend to do the following:

Test with simplified data. Before doing statistics on a real data set, we should try calculating statistics for a single record, for two identical records, for two records whose values are one step apart, or for some other case where we can calculate the right answer by hand.
Test a simplified case. If our program is supposed to simulate magnetic eddies in rapidly-rotating blobs of supercooled helium, our first test should be a blob of helium that isn’t rotating, and isn’t being subjected to any external electromagnetic fields. Similarly, if we’re looking at the effects of climate change on speciation, our first test should hold temperature, precipitation, and other factors constant.
Compare to an oracle. A is something whose results are trusted, such as experimental data, an older program, or a human expert. We use to test oracles to determine if our new program produces the correct results. If we have a test oracle, we should store its output for particular cases so that we can compare it with our new results as often as we like without re-running that program.
Check conservation laws. Mass, energy, and other quantities are conserved in physical systems, so they should be in programs as well. Similarly, if we are analyzing patient data, the number of records should either stay the same or decrease as we move from one analysis to the next (since we might throw away outliers or records with missing values). If “new” patients start appearing out of nowhere as we move through our pipeline, it’s probably a sign that something is wrong.
Visualize. Data analysts frequently use simple visualizations to check both the science they’re doing and the correctness of their code (just as we did in the opening lesson of this tutorial). This should only be used for debugging as a last resort, though, since it’s very hard to compare two visualizations automatically.

Make It Fail Every Time

We can only debug something when it fails, so the second step is always to find a test case that makes it fail every time. The “every time” part is important because few things are more frustrating than debugging an intermittent problem: if we have to call a function a dozen times to get a single failure, the odds are good that we’ll scroll past the failure when it actually occurs.

As part of this, it’s always important to check that our code is “plugged in”, i.e., that we’re actually exercising the problem that we think we are. Every programmer has spent hours chasing a bug, only to realize that they were actually calling their code on the wrong data set or with the wrong configuration parameters, or are using the wrong version of the software entirely. Mistakes like these are particularly likely to happen when we’re tired, frustrated, and up against a deadline, which is one of the reasons late-night (or overnight) coding sessions are almost never worthwhile.

Make It Fail Fast

If it takes 20 minutes for the bug to surface, we can only do three experiments an hour. That doesn’t just mean we’ll get less data in more time: we’re also more likely to be distracted by other things as we wait for our program to fail, which means the time we are spending on the problem is less focused. It’s therefore critical to make it fail fast.

As well as making the program fail fast in time, we want to make it fail fast in space, i.e., we want to localize the failure to the smallest possible region of code:

The smaller the gap between cause and effect, the easier the connection is to find. Many programmers therefore use a divide and conquer strategy to find bugs, i.e., if the output of a function is wrong, they check whether things are OK in the middle, then concentrate on either the first or second half, and so on.
N things can interact in N² different ways, so every line of code that isn’t run as part of a test means more than one thing we don’t need to worry about.

Change One Thing at a Time, For a Reason

Replacing random chunks of code is unlikely to do much good. (After all, if you got it wrong the first time, you’ll probably get it wrong the second and third as well.) Good programmers therefore change one thing at a time, for a reason They are either trying to gather more information (“is the bug still there if we change the order of the loops?”) or test a fix (“can we make the bug go away by sorting our data before processing it?”).

Every time we make a change, however small, we should re-run our tests immediately, because the more things we change at once, the harder it is to know what’s responsible for what (those N² interactions again). And we should re-run all of our tests: more than half of fixes made to code introduce (or re-introduce) bugs, so re-running all of our tests tells us whether we have regressed.

Keep Track of What You’ve Done

Good scientists keep track of what they’ve done so that they can reproduce their work, and so that they don’t waste time repeating the same experiments or running ones whose results won’t be interesting. Similarly, debugging works best when we keep track of what we’ve done and how well it worked. If we find ourselves asking, “Did left followed by right with an odd number of lines cause the crash? Or was it right followed by left? Or was I using an even number of lines?” then it’s time to step away from the computer, take a deep breath, and start working more systematically.

Records are particularly useful when the time comes to ask for help. People are more likely to listen to us when we can explain clearly what we did, and we’re better able to give them the information they need to be useful.

Version Control Revisited

Version control is often used to reset software to a known state during debugging, and to explore recent changes to code that might be responsible for bugs. In particular, most version control systems have a blame command that will show who last changed particular lines of code…

Be Humble

And speaking of help: if we can’t find a bug in 10 minutes, we should be humble and ask for help. Just explaining the problem aloud is often useful, since hearing what we’re thinking helps us spot inconsistencies and hidden assumptions.

Asking for help also helps alleviate confirmation bias. If we have just spent an hour writing a complicated program, we want it to work, so we’re likely to keep telling ourselves why it should, rather than searching for the reason it doesn’t. People who aren’t emotionally invested in the code can be more objective, which is why they’re often able to spot the simple mistakes we have overlooked.

Part of being humble is learning from our mistakes. Programmers tend to get the same things wrong over and over: either they don’t understand the language and libraries they’re working with, or their model of how things work is wrong. In either case, taking note of why the error occurred and checking for it next time quickly turns into not making the mistake at all.

And that is what makes us most productive in the long run. As the saying goes, A week of hard work can sometimes save you an hour of thought. If we train ourselves to avoid making some kinds of mistakes, to break our code into modular, testable chunks, and to turn every assumption (or mistake) into an assertion, it will actually take us less time to produce working programs, not more.

Debug With a Neighbor

Take a function that you have written today, and introduce a tricky bug. Your function should still run, but will give the wrong output. Switch seats with your neighbor and attempt to debug the bug that they introduced into their function. Which of the principles discussed above did you find helpful?

Not Supposed to be the Same

You are assisting a researcher with Python code that computes the Body Mass Index (BMI) of patients. The researcher is concerned because all patients seemingly have identical BMIs, despite having different physiques. BMI is calculated as weight in kilograms divided by the the square of height in metres.

Use the debugging principles in this exercise and locate problems with the code. What suggestions would you give the researcher for ensuring any later changes they make work correctly?
patients = [[70, 1.8], [80, 1.9], [150, 1.7]]

def calculate_bmi(weight, height):
    return weight / (height ** 2)

for patient in patients:
    height, weight = patients[0]
    bmi = calculate_bmi(height, weight)
    print("Patient's BMI is: %f" % bmi)
Patient's BMI is: 21.604938
Patient's BMI is: 21.604938
Patient's BMI is: 21.604938

Solution

* The loop is not being utilised correctly. `height` and `weight` are always set as the first patient's data during each iteration of the loop. * The height/weight variables are reversed in the function call to `calculate_bmi(...)`

Hidden Debugging Script (for students without a function to debug)

```python def count_kmers(dna, k): """ Count all k-mers (substrings of length k) in the given DNA string. Parameters: dna (str): The DNA sequence. k (int): The length of each k-mer. Returns: dict: A dictionary mapping each k-mer to its count. """ counts = {} for i in range(len(dna) - k + 1): kmer = dna[k:i+k] counts[kmer] = counts.get(kmer, 0) + 1 return counts # Example usage dna_sequence = "ATCGATCGAATTCG" kmer_length = 3 kmer_counts = count_kmers(dna_sequence, kmer_length) print(kmer_counts)

Key Points

Know what code is supposed to do before trying to debug it.

Make it fail every time.

Make it fail fast.

Change one thing at a time, and for a reason.

Keep track of what you’ve done.

Be humble.

Morning break

Overview

Teaching: 0 min
Exercises: 0 min

Questions

Objectives

Key Points

Defensive Programming

Overview

Teaching: 30 min
Exercises: 15 min

Questions

How can I make my programs more reliable?

Objectives

Explain what an assertion is.

Add assertions that check the program’s state is correct.

Correctly add precondition and postcondition assertions to functions.

Explain what test-driven development is, and use it when creating new functions.

Explain why variables should be initialized using actual data values rather than arbitrary constants.

Our previous lessons have introduced the basic tools of programming: variables and lists, file I/O, loops, conditionals, and functions. What they haven’t done is show us how to tell whether a program is getting the right answer, and how to tell if it’s still getting the right answer as we make changes to it.

To achieve that, we need to:

Write programs that check their own operation.
Write and run tests for widely-used functions.
Make sure we know what “correct” actually means.

The good news is, doing these things will speed up our programming, not slow it down. As in real carpentry — the kind done with lumber — the time saved by measuring carefully before cutting a piece of wood is much greater than the time that measuring takes.

Assertions

The first step toward getting the right answers from our programs is to assume that mistakes will happen and to guard against them. This is called defensive programming, and the most common way to do it is to add assertions to our code so that it checks itself as it runs. An assertion is simply a statement that something must be true at a certain point in a program. When Python sees one, it evaluates the assertion’s condition. If it’s true, Python does nothing, but if it’s false, Python halts the program immediately and prints the error message if one is provided. For example, this piece of code halts as soon as the loop encounters a value that isn’t positive:

numbers = [1.5, 2.3, 0.7, -0.001, 4.4]
total = 0.0
for num in numbers:
    assert num > 0.0, 'Data should only contain positive values'
    total += num
print('total is:', total)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-19-33d87ea29ae4> in <module>()
      2 total = 0.0
      3 for num in numbers:
----> 4     assert num > 0.0, 'Data should only contain positive values'
      5     total += num
      6 print('total is:', total)

AssertionError: Data should only contain positive values

Programs like the Firefox browser are full of assertions: 10-20% of the code they contain are there to check that the other 80–90% are working correctly. Broadly speaking, assertions fall into three categories:

A precondition is something that must be true at the start of a function in order for it to work correctly.
A postcondition is something that the function guarantees is true when it finishes.
An invariant is something that is always true at a particular point inside a piece of code.

For example, suppose we are representing rectangles using a tuple of four coordinates (x0, y0, x1, y1), representing the lower left and upper right corners of the rectangle. In order to do some calculations, we need to normalize the rectangle so that the lower left corner is at the origin and the longest side is 1.0 units long. This function does that, but checks that its input is correctly formatted and that its result makes sense:

def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin and 1.0 units long on its longest axis.
    Input should be of the format (x0, y0, x1, y1).
    (x0, y0) and (x1, y1) define the lower left and upper right corners
    of the rectangle, respectively."""
    assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
    x0, y0, x1, y1 = rect
    assert x0 < x1, 'Invalid X coordinates'
    assert y0 < y1, 'Invalid Y coordinates'

    dx = x1 - x0
    dy = y1 - y0
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0

    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'

    return (0, 0, upper_x, upper_y)

The preconditions on lines 6, 8, and 9 catch invalid inputs:

print(normalize_rectangle( (0.0, 1.0, 2.0) )) # missing the fourth coordinate

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-1b9cd8e18a1f> in <module>
----> 1 print(normalize_rectangle( (0.0, 1.0, 2.0) )) # missing the fourth coordinate

<ipython-input-1-c94cf5b065b9> in normalize_rectangle(rect)
      4     (x0, y0) and (x1, y1) define the lower left and upper right corners
      5     of the rectangle, respectively."""
----> 6     assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
      7     x0, y0, x1, y1 = rect
      8     assert x0 < x1, 'Invalid X coordinates'

AssertionError: Rectangles must contain 4 coordinates

print(normalize_rectangle( (4.0, 2.0, 1.0, 5.0) )) # X axis inverted

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-325036405532> in <module>
----> 1 print(normalize_rectangle( (4.0, 2.0, 1.0, 5.0) )) # X axis inverted

<ipython-input-1-c94cf5b065b9> in normalize_rectangle(rect)
      6     assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
      7     x0, y0, x1, y1 = rect
----> 8     assert x0 < x1, 'Invalid X coordinates'
      9     assert y0 < y1, 'Invalid Y coordinates'
     10

AssertionError: Invalid X coordinates

The post-conditions on lines 20 and 21 help us catch bugs by telling us when our calculations might have been incorrect. For example, if we normalize a rectangle that is taller than it is wide everything seems OK:

print(normalize_rectangle( (0.0, 0.0, 1.0, 5.0) ))

(0, 0, 0.2, 1.0)

but if we normalize one that’s wider than it is tall, the assertion is triggered:

print(normalize_rectangle( (0.0, 0.0, 5.0, 1.0) ))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-8d4a48f1d068> in <module>
----> 1 print(normalize_rectangle( (0.0, 0.0, 5.0, 1.0) ))

<ipython-input-1-c94cf5b065b9> in normalize_rectangle(rect)
     19
     20     assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 21     assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
     22
     23     return (0, 0, upper_x, upper_y)

AssertionError: Calculated upper Y coordinate invalid

Re-reading our function, we realize that line 14 should divide dy by dx rather than dx by dy. In a Jupyter notebook, you can display line numbers by typing Ctrl+M followed by L. If we had left out the assertion at the end of the function, we would have created and returned something that had the right shape as a valid answer, but wasn’t. Detecting and debugging that would almost certainly have taken more time in the long run than writing the assertion.

But assertions aren’t just about catching errors: they also help people understand programs. Each assertion gives the person reading the program a chance to check (consciously or otherwise) that their understanding matches what the code is doing.

Most good programmers follow two rules when adding assertions to their code. The first is, fail early, fail often. The greater the distance between when and where an error occurs and when it’s noticed, the harder the error will be to debug, so good code catches mistakes as early as possible.

The second rule is, turn bugs into assertions or tests. Whenever you fix a bug, write an assertion that catches the mistake should you make it again. If you made a mistake in a piece of code, the odds are good that you have made other mistakes nearby, or will make the same mistake (or a related one) the next time you change it. Writing assertions to check that you haven’t regressed (i.e., haven’t re-introduced an old problem) can save a lot of time in the long run, and helps to warn people who are reading the code (including your future self) that this bit is tricky.

Test-Driven Development

An assertion checks that something is true at a particular point in the program. The next step is to check the overall behavior of a piece of code, i.e., to make sure that it produces the right output when it’s given a particular input. For example, suppose we need to find where two or more time series overlap. The range of each time series is represented as a pair of numbers, which are the time the interval started and ended. The output is the largest range that they all include:

Graph showing three number lines and, at the bottom,
the interval that they overlap.

Most novice programmers would solve this problem like this:

Write a function range_overlap.
Call it interactively on two or three different inputs.
If it produces the wrong answer, fix the function and re-run that test.

This clearly works — after all, thousands of scientists are doing it right now — but there’s a better way:

Write a short function for each test.
Write a range_overlap function that should pass those tests.
If range_overlap produces any wrong answers, fix it and re-run the test functions.

Writing the tests before writing the function they exercise is called test-driven development (TDD). Its advocates believe it produces better code faster because:

If people write tests after writing the thing to be tested, they are subject to confirmation bias, i.e., they subconsciously write tests to show that their code is correct, rather than to find errors.
Writing tests helps programmers figure out what the function is actually supposed to do.

Here are three test functions for range_overlap:

assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-25-d8be150fbef6> in <module>()
----> 1 assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
      2 assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
      3 assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)

AssertionError:

The error is actually reassuring: we haven’t written range_overlap yet, so if the tests passed, it would be a sign that someone else had and that we were accidentally using their function.

And as a bonus of writing these tests, we’ve implicitly defined what our input and output look like: we expect a list of pairs as input, and produce a single pair as output.

Something important is missing, though. We don’t have any tests for the case where the ranges don’t overlap at all:

assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == ???

What should range_overlap do in this case: fail with an error message, produce a special value like (0.0, 0.0) to signal that there’s no overlap, or something else? Any actual implementation of the function will do one of these things; writing the tests first helps us figure out which is best before we’re emotionally invested in whatever we happened to write before we realized there was an issue.

And what about this case?

assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == ???

Do two segments that touch at their endpoints overlap or not? Mathematicians usually say “yes”, but engineers usually say “no”. The best answer is “whatever is most useful in the rest of our program”, but again, any actual implementation of range_overlap is going to do something, and whatever it is ought to be consistent with what it does when there’s no overlap at all.

Since we’re planning to use the range this function returns as the X axis in a time series chart, we decide that:

every overlap has to have non-zero width, and
we will return the special value None when there’s no overlap.

None is built into Python, and means “nothing here”. (Other languages often call the equivalent value null or nil). With that decision made, we can finish writing our last two tests:

assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-26-d877ef460ba2> in <module>()
----> 1 assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
      2 assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None

AssertionError:

Again, we get an error because we haven’t written our function, but we’re now ready to do so:

def range_overlap(ranges):
    """Return common overlap among a set of [left, right] ranges."""
    max_left = 0.0
    min_right = 1.0
    for (left, right) in ranges:
        max_left = max(max_left, left)
        min_right = min(min_right, right)
    return (max_left, min_right)

Take a moment to think about why we calculate the left endpoint of the overlap as the maximum of the input left endpoints, and the overlap right endpoint as the minimum of the input right endpoints. We’d now like to re-run our tests, but they’re scattered across three different cells. To make running them easier, let’s put them all in a function:

def test_range_overlap():
    assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
    assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
    assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
    assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
    assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)
    assert range_overlap([]) == None

We can now test range_overlap with a single function call:

test_range_overlap()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-cf9215c96457> in <module>()
----> 1 test_range_overlap()

<ipython-input-28-5d4cd6fd41d9> in test_range_overlap()
      1 def test_range_overlap():
----> 2     assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
      3     assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
      4     assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
      5     assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)

AssertionError:

The first test that was supposed to produce None fails, so we know something is wrong with our function. We don’t know whether the other tests passed or failed because Python halted the program as soon as it spotted the first error. Still, some information is better than none, and if we trace the behavior of the function with that input, we realize that we’re initializing max_left and min_right to 0.0 and 1.0 respectively, regardless of the input values. This violates another important rule of programming: always initialize from data.

Pre- and Post-Conditions

Suppose you are writing a function called average that calculates the average of the numbers in a list. What pre-conditions and post-conditions would you write for it? Compare your answer to your neighbor’s: can you think of a function that will pass your tests but not his/hers or vice versa?
Solution
# a possible pre-condition:
assert len(input_list) > 0, 'List length must be non-zero'
# a possible post-condition:
assert numpy.min(input_list) <= average <= numpy.max(input_list),
'Average should be between min and max of input values (inclusive)'

Testing Assertions

Given a sequence of a number of cars, the function get_total_cars returns the total number of cars.
get_total_cars([1, 2, 3, 4])
10
get_total_cars(['a', 'b', 'c'])
ValueError: invalid literal for int() with base 10: 'a'
Explain in words what the assertions in this function check, and for each one, give an example of input that will make that assertion fail.
def get_total(values):
    assert len(values) > 0
    for element in values:
        assert int(element)
    values = [int(element) for element in values]
    total = sum(values)
    assert total > 0
    return total
Solution

The first assertion checks that the input sequence values is not empty. An empty sequence such as [] will make it fail.

The second assertion checks that each value in the list can be turned into an integer. Input such as [1, 2,'c', 3] will make it fail.

The third assertion checks that the total of the list is greater than 0. Input such as [-10, 2, 3] will make it fail.

Key Points

Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do.

Put assertions in programs to check their state as they run, and to help readers understand how those programs are supposed to work.

Use preconditions to check that the inputs to a function are safe to use.

Use postconditions to check that the output from a function is safe to use.

Write tests before writing code in order to help determine exactly what that code is supposed to do.

Command-Line Programs

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How can I write Python programs that will work like Unix command-line tools?

Objectives

Use the values of command-line arguments in a program.

Handle flags and files separately in a command-line program.

Read data from standard input in a program so that it can be used in a pipeline.

The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a dataset and prints the average inflammation per patient.

Switching to Shell Commands

In this lesson we are switching from typing commands in a Python interpreter to typing commands in a shell terminal window (such as bash). When you see a $ in front of a command that tells you to run that command in the shell rather than the Python interpreter.

This program does exactly what we want - it prints the average inflammation per patient for a given file.

$ python ../code/readings_04.py --mean inflammation-01.csv

We might also want to look at the minimum of the first four lines

$ head -4 inflammation-01.csv | python ../code/readings_06.py --min

or the maximum inflammations in several files one after another:

$ python ../code/readings_04.py --max inflammation-*.csv

Our scripts should do the following:

If no filename is given on the command line, read data from standard input.
If one or more filenames are given, read data from them and report statistics for each file separately.
Use the --min, --mean, or --max flag to determine what statistic to print.

To make this work, we need to know how to handle command-line arguments in a program, and understand how to handle standard input. We’ll tackle these questions in turn below.

Command-Line Arguments

Using the text editor of your choice, save the following in a text file called sys_version.py:

import sys
print('version is', sys.version)

The first line imports a library called sys, which is short for “system”. It defines values such as sys.version, which describes which version of Python we are running. We can run this script from the command line like this:

$ python sys_version.py

version is 3.4.3+ (default, Jul 28 2015, 13:17:50)
[GCC 4.9.3]

Create another file called argv_list.py and save the following text to it.

import sys
print('sys.argv is', sys.argv)

The strange name argv stands for “argument values”. Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list sys.argv so that the program can determine what they were. If we run this program with no arguments:

$ python argv_list.py

sys.argv is ['argv_list.py']

the only thing in the list is the full path to our script, which is always sys.argv[0]. If we run it with a few arguments, however:

$ python argv_list.py first second third

sys.argv is ['argv_list.py', 'first', 'second', 'third']

then Python adds each of those arguments to that magic list.

With this in hand, let’s build a version of readings.py that always prints the per-patient mean of a single data file. The first step is to write a function that outlines our implementation, and a placeholder for the function that does the actual work. By convention this function is usually called main, though we can call it whatever we want:

$ cat ../code/readings_01.py

import sys
import numpy


def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for row_mean in numpy.mean(data, axis=1):
        print(row_mean)

This function gets the name of the script from sys.argv[0], because that’s where it’s always put, and the name of the file to process from sys.argv[1]. Here’s a simple test:

$ python ../code/readings_01.py inflammation-01.csv

There is no output because we have defined a function, but haven’t actually called it. Let’s add a call to main:

$ cat ../code/readings_02.py

import sys
import numpy

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for row_mean in numpy.mean(data, axis=1):
        print(row_mean)

if __name__ == '__main__':
   main()

and run that:

$ python ../code/readings_02.py inflammation-01.csv

Running Versus Importing

Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.

In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:
if __name__ == '__main__':
    main()  # Or whatever function produces output
When you import a Python file, __name__ is the name of that file (e.g., when importing readings.py, __name__ is 'readings'). However, when running a script in bash, __name__ is always set to '__main__' in that script so that you can determine if the file is being imported or run as a script.

The Right Way to Do It

If our programs can take complex parameters or multiple filenames, we shouldn’t handle sys.argv directly. Instead, we should use Python’s argparse library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.

Handling Multiple Files

The next step is to teach our program how to handle multiple files. Since 60 lines of output per file is a lot to page through, we’ll start by using three smaller files, each of which has three days of data for two patients:

$ ls small-*.csv

small-01.csv small-02.csv small-03.csv

$ cat small-01.csv

0,0,1
0,1,2

$ python ../code/readings_02.py small-01.csv

0.333333333333
1.0

Using small data files as input also allows us to check our results more easily: here, for example, we can see that our program is calculating the mean correctly for each line, whereas we were really taking it on faith before. This is yet another rule of programming: test the simple things first.

We want our program to process each file separately, so we need a loop that executes once for each filename. If we specify the files on the command line, the filenames will be in sys.argv, but we need to be careful: sys.argv[0] will always be the name of our script, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

The solution to both problems is to loop over the contents of sys.argv[1:]. The ‘1’ tells Python to start the slice at location 1, so the program’s name isn’t included; since we’ve left off the upper bound, the slice runs to the end of the list, and includes all the filenames. Here’s our changed program readings_03.py:

$ cat ../code/readings_03.py

import sys
import numpy

def main():
    script = sys.argv[0]
    for filename in sys.argv[1:]:
        data = numpy.loadtxt(filename, delimiter=',')
        for row_mean in numpy.mean(data, axis=1):
            print(row_mean)

if __name__ == '__main__':
   main()

and here it is in action:

$ python ../code/readings_03.py small-01.csv small-02.csv

333333333333
0
6666666667
0

The Right Way to Do It

At this point, we have created three versions of our script called readings_01.py, readings_02.py, and readings_03.py. We wouldn’t do this in real life: instead, we would have one file called readings.py that we committed to version control every time we got an enhancement working. For teaching, though, we need all the successive versions side by side.

Handling Command-Line Flags

The next step is to teach our program to pay attention to the --min, --mean, and --max flags. These always appear before the names of the files, so we could do this:

$ cat ../code/readings_04.py

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]

    for filename in filenames:
        data = numpy.loadtxt(filename, delimiter=',')

        if action == '--min':
            values = numpy.min(data, axis=1)
        elif action == '--mean':
            values = numpy.mean(data, axis=1)
        elif action == '--max':
            values = numpy.max(data, axis=1)

        for val in values:
            print(val)

if __name__ == '__main__':
   main()

This works:

$ python ../code/readings_04.py --max small-01.csv

1.0
2.0

but there are several things wrong with it:

main is too large to read comfortably.
If we do not specify at least two additional arguments on the command-line, one for the flag and one for the filename, but only one, the program will not throw an exception but will run. It assumes that the file list is empty, as sys.argv[1] will be considered the action, even if it is a filename. Silent failures like this are always hard to debug.
The program should check if the submitted action is one of the three recognized flags.

This version pulls the processing of each file out of the loop into a function of its own. It also checks that action is one of the allowed flags before doing any processing, so that the program fails fast:

$ cat ../code/readings_05.py

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    for filename in filenames:
        process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

if __name__ == '__main__':
   main()

This is four lines longer than its predecessor, but broken into more digestible chunks of 8 and 12 lines.

Handling Standard Input

The next thing our program has to do is read data from standard input if no filenames are given so that we can put it in a pipeline, redirect input to it, and so on. Let’s experiment in another script called count_stdin.py:

$ cat ../code/count_stdin.py

import sys

count = 0
for line in sys.stdin:
    count += 1

print(count, 'lines in standard input')

This little program reads lines from a special “file” called sys.stdin, which is automatically connected to the program’s standard input. We don’t have to open it — Python and the operating system take care of that when the program starts up — but we can do almost anything with it that we could do to a regular file. Let’s try running it as if it were a regular command-line program:

$ python ../code/count_stdin.py < small-01.csv

2 lines in standard input

A common mistake is to try to run something that reads from standard input like this:

$ python ../code/count_stdin.py small-01.csv

i.e., to forget the < character that redirects the file to standard input. In this case, there’s nothing in standard input, so the program waits at the start of the loop for someone to type something on the keyboard. Since there’s no way for us to do this, our program is stuck, and we have to halt it using the Interrupt option from the Kernel menu in the Notebook.

We now need to rewrite the program so that it loads data from sys.stdin if no filenames are provided. Luckily, numpy.loadtxt can handle either a filename or an open file as its first parameter, so we don’t actually need to change process. Only main changes:

$ cat ../code/readings_06.py

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

if __name__ == '__main__':
   main()

Let’s try it out:

$ python ../code/readings_06.py --mean < small-01.csv

0.333333333333
1.0

That’s better. In fact, that’s done: the program now does everything we set out to do.

Arithmetic on the Command Line

Write a command-line program that does addition and subtraction:

$ python arith.py add 1 2

$ python arith.py subtract 3 4

-1

Solution

import sys

def main():
    assert len(sys.argv) == 4, 'Need exactly 3 arguments'

    operator = sys.argv[1]
    assert operator in ['add', 'subtract', 'multiply', 'divide'], \
        'Operator is not one of add, subtract, multiply, or divide: bailing out'
    try:
        operand1, operand2 = float(sys.argv[2]), float(sys.argv[3])
    except ValueError:
        print('cannot convert input to a number: bailing out')
        return

    do_arithmetic(operand1, operator, operand2)

def do_arithmetic(operand1, operator, operand2):

    if operator == 'add':
        value = operand1 + operand2
    elif operator == 'subtract':
        value = operand1 - operand2
    elif operator == 'multiply':
        value = operand1 * operand2
    elif operator == 'divide':
        value = operand1 / operand2
    print(value)

main()

Finding Particular Files

Using the glob module introduced earlier, write a simple version of ls that shows files in the current directory with a particular suffix. A call to this script should look like this:

$ python my_ls.py py

left.py
right.py
zero.py

Solution

import sys
import glob

def main():
    """prints names of all files with sys.argv as suffix"""
    assert len(sys.argv) >= 2, 'Argument list cannot be empty'
    suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
    glob_input = '*.' + suffix # construct the input
    glob_output = sorted(glob.glob(glob_input)) # call the glob function
    for item in glob_output: # print the output
        print(item)
    return

main()

Changing Flags

Rewrite readings.py so that it uses -n, -m, and -x instead of --min, --mean, and --max respectively. Is the code easier to read? Is the program easier to understand?

Solution

# this is code/readings_07.py
import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['-n', '-m', '-x'], \
           'Action is not one of -n, -m, or -x: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '-n':
        values = numpy.min(data, axis=1)
    elif action == '-m':
        values = numpy.mean(data, axis=1)
    elif action == '-x':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

main()

Adding a Help Message

Separately, modify readings.py so that if no parameters are given (i.e., no action is specified and no filenames are given), it prints a message explaining how it should be used.

Solution

# this is code/readings_08.py
import sys
import numpy

def main():
    script = sys.argv[0]
    if len(sys.argv) == 1: # no arguments, so print help message
        print("""Usage: python readings_08.py action filenames
              action must be one of --min --mean --max
              if filenames is blank, input is taken from stdin;
              otherwise, each filename in the list of arguments is processed in turn""")
        return

    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

main()

Adding a Default Action

Separately, modify readings.py so that if no action is given it displays the means of the data.

Solution

# this is code/readings_09.py
import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    if action not in ['--min', '--mean', '--max']: # if no action given
        action = '--mean'    # set a default action, that being mean
        filenames = sys.argv[1:] # start the filenames one place earlier in the argv list
    else:
        filenames = sys.argv[2:]

    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

main()

A File-Checker

Write a program called check.py that takes the names of one or more inflammation data files as arguments and checks that all the files have the same number of rows and columns. What is the best way to test your program?

Solution

import sys
import numpy

def main():
    script = sys.argv[0]
    filenames = sys.argv[1:]
    if len(filenames) <=1: #nothing to check
        print('Only 1 file specified on input')
    else:
        nrow0, ncol0 = row_col_count(filenames[0])
        print('First file %s: %d rows and %d columns' % (filenames[0], nrow0, ncol0))
        for filename in filenames[1:]:
            nrow, ncol = row_col_count(filename)
            if nrow != nrow0 or ncol != ncol0:
                print('File %s does not check: %d rows and %d columns' % (filename, nrow, ncol))
            else:
                print('File %s checks' % filename)
        return

def row_col_count(filename):
    try:
        nrow, ncol = numpy.loadtxt(filename, delimiter=',').shape
    except ValueError:
        # 'ValueError' error is raised when numpy encounters lines that
        # have different number of data elements in them than the rest of the lines,
        # or when lines have non-numeric elements
        nrow, ncol = (0, 0)
    return nrow, ncol

main()

Counting Lines

Write a program called line_count.py that works like the Unix wc command:

If no filenames are given, it reports the number of lines in standard input.
If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

Solution

import sys

def main():
    """print each input filename and the number of lines in it,
       and print the sum of the number of lines"""
    filenames = sys.argv[1:]
    sum_nlines = 0 #initialize counting variable

    if len(filenames) == 0: # no filenames, just stdin
        sum_nlines = count_file_like(sys.stdin)
        print('stdin: %d' % sum_nlines)
    else:
        for filename in filenames:
            nlines = count_file(filename)
            print('%s %d' % (filename, nlines))
            sum_nlines += nlines
        print('total: %d' % sum_nlines)

def count_file(filename):
    """count the number of lines in a file"""
    f = open(filename,'r')
    nlines = len(f.readlines())
    f.close()
    return(nlines)

def count_file_like(file_like):
    """count the number of lines in a file-like object (eg stdin)"""
    n = 0
    for line in file_like:
        n = n+1
    return n

main()

Generate an Error Message

Write a program called check_arguments.py that prints usage then exits the program if no arguments are provided. (Hint: You can use sys.exit() to exit the program.)
$ python check_arguments.py
usage: python check_argument.py filename.txt
$ python check_arguments.py filename.txt
Thanks for specifying arguments!

Key Points

The sys library connects a Python program to the system it is running on.

The list sys.argv contains the command-line arguments that a program was run with.

Avoid silent failures.

The pseudo-file sys.stdin connects to a program’s standard input.

Lunch break

Overview

Teaching: 0 min
Exercises: 0 min

Questions

Objectives

Key Points

Project: Roman numeral converter

Overview

Teaching: 0 min
Exercises: 150 min

Questions

Write a Roman numeral converter

Objectives

Learn how to solve a problem by designing a Python program

Get more hands-on experience

As our first programming exercise, we’ll build a Roman number converter. This will help you to go through the full process of designing a program to solve a data challenge.

Introduction to Roman Numerals

Roman numerals originated in ancient Rome and have been used throughout Western Europe until the late Middle Ages. They are composed of combinations of letters from the Latin alphabet: I, V, X, L, C, D, and M, which stand for:

I = 1
V = 5
X = 10
L = 50
C = 100
D = 500
M = 1000

Basic Rules of Roman Numerals:

Addition: When a smaller numeral appears after a larger or equal one, you add the values (e.g., VI is 6 because V + I = 6).
Subtraction: When a smaller numeral appears before a larger one, you subtract the smaller from the larger (e.g., IV is 4 because I is before V). The most common subtraction combinations are:
- I can be placed before V and X to make 4 and 9.
- X can be placed before L and C to make 40 and 90.
- C can be placed before D and M to make 400 and 900.
Repetition: The same symbol can be repeated up to three times in succession. For example:
- III = 3
- XXX = 30
Combination: Roman numerals are usually written from largest to smallest from left to right, apart from when doing subtractions.

Examples:

III = 1 + 1 + 1 = 3
IV = 5 - 1 = 4
IX = 10 - 1 = 9
LVIII = 50 + 5 + 3 = 58
MCMXCIV = 1000 + (1000 - 100) + (100 - 10) + (5 - 1) = 1994

Why It Matters in Programming

Converting between Roman and Arabic numerals involves understanding both addition and subtraction rules, making it an excellent exercise for developing algorithms. You’ll need to read in strings of characters while adhering to specific logic patterns —- a fundamental skill in programming.

As you embark on this project, think about the different approaches you can take. Consider efficiency, readability, and how you might handle edge cases or errors (such as invalid numeral sequences). This will not only help you write better code but also enhance your problem-solving skills.

Bonus challenge

Write an Arabic to Roman numeral converter, e.g. going the other way around.

Key Points

Programming with Python

Sets

Overview

A set keeps an unordered collection of unique items.

Check Membership with in

Adding and Removing Items

Add Items with add()

Remove Items with remove()

Set Operations

Union of Sets

Note

Intersection of Sets

Difference of Sets

Order Matters in difference()

Converting a Set to a Sorted List

Initialising Challenge

Fill in the Blanks Challenge

Comparing Bacterial Isolates

Key Points

Dictionaries

Overview

A dictionary allows you to keep data associated to a custom key

Basic Operations

Example: DNA Codon Table

Initialising

Fill in the Blanks

Adding a New Codon

Key Points

Programming Style

Overview

Follow standard Python style in your code.

Use docstrings for functions

Multiline Strings

Examples of Bad/Not Pythonic Code

Example 1: Poor Naming and Formatting

Example 2: Importance of Good Comments

Exercises

What Will Be Shown?

Document This

Messy code

Key Points

Debugging

Overview

Know What It’s Supposed to Do

Make It Fail Every Time

Make It Fail Fast

Change One Thing at a Time, For a Reason

Keep Track of What You’ve Done

Version Control Revisited

Be Humble

Debug With a Neighbor

Not Supposed to be the Same

Key Points

Morning break

Overview

Key Points

Defensive Programming

Overview

Assertions

Test-Driven Development

Pre- and Post-Conditions

Solution

Testing Assertions

Solution

Key Points

Command-Line Programs

Overview

Switching to Shell Commands

Command-Line Arguments

Running Versus Importing

The Right Way to Do It

Handling Multiple Files

The Right Way to Do It

Handling Command-Line Flags

Handling Standard Input

Arithmetic on the Command Line

Solution

Finding Particular Files

Solution

Changing Flags

Check Membership with `in`

Add Items with `add()`

Remove Items with `remove()`

Order Matters in `difference()`