# Python Coroutines

In this post we’ll cover Python coroutines. First we’ll study David Beazley’s great tutorial [1] to learn about generators and coroutines, and how to build a multitask system from the ground-up.

Goddess Durga – source: Flickr

We’ll then take a look at a Python module called asyncio which makes use of coroutines to enable asynchronous I/O.

## Generators

Python generators are functions what allow return the execution back to the caller such that when called again it will resume its execution from the same place. It’s the same concept of JavaScript generators, which we talked about before.

The syntax for yielding the execution back is via the yield keyword. Here’s an example of a custom implementation of range():

Because the execution can be yielded at any time in the code, we can use infinite loops:

Another way to use generators is to “pipe” (in the Linux way), one generator into another, like in the example where we read lines from a dictionary file and find those ending in “ion”:

This also highlights one advantage of using generators abstraction: the lower memory footprint since it can process one line at a time. Sure, we could achieve the same with regular functions, but they would need to be combined in a single block:

Which is efficient but doesn’t allow modularity.

Combined with the fact that generators can be infinite loops, one can model functionality like tail and grep as generators (e.g. tail -f | grep foo) which is the exact example Beazley provides in his presentation [1].

## Coroutines

In PEP 342 it explains the motivation behind coroutines:

Coroutines are a natural way of expressing many algorithms, such as simulations, games, asynchronous I/O, and other forms of event-driven programming or co-operative multitasking. Python’s generator functions are almost coroutines — but not quite — in that they allow pausing execution to produce a value, but do not provide for values or exceptions to be passed in when execution resumes. They also do not allow execution to be paused within the try portion of try/finally blocks, and therefore make it difficult for an aborted coroutine to clean up after itself.

There are four main changes:

• yield is an expression instead of statement, which means you can assign the return type of yield to a variable.
• send() method which can be used to send value to the yield expression. send() is a general version of next(), which is equivalent to .send(None).
• throw() method which raises an exception at the place where the last yield was called.
• close() method which raises a GeneratorExit at the place where the last yield was called. Effectively used to force terminating the generator.

Here’s a simple coroutine that just prints what it receives:

Notice that we have to call an empty send() to “initialize” the coroutine. According to PEP 342:

Because generator-iterators begin execution at the top of the generator’s function body, there is no yield expression to receive a value when the generator has just been created. Therefore, calling send() with a non-None argument is prohibited when the generator iterator has just started (…). Thus, before you can communicate with a coroutine you must first call next() or send(None) to advance its execution to the first yield expression.

In the example below we highlight how this flow can be confusing. The first send() receives the value from yield, but it is only able to send a value through a subsequent send().

Since in a coroutine most of the times the first send() will be to initialize the coroutine, Beazley provides a decorator to abstract that:

### Piping

We can chain multiple coroutines the same we did with generators but now the order is somewhat reversed. To see why, consider our generator pipe example rewritten with coroutines. In here we pass print_step() to filter_step() and filter_step() to get_words(). Contrast this with the generator version where we start with get_words, pass its results to filter_step() and then print_step().

### Event Handling

Beazley provides an example using an XML parser to manipulate XML and convert to JSON. The API of xml.sax relies on the visitor pattern, where at important steps of the AST traversal (e.g. startElement, character, endElement) it calls specific methods of a content handler.

The example uses an adapter layer (cosax.py), to allow using coroutines instead the content handler. It also combines multiple smaller coroutines to produce a more complex ones, demonstrating the composability of coroutines.

After some preliminary overview of Operating Systems Beazley builds a prototype of a multitasking system by modeling coroutines as tasks (no threads or processes).

• Task – wrapper around coroutine
• Scheduler – can create new tasks and schedule tasks – keeps a loop as long as there are tasks. Existing tasks can cause tasks to be scheduled. JavaScript event loop.

Yielding inside a task is like an OS interruption (trap).

#### System Calls

One of the interesting implementation of the multitasking system is system calls. The task/coroutine issues a system call by providing it as a value to the yield (e.g. yield GetTid()). This is received by the scheduler which can provide the necessary context to the specific system call implementation.

The tutorial then covers one of the most important parts of the multitasking system which is the ability to do asynchronous I/O operations (e.g. reading and writing to a file).

The idea is to introduce a “system call” corresponding to performing I/O. When a task invokes that, it doesn’t get rescheduled but put in a staging area (self.read_waiting in the code below). The scheduler has then a continuously run task that polls the OS to check if any of the file descriptors corresponding to the tasks in the staging area are ready, in which case the task is rescheduled.

Beazley uses that system to implement a simple server that receives data via a socket:

#### Trampolining

One of the limitations of the scheduler that is developed up to this point is that the yield has to happen at the top-level of coroutine, that is, it cannot invoke other functions that yield, which limits refactoring and modularization.

To overcome this limitation, the author proposes a technique called trampolining. Here’s an example where the execution of add() is done by the top level code, not by main():

We can do a similar thing in the multitasking system. Because the coroutines can recursively call other coroutines, we need to keep a callstack. We can do it inside Task by keeping an explicit stack:

In the example above, when a coroutine that is wrapped in Task makes another coroutine call, for example:

client, addr = yield Accept(sock)

Within Task, result will be a generator, so the current coroutine will be put on the stack and the execution will pass to that generator instead:

self.target = result

When that sub-coroutine yields the execution, the original caller is resumed

self.target = self.stack.pop()

Note that this all happen transparently to the scheduler itself.

This concludes the study of the multitask system from the ground up.

## Modern Asynchronous I/O

The tutorial we just studied is from 2009 and while still relevant, Python went through many changes since then.

We’ll now cover some of the development since the generator-based coroutines were introduced and that culminated in a standard module for handling asynchronous I/O.

### Delegating to subgenerators: yield from

PEP 380 introduced a new syntax for delegating the execution to another generator until that generator is finished. Simeon [5] provides an example where using yield from simplifies the code. In the code below generator() calls generator2() and generator3() in a loop to “exhaust” their execution.

The same can be accomplished with yield for, which does that implicitly:

### Async / Await

PEP 492 Introduces the async and await syntax. The motivation contains these two reasons among others:

It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.

Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring

To address these issues, a function containing the async is considered a native coroutine (as opposed to a generator-based coroutine):

A native coroutine can await generator-based coroutines in which case it has the same  behavior as the wait for. Borrowing from the example above:

### The asyncio module

PEP 3156 Describes a system to support asynchronous I/O, which is now known as the asyncio module. It proposes a coroutine-based multitasking system similar to the one Beazley describes. It has support to common I/O operations like those involving files, sockets, subprocesses, etc.

We won’t dive into much detail in this post, but we can see parallels to the scheduler we  studied previously. Here’s an example where asyncio.run() schedules the coroutine main() which in turn executes the sub-coroutines sleep() one after another:

This is not taking advantage of the non-blocking capabilities of asyncio. Why? Recall that await is equivalent to yield from and that causes the current coroutine to wait the call to finish until it continues. If we run this code we get:

will sleep: 1 slept: 1 will sleep: 2 slept: 2

What we want is to schedule both sub-coroutines at the same time, and asyncio allows that via the gather() method:

If we run this code we get:

will sleep 1 will sleep 2 slept 1 slept 2

Which means that the first sleep executed but yielded the execution back to the scheduler after the sleep() call. The second sleep() got run, printing “will sleep 2”.

## Conclusion

I’ve used async/await and event loops in other languages such as Hack and JavaScript, but only recently ran into it in Python. I kept seeing mentions of coroutines and that led me to study it in more details. Overall I felt like I learned a lot.

David Beazley’s tutorial is really good. They’re thorough and provide lots of analogies with operating systems concepts.

I also liked the presentation medium: the slides are self-contained (all the information is present as text) and he shows the whole code at first then repeats the code multiple times in subsequent slides, highlighting the important pieces in each of them. This is difficult to achieve in flat text like a blog post.

## References

[1] David Beazley – A Curious Course on Coroutines and Concurrency
[2] PEP 342 – Coroutines via Enhanced Generators
[3] PEP 380 — Syntax for Delegating to a Subgenerator
[4] PEP 492 — Coroutines with async and await syntax
[5] Simeon Visser – Python 3: Using “yield from” in Generators – Part 1
[6] PEP 3156 — Asynchronous IO Support Rebooted: the “asyncio” Module

# 2019 in Review

This is a meta-post to review what happened in 2019.

## Posts Summary

This year I continued my efforts to learn about Bioinformatics. I wrote about the DNA Assembly problem, Protein Fold prediction and Protein Design.

I covered some Computer Science topics such as Consistent Hashing, the Aho-Corasick algorithm and Constructing Trees from a Distance Matrix (as part of my studies in Bioinformatics).

For Web Development I learned about JavaScript Async Functions and CDNs in more details. I explored Observable, a JavaScript notebook.

As part of my resolution for 2019, I learned a bit more about Rust, in particular its memory management model which is the part I struggled (and still do) the most with. I also started revisiting Python because of work, and started with Python Type Hints.

I assembled my own Desktop which inspired me to learn more about the Von Neumann Architecture. I also went back to using Linux (after many years exclusive on MacOS).

The most popular blog post of the year was the Python Type Hints (10k views, 9k visits in just 2 days!). It was somehow feature on the front page of Hacker News being at position #8 at some point! The discussion was mostly about the merits of typed vs. untyped languages but it was a nice surprise nevertheless.

With some delays here and there, I kept the resolution to post once a month on average. The blog completed 7 years with 92 posts.

## Resolutions for 2020

My plan is to focus in learning Python, since now I have an opportunity to use it at work. I’ll keep Rust on the side and learn more about it if time permits.

My studies on Bioinformatics made me feel that the computational models are too simplistic to be useful in practice. I’d love to get exposed to real-world examples, even if they’re ugly heuristics, but for now I’m done with the theory. As I dived deeper on the subjects, I started seeing more and more mentions to quantum physics and quantum computing.

I studied QM back in college, but I cannot claim to understand even the basics of it. As for Quantum Computing, I know as much as Justin Trudeau ;) My focus will be in learning Quantum Computing.

I’m still interested in reading more papers, especially on distributed systems and AI, and mobile development (I mentioned these for the sake of tradition at this point, since I never end up following up on them).

## Personal

At the end of the year I like to look back and reflect on and remember all the things I’ve done besides work and the technical blog.

### Trips

In summary, in 2019 I visited Israel, Egypt, India. Within the US, I had a chance to visit many national parks.

Early in the year, after a work trip, I went to Egypt and visited some sites in Cairo and Luxor.

1. Giza Pyramids, 2. Temple in Luxor, 3. Lanterns in the Khan el-Khalili

In November we attended a wedding in Hyderabad. We decided to tour around New Delhi, Jaipur and Agra before heading south to Hyderabad and Tirupati. From my selection below, one can see how much I admire the Mughal architecture.

Top: 1. Humayun Tomb (New Delhi), 2. Hawa Mahal (Jaipur), 3. Jal Mahal. Bottom: 1. Karauli Temple, 2. Taj Mahal (Agra), 3. Qtub Tombs (Hyderabad)

I had a chance to see many new National Parks and their natural wonders, as part of a quick trip to Seattle, Washington and a road trip in Southern Utah.

Top: 1. Lake Diablo (North Cascades), 2. Angel’s Landing Overlook (Zion), 3. Canyons (Bryce). Bottom: 4. Rock formations along the road 5. Famous arch (Arches) 6. Islands in the sky (Canyonlands).

### Books

Books I enjoyed this year:

• Range: Why Generalists Triumph in a Specialized World – David Epstein makes a case that being a generalist can lead to a more successful path. He provided a bunch of examples including Roger Federer and Galileo. Engaging.
• A History of Western Philosophy – Bertrand Russell covers many philosophers and schools of thought. Good perspective on Western Europe’s history. His accounts often expose his own beliefs and opinions.
• Ten Drugs: How Plants, Powders, and Pills Have Shaped the History of Medicine – Thomas Hager describes the history of 10 drugs, from opioids, to vaccines and high colestherol drugs. Engaging.
• The Book of Why – Judea Pearl describes some of his work such as Causal Models and Do Calculus in accessible language. The ideas are very clever and interesting, but the language of the book is a bit hyperbolic (e.g. terms like miracle, revolution and “scientists couldn’t have done this before“) which gets in the way.
• Algorithms to Live By – Brian Christian and Tom Griffiths draw ideas from Computer Science (mostly combinatorial optimization) to solve problems from real life. Great to see many problems and algorithms from grad school in here.
• Outliers – Malcom Gladwell analyzes several success stories to make a compelling case that success is not about just genius and hard-work. The environment where you were born and grow up matter a lot more than we think.
• Educated: A Memoir – Intense and powerful story by Tara Westover. She grew up in an environment very hostile to her education but succeeded in achieving high goals against the odds. Really well written.
• Shoe Dog: A Memoir by the Creator of NIKE – history of Nike from the perspective of its founder, Phil Knight. He is a good writer and Nike had a very interesting initial business model.
• Mythos: The Greek Myths Retold – Stephen Fry provides a nice overview of the Greek Mythology focusing on the gods and semi-gods (not heroes). Casual writing style and interesting connections to the current world (via etymologies).

I’m looking forward to many more adventures, growth and learning in 2020!

# Observable

Observable is a web-first interactive notebook for data analysis, visualization, and exploration [1].

It was created by Mike Bostock, the creator of d3.js, which we discussed previously way back in 2014 [2]. At some point, Mike stepped down from d3.js to focus on a new library called d3.express, which he presented in 2017 during the OpenVis conference [3] and was still in the makes. d3.express eventually got renamed to Observable.

I’ve been experimenting with Observable and in this post I’ll cover some of my learnings while trying to build two visualizations.

One of the benefits of notebooks like Observable is to co-locate code with its output. This makes it much easier to explain code because in addition to markdown comments, we can modify the code and see the results in real-time. This means that a lot of the documentation of APIs available in Observable are notebooks themselves. For these reasons, this post consists of a series of links to notebooks with high-level comments.

Before we start, some assumptions I’m making are familiarity with imperative programming and the basics of JavaScript.

## What is Observable?

Why Observable – this notebook explains the motivation behind Observable and how it works at a high level:

An Observable notebook consists of reactive cells, each defining a piece of state. Rather than track dependencies between mutable values in your head, the runtime automatically extracts dependencies via static analysis and efficiently re-evaluates cells whenever things change, like a spreadsheet

Five-Minute Introduction – provides a hands-on approach of the many topics, some of which we’ll cover in more detail in this post:

• Real-time evaluation of JavaScript snippets
• Cell references
• Markdown
• HTML/DOM display
• Promises
• Generators
• Views
• Importing from other notebooks

Let’s start by delving into the details of JavaScript snippets and cell references.

## Observable vs. JavaScript

Observable’s not JavaScript – in this notebook Bostock explains the differences between JavaScript and Observable. Some notes I found interesting from that notebook:

• A cell is composed by a cell name and an expression:
• [cell name] = [expression]

It can be simple statements like 1 + 2, or multi-line statements like

• The value of  cell_name can be used in other cells, but by default is read-only.
• Each cell can only have one  cell_name  assignment. In other words, it can only “export” one variable. It’s possible to cheat by exporting an object. Note that because curly brackets are used to denote code blocks, we need to wrap an object literal with parenthesis:

• Cells can refer to other cells in any part of the code – Observable builds a dependency DAG to figure out the right order. This also means dependencies cannot be circular. How Observable Runs explains this in more details.
• Constructs like async functions (await) and generators (yield) have special behavior in a cell. We’ll expand in the Generators section below.
• Cells can be mutated if declared with a qualifier (mutable). We’ll expand in the Mutability section below.

## Mutability

Introduction to Mutable State – cells are read-only by default but Observable allows changing the value of a cell by declaring it mutable. When changing the value of the cell elsewhere, the mutable keyword also must be used.

It’s important to note that the cell is immutable, but if the cell is a reference to a value, say an Object, then the value can be mutated. This can lead to unexpected behavior, so  I created a notebook, Mutating references, to highlight this issue.

## Markdown

Markdown summary – is a great cheat sheet of Markdown syntax in Observable. I find the Markdown syntax in Observable verbose which is not ideal given text is so prevalent in notebooks:

mdHello World 
More cumbersome still is typing inline code. Because it uses backticks, It has to be escaped:

mdHello \code\

To be fair, most code will be typed as cells, so this shouldn’t be too common. It supports LaTeX via KaTeX which is awesome. KaTeX on Observable contains a bunch of examples.

## Generators

Generators are a JavaScript construct that allows a function to yield the execution back to the caller and resume from where it stopped when it’s called again. We talked about this in the Async Functions in JavaScript post.

Introduction to Generators explains how generators can be used in Observable. The combination of generators and delays (via promises) is a great mechanism to implement a ticker, which is the base of animation. Here’s a very simple way to define a ticker in Observable (remember that Promises are awaited without extra syntax):

## Views

Introduction to Views explains what views are and how to construct one from scratch. I like this recommendation:

If there is some value that you would like the user to control in your notebook, then usually the right way to represent that value is as a view.

Views are thus convenient ways to expose parameters via buttons and knobs such as text inputs, dropdowns and checkboxes.

I ran into a problem when using checkbox with labels in it, like the cell below:

It does not yield true/false as I wanted. This notebook, Checkbox, discusses the problem and offers a solution. This is an example where great but imperfect abstractions make it hard to understand when something doesn’t work. It’s worth learning about how viewof is implemented behind the scenes.

In one of my experiments I needed to synchronize a view and a ticker. Bostock provides a neat abstraction in his Synchronized Views notebook.

## Importing from other notebooks

Introduction to Imports shows how easy it is to import cells from other notebooks.

It also shows that it’s possible to override the value of some of the cells in that notebook:

This is neat, because the chart cell depends on the data, and by allowing overriding data, it allows parameterizing chart. In other words, we import chart as a function as opposed to a value (the result of chart(data)).

Versioning. When importing a notebook we always import the most recent version, which means it could potentially break if the notebook’s contract changes. The introduction above mentions the possibility of specifying the version when importing but didn’t provide an example on how to do so.

My strategy so far has been to clone a notebook if I’m really concerned about it changing, but that means one has to manually update / re-fork if they want to sync with the upstream changes.

## React Hooks

I’m used to write UI components with React and luckily Jed Fox created a notebook which I forked into this React Notebook which allowed me to use React Hooks to write a view.

## Conclusion

I really like Observable and am looking forward to add implement more data visualizations in it. I experimented with 2 animations so far: a simplistic Solar System model and a spinning Globe.

One natural question that crossed my mind is whether I should have written this whole post as an Observable notebook. In the end I decided to stick with an old-school static post for two reasons:

• Consistency: Observable only supports JavaScript, so if I’m writing a post about Python I’d need to fallback to a post, and I wouldn’t want to have a mix of both.
• Durability: Observable has a lot of potential but it’s still mostly experimental and I’m not sure I can rely on it sticking around for a long time.

## References

• Introduction
• Mutability
• Markdown
• Views
•  Importing
• Integration

### Other

[1] Observable – Why Observable?
[2] Introduction to d3.js
[3] OpenViz Conf – 2017
[4] Async Functions in JavaScript

# Python Type Hints

Jukka Lehtosalo is a Finnish Software Engineer at Dropbox. He developed an optional type system for Python, mypy, during his PhD thesis in Cambridge. After meeting in a Python conference, Guido van Rossum (creator of Python) invited Lehtosalo to join him at Dropbox. They started adopting mypy in real use cases during a Hackathon, which led to mypy being one of the most popular Python type checkers [2].

In this post we’ll cover mypy in general terms as well many examples demonstrating the syntax and capabilities of this type checker.

## Introduction

mypy is a static type checker. This means it does not run during the code execution, so it’s mostly useful during development, much like tests. It relies on manual annotations in the code called type hints, which identify the types of arguments, return types, internal variables and member variables.

One simple example of such annotations is for a function to compute the length of a string:

We are indicating that the input argument s is a string and it return an integer corresponding to its length. The main job of a type checker is to find inconsistencies in the annotations in our code (or in the libraries we use). For example, here’s an incorrect type annotation:

When both operands are integers, the operator + returns another integer, so n + 1 is an integer, which is a clear contradiction given we provided the return type as string.

### Type annotations are free unit tests

This contrived example doesn’t make the usefulness of a type checker obvious, but imagine a case where we implicitly make assumption on the argument type. For example, a small program to compute how much an exchange house would give a user for their cash.

If we don’t perform input validation correctly, the code above might work for the tested scenarios but would fail in production the moment the user provided a string. Of course good QA will cover scenarios like these, but this is to sort of guarantees one gets for free by making the types explicit:

The previous code would now fail the type checker validation.

### Type hints are optional and incremental

Most of the design of type annotations are well documented in PEP 484, the document claims:

Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

This also seems to imply that Python type hints will always be partial / gradual, since if full typing is required, it will make transition from non-typed to fully typed codebases prohibitive. Also, there are concrete benefits even with partial typing.

A partial type system makes it optional to add type annotations to variables, instead of it being fully mandatory (like Java or C++). The type checker then performs validation with whatever information it has in hands.

Incomplete typing can be dangerous if developers build trust on the type checker while it’s only performing partial checks due to incomplete information. Let’s consider an example:

At first glance it seems a well typed program and if we run the type checker it will pass. But if we run main(), we’ll have a runtime error. The problem is that expect_string is missing the return type, so in main(), the type checker cannot infer the type of untyped, so it doesn’t perform any validation on expects_int().

### Local inference

The previous example also highlights an important aspect of the mypy type checker: it only performs inferences at the function boundary. In theory it should infer that expects_string returns str because we’re returning its argument and then infer that untyped is a string.

This is fine in this example but it could be very expensive to make these inferences, especially if we need to consider branching and recursive calls to other functions. In theory the type checker could go only a few levels deep in the function call but this would make the behavior of the type checker very hard to reason about.

For that reason, the type check will only consider the type of the functions being called. For example, it knows expects_string() expects a string and returns no type, so this is what it will assign to untyped no matter what happens inside expects_string().

Now that we know the basics of the type checker, let’s cover some of the syntax and more advanced typing that mypy supports.

## Examples

### Setup

Before we start, it’s useful to be able to test the snippets. To do so, copy the code into a file, say example.py and run this command in the terminal:

mypy example.py

which will print any type errors that exist in example.py. mypy can be installed via Python packaging system, pip. Make sure to user Python 3:

pip3 install mypy

### Primitive types

bool, int, str, float are the types one will most likely use in functions. As seen above, we can use these to type arguments and return types:

### Composed types

We’ll look into generics later, but it should be straightforward to understand the typing of composed types like lists, dictionaries and tuples:

It’s worth noting that these types need to be explicitly imported from the typing module.

### None vs. NoReturn

None is used to indicate that no return value is to be expected from a function.

NoReturn is to indicate the function should not return via the normal flow:

### Local Variables

The example below demonstrates the syntax for typing local variables. In general typing local variables is not necessary since their type can often be inferred from the assignment / initialization.

### Optional and Union

Optional[T] is a generic type that indicates the type is either T or None. For example:

Optional is a special case of a more general concept of Union of types:

My personal take is that Union should be avoided (except for special cases like Optional) because it makes the code harder to deal with (having to handle multiple types) and it’s often better via inheritance (base type representing the union).

### Any vs. object

Any is equivalent to not providing the type annotation. On the other hand, object is the base of all types, so it would be more like a Union of all the types. A variable of type object can be assigned a value of any type, but the value of an object variable can only be assigned to other object variables. It’s possible to refine the type of a variable to a more specific one. See “Refining types”.

### Classes

There are three main things we need to consider when annotating variables in a class context:

• We can add types to the member variable (_n in this case).
• We don’t type self: it’s assumed to be the type of the class where it’s defined.
• The return type of __init__ is None

### Callables

Callables can be used to type higher order functions. Here’s an example where we pass a function (lambda) as argument to a map function:

The type of the function is Callable. The first element, [int] in the example above, is a list of the types of the arguments. The second argument is the return type.

As another example, if we want to define a reduce function on strings, our callback has now type Callable[[str, str], str] because it takes 2 arguments.

### Generics: Type Variables

Type variables allow us to add constraints to the types of one or more argument and/or return types so that they share the same type. For example, the function below is typed such that if we pass List[int] as argument, it will return Optional[int]:

Note that the string passed to the TypeVar() function must match the of the variable it is assigned to. This is an inelegant syntax but I’m imagining it’s the result of working around syntax limitations of Python (and the difficulties in changing the core Python syntax for annotations).

We can use multiple TypeVars in a function:

Constraints. According to [3] it’s also possible to limit the type var to be of a specific types:

TypeVar supports constraining parametric types to a fixed set of possible types (note: those types cannot be parametrized by type variables).

It also notes:

There should be at least two constraints, if any; specifying a single constraint is disallowed.

Which makes sense, if we were to restrict a TypeVar to a single type we might as well use that type directly.

In the example below we allow Tmix to be bound to either int or str. Note this is different from Union[int, str] because the latter is both int and str at the same time, while the former is either int or str, depending on how it’s called. The third call to fmix() below would be valid for a Union.

### Parametrized Classes

We’ve just seen how to parametrize functions via TypeVar. We can also extend such functionality to classes via the Generic base class:

### Ignoring type hints

During the transition from untyped to typed code, it might be necessary to temporarily turn off type checking in specific parts of the code. For example, imagine a scenario where types were added to a widely used function but many existing callers are passing incorrect types.

The comment # type: ignore makes the type checker skip the current line (if an inline comment) or the next line (if a line comment). Here’s an obvious type violation that is ignored by the type checker:

It’s also possible to turn off type-checking completely for the file:

A # type: ignore comment on a line by itself at the top of a file, before any docstrings, imports, or other executable code, silences all errors in the file

### Refining types: isinstance() and cast()

In some cases we might receive values from untyped code or ambiguous types like Unions or object. Two ways of informing the type checker about the specific type is by explicit check via isinstance() or cast(). The isinstance() will be usually in a if clause:

This allows the type checker to infer the type of x within the x block, so it won’t complain about the call of inc(x).

Another, more drastic, approach is to use cast([type], x) which returns x if it has in runtime but otherwise throws an exception, but this allows the type checker to refine the type of x to statically. Here’s an example:

It’s a bummer that the order of arguments of cast([type], var) and isisntance(var, [type]) are inconsistent.

### Arbitrary argument lists

It’s possible to type arbitrary argument lists, both the positional and named ones. In the example below, args has type Tuple[str, ...] and kwds has type Dict[str, int]. Note that the ... in Tuple[str, ...] indicates an arbitrary-length tuple of strings. It’s unclear to me why it’s a tuple and not a list, but I’d guess it has to do with how it represents non-arbitrary argument lists (e.g. foo(x: int, y: str) would be Tuple[int, str]).

## Conclusion

I really like Python and it’s my go-to language for small examples and automation scripts. However, having had to work with a large code base before, I was really displeased by its lack of types.

I’m also a big proponent of typed code, especially for large, multi-person code bases, but I understand the appeal of rapid development for prototypes and Python now offers both.

As we saw in some examples, the syntax is cumbersome at times but overall I find the mpyp type checker pretty expressive.

## References

[2] Dropbox Blog: Thank you, Guido
[3] PEP 484 — Type Hints

# Aho-Corasick

Alfred Aho is a Canadian computer scientist. He is famous for the book Principles of Compiler Design co-authored with Jeffrey Ullman. Aho also developed a number of utilities for UNIX which has widespread usage, including the AWK programming language (acronym for Aho, Weinberger and Kernigham) and variants of grep.

Unfortunately, I couldn’t find much information about Margareth J. Corasick, except that she worked at Bell Labs and together with Aho, developed the Aho-Corasick algorithm, which we’ll talk about in this post.

### Multiple-string search

We have studied before the problem of finding a string in a text, for which the Knuth-Morris-Prat algorithm is well-known.

Aho and Corasick studied the problem of finding a set of strings in a text. More precisely, given a body of text H and a set of search terms S, we want to find all occurrences of each term of S in H.

They came up with the Aho-Corasick algorithm which is able to solve this problem in linear time on the size of H and the size of the output (i.e. the total number of characters that need to be printed when a match is reported).

It consists in pre-computing a look-up structure from S and then scanning the text using it. Whenever it detects a match was found, it prints the corresponding matches. Let’s see first how to construct the structure.

### The look-up structure

The look-up structure is constructed in two phases, each to construct two sets of edges: goto and fail edges.

The goto edges are essentially the edges of a trie containing the entries in S. The trie can be seen as a state machine where each node represents a prefix of one of the entries in S and the transition function tells us to which prefix to go next given an input character. These edges are thus called goto arrows. We denote by g(s, a) the node we navigate to if we’re at node s and receive input a.

Given a text H = [a1,a2,…,a_n], we can start at the root and follow the edges labeled a1, a2 and so on. If we land on a node representing an entry in S we print it.

Eventually though, we may reach a node s such that the next character a_j doesn’t have a corresponding edge g(s, a_j). We know that s is a suffix of [a1,a2,…,a_{j-1}], say s = [a_k,…,a_{j-1}]. We also know there’s no other unreported entry in S that starts at k but there might be one that starts at a_{k+1}. We could restart the search at k+1 and at the root of the tree, but can we do better?

Because S is fixed, no matter what text H we have, by the time we encounter a dead end at a node r, we know that the last characters seen are  [a_k,…,a_{j-1}] = r. Now suppose s = [a_{k+1},…,a_{j-1}, a_j] happens to be in the trie. Then we can continue our search from s, without having to backtrack. If k+1 doesn’t work, we can try k+2, k+3, and so forth. More generally, say that we eventually found the smallest index k* > k such that s* = [a_k*,…,a_{j-1}, a_j] is in the trie. In other words s* is the longest proper suffix of s in the trie. When we fail to find s = r + a_j in the trie, we can resume at s*. This suggests we could have a fail edge from s to s* in our trie to quickly resume the search without backtrack.

Technically, s doesn’t exist in the trie, so we can’t add an edge there. However, we can still add the failure edge in r, and denote it as f(r, a_j) = s* for all labels a_j not in a goto edge.

Given the goto and fail edges, we have a proper state machine which we can use to find matches in H. Let’s combine them into a single set of edges e(s, a) (that is e(s, a) = g(s, a) if it exists, otherwise f(s, a)).

We are still missing printing out the matches. Because of the shortcuts we take via the fail edges, we never explicitly visit a suffix of a state s. For this reason, when visiting a state s, we need to print all its suffixes that belong to S.

We’ll now see how to construct the fail edges and the output.

Building the fail edges

We’ll show how to construct the failure edges efficiently. We define auxiliary edges ps(s) which represents the longest proper suffix of s that is in the trie. Note that f(r, a) is ps(r + a) for when g(r, a) doesn’t exist.

The idea is that if we know ps(r) for all nodes s of depth up to l in the trie, we can compute ps(r + a), that is for the nodes in the next level.

Consider each node r of depth l and each symbol a. Let s = r + a. We know ps(r) is in the trie, but is ps(r) + a in the trie? Not necessarily, but maybe a suffix of ps(r) is? We can try ps(ps(r)) and further suffixes until we find one that is in the trie, or we reach the empty string.

How can we make sure to only process node s only after all the edges from lower levels have been processed? We can process the nodes in bread-first order, which can be achieved using a queue. The pseudo-code could be:

Note how instead of searching for the largest suffix by removing one character at a time and checking if it exists in the trie, we’re “jumping” to suffixes via repeated application of ps(), which is what makes the algorithm efficient as we shall see. You might be asking if we’re not missing any valid suffix when doing that, but no worries, we analyze the correctness of this in Lemma 1 (Appendix).

Building the output

If we compute output(s) in a bread-first order, we can assume we know output(r) for nodes r lower level than s. Assuming we already computed ps(s), we have that output(s) = output(ps(s)) + {s} if s is in S:

How do we know by jumping through suffixes via ps(s) we are not missing any matches? We demonstrate that in Lemma 2 (Appendix).

### Implementation in Python

I’ve implemented the pseudo-code proposed here in Python, and made it available on Github. The main implementation aspect is how we compute the output list. To avoid storing the matches multiple times, we model it as a linked list. More precisely, when computing output(s) from ps(s), if ps(s) is itself a keyword, we point output(s) to ps(s). Otherwise we point it to output(ps(s)).

When printing output(s) we just need to traverse the list,:

### Conclusion

It was a lot of work to study Aho Corasick’s paper [1], understand the proof and re-write with my own words, but this provided me much better intuition behind the algorithm.

Like the KMP, I’ve known this algorithm before but never took the time to fully understand it.

My initial implementation was in Rust, but I got stuck when modeling the nodes because of self-reference. This led me to this Stackoverflow answer which I haven’t time to study, but seems like a good idea for a post.

### Appendix

#### Correctness

Lemma 1. Let s be a prefix in the look-up structure. Then ps(s) is the longest proper suffix of s that exists in the look-up structure T.

Let’s prove this by induction on the size of s. The base is for nodes of level 1, where ps(s) = root (empty string), which is true, since the proper suffix of a single letter is empty.

Assuming that ps(s’) is the longest proper suffix of s that exists in the look-up structure for |s’| < |s| , let’s prove this is true for ps(s) as well. Let’s show first that ps(s) is a proper suffix of s in T. We know that ps(r) is a suffix of r, and so is ps(ps(r)) since a suffix of a suffix of r is also a suffix of r, and so on. Let’s define ps_k(r) = ps(ps_{k-1}(r)) and ps_n(r) such that g(ps_n(r), a) exists. By construction we assigned ps(s) = g(ps_n(r), a). Since ps_n(r) is a suffix of r, and r + a = s, we know what ps_n(r) + a is suffix of s. We’ll show there’s no other proper suffix or s in loop-up structure larger than ps(s). Suppose there is a s* in T such that |s*| > |ps(s)|. Let r* be = s* + a. Note that r* is a suffix of r and it exists in T (since s* is in there and T it’s a prefix tree). It cannot be larger than ps(r) because of our induction hypothesis. r* has to be larger than ps_n(r) since |ps(s)| = |ps_n(r)| + 1, |s*| > |ps(s)| and |s*| = |r*| + 1.

The first observation is that r* != ps_k(r) for any k. Otherwise the algorithm would have set ps(s) = g(r*, a) = s*. So there must be some k for which |ps_k(r)| > |r*| > |ps_{k+1}(r)|, this means that for r’ = ps_k(r), there’s a suffix r* suc that |r*| > |ps(r’)|. But this is a contradiction of the induction hypothesis saying that ps(r’) is the largest suffix of r’ in T.

Lemma 2. output(s) contains y if and only if y is in S and a suffix of s.

Let’s prove one direction: if output(s) contains y then y is in S and a suffix of s. This is easy to see by construction, output(s) contains s if it’s in S, and also the outputs of ps(s), which is a suffix of s, and by induction are all in S and suffix of ps(s), and hence of s.

Now suppose y is a keyword and suffix of s. Let’s show that there’s a k for which ps_k(s) = y. If this was not the case, then by a similar argument we did in Lemma 1, there is ps_{k-1}(s) such that y is suffix of it, and |y| > |ps_{k}(s)|, which violates Lemma 1, since  ps(ps_{k-1}(s)) = ps_{k}(s) is not the longest possible. Since ps_k(s) = y, by construction, we know that output(y) will eventually be merged into output(s).

Lemma 3. At the end of the loop in the Algorithm 1, state s is in the look-up structure T and it is a suffix of [a1,a2,…,aj].

This is true for a1: it would be either in state [a1] in T, or the empty state, both being suffix of [a1]. Assuming this hypothesis holds for j-1, we were in a state r before consuming aj, and r is the longest suffix of  [a1,a2,…,aj-1].

By construction we know that the end state s is still in T, since we only followed valid edges of T (gotos and fails). We need to show s is the longest suffix of [a1,a2,…,aj]. For the sake of contradiction, assume there’s s* such that |s*| > |s| in T that is a suffix of [a1,a2,…,aj]. Let r* be such that s* = r* + aj. Note that r* is in T and is a suffix of [a1,a2,…,aj-1] and by the inductive hypothesis |r*| < |r|. If we used a goto edge, then |r*| > |r| which contradicts the inductive hypothesis. If we used a fail edge, then s = f(r, aj). Using the same argument from Lemma 1, we have that r* is a suffix of r, and that r* is not ps_k(r) for any k, so it must be that for some k, x = ps_k(r), |x| > |r*| > |ps(x)|and is a contradiction because r* is a proper suffix of x but longer than ps(x).

Theorem 1. The Aho Corasick algorithm is correct.

We need to show that every substring of text which is in S, will be reported. Any substring we were to report end at an index j of H, so it suffices to show we report every suffix of [a1,a2,…,aj] in S, for all j.

Let s be the node we are at the end of each loop in Algorithm 1. Suppose there’s a suffix s* of [a1,a2,…,aj] in S that is not in output(s). From Lemma 2, we know that every keyword in S that is a suffix of s is in output(s), it follows that s* is not a suffix of s, so |s*| > |s|, which cannot be from Lemma 3.

QED

#### Complexity

Theorem 2. The look-up stricture can be constructed in linear time of the number of characters in S.

Let’s define size(S) = the total number of characters in S and A the set of symbols of the alphabet.

It’s known to that we can construct a trie from a set of string in S, so the goto edges are covered.

For constructing the ps() function, the cost will be associated on how many times we execute the inner loop. We visit each node once and for each of the |A| symbols, we perform the inner loop. Since|f(s)| < |s|, the number of times we execute the inner-most loop for a given state is bounded by |s|, so for each state we execute the inner-most loop |s|*|A| times. If we sum over all states, we have size(S)*|A|. Assuming |A| is constant and small, we can ignore it in the overall complexity. It’s worth noting Aho and Corasick’s original paper don’t depend on |A|.

The cost of constructing the output() function is O(1) if we were to use a shared linked list. In that case output(s) can be constructing with a cell s and have it point to the list of output(f(s)), so again, this would be bounded by size(S).

For the search algorithm, let’s ignore the cost of printing the output for now. We can follow an edge in O(1) time and do so exactly once per character.

In the worst case there could be a lot matches at each position. For example, if S = {a, aa, aaa, …, a^(n/2)} and H = a^n.  For n/2 positions we’ll need to output all entries in S, leading to a complexity of size(S)*n.

Since these values need to be printed, there’s no way to be faster than that and this could be the dominating factor in the total complexity. In practice however the number of matches is much smaller.

### References

[1] Efficient String Matching: An Aid to Bibliographic Search
[2] Alfred Aho – Wikipedia

# Protein Design

We previously learned about the problem of predicting the folding of a protein, that is, given a chain of amino acids, find its final 3D structure. This time we’re interested in the reverse problem, that is, given a 3D structure, find some chain of amino-acid that would lead to that structure once fully folded. This problem is called Protein Design.

In this post we’ll focus on mathematical models for this problem, studying its computational complexity and discuss possible solutions.

### Mathematical Modeling

The 3D structure of a protein can be divided into two parts: the backbone and the side chains. In the model proposed by Piece and Winfree [3], we assume the backbone is rigid and that we’ll try to find the amino-acids for the side chains such that it minimizes some energy function.

This means we’re not really trying to predict the whole chain of amino acids, but a subset of those amino that will end up on the side chains.

The amino acids on the side-chain can have a specific orientation [2, 3], known as rotamer, which in turn can be represented by a single value, its dihedral angle. We can define some arbitrary order for these amino acids and label them with an index, which we call position.

At each position there are multiple rotamers possible and we need to select them to minimize the overall energy function. More formally, for each position i, let $R_i$ be the set of rotamers available and $r_i \in R_i$ the chosen rotamer.

The model assumes the cost function of the structure is pairwise decomposable, meaning that we can account for the interaction of each pair independently when calculating the total energy:

Where $E(r_i, r_j)$ is the energy cost of the interaction between positions i and j, assuming rotamers $r_i$ and $r_j$ respectivelly. The definition of E can be based on molecular dynamics such as AMBER.

In [3], the authors call this optimization problem PRODES (PROtein DESign).

### PRODES is NP-Hard

Pierce and Winfree [3] prove that PRODES is NP-Hard. We’ll provide an informal idea of the proof here.

First we need to prove that the decision version of PRODES is NP-Hard. The decision version of PRODES is, given a value K, determine whether there is a set of rotamers such that the energy cost is less or equal K. We’ll call it PRODESd. We can then prove that PRODESd is NP-complete by showing that it belongs to the NP complexity class and by reducing, in polynomial time, some known NP-complete problem to it.

We claim that this problem is in NP because given an instance for the problem we can verify in polynomial time whether it is a solution (i.e. returns true), since we just need to evaluate the cost function and verify whether it’s less than K.

We can reduce the 3-SAT problem, known to be NP-complete, to PRODESd. The idea is that we can map every instance of the 3-SAT to an instance of PRODESd and show that the result is “true” for that instance of 3-SAT if and only if the result is “true” for the mapped instance of PRODESd.

Let’s start by formalizing PRODESd as a graph problem, with an example shown in the picture below:

a) has 3 positions with their sets of rotamers. b) each rotamer becomes a vertex grouped into sets. We pick exactly one vertex per set and try to minimize the total cost of the edges associated with the selected vertices. Image source: [3]

Now, given a 3-SAT instance, we create a vertex set for each clause $C_i$ (containing a vertex for each literal), and a vertex set for each variable $x_i$ (containing vertices T and F). For each literal $x_i$ we add an edge of weight 1 to the vertex F of the set corresponding to variable $x_i$. Conversely, for each negated literal, $\bar x_i$, we add an edge of weight 1 to the vertex T. All other edges have weight 0.

For example, the instance $(x_1 + \bar x_2 + x_3) \cdot ( \bar x_1 + x_2 + x_3) \cdot (\bar x_3)$ yields the following graph where only edges of non-zero weight are shown:

Source: [3]

We claim that this 3-SAT instance is satisfiable if and only if the PRODESd is true for K=0. The idea is the following: for each vertex set corresponding to a variable we pick either T or F, which corresponds to assigning true or false to the variable. For each vertex set corresponding to a clause we pick a literal that will evaluate to true (and hence make the clause true). It’s possible to show that if 3-SAT is satisfiable, there’s a solution for PRODESd that avoids any edge with non-zero weight.

### Integer Linear Programming

Now that we know that PRODES is unlikely to have an efficient algorithm to solve it, we can attempt to obtain exact solutions using integer linear programming model. Let’s start with some definitions:

We can define our variables as:

The object function of our model becomes:

Finally, the constraints are:

Equation (1) says we should pick exactly one rotamer for each position. Constraints (2) and (3) enforce that $x_{i, j} = 1$ if and only if $r_i = r_j = 1$.

Note: the LaTeX used to generate the images above are available here.

### Conclusion

The study of protein prediction led me to protein design, which is a much simpler problem, even though from the computational complexity perspective it’s still an intractable problem.

The model we studied is very simple and makes a lot of assumptions, but it’s interesting as a theoretical computer science problem. Still I’m not sure how useful it is in practice.

### References

[1] Wikipedia – Protein design
[2] Wikipedia – Rotamer
[3] Protein Design is NP-hard – Niles A. Pierce, Erik Winfree

# Protein Folding Prediction

In this post we’ll study an important problem in bioinformatics, which consists in predicting protein folding using computational methods.

We’ll cover basic concepts like proteins, protein folding, why it is important and then discuss a few methods used to predict them.

## Amino Acids and Proteins

Amino acids are any molecules containing an Amino (-NH2) + Carboxyl (COOH) + and a side chain (denoted as R), which is the part that varies between different amino acids [1].

Etymology. A molecule containing a carboxyl (COOH) is known as carboxylid acid, so together with the Amino part, they give name to amino acids [2].

An amino acid where R attaches to the first carbon (see Figure 1) is known as α-amino acids (read as alpha amino acids).

Figure 1: alpha and beta carbons in a Carbonyl group (link)

There are 500 different amino acids known, but only 22 of them are known to be present in proteins and all of them are α-amino acids. Since we’re talking about proteins, we’ll implicitly assume amino acids are the alpha amino ones. See Figure 2.

Figure 2: Amino acid

Amino acids can be chained, forming structures known as pepitides, which in turn can be chained forming structures like polipeptides. Some polipeptides are called proteins, usually when they’re large and have some specific function, but the distinction is not precise.

Etymology.Peptide comes from the Greek to digest. According to [3], scientists Starling and Bayliss discovered a substance called secretin, which is produced in the stomach and signals the pancreas to secrete pancreatic juice, important in the digestion process. Secretin was the first known peptide and inspired its name.

## Protein Folding

To be functional, a protein needs to fold into a 3-D structure. The sequence of amino acids in that protein is believed [6] to be all that is needed to determine the final shape of a protein and its function. This final shape is also known as native fold.

There are 4 levels of folding, conveniently named primary, secondary, tertiary and quaternary.

• Primary structure is the non-folded form, the “linearized” chain of amino acids.
• Secondary structure is formed when parts of the chain bond together via hydrogen bonds, and can be categorized into alpha-helix and beta-strands.
• Tertiary structure is when the alpha-helix and beta-sheet fold onto globular structures.
• Quaternary structure is when 2 or more peptide chains that have already been folded combine with each other.

## Prediction

The prediction of protein folds usually involves determining the tertiary (most common form) structure from its primary form. We’ve seen before in DNA Sequencing  that reading amino acid sequences is easier when its denatured into a linear chain. On the other hand, analyzing the intricate folded forms can be very hard [7], even with techniques such as X-ray crystalography.

## Why is this useful?

If the protein structure is known, we can infer its function on the basis of structural similarity with other proteins we know more about. One can also predict which molecules or drugs can efficiently bind and how they will bind to protein [7].

From Wikipedia’s Nutritious Rice for the World [8], protein prediction can also help understand the function of genes:

… prediction tools will determine the function of each protein and the role of the gene that encodes it.

## Computational Methods

In this section we’ll focus on the computation methods used to aid the prediction of protein folds. There are 2 main types, template based and de novo.

Template based methods rely on proteins with known folds which can be used as template for determining the fold of a target protein with unknown fold. This only works well if the target protein is similar enough to the “template” one, or has some shared ancestry with it, in which case they are said to be homologous.

De novo methods do not rely on existing protein for templates. It uses known physico-chemical properties to determine the native structure of a protein. We can model the folding as an optimization problem where the cost function is some measure of structure stability (e.g. energy level) and steps are taken to minimize such function given the physico-chemical constraints. Molecules that have the same bonds but different structures are known as conformations, which form the search space of our optimization problem.

Let’s now cover some implementations of de novo methods.

Rosetta

One implementation of de novo method is the Rosetta method [9]. It starts off with the primary form of the protein. For each segment of length 9 it finds a set of known “folds” for that segment. A fold in this case is represented by torsion angles on the peptide bonds. The search consists in randomly choosing one of the segments and applying the torsion of its corresponding known folds to the current solution and repeat.

Note that while this method does not rely on protein templates, it still assumes an existing database of folds for amino acid fragments. However, by working with smaller chains, the chances of finding a matching template are much higher.

AlphaFold

One major problem with de novo methods is that they’re computationally expensive, since it has to analyze many solutions to optimize them, which is a similar problem chess engines have to.

Google’s DeepMind has made the news with breakthrough performance with its Go engine, AlphaGo. They used a similar approach to create a solver for protein folding prediction, AlphaFold. This approach placed first in a competition called CASP.

CASP is a competition held annually to measure the progress of predictors. The idea is to ask competitors to predict the shape of a protein whose native fold is known but not public.

Crowdsourcing

Another approach to the computational costs involved in the prediction is to distribute the work across several machines. One such example is Folding@home, which relies on volunteers lending their idle CPU time to perform chunks of work in a distributed fashion which is then sent back to the server.

An alternative to offering idle CPU power from volunteers’ machines is to lend CPU power from their own brains. Humans are good at solving puzzles and softwares like Foldit rely on this premise to find suitable folds. According to Wikipedia:

A 2010 paper in the science journal Nature credited Foldit’s 57,000 players with providing useful results that matched or outperformed algorithmically computed solutions

## Conclusion

Here are some questions that I had before writing this post:

• What is protein folding? Why predicting it important?
• What are some of the existing computational methods to help solving this problem?

Questions that are still unanswered are details on the algorithms like the Rosetta method. I also learned about Protein Design in the process, which I’m looking into studying next. The approach used by DeepMind seem very interesting, but as of this writing, they haven’t published a paper with more details.

As a final observation, while researching for this post, I ran into this Stack Exchange discussion, in which one of the comments summarizes my impressions:

This field is pretty obscure and difficult to get around in. There aren’t any easy introductions that I know of.

I found that the information available is often vague and sparse. I’d love to find a good textbook reference to learn more details. One possible explanation is that this field is relatively new, and there’s a lot of proprietary information due to this being a lucrative field.

## References

[1] Wikipedia – Amino acid
[2] Wikipedia – Carboxylic acid
[3] The Research History of Peptide
[4] Wikipedia – Homology modeling
[5] Wikipedia – Threading (protein sequence)
[6] Wikipedia – De novo protein structure prediction
[7] Quora: Why is protein structure prediction important?
[8] Wikipedia – Nutritious Rice for the World
[9] The Rosetta method for Protein Structure Prediction
[10] AlphaFold: Using AI for scientific discovery