# Python Coroutines

In this post we’ll cover Python coroutines. First we’ll study David Beazley’s great tutorial [1] to learn about generators and coroutines, and how to build a multitask system from the ground-up.

Goddess Durga – source: Flickr

We’ll then take a look at a Python module called asyncio which makes use of coroutines to enable asynchronous I/O.

## Generators

Python generators are functions what allow return the execution back to the caller such that when called again it will resume its execution from the same place. It’s the same concept of JavaScript generators, which we talked about before.

The syntax for yielding the execution back is via the yield keyword. Here’s an example of a custom implementation of range():

Because the execution can be yielded at any time in the code, we can use infinite loops:

Another way to use generators is to “pipe” (in the Linux way), one generator into another, like in the example where we read lines from a dictionary file and find those ending in “ion”:

This also highlights one advantage of using generators abstraction: the lower memory footprint since it can process one line at a time. Sure, we could achieve the same with regular functions, but they would need to be combined in a single block:

Which is efficient but doesn’t allow modularity.

Combined with the fact that generators can be infinite loops, one can model functionality like tail and grep as generators (e.g. tail -f | grep foo) which is the exact example Beazley provides in his presentation [1].

## Coroutines

In PEP 342 it explains the motivation behind coroutines:

Coroutines are a natural way of expressing many algorithms, such as simulations, games, asynchronous I/O, and other forms of event-driven programming or co-operative multitasking. Python’s generator functions are almost coroutines — but not quite — in that they allow pausing execution to produce a value, but do not provide for values or exceptions to be passed in when execution resumes. They also do not allow execution to be paused within the try portion of try/finally blocks, and therefore make it difficult for an aborted coroutine to clean up after itself.

There are four main changes:

• yield is an expression instead of statement, which means you can assign the return type of yield to a variable.
• send() method which can be used to send value to the yield expression. send() is a general version of next(), which is equivalent to .send(None).
• throw() method which raises an exception at the place where the last yield was called.
• close() method which raises a GeneratorExit at the place where the last yield was called. Effectively used to force terminating the generator.

Here’s a simple coroutine that just prints what it receives:

Notice that we have to call an empty send() to “initialize” the coroutine. According to PEP 342:

Because generator-iterators begin execution at the top of the generator’s function body, there is no yield expression to receive a value when the generator has just been created. Therefore, calling send() with a non-None argument is prohibited when the generator iterator has just started (…). Thus, before you can communicate with a coroutine you must first call next() or send(None) to advance its execution to the first yield expression.

In the example below we highlight how this flow can be confusing. The first send() receives the value from yield, but it is only able to send a value through a subsequent send().

Since in a coroutine most of the times the first send() will be to initialize the coroutine, Beazley provides a decorator to abstract that:

### Piping

We can chain multiple coroutines the same we did with generators but now the order is somewhat reversed. To see why, consider our generator pipe example rewritten with coroutines. In here we pass print_step() to filter_step() and filter_step() to get_words(). Contrast this with the generator version where we start with get_words, pass its results to filter_step() and then print_step().

### Event Handling

Beazley provides an example using an XML parser to manipulate XML and convert to JSON. The API of xml.sax relies on the visitor pattern, where at important steps of the AST traversal (e.g. startElement, character, endElement) it calls specific methods of a content handler.

The example uses an adapter layer (cosax.py), to allow using coroutines instead the content handler. It also combines multiple smaller coroutines to produce a more complex ones, demonstrating the composability of coroutines.

After some preliminary overview of Operating Systems Beazley builds a prototype of a multitasking system by modeling coroutines as tasks (no threads or processes).

• Task – wrapper around coroutine
• Scheduler – can create new tasks and schedule tasks – keeps a loop as long as there are tasks. Existing tasks can cause tasks to be scheduled. JavaScript event loop.

Yielding inside a task is like an OS interruption (trap).

#### System Calls

One of the interesting implementation of the multitasking system is system calls. The task/coroutine issues a system call by providing it as a value to the yield (e.g. yield GetTid()). This is received by the scheduler which can provide the necessary context to the specific system call implementation.

The tutorial then covers one of the most important parts of the multitasking system which is the ability to do asynchronous I/O operations (e.g. reading and writing to a file).

The idea is to introduce a “system call” corresponding to performing I/O. When a task invokes that, it doesn’t get rescheduled but put in a staging area (self.read_waiting in the code below). The scheduler has then a continuously run task that polls the OS to check if any of the file descriptors corresponding to the tasks in the staging area are ready, in which case the task is rescheduled.

Beazley uses that system to implement a simple server that receives data via a socket:

#### Trampolining

One of the limitations of the scheduler that is developed up to this point is that the yield has to happen at the top-level of coroutine, that is, it cannot invoke other functions that yield, which limits refactoring and modularization.

To overcome this limitation, the author proposes a technique called trampolining. Here’s an example where the execution of add() is done by the top level code, not by main():

We can do a similar thing in the multitasking system. Because the coroutines can recursively call other coroutines, we need to keep a callstack. We can do it inside Task by keeping an explicit stack:

In the example above, when a coroutine that is wrapped in Task makes another coroutine call, for example:

client, addr = yield Accept(sock)

Within Task, result will be a generator, so the current coroutine will be put on the stack and the execution will pass to that generator instead:

self.target = result

When that sub-coroutine yields the execution, the original caller is resumed

self.target = self.stack.pop()

Note that this all happen transparently to the scheduler itself.

This concludes the study of the multitask system from the ground up.

## Modern Asynchronous I/O

The tutorial we just studied is from 2009 and while still relevant, Python went through many changes since then.

We’ll now cover some of the development since the generator-based coroutines were introduced and that culminated in a standard module for handling asynchronous I/O.

### Delegating to subgenerators: yield from

PEP 380 introduced a new syntax for delegating the execution to another generator until that generator is finished. Simeon [5] provides an example where using yield from simplifies the code. In the code below generator() calls generator2() and generator3() in a loop to “exhaust” their execution.

The same can be accomplished with yield for, which does that implicitly:

### Async / Await

PEP 492 Introduces the async and await syntax. The motivation contains these two reasons among others:

It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.

Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring

To address these issues, a function containing the async is considered a native coroutine (as opposed to a generator-based coroutine):

A native coroutine can await generator-based coroutines in which case it has the same  behavior as the wait for. Borrowing from the example above:

### The asyncio module

PEP 3156 Describes a system to support asynchronous I/O, which is now known as the asyncio module. It proposes a coroutine-based multitasking system similar to the one Beazley describes. It has support to common I/O operations like those involving files, sockets, subprocesses, etc.

We won’t dive into much detail in this post, but we can see parallels to the scheduler we  studied previously. Here’s an example where asyncio.run() schedules the coroutine main() which in turn executes the sub-coroutines sleep() one after another:

This is not taking advantage of the non-blocking capabilities of asyncio. Why? Recall that await is equivalent to yield from and that causes the current coroutine to wait the call to finish until it continues. If we run this code we get:

will sleep: 1 slept: 1 will sleep: 2 slept: 2

What we want is to schedule both sub-coroutines at the same time, and asyncio allows that via the gather() method:

If we run this code we get:

will sleep 1 will sleep 2 slept 1 slept 2

Which means that the first sleep executed but yielded the execution back to the scheduler after the sleep() call. The second sleep() got run, printing “will sleep 2”.

## Conclusion

I’ve used async/await and event loops in other languages such as Hack and JavaScript, but only recently ran into it in Python. I kept seeing mentions of coroutines and that led me to study it in more details. Overall I felt like I learned a lot.

David Beazley’s tutorial is really good. They’re thorough and provide lots of analogies with operating systems concepts.

I also liked the presentation medium: the slides are self-contained (all the information is present as text) and he shows the whole code at first then repeats the code multiple times in subsequent slides, highlighting the important pieces in each of them. This is difficult to achieve in flat text like a blog post.

## References

[1] David Beazley – A Curious Course on Coroutines and Concurrency
[2] PEP 342 – Coroutines via Enhanced Generators
[3] PEP 380 — Syntax for Delegating to a Subgenerator
[4] PEP 492 — Coroutines with async and await syntax
[5] Simeon Visser – Python 3: Using “yield from” in Generators – Part 1
[6] PEP 3156 — Asynchronous IO Support Rebooted: the “asyncio” Module

# Python Type Hints

Jukka Lehtosalo is a Finnish Software Engineer at Dropbox. He developed an optional type system for Python, mypy, during his PhD thesis in Cambridge. After meeting in a Python conference, Guido van Rossum (creator of Python) invited Lehtosalo to join him at Dropbox. They started adopting mypy in real use cases during a Hackathon, which led to mypy being one of the most popular Python type checkers [2].

In this post we’ll cover mypy in general terms as well many examples demonstrating the syntax and capabilities of this type checker.

## Introduction

mypy is a static type checker. This means it does not run during the code execution, so it’s mostly useful during development, much like tests. It relies on manual annotations in the code called type hints, which identify the types of arguments, return types, internal variables and member variables.

One simple example of such annotations is for a function to compute the length of a string:

We are indicating that the input argument s is a string and it return an integer corresponding to its length. The main job of a type checker is to find inconsistencies in the annotations in our code (or in the libraries we use). For example, here’s an incorrect type annotation:

When both operands are integers, the operator + returns another integer, so n + 1 is an integer, which is a clear contradiction given we provided the return type as string.

### Type annotations are free unit tests

This contrived example doesn’t make the usefulness of a type checker obvious, but imagine a case where we implicitly make assumption on the argument type. For example, a small program to compute how much an exchange house would give a user for their cash.

If we don’t perform input validation correctly, the code above might work for the tested scenarios but would fail in production the moment the user provided a string. Of course good QA will cover scenarios like these, but this is to sort of guarantees one gets for free by making the types explicit:

The previous code would now fail the type checker validation.

### Type hints are optional and incremental

Most of the design of type annotations are well documented in PEP 484, the document claims:

Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

This also seems to imply that Python type hints will always be partial / gradual, since if full typing is required, it will make transition from non-typed to fully typed codebases prohibitive. Also, there are concrete benefits even with partial typing.

A partial type system makes it optional to add type annotations to variables, instead of it being fully mandatory (like Java or C++). The type checker then performs validation with whatever information it has in hands.

Incomplete typing can be dangerous if developers build trust on the type checker while it’s only performing partial checks due to incomplete information. Let’s consider an example:

At first glance it seems a well typed program and if we run the type checker it will pass. But if we run main(), we’ll have a runtime error. The problem is that expect_string is missing the return type, so in main(), the type checker cannot infer the type of untyped, so it doesn’t perform any validation on expects_int().

### Local inference

The previous example also highlights an important aspect of the mypy type checker: it only performs inferences at the function boundary. In theory it should infer that expects_string returns str because we’re returning its argument and then infer that untyped is a string.

This is fine in this example but it could be very expensive to make these inferences, especially if we need to consider branching and recursive calls to other functions. In theory the type checker could go only a few levels deep in the function call but this would make the behavior of the type checker very hard to reason about.

For that reason, the type check will only consider the type of the functions being called. For example, it knows expects_string() expects a string and returns no type, so this is what it will assign to untyped no matter what happens inside expects_string().

Now that we know the basics of the type checker, let’s cover some of the syntax and more advanced typing that mypy supports.

## Examples

### Setup

Before we start, it’s useful to be able to test the snippets. To do so, copy the code into a file, say example.py and run this command in the terminal:

mypy example.py

which will print any type errors that exist in example.py. mypy can be installed via Python packaging system, pip. Make sure to user Python 3:

pip3 install mypy

### Primitive types

bool, int, str, float are the types one will most likely use in functions. As seen above, we can use these to type arguments and return types:

### Composed types

We’ll look into generics later, but it should be straightforward to understand the typing of composed types like lists, dictionaries and tuples:

It’s worth noting that these types need to be explicitly imported from the typing module.

### None vs. NoReturn

None is used to indicate that no return value is to be expected from a function.

NoReturn is to indicate the function should not return via the normal flow:

### Local Variables

The example below demonstrates the syntax for typing local variables. In general typing local variables is not necessary since their type can often be inferred from the assignment / initialization.

### Optional and Union

Optional[T] is a generic type that indicates the type is either T or None. For example:

Optional is a special case of a more general concept of Union of types:

My personal take is that Union should be avoided (except for special cases like Optional) because it makes the code harder to deal with (having to handle multiple types) and it’s often better via inheritance (base type representing the union).

### Any vs. object

Any is equivalent to not providing the type annotation. On the other hand, object is the base of all types, so it would be more like a Union of all the types. A variable of type object can be assigned a value of any type, but the value of an object variable can only be assigned to other object variables. It’s possible to refine the type of a variable to a more specific one. See “Refining types”.

### Classes

There are three main things we need to consider when annotating variables in a class context:

• We can add types to the member variable (_n in this case).
• We don’t type self: it’s assumed to be the type of the class where it’s defined.
• The return type of __init__ is None

### Callables

Callables can be used to type higher order functions. Here’s an example where we pass a function (lambda) as argument to a map function:

The type of the function is Callable. The first element, [int] in the example above, is a list of the types of the arguments. The second argument is the return type.

As another example, if we want to define a reduce function on strings, our callback has now type Callable[[str, str], str] because it takes 2 arguments.

### Generics: Type Variables

Type variables allow us to add constraints to the types of one or more argument and/or return types so that they share the same type. For example, the function below is typed such that if we pass List[int] as argument, it will return Optional[int]:

Note that the string passed to the TypeVar() function must match the of the variable it is assigned to. This is an inelegant syntax but I’m imagining it’s the result of working around syntax limitations of Python (and the difficulties in changing the core Python syntax for annotations).

We can use multiple TypeVars in a function:

Constraints. According to [3] it’s also possible to limit the type var to be of a specific types:

TypeVar supports constraining parametric types to a fixed set of possible types (note: those types cannot be parametrized by type variables).

It also notes:

There should be at least two constraints, if any; specifying a single constraint is disallowed.

Which makes sense, if we were to restrict a TypeVar to a single type we might as well use that type directly.

In the example below we allow Tmix to be bound to either int or str. Note this is different from Union[int, str] because the latter is both int and str at the same time, while the former is either int or str, depending on how it’s called. The third call to fmix() below would be valid for a Union.

### Parametrized Classes

We’ve just seen how to parametrize functions via TypeVar. We can also extend such functionality to classes via the Generic base class:

### Ignoring type hints

During the transition from untyped to typed code, it might be necessary to temporarily turn off type checking in specific parts of the code. For example, imagine a scenario where types were added to a widely used function but many existing callers are passing incorrect types.

The comment # type: ignore makes the type checker skip the current line (if an inline comment) or the next line (if a line comment). Here’s an obvious type violation that is ignored by the type checker:

It’s also possible to turn off type-checking completely for the file:

A # type: ignore comment on a line by itself at the top of a file, before any docstrings, imports, or other executable code, silences all errors in the file

### Refining types: isinstance() and cast()

In some cases we might receive values from untyped code or ambiguous types like Unions or object. Two ways of informing the type checker about the specific type is by explicit check via isinstance() or cast(). The isinstance() will be usually in a if clause:

This allows the type checker to infer the type of x within the x block, so it won’t complain about the call of inc(x).

Another, more drastic, approach is to use cast([type], x) which returns x if it has in runtime but otherwise throws an exception, but this allows the type checker to refine the type of x to statically. Here’s an example:

It’s a bummer that the order of arguments of cast([type], var) and isisntance(var, [type]) are inconsistent.

### Arbitrary argument lists

It’s possible to type arbitrary argument lists, both the positional and named ones. In the example below, args has type Tuple[str, ...] and kwds has type Dict[str, int]. Note that the ... in Tuple[str, ...] indicates an arbitrary-length tuple of strings. It’s unclear to me why it’s a tuple and not a list, but I’d guess it has to do with how it represents non-arbitrary argument lists (e.g. foo(x: int, y: str) would be Tuple[int, str]).

## Conclusion

I really like Python and it’s my go-to language for small examples and automation scripts. However, having had to work with a large code base before, I was really displeased by its lack of types.

I’m also a big proponent of typed code, especially for large, multi-person code bases, but I understand the appeal of rapid development for prototypes and Python now offers both.

As we saw in some examples, the syntax is cumbersome at times but overall I find the mpyp type checker pretty expressive.

## References

[2] Dropbox Blog: Thank you, Guido
[3] PEP 484 — Type Hints

# Aho-Corasick

Alfred Aho is a Canadian computer scientist. He is famous for the book Principles of Compiler Design co-authored with Jeffrey Ullman. Aho also developed a number of utilities for UNIX which has widespread usage, including the AWK programming language (acronym for Aho, Weinberger and Kernigham) and variants of grep.

Unfortunately, I couldn’t find much information about Margareth J. Corasick, except that she worked at Bell Labs and together with Aho, developed the Aho-Corasick algorithm, which we’ll talk about in this post.

### Multiple-string search

We have studied before the problem of finding a string in a text, for which the Knuth-Morris-Prat algorithm is well-known.

Aho and Corasick studied the problem of finding a set of strings in a text. More precisely, given a body of text H and a set of search terms S, we want to find all occurrences of each term of S in H.

They came up with the Aho-Corasick algorithm which is able to solve this problem in linear time on the size of H and the size of the output (i.e. the total number of characters that need to be printed when a match is reported).

It consists in pre-computing a look-up structure from S and then scanning the text using it. Whenever it detects a match was found, it prints the corresponding matches. Let’s see first how to construct the structure.

### The look-up structure

The look-up structure is constructed in two phases, each to construct two sets of edges: goto and fail edges.

The goto edges are essentially the edges of a trie containing the entries in S. The trie can be seen as a state machine where each node represents a prefix of one of the entries in S and the transition function tells us to which prefix to go next given an input character. These edges are thus called goto arrows. We denote by g(s, a) the node we navigate to if we’re at node s and receive input a.

Given a text H = [a1,a2,…,a_n], we can start at the root and follow the edges labeled a1, a2 and so on. If we land on a node representing an entry in S we print it.

Eventually though, we may reach a node s such that the next character a_j doesn’t have a corresponding edge g(s, a_j). We know that s is a suffix of [a1,a2,…,a_{j-1}], say s = [a_k,…,a_{j-1}]. We also know there’s no other unreported entry in S that starts at k but there might be one that starts at a_{k+1}. We could restart the search at k+1 and at the root of the tree, but can we do better?

Because S is fixed, no matter what text H we have, by the time we encounter a dead end at a node r, we know that the last characters seen are  [a_k,…,a_{j-1}] = r. Now suppose s = [a_{k+1},…,a_{j-1}, a_j] happens to be in the trie. Then we can continue our search from s, without having to backtrack. If k+1 doesn’t work, we can try k+2, k+3, and so forth. More generally, say that we eventually found the smallest index k* > k such that s* = [a_k*,…,a_{j-1}, a_j] is in the trie. In other words s* is the longest proper suffix of s in the trie. When we fail to find s = r + a_j in the trie, we can resume at s*. This suggests we could have a fail edge from s to s* in our trie to quickly resume the search without backtrack.

Technically, s doesn’t exist in the trie, so we can’t add an edge there. However, we can still add the failure edge in r, and denote it as f(r, a_j) = s* for all labels a_j not in a goto edge.

Given the goto and fail edges, we have a proper state machine which we can use to find matches in H. Let’s combine them into a single set of edges e(s, a) (that is e(s, a) = g(s, a) if it exists, otherwise f(s, a)).

We are still missing printing out the matches. Because of the shortcuts we take via the fail edges, we never explicitly visit a suffix of a state s. For this reason, when visiting a state s, we need to print all its suffixes that belong to S.

We’ll now see how to construct the fail edges and the output.

Building the fail edges

We’ll show how to construct the failure edges efficiently. We define auxiliary edges ps(s) which represents the longest proper suffix of s that is in the trie. Note that f(r, a) is ps(r + a) for when g(r, a) doesn’t exist.

The idea is that if we know ps(r) for all nodes s of depth up to l in the trie, we can compute ps(r + a), that is for the nodes in the next level.

Consider each node r of depth l and each symbol a. Let s = r + a. We know ps(r) is in the trie, but is ps(r) + a in the trie? Not necessarily, but maybe a suffix of ps(r) is? We can try ps(ps(r)) and further suffixes until we find one that is in the trie, or we reach the empty string.

How can we make sure to only process node s only after all the edges from lower levels have been processed? We can process the nodes in bread-first order, which can be achieved using a queue. The pseudo-code could be:

Note how instead of searching for the largest suffix by removing one character at a time and checking if it exists in the trie, we’re “jumping” to suffixes via repeated application of ps(), which is what makes the algorithm efficient as we shall see. You might be asking if we’re not missing any valid suffix when doing that, but no worries, we analyze the correctness of this in Lemma 1 (Appendix).

Building the output

If we compute output(s) in a bread-first order, we can assume we know output(r) for nodes r lower level than s. Assuming we already computed ps(s), we have that output(s) = output(ps(s)) + {s} if s is in S:

How do we know by jumping through suffixes via ps(s) we are not missing any matches? We demonstrate that in Lemma 2 (Appendix).

### Implementation in Python

I’ve implemented the pseudo-code proposed here in Python, and made it available on Github. The main implementation aspect is how we compute the output list. To avoid storing the matches multiple times, we model it as a linked list. More precisely, when computing output(s) from ps(s), if ps(s) is itself a keyword, we point output(s) to ps(s). Otherwise we point it to output(ps(s)).

When printing output(s) we just need to traverse the list,:

### Conclusion

It was a lot of work to study Aho Corasick’s paper [1], understand the proof and re-write with my own words, but this provided me much better intuition behind the algorithm.

Like the KMP, I’ve known this algorithm before but never took the time to fully understand it.

My initial implementation was in Rust, but I got stuck when modeling the nodes because of self-reference. This led me to this Stackoverflow answer which I haven’t time to study, but seems like a good idea for a post.

### Appendix

#### Correctness

Lemma 1. Let s be a prefix in the look-up structure. Then ps(s) is the longest proper suffix of s that exists in the look-up structure T.

Let’s prove this by induction on the size of s. The base is for nodes of level 1, where ps(s) = root (empty string), which is true, since the proper suffix of a single letter is empty.

Assuming that ps(s’) is the longest proper suffix of s that exists in the look-up structure for |s’| < |s| , let’s prove this is true for ps(s) as well. Let’s show first that ps(s) is a proper suffix of s in T. We know that ps(r) is a suffix of r, and so is ps(ps(r)) since a suffix of a suffix of r is also a suffix of r, and so on. Let’s define ps_k(r) = ps(ps_{k-1}(r)) and ps_n(r) such that g(ps_n(r), a) exists. By construction we assigned ps(s) = g(ps_n(r), a). Since ps_n(r) is a suffix of r, and r + a = s, we know what ps_n(r) + a is suffix of s. We’ll show there’s no other proper suffix or s in loop-up structure larger than ps(s). Suppose there is a s* in T such that |s*| > |ps(s)|. Let r* be = s* + a. Note that r* is a suffix of r and it exists in T (since s* is in there and T it’s a prefix tree). It cannot be larger than ps(r) because of our induction hypothesis. r* has to be larger than ps_n(r) since |ps(s)| = |ps_n(r)| + 1, |s*| > |ps(s)| and |s*| = |r*| + 1.

The first observation is that r* != ps_k(r) for any k. Otherwise the algorithm would have set ps(s) = g(r*, a) = s*. So there must be some k for which |ps_k(r)| > |r*| > |ps_{k+1}(r)|, this means that for r’ = ps_k(r), there’s a suffix r* suc that |r*| > |ps(r’)|. But this is a contradiction of the induction hypothesis saying that ps(r’) is the largest suffix of r’ in T.

Lemma 2. output(s) contains y if and only if y is in S and a suffix of s.

Let’s prove one direction: if output(s) contains y then y is in S and a suffix of s. This is easy to see by construction, output(s) contains s if it’s in S, and also the outputs of ps(s), which is a suffix of s, and by induction are all in S and suffix of ps(s), and hence of s.

Now suppose y is a keyword and suffix of s. Let’s show that there’s a k for which ps_k(s) = y. If this was not the case, then by a similar argument we did in Lemma 1, there is ps_{k-1}(s) such that y is suffix of it, and |y| > |ps_{k}(s)|, which violates Lemma 1, since  ps(ps_{k-1}(s)) = ps_{k}(s) is not the longest possible. Since ps_k(s) = y, by construction, we know that output(y) will eventually be merged into output(s).

Lemma 3. At the end of the loop in the Algorithm 1, state s is in the look-up structure T and it is a suffix of [a1,a2,…,aj].

This is true for a1: it would be either in state [a1] in T, or the empty state, both being suffix of [a1]. Assuming this hypothesis holds for j-1, we were in a state r before consuming aj, and r is the longest suffix of  [a1,a2,…,aj-1].

By construction we know that the end state s is still in T, since we only followed valid edges of T (gotos and fails). We need to show s is the longest suffix of [a1,a2,…,aj]. For the sake of contradiction, assume there’s s* such that |s*| > |s| in T that is a suffix of [a1,a2,…,aj]. Let r* be such that s* = r* + aj. Note that r* is in T and is a suffix of [a1,a2,…,aj-1] and by the inductive hypothesis |r*| < |r|. If we used a goto edge, then |r*| > |r| which contradicts the inductive hypothesis. If we used a fail edge, then s = f(r, aj). Using the same argument from Lemma 1, we have that r* is a suffix of r, and that r* is not ps_k(r) for any k, so it must be that for some k, x = ps_k(r), |x| > |r*| > |ps(x)|and is a contradiction because r* is a proper suffix of x but longer than ps(x).

Theorem 1. The Aho Corasick algorithm is correct.

We need to show that every substring of text which is in S, will be reported. Any substring we were to report end at an index j of H, so it suffices to show we report every suffix of [a1,a2,…,aj] in S, for all j.

Let s be the node we are at the end of each loop in Algorithm 1. Suppose there’s a suffix s* of [a1,a2,…,aj] in S that is not in output(s). From Lemma 2, we know that every keyword in S that is a suffix of s is in output(s), it follows that s* is not a suffix of s, so |s*| > |s|, which cannot be from Lemma 3.

QED

#### Complexity

Theorem 2. The look-up stricture can be constructed in linear time of the number of characters in S.

Let’s define size(S) = the total number of characters in S and A the set of symbols of the alphabet.

It’s known to that we can construct a trie from a set of string in S, so the goto edges are covered.

For constructing the ps() function, the cost will be associated on how many times we execute the inner loop. We visit each node once and for each of the |A| symbols, we perform the inner loop. Since|f(s)| < |s|, the number of times we execute the inner-most loop for a given state is bounded by |s|, so for each state we execute the inner-most loop |s|*|A| times. If we sum over all states, we have size(S)*|A|. Assuming |A| is constant and small, we can ignore it in the overall complexity. It’s worth noting Aho and Corasick’s original paper don’t depend on |A|.

The cost of constructing the output() function is O(1) if we were to use a shared linked list. In that case output(s) can be constructing with a cell s and have it point to the list of output(f(s)), so again, this would be bounded by size(S).

For the search algorithm, let’s ignore the cost of printing the output for now. We can follow an edge in O(1) time and do so exactly once per character.

In the worst case there could be a lot matches at each position. For example, if S = {a, aa, aaa, …, a^(n/2)} and H = a^n.  For n/2 positions we’ll need to output all entries in S, leading to a complexity of size(S)*n.

Since these values need to be printed, there’s no way to be faster than that and this could be the dominating factor in the total complexity. In practice however the number of matches is much smaller.

### References

[1] Efficient String Matching: An Aid to Bibliographic Search
[2] Alfred Aho – Wikipedia

# Numerical Representations as inspiration for Data Structures

In this chapter Okasaki describes a technique for developing efficient data structures through analogies with numerical representations, in particular the binary and its variations.

We’ve seen this pattern arise with Binomial Heaps in the past. Here the author presents the technique in its general form and applies it to another data structure, binary random access lists.

### Binary Random Access Lists

These lists allows efficient insertion at/removal from the beginning, and also access and update at a particular index.

The simple version of this structure consists in distributing the elements in complete binary leaf trees. A complete binary leaf tree (CBLF) is one that only stores elements only at the leaves, so a tree with height i, has 2^(i+1)-1 nodes, but only 2^i elements.

Consider an array of size n, and let Bn be the binary representation of n. If the i-th digit of Bn is 1, then we have a tree containing 2^i leaves. We then distribute the elements into these trees, starting with the least significant digit (i.e. the smallest tree) and traversing the tree in
pre-order.

For example, an array of elements (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) has 11 elements, which is 1011 in binary. So we have one tree with a single leave (1), a tree with 2 leaves (2, 3) and another containing 8 leaves (4, 5, 6, 7, 8, 9, 10, 11).

We use a dense representation of the binary number, by having explicit elements for the 0 digit. Here’s a possible signature for this implementation:

Inserting a new element consists in adding a new tree with a single leave. If a tree already exists for a given size, they have to be merged into a tree of the next index. Merging two CBLFs of the same size is straightforward. We just need to make them children of a new root. Since elements are stored in pre-order, the tree being inserted or coming from carry over should be the left child.

Looping back to our example, if we want to insert the element 100, we first insert a tree with a single leaf (100). Since the least significant digit already has an element, we need to merge them into a new tree containing (100, 1) and try to insert at the next position. A conflict will arise with (2, 3), so we again merge them into (100, 1, 2, 3) and try the next position. We finally succeed in inserting at position 2, for a new list containing trees like (100, 1, 2, 3) and (4, 5, 6, 7, 8, 9, 10, 11).

The complexity of inserting an element is O(log n) in the worst case which requires merging tree for all digits (e.g. if Bn = 111...111). Merging two trees is O(1).

Removing the first element is analogous to decrementing the number, borrowing from the next digit if the current digit is 0.

Searching for an index consists in first finding the tree containing the index and then searching within the tree. More specifically, because the elements are sorted beginning from the smallest tree to the largest, we can find the right tree just by inspecting the number of elements in each tree until we find the one whose range includes the desired index. Within a tree, elements are stored in pre-order, so we can find the right index in O(height) of the tree.

After finding the right index, returning the element at that index is trivial. Updating the element at a given index requires rebuilding the tree when returning from the recursive calls.

Okasaki then proposes a few different numbering systems that allow to perform insertion/removal in O(1) time. Here we’ll only discuss the less obvious but more elegant one, using skew binary numbers.

### Skew Binary Random Access Lists

A skew binary number representation supports the digits 0, 1 and 2.

The weight of the i-th digit is 2^(i+1) - 1. In its canonical form, it only allows the least significant non-zero digit to be 2.

Examples:

Decimal and Skewed Binary

It’s possible to show this number system offers a unique representation for decimal numbers. See the Appendix for a sketch of the proof and an algorithm for converting decimals to skewed binary numbers.

Incrementing a number follows these rules:

• If there’s a digit 2 in the number, turn it into 0 and increment the next digit. By definition that is either 0 or 1, so we can safely increment it without having to continue carrying over.
• Otherwise the least significant digit is either 0 or 1, and it can be incremented without carry overs.

The advantage of this number system is that increments (similarly, decrements) never carry over more than once so the complexity O(1), as opposed to the worst-case O(log n) for regular binary numbers.

A skew binary random access list can be implemented using this idea. We use a sparse representation (that is, not including 0s). Each digit one with position i corresponds to a tree with (2^(i+1) - 1) elements, in this case a complete binary tree with height i+1. A digit 2 is represented by two consecutive trees
with same weight.

Adding a new element to the beginning of the list is analogous to incrementing the number, which we saw can be done in O(1). Converting a digit 0 to 1 or 1 to 2, is a matter of prepending a tree to a list. To convert a 2 to 0 and increment the next position, we need to merge two trees representing it with the element to be inserted. Because each tree is traversed in pre-order, we make the element the root of the tree.

Elements are inserted in pre-order in each tree, so when searching for an
index, we can first find the right tree by looking at the tree sizes and within a tree we can do a “binary search” in O(height) of the tree.

### Binomial Heaps

In this chapter, this technique is also applied to improve the worst case runtime of insertion of binomial heaps. The implementation, named Skewed Binomial Heap, is on github.

### Conclusion

This chapter demonstrated that binary representations are a useful analogy to come up with data structures and algorithms, because they’re simple. This simplicity can lead to inefficient running times, though. Representations such as skewed binary numbers can improve the worst case of some operations with the trade-off of less intuitive and more complex implementations.

### Appendix A – Proof

Sketch of the proof. First, it’s obvious that two different decimals cannot map to the same binary representation. Otherwise the same equation with the same weights would result in different values. We need to show that two binary representations do not map to the same decimal.

Suppose it does, and let them be B1 and B2. Let k be the largest position where these number have a different digit. Without loss of generality, suppose that B1[k] > B2[k].

First case. suppose that B1[k] = 1, and B2[k] = 0 and B2 doesn’t have any digit 2. B1 is then at least $M + 2^{k+1} - 1$, while B2 is at most $M + \sum_{i = 1}^{k} (2^{i} - 1)$ which is $M + 2^{k + 1} - k$ (M corresponds to the total weight of digits in positions > k). This implies that B2 < B1, a contradiction.

Second case. suppose that B1[k] = 1, but now B2 does have a digit 2 at position j. It has to be that j < k. Since only zeros follow it, we can write B2‘s upper bound as

$M + \sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1$

Since $2(2^{j + 1} - 1) < 2^{j + 2} - 1$, we have

$\sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1 < \sum_{i = j + 2}^{k} (2^{i} - 1) + 2^{j + 2} - 1$

We can continue this argument until we get that B2 is less than $M + 2(2^{k} - 1)$ which is less than $M + 2^{k + 1} - 1$, B1.

Third case. Finally, suppose we have B1' such that B1'[k] = 2, and B2'[k] = 1. We can subtract $2^{k+1} - 1$ from both and reduce to the previous case. ▢

### Appendix B – Conversion algorithm

Converting from a decimal representation to a binary one is straightforward, but it’s more involved to do so for skewed binary numbers.

Suppose we allow trailing zeros and have all the numbers with k-digits. For example, if k=2, we have 00, 01, 02, 10, 11, 12 and 20. We can construct the numbers with k+1-digits by either prefixing 0 or 1, and the additional 2 followed by k zeros. For k=3, we have 000, 001, 002, 010, 011, 012, 020, 100, 101, 102, 110, 111, 112, 120 and finally 200.

More generally, we can see there are 2^(k+1) - 1 numbers with k digits. We can construct the k+1 digits by appending 0 or 1 and then adding an extra number which is starts 2 and is followed by k zeros, for a total of 2^(k+1) - 1 + 2^(k+1) - 1 + 1 = 2^(k + 2) - 1, so we can see this invariant holds by induction on k, and we verify that is true for the base, since for k = 1 we enumerated 3 numbers.

This gives us a method to construct the skewed number representation if we know the number of its digits say, k. If the number is the first 2^(k) - 1 numbers, that is, between 0 and 2^k - 2, we know it starts with a 0. If it’s the next 2^(k) - 1, that is, between 2^k - 1 and 2^(k+1) - 3, we know it starts with a 1. If it’s the next one, exactly 2^(k+1) - 2, we know it starts with a 2.

We can continue recursively by subtracting the corresponding weight for this digit until k = 0. We can find out how many digits a number n has (if we’re to exclude leading zeros) by find the smallest k such that 2^(k+1)-1 is greater than n. For example, for 8, the smallest k is 3, since 2^(k+1)-1 = 15, and 2^(k)-1 = 7.

The Python code below uses these ideas to find the skewed binary number representation in O(log n):

One might ask: why not OCaml? My excuse is that I already have a number theory repository in Python, so it seemed like a nice addition. Converting this to functional code, in particular OCaml is easy.

This algorithm requires an additional O(log n) memory, while the conversion to a binary number can be done with constant extra memory. My intuition is that this is possible because the weights for the binary numbers are powers of the same number, 2^k, unlike the skewed numbers’ weights. Is it possible to work around this?

# Tree Ring Matching using the KMP Algorithm

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

### Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.

Tool to explain the matching system

### The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
minlen = min(len(prefix), len(suffix))
max_match = 0
for match_len in range(1, minlen + 1):
if prefix[:match_len] == suffix[-match_len:]:
max_match = match_len
return max_match


In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

### The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

tlen = len(T)
qlen = len(Q)

for i, _ in enumerate(T):
match = 0
# Searches for Q starting from position i in T
while match < qlen and \
i + match < tlen and \
Q[match] == T[i + match]:
match += 1

if match == qlen:
print 'found substring in position', i


#### Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
^ mismatch in position 4


If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
^ mismatch in position 5


When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
^ continue search from here
(not from position 2)


We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
^ continue search from here
(not from position 2)


Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
^ continue search from here


If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

lsp = precompute_lsp(Q)

i = 0
j = 0

qlen = len(Q)
tlen = len(T)

while i < tlen and j < qlen:

if j  0:
j = lsp[j - 1]
else:
i += 1

# string was found
if j == qlen:
print i - j


We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

#### The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
qlen = len(Q)
lsp = [0]*qlen

i = 1
while i < qlen :

j = 0
while (i + j < qlen and Q[j] == Q[i + j]):
lsp[i + j] = max(lsp[i + j], j + 1)
j += 1

i += 1
return lsp


The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.

The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

qlen = len(Q)

lsp = [0]*qlen

i = 1
j = 0
while i  0:
j = lsp[j - 1]
else:
i += 1

return lsp


We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

### Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

lsp = precompute_lsp(Q)

i = 0
j = 0

qlen = len(Q)
tlen = len(T)

while i < tlen:

if j  0:
j = lsp[j - 1]
else:
i += 1

yield_indices(i, j)


Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
matches = []
qlen = len(Q)
def match_accumulator(i, j):
if j == qlen:
matches.append(i - j)
kmp_generic(Q, T, match_accumulator)
return matches


Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
slen = len(suffix)
max_match = [None]
def max_matcher(i, j):
if max_match[0] is None and i == slen:
max_match[0] = j

# Search prefix in suffix
kmp(prefix, suffix, max_matcher)

return 0 if max_match[0] is None else max_match[0]


Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.

### Conclusion

I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.

### References

[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings

# Revisiting Python: Object Oriented Programming

This is the third post in the series on revisiting Python. In the previous post we talked about Python functions.

We’ll talk about classes in Python 2.7. Following the premisses of previous posts, this is not intended to be an introduction to Object Oriented Programming, but rather how we can use this paradigm in Python.

We’ll first cover basic concepts like classes and inheritance. Later we’ll discuss some more advanced features from the Python language that allows us extending the set of built-in patterns provided by the language.

### Classes

Classes are the building blocks of object oriented programming. Python classes look similar to those in languages like C++ and Java, but there are some important differences, which we’ll comment on this section.

Classes are objects. When writing a class definition the code is executed and an class object is created and assigned to a name corresponding to the class name. For example, in the code below, an object representing the class is assigned to a variable in the current scope MyClass.

class MyClass:
print 'hi'

print MyClass # __main__.MyClass
x = MyClass
print x       # __main__.MyClass


If we run the code above it will print ‘hi’. We can manipulate MyClass as a regular variable, assigning it to other variables, passing it as parameter to functions, etc.

Instantiating. We can create an instance of a given class by using a function call notation, that is, MyClass().

This will also call the method __init()__ which can be used to initialize the instance, like a constructor. Functions defined within a class become methods to the instances. Methods can be called using the syntax instance.method(). For example:

class MyClass:
def myMethod(self):
print(self)

instance = MyClass()

instance.myMethod() # Prints &amp;lt;__main__.MyClass instance at 0x10891f290&amp;gt;


When invoking a function using instance.method(), instance() is bound to the first argument to method() in this case. We usually name this first parameter self, but it’s just a convention.

Class vs. instance members. Note that local variables defined at the class level belong to the class object, not to a specific instance object. Making an analogy to C++, this is a static member variable.

class MyClass:
x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x # 20


To make a variable instance specific, we need to make use of the instance passed as the first parameter to the methods.

class MyClass:
def __init__(self):
self.x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x


All methods and variables defined at the class level are stored internally in the __dict__ variable.

class MyClass:
x = 10
def f():
pass

print MyClass.__dict__
# 'x': 1, '__module__': '__main__', '__doc__': None, 'f': &amp;lt;function f at 0x109f32d70&amp;gt;}


Methods. In Python, methods are “static”, but by default it requires an instance of the class to be bound to the first parameter. As we saw above, this happens automatically when we use the instance.method() syntax. Alternatively, we could explicitly pass the instance to the static method, using the SomeClass.method(instance) syntax. To illustrate that, imagine we have a method in our class a method printX(). We can invoke it either by a method from the instance object or from the class object passing the instance:

class MyClass:
def __init__(self):
self.x = 10

def printX(self):
print self.x

inst = MyClass()

# Both are equivalent:
inst.printX()
MyClass.printX(inst)


Methods are essentially functions assigned to class member variables. To make the point clear, we can rewrite the previous snippet replacing the inline definition of printX() by an external function:

def externalPrintX(self):
print self.x

class MyClass:
def __init__(self):
self.x = 10

printX = externalPrint


Note that the external function still needs to have as the first arguments an instance of MyClass.

Given that all methods are public and data variables take precedence over static ones, it can cause some confusion if we assign functions to instance members:

def externalPrintX():
print 'external hello world'

class MyClass:
def __init__(self):
self.x = 10
self.printX = externalPrintX

def printX(self):
print 'hello world'

inst = MyClass()
inst.printX()


Here, the method printX() method got replaced by another function, externalPrintX(). Note that, differently from regular methods, the instance is not bound to the first argument when invoking inst.printX().

Static Methods. Since we’re required to provide an instance to the first argument, it’s not truly static in the C++ class terminology. It’s possible to override this requirement by using the staticmethod() function. To illustrate this, here’s an example:

class ClassWithStaticMethod():
def regularMethod(x):
print 'This is not bound to any instance:', x

staticMethod = staticmethod(regularMethod)

x = ClassWithStaticMethod()
x.staticMethod(10) # This is not bound to any instance: 10


Class Methods. are different from regular methods because they always receive the class object instead of the instance object as the first argument. More specifically, when calling from a class object, it is bound automatically as the first parameter and when called from the instance object, its corresponding class object is bound instead. For example,

class ClassWithClassMethod():
def regularMethod(type):
print type
classMethod = ClassMethod(regularMethod)

ClassWithClassMethod.classMethod()

x = ClassWithClassMethod() # ClassWithClassMethod
x.classMethod()            # ClassWithClassMethod


### Inheritance

The syntax for inheritance is class Derived(Base). Methods defined in the base classes can be overridden by derived classes. If a name (variable or method) is referenced from an instance, it’s searched from the current class, going up on the inheritance chain, until the definition is first found.

# animal.py
class Animal:
def breath(self):
return 'breathing air'
def eat(self):
return 'eating'

class Dog(Animal):
def eat(self):
return Animal.eat(self) + ' meat'

dog = Dog()
print dog.breath()  # breathing air
print dog.eat()     # eating meat


In this example, eat() is first found at Dog, while breath() is only found at Animal. Note that we can refer to a specific implementation by providing the class explicitly, like Animal.eat(self).

Old-style vs. new style classes. In a nutshell, the difference between old-style and new-style classes is that the latter is a descendant of object. Thus, by default user defined classes are old-style by default (though old-style classes are removed from Python 3).

According to the docs, the motivation for creating the new-style classes is to unify Python types and user defined classes. In this model, classes are able to derive from built-in types [2].

Multiple-inheritance. is supported in Python. The syntax is a comma separated list of base classes:

class DerivedClassName(Base1, Base2, Base3):
...


Method resolution order (MRO). We just saw that for simple inheritance chains, the natural lookup order for methods is straightforward. For multiple-inheritance it can be complicated mainly because of the diamond pattern.

For old-style classes, the method lookup order, MRO, is defined by a “depth-first search”. It first looks in the entire inheritance chain of Base1, then Base2, and so on. The MRO is always well defined as long as the inheritance graph is a DAG (direct acyclic graph).

The problem is that it’s not intuitive. Depending on the order a given class appears in a descendant, its own MRO can vary. For example, consider the following complex dependency:

class A:
def name(self):
return 'a'

class B:
def name(self):
return 'b'

class C(B, A):
def composed_name(self):
return 'c ' + self.name()

class D(A, C): pass

d = D()
print d.composed_name()


When we call composed_name() on the instance of class D, the name resolution will only find it in class C. This class on its turn needs to resolve the name() method. Should we get it from class B, since it’s listed first on C‘s inheritance?

That’s not what happens. name() is resolved from the class D, because there, A appears before. This is a very confusing behavior and error-prone.

In new-style classes, the diamond pattern exists by default when multiple inheritance is used, since everything ends up deriving from object. But in the new style, the type of inheritance that can be created is more strict. It uses an interesting algorithm, C3, described in [3] to create a consistent ordering or throw an exception in case such ordering doesn’t exist.

super. In the animal.py example, we had to hard code Animal.eat() to be able to call Dog‘s parent class. The super() function takes a class object and an instance, and it returns a proxy object on which we can call methods and it will resolve them based on the MRO the object class. More specifically, consider the following example:

class A(object):
def method(self):
print 'calling method from a'

class B(object):
def method(self):
print 'calling method from b'
def anotherMethod(self):
print 'calling another method from b'

class C(A, B):
def method(self):
print 'calling method from c'
def anotherMethod(self):
print 'calling another method from c'

def run(self):
proxy = super(C, self)
proxy.method()          # calling method from a
proxy.anotherMethod()   # calling another method from b

C().run()


In this example, super() returned an object, proxy, which can resolve method() and anotherMethod() properly. Since C derives from A first, it resolves it to A.method(). On the other hand, since anotherMethod() is only defined in B, it will resolve to B.anotherMethod().

### Descriptors

So far, we discussed the basic components of object oriented programming. Now we’ll describe some advanced concepts that will help us better understand how some functions like staticmethod() work.

A class is said to be a descriptor if it implements any of __get__(), __set__() or __delete__(). In particular, if it doesn’t implement __set__(), it’s called non-data descriptor. If it implements both __get__() and __set__(), it’s a data descriptor [4, 5].

If another class contains class members that are descriptors, the __get__() and __set()__ will be executed when do reads and assignments respectively. For a concrete example, considering the following descriptor:

class Descriptor(object):
def __get__(self, obj, objtype):
print 'getting x'
return self.x

def __set__(self, obj, val):
print 'setting x to ' + str(val)
self.x = val


It stores a value internally at the variable x, but includes some logging when setting or getting this value. Now, suppose we have a class that has a member that is a descriptor:

class ClassWithDescriptors(object):
member = Descriptor()

x = ClassWithDescriptors()
x.member = 20
print x.member


When we try to assign a value to x.member, the method __set__() from Descriptor is called instead. The same applies to when we try to read from x.member. It’s important that member is a class member, not an instance member. For example, if we have the instance member variable, it would not print the logging messages:

class ClassWithDescriptors(object):
def __init__(self):
self.instance_member = Descriptor()

x = ClassWithDescriptors()

x.instance_member = 20   # This is just overriding instance_member with an integer
print x.instance_member


staticmethod() and classmethod(). One interesting use case for descriptors to implement the staticmethod() and classmethod() “functions”. We could write the following descriptor for each of these, respectively:

class StaticMethod(object):
def __init__(self, f):
self.f = f

def __get__(self, obj, objtype=None):
return self.f

class ClassMethod(object):
def __init__(self, f):
self.f = f

def __get__(self, obj, objtype=None):
if objtype is None:
objtype = type(obj)
def newfunc(*args):
return self.f(objtype, *args)
return newfunc


### Functors

Python has the concept of function objects or functors. It’s an object that can be invoked as a function so long its class defines the __call__() method. A simple example is:

class Multiplier:
def __init__(self, factor):
self._factor = factor
def __call__(self, a):
return self._factor * a

double = Multiplier(2)
triple = Multiplier(3)
print double(5) # Prints 10
print triple(7) # Prints 21


One advantage of functors over regular functions is the capability of carrying context.

### Decorators

Decorator is a design pattern in which we can modify the behavior of an object without modifying all the objects from that particular class. It’s also known as a wrapper, because we wrap the object inside another object that adds the desired behavior.

We’ll now describe a motivation for using decorators and how Python offers a syntax sugar for this pattern using annotations. For our example, imagine we have a function that takes a three variables and returns the sum of them:

def add(a, b, c):
return a + b + c


Now imagine we want to validate the parameters passed to add(). We could add that to the beginning of the function, but maybe we already have a function to do the validation for us. In this case, we could wrap the function in a another one, which would first perform the parameters validation and then call the function. The code below does that:

def validate(f):
def closure(*args):
for arg in args:
assert isinstance(arg, int)
return f(*args)
return closure

# Throws an exception


validate() takes a function f() and wraps it inside closure(), which verifies all arguments are integers before invoking f(). So by passing add() to validate(), we’re getting a decorated function validated_add() which performs validation on the arguments.

Python offers a syntax sugar to do exactly that. By annotating a function with @something, it passes the function to a function named something(), that should act as a function transformation, and uses the transformed function instead.

@validate
return a + b + c


Imagine we want to customize the validate function to enable us to define the type of each element passed to the decorated function.

Annotations with parameters. We can do that by passing parameters to the decorator, using this syntax @something(arg1, arg2, ...). In this form, we actually construct a decorator builder, so we need to wrap our validate() function inside another one:

def validate(*arg_types):
def validate_impl(f):
def closure(*args):
assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
for i in range(len(args)):
assert isinstance(args[i], arg_types[i])
return f(*args)
return closure
return validate_impl


With this new validate() function, we can update the annotation to our simple add() function:

@validate(int, int, int)
return a + b + c


We can now use this as a very simple type validation. Imagine we have a function that takes a string and repeats it N times. We can enforce the types using the same annotation:

@validate(str, int)
def rep(s, n):
return &amp;quot;&amp;quot;.join([s]*n)


Decorators can also be function objects, so we can define a class to do a similar work the validate() function does:

class Validator:
def __init__(self, *arg_types):
self._arg_types = arg_types

def __call__(self, f):
arg_types = self._arg_types
def closure(*args):
assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
for i in range(len(args)):
assert isinstance(args[i], arg_types[i])
return f(*args)
return closure


In this case, the parameters defined in the annotation are passed to the constructor of the class.

Finally, two very common annotations are staticmethod and classmethod. There is no magic here. These cause the staticmethod() and classmethod() instances to be instantiated, and we saw how these work in previous sections.

### Conclusion

In this post we covered object oriented programming in Python. We learned details about classes, especially how they compare to classes in other languages. We then discussed inheritance, including multiple inheritance and old and new-style classes. We then delved into more advanced concepts related to object oriented programming. This included descriptors, functors and decorators.

Python requires relatively few special constructs to accomplish features from other object-oriented languages. Examples include the concept of super and static methods, which are implemented using general purpose features from Python like descriptors and decorators.

### Reference

[1] The Python Tutorial – Classes
[2] Python.org Docs – New-style Classes
[3] The Python 2.3 Method Resolution Order
[4] Python Descriptors Demystified
[5] Python HOWTOs – Descriptor HowTo Guide

# Revisiting Python: Functions

This is the second post in the series on revisiting Python. In the first post we discussed the motivation for these posts and started by revisiting Built-in Types.

In this post, we’ll talk about functions in Python 2.7.

### Functions

Functions in python consist of the keyword def, a function name and a list of arguments:

def hello(arg1, arg2):
print "hello world"


The pythonic-way to document what a function does is through docstrings (block of strings delimited by triple-double quotes:

def hello(arg1, arg2):
"""
Prints hello world
"""
print "hello world"


Some tools are able to parse these docstrings to automatically build documentation.

Namespace. is a map from names (variables names, function names) to objects (their values). Examples of namespaces include the built-in namespace, the global namespace within a module and local namespace within a function.

Scope. is a textual region of a Python program where a namespace is directly accessible.

For example, within a function we have a local scope. Variable from the arguments or created within the function itself belong to this local scope. Consider the following example:

d = 4
def foo(a):
b = 1
return abs(a - b)
foo(10)


In the code above, a, b are in the local scope. d is in the module global scope and the function abs is in the built-in scope. When we refer to a variable or function name, Python search scopes in the following order:

1. Local scope
2. Local scope of the calling functions
3. Global scope within a module
4. Built-in scope

Item 2 requires some clarification. In Python we can define a function inside another function. For example,

def outer():
scope_name = 'outer'
def inner():
print(scope_name)
return inner

scope_name = 'global'
# Get the inner function
inner = outer()
inner() # outer


In the code above, scope_name is defined both in the global scope and in the outer() function scope. When referenced within inner(), the outer scope takes precedence, so this will print 'outer'. Note how the scope is resolved statically, when defining inner(), not in runtime, explaining why the docs emphasize “textual region“. This is known as lexical scoping, as we discussed when studying the R language.

Variables that are not in the local scope are read-only by default. If we try to assign a new value to an existing variable in an outer scope, we are just creating a new local variable with the same name. Note that if the global variable is a reference to an object, we can modify the object, but we still can’t re-assign a value to that global variable.

x = [1, 2, 3]

def modify_global_variable_content():
x.append(4)

def reassign_global_variable():
x = [10, 20, 30]

modify_global_variable_content()
reassign_global_variable()
print x


In the code above, modify_global_variable_content() modifies the content of x, but reassign_global_variable() just creates a local variable x. To actually modify the variable we can make it explicit using the global keyword.

def reassign_global_variable():
global x
x = [10, 20, 30]


Functions can be assigned to variables and passed around, for example:

def inc(x):
return x + 1

def print_f(f, x):
print f(x)

print_f(inc, 2)


Here we pass the function inc(), which then is assigned to the variable f in print_f(), and used to invoke the original function.

Default arguments. It’s possible to define default arguments to functions, like in C++. The initialization only happens once and the value is shared between subsequent calls, so for objects, this is what happens:

def append(x, xs = []):
xs.append(x)
return xs

print append(1) # [1]
print append(2) # [1, 2]
print append(3) # [1, 2, 3]


Keyword arguments. The default match between the arguments passed to a function and the arguments the function receives is positional, that is, the first value is assigned to the first parameter and so on. Besides that, we can make the argument names explicit.

def listify(a, b, c):
return [a, b, c]
print listify(b = 2, c = 3, a = 1)


There are some constraints though: 1. Positional arguments must come before keyword arguments. 2. If we use a keyword for a given argument, we have to name all required arguments that appear after it in the function definition.

def listify(a, b, c):
return [a, b, c]

# Error: Violates rule 1
print listify(c = 3, 1, 2)

# Error: Violates rule 2
print listify(2, 3, a = 1)


This feature is one of my favorite in Python, which I miss from many other languages. This is particular useful in the scenario where a function takes many default arguments, but we only want to provide the last one.

Var arg. Functions can define two different sets of variable arguments. An argument named * is assigned with a list of all position arguments passed beyond the required arguments; an argument named ** is assigned with a dictionary with all keyword arguments not in the required arguments. A simple example:

def f(fixed, *arguments, **keywords):
print fixed     # 1
print arguments # (2, 3)
print keywords  # {a: 4, b: 5}

f(1, 2, 3, a = 4, b = 5)


Argument unpacking. With an analogous idea, it’s possible to call a function with a list of arguments using the * modifier. For example:

def f(a, b, c):
print a, b, c
args = [1, 2, 3]
f(*args)


Or a map of keyword arguments using the ** modifier:

def f(a, b, c):
print a, b, c
args = {'b': 2, 'c': 3, 'a': 1}
f(**args)


### Generators

In the previous post we saw the iterator type. Remember how to create a custom iterator, he had to define a class and implement the required methods (__iter__() and next()).

We can also instantiate a generator from a function, by using the yield keyword. For example,

def sample_generator(n):
for i in range(n):
yield i


The result of the function is an iterator. Every time we run next(), we execute the function until it reaches a yield statement, in which case it return that value.

it = sample_generator(10)
it.next() # 0
it.next() # 1
...


Generators are particularly useful if the iterator has to work in sequential steps that performs different actions based at each step. For example, consider the following iterator:

class MyIterator:
def __init__(self, x):
self.step = 0
self.x = x

def __iter__(self):
return self

def next(self):
if self.step == 0:
ret = self.x
elif self.step == 1:
ret = 2*self.x
elif self.step == 2:
ret = self.x*self.x
else:
raise StopIteration

self.step += 1
return ret


This is a toy example that runs different functions of x based on which stage it is on. Using generators, this would simply become:

def my_generator(x):
yield x
yield 2*x
yield x*x


A generator can use the return statement (without value) to stop the iterator, which is equivalent to raise StopIteration.

It’s possible to send values back to generators through the method send(). Consider the following example:

def generator_with_sends(x):
while True:
y = (yield x)
if y is not None:
x = y

it = generator_with_sends(5)
print it.next()   # 5
print it.send(10) # 10
print it.next()   # 10


Note that it.send(10) returns 10, not 5. This is because yield halts when the right hand side of the statement is evaluated but before the assignment. Calling send() will resume from there, but this time the return value of (yield x) will be the value passed to send. The code will then execute until the next yield, in which x contains the value passed to y. It’s not allowed to call send() before the generator had yielded the first time.

When a generator is garbage-collected it raises an GeneratorExit exception. It’s possible to force that by calling the stop() method.

def stubborn_generator():
try:
while True:
yield 'no'
except GeneratorExit:
print 'generator stopped. clean up...'
raise StopIteration
finally:
print 'can do some shared clean up here'

def scoped():
s = stubborn_generator()
print s.next()

scoped()


### Map, filter, reduce

Python has built-in function that work with sequences: map(), filter() and reduce() being the most common. These functions work with iterators too. For example:

# map_generator.py
def generate_ints(n):
for i in range(n):
yield i

def double(x):
return x*2

print map(double, generate_ints(5))
# [0, 2, 4, 6, 8]


The problem is that it has to convert a generator to a list, so it has to evaluate a generator until it stops, thus we can’t do this with an “infinite” iterator, like the one from this generator:

def generate_forever():
i = 0
while True:
yield i
i += 1


the alternative is using the itertools library, which contains the imap() and ifilter() methods, which generate new iterators with the function applied.

h = itertools.imap(double, generate_forever())
for i in range(10):
print h.next()


We can see generators as lazily evaluated functions, which is the default behavior in languages like Haskell.

### Partial function application

One very useful functional paradigm is partial application. It’s available in the functools module. For example, we can re-write our double() (from map_generator.py) function as follows:

# partial.py
import operator
import functools

double = functools.partial(operator.mul, 2)


Here, operator.mul is just the function version for the * operator, that is,

def mul(a, b):
return a*b


The code partial.py will return another function where the first argument to the operator.mul function is replaced by 2. The nice thing about the partial() function is that it accepts keyword arguments:

def f(a, b, c, d):
return [a, b, c, d]
g = functools.partial(f, b=2, c=3)
print g(a=1, d=4)


### Conclusion

In this second post we covered functions in Python. We learned about namespaces, scopes and variable number of arguments. We also learned about generators and increased our knowledge on iterators by introducing the itertools module. Finally, we saw that Python can work with higher-order functions by using libraries like functools.

### References

[1] The Python Tutorial: More Control Flow Tools
[2] The Python HOWTOs: Functional Programming HOWTO