Python Coroutines

In this post we’ll cover Python coroutines. First we’ll study David Beazley’s great tutorial [1] to learn about generators and coroutines, and how to build a multitask system from the ground-up.

3052214847_dbd27f9ccc_w (1)

Goddess Durga – source: Flickr

We’ll then take a look at a Python module called asyncio which makes use of coroutines to enable asynchronous I/O.

Generators

Python generators are functions what allow return the execution back to the caller such that when called again it will resume its execution from the same place. It’s the same concept of JavaScript generators, which we talked about before.

The syntax for yielding the execution back is via the yield keyword. Here’s an example of a custom implementation of range():

Because the execution can be yielded at any time in the code, we can use infinite loops:

Another way to use generators is to “pipe” (in the Linux way), one generator into another, like in the example where we read lines from a dictionary file and find those ending in “ion”:

This also highlights one advantage of using generators abstraction: the lower memory footprint since it can process one line at a time. Sure, we could achieve the same with regular functions, but they would need to be combined in a single block:

Which is efficient but doesn’t allow modularity.

Combined with the fact that generators can be infinite loops, one can model functionality like tail and grep as generators (e.g. tail -f | grep foo) which is the exact example Beazley provides in his presentation [1].

Coroutines

In PEP 342 it explains the motivation behind coroutines:

Coroutines are a natural way of expressing many algorithms, such as simulations, games, asynchronous I/O, and other forms of event-driven programming or co-operative multitasking. Python’s generator functions are almost coroutines — but not quite — in that they allow pausing execution to produce a value, but do not provide for values or exceptions to be passed in when execution resumes. They also do not allow execution to be paused within the try portion of try/finally blocks, and therefore make it difficult for an aborted coroutine to clean up after itself.

There are four main changes:

  • yield is an expression instead of statement, which means you can assign the return type of yield to a variable.
  • send() method which can be used to send value to the yield expression. send() is a general version of next(), which is equivalent to .send(None).
  • throw() method which raises an exception at the place where the last yield was called.
  • close() method which raises a GeneratorExit at the place where the last yield was called. Effectively used to force terminating the generator.

Here’s a simple coroutine that just prints what it receives:

Notice that we have to call an empty send() to “initialize” the coroutine. According to PEP 342:

Because generator-iterators begin execution at the top of the generator’s function body, there is no yield expression to receive a value when the generator has just been created. Therefore, calling send() with a non-None argument is prohibited when the generator iterator has just started (…). Thus, before you can communicate with a coroutine you must first call next() or send(None) to advance its execution to the first yield expression.

In the example below we highlight how this flow can be confusing. The first send() receives the value from yield, but it is only able to send a value through a subsequent send().

Since in a coroutine most of the times the first send() will be to initialize the coroutine, Beazley provides a decorator to abstract that:

Piping

We can chain multiple coroutines the same we did with generators but now the order is somewhat reversed. To see why, consider our generator pipe example rewritten with coroutines. In here we pass print_step() to filter_step() and filter_step() to get_words(). Contrast this with the generator version where we start with get_words, pass its results to filter_step() and then print_step().

Event Handling

Beazley provides an example using an XML parser to manipulate XML and convert to JSON. The API of xml.sax relies on the visitor pattern, where at important steps of the AST traversal (e.g. startElement, character, endElement) it calls specific methods of a content handler.

The example uses an adapter layer (cosax.py), to allow using coroutines instead the content handler. It also combines multiple smaller coroutines to produce a more complex ones, demonstrating the composability of coroutines.

Coroutines as tasks

After some preliminary overview of Operating Systems Beazley builds a prototype of a multitasking system by modeling coroutines as tasks (no threads or processes).

  • Task – wrapper around coroutine
  • Scheduler – can create new tasks and schedule tasks – keeps a loop as long as there are tasks. Existing tasks can cause tasks to be scheduled. JavaScript event loop.

Yielding inside a task is like an OS interruption (trap).

System Calls

One of the interesting implementation of the multitasking system is system calls. The task/coroutine issues a system call by providing it as a value to the yield (e.g. yield GetTid()). This is received by the scheduler which can provide the necessary context to the specific system call implementation.

The tutorial then covers one of the most important parts of the multitasking system which is the ability to do asynchronous I/O operations (e.g. reading and writing to a file).

The idea is to introduce a “system call” corresponding to performing I/O. When a task invokes that, it doesn’t get rescheduled but put in a staging area (self.read_waiting in the code below). The scheduler has then a continuously run task that polls the OS to check if any of the file descriptors corresponding to the tasks in the staging area are ready, in which case the task is rescheduled.

Beazley uses that system to implement a simple server that receives data via a socket:

Trampolining

One of the limitations of the scheduler that is developed up to this point is that the yield has to happen at the top-level of coroutine, that is, it cannot invoke other functions that yield, which limits refactoring and modularization.

To overcome this limitation, the author proposes a technique called trampolining. Here’s an example where the execution of add() is done by the top level code, not by main():

We can do a similar thing in the multitasking system. Because the coroutines can recursively call other coroutines, we need to keep a callstack. We can do it inside Task by keeping an explicit stack:

In the example above, when a coroutine that is wrapped in Task makes another coroutine call, for example:

client, addr = yield Accept(sock)

Within Task, result will be a generator, so the current coroutine will be put on the stack and the execution will pass to that generator instead:

self.target = result

When that sub-coroutine yields the execution, the original caller is resumed

self.target = self.stack.pop()

Note that this all happen transparently to the scheduler itself.

This concludes the study of the multitask system from the ground up.

Modern Asynchronous I/O

The tutorial we just studied is from 2009 and while still relevant, Python went through many changes since then.

We’ll now cover some of the development since the generator-based coroutines were introduced and that culminated in a standard module for handling asynchronous I/O.

Delegating to subgenerators: yield from

PEP 380 introduced a new syntax for delegating the execution to another generator until that generator is finished. Simeon [5] provides an example where using yield from simplifies the code. In the code below generator() calls generator2() and generator3() in a loop to “exhaust” their execution.

The same can be accomplished with yield for, which does that implicitly:

Async / Await

PEP 492 Introduces the async and await syntax. The motivation contains these two reasons among others:

It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.

Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring

To address these issues, a function containing the async is considered a native coroutine (as opposed to a generator-based coroutine):

A native coroutine can await generator-based coroutines in which case it has the same  behavior as the wait for. Borrowing from the example above:

The asyncio module

PEP 3156 Describes a system to support asynchronous I/O, which is now known as the asyncio module. It proposes a coroutine-based multitasking system similar to the one Beazley describes. It has support to common I/O operations like those involving files, sockets, subprocesses, etc.

We won’t dive into much detail in this post, but we can see parallels to the scheduler we  studied previously. Here’s an example where asyncio.run() schedules the coroutine main() which in turn executes the sub-coroutines sleep() one after another:

This is not taking advantage of the non-blocking capabilities of asyncio. Why? Recall that await is equivalent to yield from and that causes the current coroutine to wait the call to finish until it continues. If we run this code we get:

will sleep: 1
slept: 1
will sleep: 2
slept: 2

What we want is to schedule both sub-coroutines at the same time, and asyncio allows that via the gather() method:

If we run this code we get:

will sleep 1
will sleep 2
slept 1
slept 2

Which means that the first sleep executed but yielded the execution back to the scheduler after the sleep() call. The second sleep() got run, printing “will sleep 2”.

Conclusion

I’ve used async/await and event loops in other languages such as Hack and JavaScript, but only recently ran into it in Python. I kept seeing mentions of coroutines and that led me to study it in more details. Overall I felt like I learned a lot.

David Beazley’s tutorial is really good. They’re thorough and provide lots of analogies with operating systems concepts.

I also liked the presentation medium: the slides are self-contained (all the information is present as text) and he shows the whole code at first then repeats the code multiple times in subsequent slides, highlighting the important pieces in each of them. This is difficult to achieve in flat text like a blog post.

References

[1] David Beazley – A Curious Course on Coroutines and Concurrency
[2] PEP 342 – Coroutines via Enhanced Generators
[3] PEP 380 — Syntax for Delegating to a Subgenerator
[4] PEP 492 — Coroutines with async and await syntax
[5] Simeon Visser – Python 3: Using “yield from” in Generators – Part 1
[6] PEP 3156 — Asynchronous IO Support Rebooted: the “asyncio” Module

Python Type Hints

Jukka Lehtosalo is a Finnish Software Engineer at Dropbox. He developed an optional type system for Python, mypy, during his PhD thesis in Cambridge. After meeting in a Python conference, Guido van Rossum (creator of Python) invited Lehtosalo to join him at Dropbox. They started adopting mypy in real use cases during a Hackathon, which led to mypy being one of the most popular Python type checkers [2].

In this post we’ll cover mypy in general terms as well many examples demonstrating the syntax and capabilities of this type checker.

Introduction

mypy is a static type checker. This means it does not run during the code execution, so it’s mostly useful during development, much like tests. It relies on manual annotations in the code called type hints, which identify the types of arguments, return types, internal variables and member variables.

One simple example of such annotations is for a function to compute the length of a string:

We are indicating that the input argument s is a string and it return an integer corresponding to its length. The main job of a type checker is to find inconsistencies in the annotations in our code (or in the libraries we use). For example, here’s an incorrect type annotation:

When both operands are integers, the operator + returns another integer, so n + 1 is an integer, which is a clear contradiction given we provided the return type as string.

Type annotations are free unit tests

This contrived example doesn’t make the usefulness of a type checker obvious, but imagine a case where we implicitly make assumption on the argument type. For example, a small program to compute how much an exchange house would give a user for their cash.

If we don’t perform input validation correctly, the code above might work for the tested scenarios but would fail in production the moment the user provided a string. Of course good QA will cover scenarios like these, but this is to sort of guarantees one gets for free by making the types explicit:

The previous code would now fail the type checker validation.

Type hints are optional and incremental

Most of the design of type annotations are well documented in PEP 484, the document claims:

Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.

This also seems to imply that Python type hints will always be partial / gradual, since if full typing is required, it will make transition from non-typed to fully typed codebases prohibitive. Also, there are concrete benefits even with partial typing.

A partial type system makes it optional to add type annotations to variables, instead of it being fully mandatory (like Java or C++). The type checker then performs validation with whatever information it has in hands.

Incomplete typing can be dangerous if developers build trust on the type checker while it’s only performing partial checks due to incomplete information. Let’s consider an example:

At first glance it seems a well typed program and if we run the type checker it will pass. But if we run main(), we’ll have a runtime error. The problem is that expect_string is missing the return type, so in main(), the type checker cannot infer the type of untyped, so it doesn’t perform any validation on expects_int().

Local inference

The previous example also highlights an important aspect of the mypy type checker: it only performs inferences at the function boundary. In theory it should infer that expects_string returns str because we’re returning its argument and then infer that untyped is a string.

This is fine in this example but it could be very expensive to make these inferences, especially if we need to consider branching and recursive calls to other functions. In theory the type checker could go only a few levels deep in the function call but this would make the behavior of the type checker very hard to reason about.

For that reason, the type check will only consider the type of the functions being called. For example, it knows expects_string() expects a string and returns no type, so this is what it will assign to untyped no matter what happens inside expects_string().

Now that we know the basics of the type checker, let’s cover some of the syntax and more advanced typing that mypy supports.

Examples

Setup

Before we start, it’s useful to be able to test the snippets. To do so, copy the code into a file, say example.py and run this command in the terminal:

mypy example.py

which will print any type errors that exist in example.py. mypy can be installed via Python packaging system, pip. Make sure to user Python 3:

pip3 install mypy

Primitive types

bool, int, str, float are the types one will most likely use in functions. As seen above, we can use these to type arguments and return types:

Composed types

We’ll look into generics later, but it should be straightforward to understand the typing of composed types like lists, dictionaries and tuples:

It’s worth noting that these types need to be explicitly imported from the typing module.

None vs. NoReturn

None is used to indicate that no return value is to be expected from a function.

NoReturn is to indicate the function should not return via the normal flow:

Local Variables

The example below demonstrates the syntax for typing local variables. In general typing local variables is not necessary since their type can often be inferred from the assignment / initialization.

Optional and Union

Optional[T] is a generic type that indicates the type is either T or None. For example:

Optional is a special case of a more general concept of Union of types:

My personal take is that Union should be avoided (except for special cases like Optional) because it makes the code harder to deal with (having to handle multiple types) and it’s often better via inheritance (base type representing the union).

Any vs. object

Any is equivalent to not providing the type annotation. On the other hand, object is the base of all types, so it would be more like a Union of all the types. A variable of type object can be assigned a value of any type, but the value of an object variable can only be assigned to other object variables. It’s possible to refine the type of a variable to a more specific one. See “Refining types”.

Classes

There are three main things we need to consider when annotating variables in a class context:

  • We can add types to the member variable (_n in this case).
  • We don’t type self: it’s assumed to be the type of the class where it’s defined.
  • The return type of __init__ is None

Callables

Callables can be used to type higher order functions. Here’s an example where we pass a function (lambda) as argument to a map function:

The type of the function is Callable. The first element, [int] in the example above, is a list of the types of the arguments. The second argument is the return type.

As another example, if we want to define a reduce function on strings, our callback has now type Callable[[str, str], str] because it takes 2 arguments.

Generics: Type Variables

Type variables allow us to add constraints to the types of one or more argument and/or return types so that they share the same type. For example, the function below is typed such that if we pass List[int] as argument, it will return Optional[int]:

Note that the string passed to the TypeVar() function must match the of the variable it is assigned to. This is an inelegant syntax but I’m imagining it’s the result of working around syntax limitations of Python (and the difficulties in changing the core Python syntax for annotations).

We can use multiple TypeVars in a function:

Constraints. According to [3] it’s also possible to limit the type var to be of a specific types:

TypeVar supports constraining parametric types to a fixed set of possible types (note: those types cannot be parametrized by type variables).

It also notes:

There should be at least two constraints, if any; specifying a single constraint is disallowed.

Which makes sense, if we were to restrict a TypeVar to a single type we might as well use that type directly.

In the example below we allow Tmix to be bound to either int or str. Note this is different from Union[int, str] because the latter is both int and str at the same time, while the former is either int or str, depending on how it’s called. The third call to fmix() below would be valid for a Union.

Parametrized Classes

We’ve just seen how to parametrize functions via TypeVar. We can also extend such functionality to classes via the Generic base class:

Ignoring type hints

During the transition from untyped to typed code, it might be necessary to temporarily turn off type checking in specific parts of the code. For example, imagine a scenario where types were added to a widely used function but many existing callers are passing incorrect types.

The comment # type: ignore makes the type checker skip the current line (if an inline comment) or the next line (if a line comment). Here’s an obvious type violation that is ignored by the type checker:

It’s also possible to turn off type-checking completely for the file:

A # type: ignore comment on a line by itself at the top of a file, before any docstrings, imports, or other executable code, silences all errors in the file

Refining types: isinstance() and cast()

In some cases we might receive values from untyped code or ambiguous types like Unions or object. Two ways of informing the type checker about the specific type is by explicit check via isinstance() or cast(). The isinstance() will be usually in a if clause:

This allows the type checker to infer the type of x within the x block, so it won’t complain about the call of inc(x).

Another, more drastic, approach is to use cast([type], x) which returns x if it has in runtime but otherwise throws an exception, but this allows the type checker to refine the type of x to statically. Here’s an example:

It’s a bummer that the order of arguments of cast([type], var) and isisntance(var, [type]) are inconsistent.

Arbitrary argument lists

It’s possible to type arbitrary argument lists, both the positional and named ones. In the example below, args has type Tuple[str, ...] and kwds has type Dict[str, int]. Note that the ... in Tuple[str, ...] indicates an arbitrary-length tuple of strings. It’s unclear to me why it’s a tuple and not a list, but I’d guess it has to do with how it represents non-arbitrary argument lists (e.g. foo(x: int, y: str) would be Tuple[int, str]).

Conclusion

I really like Python and it’s my go-to language for small examples and automation scripts. However, having had to work with a large code base before, I was really displeased by its lack of types.

I’m also a big proponent of typed code, especially for large, multi-person code bases, but I understand the appeal of rapid development for prototypes and Python now offers both.

As we saw in some examples, the syntax is cumbersome at times but overall I find the mpyp type checker pretty expressive.

References

[1] mypy: About
[2] Dropbox Blog: Thank you, Guido
[3] PEP 484 — Type Hints

Aho-Corasick

AhoPicAlfred Aho is a Canadian computer scientist. He is famous for the book Principles of Compiler Design co-authored with Jeffrey Ullman. Aho also developed a number of utilities for UNIX which has widespread usage, including the AWK programming language (acronym for Aho, Weinberger and Kernigham) and variants of grep.

Unfortunately, I couldn’t find much information about Margareth J. Corasick, except that she worked at Bell Labs and together with Aho, developed the Aho-Corasick algorithm, which we’ll talk about in this post.

Multiple-string search

We have studied before the problem of finding a string in a text, for which the Knuth-Morris-Prat algorithm is well-known.

Aho and Corasick studied the problem of finding a set of strings in a text. More precisely, given a body of text H and a set of search terms S, we want to find all occurrences of each term of S in H.

They came up with the Aho-Corasick algorithm which is able to solve this problem in linear time on the size of H and the size of the output (i.e. the total number of characters that need to be printed when a match is reported).

It consists in pre-computing a look-up structure from S and then scanning the text using it. Whenever it detects a match was found, it prints the corresponding matches. Let’s see first how to construct the structure.

The look-up structure

The look-up structure is constructed in two phases, each to construct two sets of edges: goto and fail edges.

The goto edges are essentially the edges of a trie containing the entries in S. The trie can be seen as a state machine where each node represents a prefix of one of the entries in S and the transition function tells us to which prefix to go next given an input character. These edges are thus called goto arrows. We denote by g(s, a) the node we navigate to if we’re at node s and receive input a.

Given a text H = [a1,a2,…,a_n], we can start at the root and follow the edges labeled a1, a2 and so on. If we land on a node representing an entry in S we print it.

Eventually though, we may reach a node s such that the next character a_j doesn’t have a corresponding edge g(s, a_j). We know that s is a suffix of [a1,a2,…,a_{j-1}], say s = [a_k,…,a_{j-1}]. We also know there’s no other unreported entry in S that starts at k but there might be one that starts at a_{k+1}. We could restart the search at k+1 and at the root of the tree, but can we do better?

Because S is fixed, no matter what text H we have, by the time we encounter a dead end at a node r, we know that the last characters seen are  [a_k,…,a_{j-1}] = r. Now suppose s = [a_{k+1},…,a_{j-1}, a_j] happens to be in the trie. Then we can continue our search from s, without having to backtrack. If k+1 doesn’t work, we can try k+2, k+3, and so forth. More generally, say that we eventually found the smallest index k* > k such that s* = [a_k*,…,a_{j-1}, a_j] is in the trie. In other words s* is the longest proper suffix of s in the trie. When we fail to find s = r + a_j in the trie, we can resume at s*. This suggests we could have a fail edge from s to s* in our trie to quickly resume the search without backtrack.

Technically, s doesn’t exist in the trie, so we can’t add an edge there. However, we can still add the failure edge in r, and denote it as f(r, a_j) = s* for all labels a_j not in a goto edge.

Given the goto and fail edges, we have a proper state machine which we can use to find matches in H. Let’s combine them into a single set of edges e(s, a) (that is e(s, a) = g(s, a) if it exists, otherwise f(s, a)).

We are still missing printing out the matches. Because of the shortcuts we take via the fail edges, we never explicitly visit a suffix of a state s. For this reason, when visiting a state s, we need to print all its suffixes that belong to S.

We’ll now see how to construct the fail edges and the output.

Building the fail edges

We’ll show how to construct the failure edges efficiently. We define auxiliary edges ps(s) which represents the longest proper suffix of s that is in the trie. Note that f(r, a) is ps(r + a) for when g(r, a) doesn’t exist.

The idea is that if we know ps(r) for all nodes s of depth up to l in the trie, we can compute ps(r + a), that is for the nodes in the next level.

Consider each node r of depth l and each symbol a. Let s = r + a. We know ps(r) is in the trie, but is ps(r) + a in the trie? Not necessarily, but maybe a suffix of ps(r) is? We can try ps(ps(r)) and further suffixes until we find one that is in the trie, or we reach the empty string.

How can we make sure to only process node s only after all the edges from lower levels have been processed? We can process the nodes in bread-first order, which can be achieved using a queue. The pseudo-code could be:

Note how instead of searching for the largest suffix by removing one character at a time and checking if it exists in the trie, we’re “jumping” to suffixes via repeated application of ps(), which is what makes the algorithm efficient as we shall see. You might be asking if we’re not missing any valid suffix when doing that, but no worries, we analyze the correctness of this in Lemma 1 (Appendix).

Building the output

If we compute output(s) in a bread-first order, we can assume we know output(r) for nodes r lower level than s. Assuming we already computed ps(s), we have that output(s) = output(ps(s)) + {s} if s is in S:

How do we know by jumping through suffixes via ps(s) we are not missing any matches? We demonstrate that in Lemma 2 (Appendix).

Implementation in Python

I’ve implemented the pseudo-code proposed here in Python, and made it available on Github. The main implementation aspect is how we compute the output list. To avoid storing the matches multiple times, we model it as a linked list. More precisely, when computing output(s) from ps(s), if ps(s) is itself a keyword, we point output(s) to ps(s). Otherwise we point it to output(ps(s)).

When printing output(s) we just need to traverse the list,:

Conclusion

It was a lot of work to study Aho Corasick’s paper [1], understand the proof and re-write with my own words, but this provided me much better intuition behind the algorithm.

Like the KMP, I’ve known this algorithm before but never took the time to fully understand it.

My initial implementation was in Rust, but I got stuck when modeling the nodes because of self-reference. This led me to this Stackoverflow answer which I haven’t time to study, but seems like a good idea for a post.

Appendix

Correctness

Lemma 1. Let s be a prefix in the look-up structure. Then ps(s) is the longest proper suffix of s that exists in the look-up structure T.

Let’s prove this by induction on the size of s. The base is for nodes of level 1, where ps(s) = root (empty string), which is true, since the proper suffix of a single letter is empty.

Assuming that ps(s’) is the longest proper suffix of s that exists in the look-up structure for |s’| < |s| , let’s prove this is true for ps(s) as well. Let’s show first that ps(s) is a proper suffix of s in T. We know that ps(r) is a suffix of r, and so is ps(ps(r)) since a suffix of a suffix of r is also a suffix of r, and so on. Let’s define ps_k(r) = ps(ps_{k-1}(r)) and ps_n(r) such that g(ps_n(r), a) exists. By construction we assigned ps(s) = g(ps_n(r), a). Since ps_n(r) is a suffix of r, and r + a = s, we know what ps_n(r) + a is suffix of s. We’ll show there’s no other proper suffix or s in loop-up structure larger than ps(s). Suppose there is a s* in T such that |s*| > |ps(s)|. Let r* be = s* + a. Note that r* is a suffix of r and it exists in T (since s* is in there and T it’s a prefix tree). It cannot be larger than ps(r) because of our induction hypothesis. r* has to be larger than ps_n(r) since |ps(s)| = |ps_n(r)| + 1, |s*| > |ps(s)| and |s*| = |r*| + 1.

The first observation is that r* != ps_k(r) for any k. Otherwise the algorithm would have set ps(s) = g(r*, a) = s*. So there must be some k for which |ps_k(r)| > |r*| > |ps_{k+1}(r)|, this means that for r’ = ps_k(r), there’s a suffix r* suc that |r*| > |ps(r’)|. But this is a contradiction of the induction hypothesis saying that ps(r’) is the largest suffix of r’ in T.

Lemma 2. output(s) contains y if and only if y is in S and a suffix of s.

Let’s prove one direction: if output(s) contains y then y is in S and a suffix of s. This is easy to see by construction, output(s) contains s if it’s in S, and also the outputs of ps(s), which is a suffix of s, and by induction are all in S and suffix of ps(s), and hence of s.

Now suppose y is a keyword and suffix of s. Let’s show that there’s a k for which ps_k(s) = y. If this was not the case, then by a similar argument we did in Lemma 1, there is ps_{k-1}(s) such that y is suffix of it, and |y| > |ps_{k}(s)|, which violates Lemma 1, since  ps(ps_{k-1}(s)) = ps_{k}(s) is not the longest possible. Since ps_k(s) = y, by construction, we know that output(y) will eventually be merged into output(s).

Lemma 3. At the end of the loop in the Algorithm 1, state s is in the look-up structure T and it is a suffix of [a1,a2,…,aj].

This is true for a1: it would be either in state [a1] in T, or the empty state, both being suffix of [a1]. Assuming this hypothesis holds for j-1, we were in a state r before consuming aj, and r is the longest suffix of  [a1,a2,…,aj-1].

By construction we know that the end state s is still in T, since we only followed valid edges of T (gotos and fails). We need to show s is the longest suffix of [a1,a2,…,aj]. For the sake of contradiction, assume there’s s* such that |s*| > |s| in T that is a suffix of [a1,a2,…,aj]. Let r* be such that s* = r* + aj. Note that r* is in T and is a suffix of [a1,a2,…,aj-1] and by the inductive hypothesis |r*| < |r|. If we used a goto edge, then |r*| > |r| which contradicts the inductive hypothesis. If we used a fail edge, then s = f(r, aj). Using the same argument from Lemma 1, we have that r* is a suffix of r, and that r* is not ps_k(r) for any k, so it must be that for some k, x = ps_k(r), |x| > |r*| > |ps(x)|and is a contradiction because r* is a proper suffix of x but longer than ps(x).

Theorem 1. The Aho Corasick algorithm is correct.

We need to show that every substring of text which is in S, will be reported. Any substring we were to report end at an index j of H, so it suffices to show we report every suffix of [a1,a2,…,aj] in S, for all j.

Let s be the node we are at the end of each loop in Algorithm 1. Suppose there’s a suffix s* of [a1,a2,…,aj] in S that is not in output(s). From Lemma 2, we know that every keyword in S that is a suffix of s is in output(s), it follows that s* is not a suffix of s, so |s*| > |s|, which cannot be from Lemma 3.

QED

Complexity

Theorem 2. The look-up stricture can be constructed in linear time of the number of characters in S.

Let’s define size(S) = the total number of characters in S and A the set of symbols of the alphabet.

It’s known to that we can construct a trie from a set of string in S, so the goto edges are covered.

For constructing the ps() function, the cost will be associated on how many times we execute the inner loop. We visit each node once and for each of the |A| symbols, we perform the inner loop. Since|f(s)| < |s|, the number of times we execute the inner-most loop for a given state is bounded by |s|, so for each state we execute the inner-most loop |s|*|A| times. If we sum over all states, we have size(S)*|A|. Assuming |A| is constant and small, we can ignore it in the overall complexity. It’s worth noting Aho and Corasick’s original paper don’t depend on |A|.

The cost of constructing the output() function is O(1) if we were to use a shared linked list. In that case output(s) can be constructing with a cell s and have it point to the list of output(f(s)), so again, this would be bounded by size(S).

For the search algorithm, let’s ignore the cost of printing the output for now. We can follow an edge in O(1) time and do so exactly once per character.

In the worst case there could be a lot matches at each position. For example, if S = {a, aa, aaa, …, a^(n/2)} and H = a^n.  For n/2 positions we’ll need to output all entries in S, leading to a complexity of size(S)*n.

Since these values need to be printed, there’s no way to be faster than that and this could be the dominating factor in the total complexity. In practice however the number of matches is much smaller.

References

[1] Efficient String Matching: An Aid to Bibliographic Search
[2] Alfred Aho – Wikipedia

Numerical Representations as inspiration for Data Structures

Book

In this chapter Okasaki describes a technique for developing efficient data structures through analogies with numerical representations, in particular the binary and its variations.

We’ve seen this pattern arise with Binomial Heaps in the past. Here the author presents the technique in its general form and applies it to another data structure, binary random access lists.

Binary Random Access Lists

These lists allows efficient insertion at/removal from the beginning, and also access and update at a particular index.

The simple version of this structure consists in distributing the elements in complete binary leaf trees. A complete binary leaf tree (CBLF) is one that only stores elements only at the leaves, so a tree with height i, has 2^(i+1)-1 nodes, but only 2^i elements.

Consider an array of size n, and let Bn be the binary representation of n. If the i-th digit of Bn is 1, then we have a tree containing 2^i leaves. We then distribute the elements into these trees, starting with the least significant digit (i.e. the smallest tree) and traversing the tree in
pre-order.

For example, an array of elements (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) has 11 elements, which is 1011 in binary. So we have one tree with a single leave (1), a tree with 2 leaves (2, 3) and another containing 8 leaves (4, 5, 6, 7, 8, 9, 10, 11).

We use a dense representation of the binary number, by having explicit elements for the 0 digit. Here’s a possible signature for this implementation:

Inserting a new element consists in adding a new tree with a single leave. If a tree already exists for a given size, they have to be merged into a tree of the next index. Merging two CBLFs of the same size is straightforward. We just need to make them children of a new root. Since elements are stored in pre-order, the tree being inserted or coming from carry over should be the left child.

Looping back to our example, if we want to insert the element 100, we first insert a tree with a single leaf (100). Since the least significant digit already has an element, we need to merge them into a new tree containing (100, 1) and try to insert at the next position. A conflict will arise with (2, 3), so we again merge them into (100, 1, 2, 3) and try the next position. We finally succeed in inserting at position 2, for a new list containing trees like (100, 1, 2, 3) and (4, 5, 6, 7, 8, 9, 10, 11).

The complexity of inserting an element is O(log n) in the worst case which requires merging tree for all digits (e.g. if Bn = 111...111). Merging two trees is O(1).

Removing the first element is analogous to decrementing the number, borrowing from the next digit if the current digit is 0.

Searching for an index consists in first finding the tree containing the index and then searching within the tree. More specifically, because the elements are sorted beginning from the smallest tree to the largest, we can find the right tree just by inspecting the number of elements in each tree until we find the one whose range includes the desired index. Within a tree, elements are stored in pre-order, so we can find the right index in O(height) of the tree.

After finding the right index, returning the element at that index is trivial. Updating the element at a given index requires rebuilding the tree when returning from the recursive calls.

Okasaki then proposes a few different numbering systems that allow to perform insertion/removal in O(1) time. Here we’ll only discuss the less obvious but more elegant one, using skew binary numbers.

characters

Skew Binary Random Access Lists

A skew binary number representation supports the digits 0, 1 and 2.

The weight of the i-th digit is 2^(i+1) - 1. In its canonical form, it only allows the least significant non-zero digit to be 2.

Examples:

table-v2

Decimal and Skewed Binary

It’s possible to show this number system offers a unique representation for decimal numbers. See the Appendix for a sketch of the proof and an algorithm for converting decimals to skewed binary numbers.

Incrementing a number follows these rules:

  • If there’s a digit 2 in the number, turn it into 0 and increment the next digit. By definition that is either 0 or 1, so we can safely increment it without having to continue carrying over.
  • Otherwise the least significant digit is either 0 or 1, and it can be incremented without carry overs.

The advantage of this number system is that increments (similarly, decrements) never carry over more than once so the complexity O(1), as opposed to the worst-case O(log n) for regular binary numbers.

A skew binary random access list can be implemented using this idea. We use a sparse representation (that is, not including 0s). Each digit one with position i corresponds to a tree with (2^(i+1) - 1) elements, in this case a complete binary tree with height i+1. A digit 2 is represented by two consecutive trees
with same weight.

Adding a new element to the beginning of the list is analogous to incrementing the number, which we saw can be done in O(1). Converting a digit 0 to 1 or 1 to 2, is a matter of prepending a tree to a list. To convert a 2 to 0 and increment the next position, we need to merge two trees representing it with the element to be inserted. Because each tree is traversed in pre-order, we make the element the root of the tree.

Elements are inserted in pre-order in each tree, so when searching for an
index, we can first find the right tree by looking at the tree sizes and within a tree we can do a “binary search” in O(height) of the tree.

Binomial Heaps

In this chapter, this technique is also applied to improve the worst case runtime of insertion of binomial heaps. The implementation, named Skewed Binomial Heap, is on github.

Conclusion

This chapter demonstrated that binary representations are a useful analogy to come up with data structures and algorithms, because they’re simple. This simplicity can lead to inefficient running times, though. Representations such as skewed binary numbers can improve the worst case of some operations with the trade-off of less intuitive and more complex implementations.

Appendix A – Proof

Sketch of the proof. First, it’s obvious that two different decimals cannot map to the same binary representation. Otherwise the same equation with the same weights would result in different values. We need to show that two binary representations do not map to the same decimal.

Suppose it does, and let them be B1 and B2. Let k be the largest position where these number have a different digit. Without loss of generality, suppose that B1[k] > B2[k].

First case. suppose that B1[k] = 1, and B2[k] = 0 and B2 doesn’t have any digit 2. B1 is then at least M + 2^{k+1} - 1, while B2 is at most M + \sum_{i = 1}^{k} (2^{i} - 1) which is M + 2^{k + 1} - k (M corresponds to the total weight of digits in positions > k). This implies that B2 < B1, a contradiction.

Second case. suppose that B1[k] = 1, but now B2 does have a digit 2 at position j. It has to be that j < k. Since only zeros follow it, we can write B2‘s upper bound as

M + \sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1

Since 2(2^{j + 1} - 1) < 2^{j + 2} - 1, we have

\sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1 < \sum_{i = j + 2}^{k} (2^{i} - 1) + 2^{j + 2} - 1

We can continue this argument until we get that B2 is less than M + 2(2^{k} - 1) which is less than M + 2^{k + 1} - 1, B1.

Third case. Finally, suppose we have B1' such that B1'[k] = 2, and B2'[k] = 1. We can subtract 2^{k+1} - 1 from both and reduce to the previous case. ▢

Appendix B – Conversion algorithm

Converting from a decimal representation to a binary one is straightforward, but it’s more involved to do so for skewed binary numbers.

Suppose we allow trailing zeros and have all the numbers with k-digits. For example, if k=2, we have 00, 01, 02, 10, 11, 12 and 20. We can construct the numbers with k+1-digits by either prefixing 0 or 1, and the additional 2 followed by k zeros. For k=3, we have 000, 001, 002, 010, 011, 012, 020, 100, 101, 102, 110, 111, 112, 120 and finally 200.

More generally, we can see there are 2^(k+1) - 1 numbers with k digits. We can construct the k+1 digits by appending 0 or 1 and then adding an extra number which is starts 2 and is followed by k zeros, for a total of 2^(k+1) - 1 + 2^(k+1) - 1 + 1 = 2^(k + 2) - 1, so we can see this invariant holds by induction on k, and we verify that is true for the base, since for k = 1 we enumerated 3 numbers.

This gives us a method to construct the skewed number representation if we know the number of its digits say, k. If the number is the first 2^(k) - 1 numbers, that is, between 0 and 2^k - 2, we know it starts with a 0. If it’s the next 2^(k) - 1, that is, between 2^k - 1 and 2^(k+1) - 3, we know it starts with a 1. If it’s the next one, exactly 2^(k+1) - 2, we know it starts with a 2.

We can continue recursively by subtracting the corresponding weight for this digit until k = 0. We can find out how many digits a number n has (if we’re to exclude leading zeros) by find the smallest k such that 2^(k+1)-1 is greater than n. For example, for 8, the smallest k is 3, since 2^(k+1)-1 = 15, and 2^(k)-1 = 7.

The Python code below uses these ideas to find the skewed binary number representation in O(log n):

One might ask: why not OCaml? My excuse is that I already have a number theory repository in Python, so it seemed like a nice addition. Converting this to functional code, in particular OCaml is easy.

This algorithm requires an additional O(log n) memory, while the conversion to a binary number can be done with constant extra memory. My intuition is that this is possible because the weights for the binary numbers are powers of the same number, 2^k, unlike the skewed numbers’ weights. Is it possible to work around this?

Tree Ring Matching using the KMP Algorithm

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.

x

Tool to explain the matching system

The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
    minlen = min(len(prefix), len(suffix))
    max_match = 0
    for match_len in range(1, minlen + 1):
        if prefix[:match_len] == suffix[-match_len:]:
            max_match = match_len
    return max_match

In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

    tlen = len(T)
    qlen = len(Q)

    for i, _ in enumerate(T):
        match = 0
        # Searches for Q starting from position i in T
        while match < qlen and \
              i + match < tlen and \
              Q[match] == T[i + match]:
            match += 1

        if match == qlen:
            print 'found substring in position', i

Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
        ^ mismatch in position 4

If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
T = ababadac...
         ^ mismatch in position 5

When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
T = ababadac...
         ^ continue search from here
           (not from position 2)

We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
T = ababadac...
         ^ continue search from here
           (not from position 2)

Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
T = ababadac...
         ^ continue search from here

If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen and j < qlen:

        if j  0:
            j = lsp[j - 1]
        else:
            i += 1

        # string was found
        if j == qlen:
            print i - j

We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
    qlen = len(Q)
    lsp = [0]*qlen

    i = 1
    while i < qlen :

        j = 0
        while (i + j < qlen and Q[j] == Q[i + j]):
            lsp[i + j] = max(lsp[i + j], j + 1)
            j += 1

        i += 1
    return lsp

The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.

philosopher

The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

    qlen = len(Q)

    lsp = [0]*qlen

    i = 1
    j = 0
    while i  0:
            j = lsp[j - 1]
        else:
            i += 1

    return lsp

We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen:

        if j  0:
            j = lsp[j - 1]
        else:
            i += 1

        yield_indices(i, j)

Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
    matches = []
    qlen = len(Q)
    def match_accumulator(i, j):
        if j == qlen:
            matches.append(i - j)
    kmp_generic(Q, T, match_accumulator)
    return matches

Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
    slen = len(suffix)
    max_match = [None]
    def max_matcher(i, j):
        if max_match[0] is None and i == slen:
            max_match[0] = j

    # Search prefix in suffix
    kmp(prefix, suffix, max_matcher)

    return 0 if max_match[0] is None else max_match[0]

Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.

Conclusion

I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.

References

[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings

Revisiting Python: Object Oriented Programming

This is the third post in the series on revisiting Python. In the previous post we talked about Python functions.

python-logo

We’ll talk about classes in Python 2.7. Following the premisses of previous posts, this is not intended to be an introduction to Object Oriented Programming, but rather how we can use this paradigm in Python.

We’ll first cover basic concepts like classes and inheritance. Later we’ll discuss some more advanced features from the Python language that allows us extending the set of built-in patterns provided by the language.

Classes

Classes are the building blocks of object oriented programming. Python classes look similar to those in languages like C++ and Java, but there are some important differences, which we’ll comment on this section.

Classes are objects. When writing a class definition the code is executed and an class object is created and assigned to a name corresponding to the class name. For example, in the code below, an object representing the class is assigned to a variable in the current scope MyClass.

class MyClass:
    print 'hi'

print MyClass # __main__.MyClass
x = MyClass
print x       # __main__.MyClass

If we run the code above it will print ‘hi’. We can manipulate MyClass as a regular variable, assigning it to other variables, passing it as parameter to functions, etc.

Instantiating. We can create an instance of a given class by using a function call notation, that is, MyClass().

This will also call the method __init()__ which can be used to initialize the instance, like a constructor. Functions defined within a class become methods to the instances. Methods can be called using the syntax instance.method(). For example:

class MyClass:
    def myMethod(self):
        print(self)

instance = MyClass()

instance.myMethod() # Prints &amp;lt;__main__.MyClass instance at 0x10891f290&amp;gt; 

When invoking a function using instance.method(), instance() is bound to the first argument to method() in this case. We usually name this first parameter self, but it’s just a convention.

Class vs. instance members. Note that local variables defined at the class level belong to the class object, not to a specific instance object. Making an analogy to C++, this is a static member variable.

class MyClass:
    x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x # 20

To make a variable instance specific, we need to make use of the instance passed as the first parameter to the methods.

class MyClass:
    def __init__(self):
        self.x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x

All methods and variables defined at the class level are stored internally in the __dict__ variable.

class MyClass:
    x = 10
    def f():
        pass

print MyClass.__dict__
# 'x': 1, '__module__': '__main__', '__doc__': None, 'f': &amp;lt;function f at 0x109f32d70&amp;gt;}

Methods. In Python, methods are “static”, but by default it requires an instance of the class to be bound to the first parameter. As we saw above, this happens automatically when we use the instance.method() syntax. Alternatively, we could explicitly pass the instance to the static method, using the SomeClass.method(instance) syntax. To illustrate that, imagine we have a method in our class a method printX(). We can invoke it either by a method from the instance object or from the class object passing the instance:

class MyClass:
    def __init__(self):
        self.x = 10

    def printX(self):
        print self.x

inst = MyClass()

# Both are equivalent:
inst.printX() 
MyClass.printX(inst)

Methods are essentially functions assigned to class member variables. To make the point clear, we can rewrite the previous snippet replacing the inline definition of printX() by an external function:

def externalPrintX(self):
    print self.x

class MyClass:
    def __init__(self):
        self.x = 10

    printX = externalPrint

Note that the external function still needs to have as the first arguments an instance of MyClass.

Given that all methods are public and data variables take precedence over static ones, it can cause some confusion if we assign functions to instance members:

def externalPrintX():
    print 'external hello world'

class MyClass:
    def __init__(self):
        self.x = 10
        self.printX = externalPrintX

    def printX(self):
        print 'hello world'

inst = MyClass()
inst.printX()

Here, the method printX() method got replaced by another function, externalPrintX(). Note that, differently from regular methods, the instance is not bound to the first argument when invoking inst.printX().

Static Methods. Since we’re required to provide an instance to the first argument, it’s not truly static in the C++ class terminology. It’s possible to override this requirement by using the staticmethod() function. To illustrate this, here’s an example:

class ClassWithStaticMethod():
    def regularMethod(x):
        print 'This is not bound to any instance:', x

    staticMethod = staticmethod(regularMethod)

x = ClassWithStaticMethod()
x.staticMethod(10) # This is not bound to any instance: 10

Class Methods. are different from regular methods because they always receive the class object instead of the instance object as the first argument. More specifically, when calling from a class object, it is bound automatically as the first parameter and when called from the instance object, its corresponding class object is bound instead. For example,

class ClassWithClassMethod():
    def regularMethod(type):
        print type
    classMethod = ClassMethod(regularMethod)

ClassWithClassMethod.classMethod() 

x = ClassWithClassMethod() # ClassWithClassMethod
x.classMethod()            # ClassWithClassMethod

Inheritance

The syntax for inheritance is class Derived(Base). Methods defined in the base classes can be overridden by derived classes. If a name (variable or method) is referenced from an instance, it’s searched from the current class, going up on the inheritance chain, until the definition is first found.

# animal.py
class Animal:
    def breath(self):
        return 'breathing air'
    def eat(self):
        return 'eating'

class Dog(Animal):
    def eat(self):
        return Animal.eat(self) + ' meat'

dog = Dog()
print dog.breath()  # breathing air
print dog.eat()     # eating meat

In this example, eat() is first found at Dog, while breath() is only found at Animal. Note that we can refer to a specific implementation by providing the class explicitly, like Animal.eat(self).

Old-style vs. new style classes. In a nutshell, the difference between old-style and new-style classes is that the latter is a descendant of object. Thus, by default user defined classes are old-style by default (though old-style classes are removed from Python 3).

According to the docs, the motivation for creating the new-style classes is to unify Python types and user defined classes. In this model, classes are able to derive from built-in types [2].

Multiple-inheritance. is supported in Python. The syntax is a comma separated list of base classes:

class DerivedClassName(Base1, Base2, Base3):
   ...

Method resolution order (MRO). We just saw that for simple inheritance chains, the natural lookup order for methods is straightforward. For multiple-inheritance it can be complicated mainly because of the diamond pattern.

For old-style classes, the method lookup order, MRO, is defined by a “depth-first search”. It first looks in the entire inheritance chain of Base1, then Base2, and so on. The MRO is always well defined as long as the inheritance graph is a DAG (direct acyclic graph).

The problem is that it’s not intuitive. Depending on the order a given class appears in a descendant, its own MRO can vary. For example, consider the following complex dependency:

class A:
    def name(self):
        return 'a'

class B:
    def name(self):
        return 'b'

class C(B, A):
    def composed_name(self):
        return 'c ' + self.name()

class D(A, C): pass

d = D()
print d.composed_name()

When we call composed_name() on the instance of class D, the name resolution will only find it in class C. This class on its turn needs to resolve the name() method. Should we get it from class B, since it’s listed first on C‘s inheritance?

That’s not what happens. name() is resolved from the class D, because there, A appears before. This is a very confusing behavior and error-prone.

In new-style classes, the diamond pattern exists by default when multiple inheritance is used, since everything ends up deriving from object. But in the new style, the type of inheritance that can be created is more strict. It uses an interesting algorithm, C3, described in [3] to create a consistent ordering or throw an exception in case such ordering doesn’t exist.

super. In the animal.py example, we had to hard code Animal.eat() to be able to call Dog‘s parent class. The super() function takes a class object and an instance, and it returns a proxy object on which we can call methods and it will resolve them based on the MRO the object class. More specifically, consider the following example:

class A(object):
    def method(self):
        print 'calling method from a'

class B(object):
    def method(self):
        print 'calling method from b'
    def anotherMethod(self):
        print 'calling another method from b'

class C(A, B):
   def method(self):
        print 'calling method from c'
    def anotherMethod(self):
        print 'calling another method from c'

    def run(self):
        proxy = super(C, self)
        proxy.method()          # calling method from a
        proxy.anotherMethod()   # calling another method from b

C().run()

In this example, super() returned an object, proxy, which can resolve method() and anotherMethod() properly. Since C derives from A first, it resolves it to A.method(). On the other hand, since anotherMethod() is only defined in B, it will resolve to B.anotherMethod().

Descriptors

So far, we discussed the basic components of object oriented programming. Now we’ll describe some advanced concepts that will help us better understand how some functions like staticmethod() work.

A class is said to be a descriptor if it implements any of __get__(), __set__() or __delete__(). In particular, if it doesn’t implement __set__(), it’s called non-data descriptor. If it implements both __get__() and __set__(), it’s a data descriptor [4, 5].

If another class contains class members that are descriptors, the __get__() and __set()__ will be executed when do reads and assignments respectively. For a concrete example, considering the following descriptor:

class Descriptor(object):
    def __get__(self, obj, objtype):
        print 'getting x'
        return self.x

    def __set__(self, obj, val):
        print 'setting x to ' + str(val)
        self.x = val

It stores a value internally at the variable x, but includes some logging when setting or getting this value. Now, suppose we have a class that has a member that is a descriptor:

class ClassWithDescriptors(object):
    member = Descriptor()

x = ClassWithDescriptors()
x.member = 20
print x.member

When we try to assign a value to x.member, the method __set__() from Descriptor is called instead. The same applies to when we try to read from x.member. It’s important that member is a class member, not an instance member. For example, if we have the instance member variable, it would not print the logging messages:

class ClassWithDescriptors(object):
    def __init__(self):
        self.instance_member = Descriptor()

x = ClassWithDescriptors()

x.instance_member = 20   # This is just overriding instance_member with an integer
print x.instance_member  

staticmethod() and classmethod(). One interesting use case for descriptors to implement the staticmethod() and classmethod() “functions”. We could write the following descriptor for each of these, respectively:

class StaticMethod(object):
    def __init__(self, f):
        self.f = f

    def __get__(self, obj, objtype=None):
      return self.f

class ClassMethod(object):
     def __init__(self, f):
          self.f = f

     def __get__(self, obj, objtype=None):
          if objtype is None:
               objtype = type(obj)
          def newfunc(*args):
               return self.f(objtype, *args)
          return newfunc

Functors

Python has the concept of function objects or functors. It’s an object that can be invoked as a function so long its class defines the __call__() method. A simple example is:

class Multiplier:
    def __init__(self, factor):
        self._factor = factor
    def __call__(self, a):
        return self._factor * a

double = Multiplier(2)
triple = Multiplier(3)
print double(5) # Prints 10
print triple(7) # Prints 21

One advantage of functors over regular functions is the capability of carrying context.

Decorators

Decorator is a design pattern in which we can modify the behavior of an object without modifying all the objects from that particular class. It’s also known as a wrapper, because we wrap the object inside another object that adds the desired behavior.

We’ll now describe a motivation for using decorators and how Python offers a syntax sugar for this pattern using annotations. For our example, imagine we have a function that takes a three variables and returns the sum of them:

def add(a, b, c):
    return a + b + c

Now imagine we want to validate the parameters passed to add(). We could add that to the beginning of the function, but maybe we already have a function to do the validation for us. In this case, we could wrap the function in a another one, which would first perform the parameters validation and then call the function. The code below does that:

def validate(f):
    def closure(*args):
        for arg in args:
            assert isinstance(arg, int)
        return f(*args)
    return closure

validated_add = validate(add)

print validated_add(5, 2, 3)
# Throws an exception
print validated_add(5, 2, 3.0) 

validate() takes a function f() and wraps it inside closure(), which verifies all arguments are integers before invoking f(). So by passing add() to validate(), we’re getting a decorated function validated_add() which performs validation on the arguments.

Python offers a syntax sugar to do exactly that. By annotating a function with @something, it passes the function to a function named something(), that should act as a function transformation, and uses the transformed function instead.

@validate
def add(a, b, c):
    return a + b + c

Imagine we want to customize the validate function to enable us to define the type of each element passed to the decorated function.

Annotations with parameters. We can do that by passing parameters to the decorator, using this syntax @something(arg1, arg2, ...). In this form, we actually construct a decorator builder, so we need to wrap our validate() function inside another one:

def validate(*arg_types):
    def validate_impl(f):
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure
    return validate_impl

With this new validate() function, we can update the annotation to our simple add() function:

@validate(int, int, int)
def add(a, b, c):
    return a + b + c

We can now use this as a very simple type validation. Imagine we have a function that takes a string and repeats it N times. We can enforce the types using the same annotation:

@validate(str, int)
def rep(s, n):
    return &amp;quot;&amp;quot;.join([s]*n)

Decorators can also be function objects, so we can define a class to do a similar work the validate() function does:

class Validator:
    def __init__(self, *arg_types):
        self._arg_types = arg_types

    def __call__(self, f):
        arg_types = self._arg_types
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure

In this case, the parameters defined in the annotation are passed to the constructor of the class.

Finally, two very common annotations are staticmethod and classmethod. There is no magic here. These cause the staticmethod() and classmethod() instances to be instantiated, and we saw how these work in previous sections.

Conclusion

In this post we covered object oriented programming in Python. We learned details about classes, especially how they compare to classes in other languages. We then discussed inheritance, including multiple inheritance and old and new-style classes. We then delved into more advanced concepts related to object oriented programming. This included descriptors, functors and decorators.

Python requires relatively few special constructs to accomplish features from other object-oriented languages. Examples include the concept of super and static methods, which are implemented using general purpose features from Python like descriptors and decorators.

Reference

[1] The Python Tutorial – Classes
[2] Python.org Docs – New-style Classes
[3] The Python 2.3 Method Resolution Order
[4] Python Descriptors Demystified
[5] Python HOWTOs – Descriptor HowTo Guide

Revisiting Python: Functions

This is the second post in the series on revisiting Python. In the first post we discussed the motivation for these posts and started by revisiting Built-in Types.

python-logo

In this post, we’ll talk about functions in Python 2.7.

Functions

Functions in python consist of the keyword def, a function name and a list of arguments:

def hello(arg1, arg2):
  print "hello world"

The pythonic-way to document what a function does is through docstrings (block of strings delimited by triple-double quotes:

def hello(arg1, arg2):
  """
    Prints hello world
  """
  print "hello world"

Some tools are able to parse these docstrings to automatically build documentation.

Namespace. is a map from names (variables names, function names) to objects (their values). Examples of namespaces include the built-in namespace, the global namespace within a module and local namespace within a function.

Scope. is a textual region of a Python program where a namespace is directly accessible.

For example, within a function we have a local scope. Variable from the arguments or created within the function itself belong to this local scope. Consider the following example:

d = 4
def foo(a):
    b = 1
    return abs(a - b)
foo(10)

In the code above, a, b are in the local scope. d is in the module global scope and the function abs is in the built-in scope. When we refer to a variable or function name, Python search scopes in the following order:

1. Local scope
2. Local scope of the calling functions
3. Global scope within a module
4. Built-in scope

Item 2 requires some clarification. In Python we can define a function inside another function. For example,

def outer():
    scope_name = 'outer'
    def inner():
        print(scope_name)
    return inner

scope_name = 'global'
# Get the inner function
inner = outer()
inner() # outer

In the code above, scope_name is defined both in the global scope and in the outer() function scope. When referenced within inner(), the outer scope takes precedence, so this will print 'outer'. Note how the scope is resolved statically, when defining inner(), not in runtime, explaining why the docs emphasize “textual region“. This is known as lexical scoping, as we discussed when studying the R language.

Variables that are not in the local scope are read-only by default. If we try to assign a new value to an existing variable in an outer scope, we are just creating a new local variable with the same name. Note that if the global variable is a reference to an object, we can modify the object, but we still can’t re-assign a value to that global variable.

x = [1, 2, 3]

def modify_global_variable_content():
    x.append(4) 

def reassign_global_variable():
    x = [10, 20, 30]

modify_global_variable_content()
reassign_global_variable()
print x

In the code above, modify_global_variable_content() modifies the content of x, but reassign_global_variable() just creates a local variable x. To actually modify the variable we can make it explicit using the global keyword.

def reassign_global_variable():
    global x
    x = [10, 20, 30]

Functions can be assigned to variables and passed around, for example:

def inc(x):
    return x + 1

def print_f(f, x):
    print f(x)

print_f(inc, 2)

Here we pass the function inc(), which then is assigned to the variable f in print_f(), and used to invoke the original function.

Default arguments. It’s possible to define default arguments to functions, like in C++. The initialization only happens once and the value is shared between subsequent calls, so for objects, this is what happens:

def append(x, xs = []):
    xs.append(x)
    return xs

print append(1) # [1]
print append(2) # [1, 2]
print append(3) # [1, 2, 3]

Keyword arguments. The default match between the arguments passed to a function and the arguments the function receives is positional, that is, the first value is assigned to the first parameter and so on. Besides that, we can make the argument names explicit.

def listify(a, b, c):
    return [a, b, c]
print listify(b = 2, c = 3, a = 1)

There are some constraints though: 1. Positional arguments must come before keyword arguments. 2. If we use a keyword for a given argument, we have to name all required arguments that appear after it in the function definition.

def listify(a, b, c):
    return [a, b, c]

# Error: Violates rule 1
print listify(c = 3, 1, 2)

# Error: Violates rule 2
print listify(2, 3, a = 1)

This feature is one of my favorite in Python, which I miss from many other languages. This is particular useful in the scenario where a function takes many default arguments, but we only want to provide the last one.

Var arg. Functions can define two different sets of variable arguments. An argument named * is assigned with a list of all position arguments passed beyond the required arguments; an argument named ** is assigned with a dictionary with all keyword arguments not in the required arguments. A simple example:

def f(fixed, *arguments, **keywords):
    print fixed     # 1
    print arguments # (2, 3) 
    print keywords  # {a: 4, b: 5}

f(1, 2, 3, a = 4, b = 5)

Argument unpacking. With an analogous idea, it’s possible to call a function with a list of arguments using the * modifier. For example:

def f(a, b, c):
    print a, b, c
args = [1, 2, 3] 
f(*args)

Or a map of keyword arguments using the ** modifier:

def f(a, b, c):
    print a, b, c
args = {'b': 2, 'c': 3, 'a': 1}
f(**args)

Generators

In the previous post we saw the iterator type. Remember how to create a custom iterator, he had to define a class and implement the required methods (__iter__() and next()).

We can also instantiate a generator from a function, by using the yield keyword. For example,

def sample_generator(n):
    for i in range(n):
        yield i

The result of the function is an iterator. Every time we run next(), we execute the function until it reaches a yield statement, in which case it return that value.

it = sample_generator(10)
it.next() # 0
it.next() # 1
...

Generators are particularly useful if the iterator has to work in sequential steps that performs different actions based at each step. For example, consider the following iterator:

class MyIterator:
    def __init__(self, x):
        self.step = 0
        self.x = x

    def __iter__(self):
        return self

    def next(self):
        if self.step == 0:
            ret = self.x
        elif self.step == 1:
            ret = 2*self.x
        elif self.step == 2:
            ret = self.x*self.x
        else:
            raise StopIteration

        self.step += 1
        return ret

This is a toy example that runs different functions of x based on which stage it is on. Using generators, this would simply become:

def my_generator(x):
    yield x
    yield 2*x
    yield x*x

A generator can use the return statement (without value) to stop the iterator, which is equivalent to raise StopIteration.

It’s possible to send values back to generators through the method send(). Consider the following example:

def generator_with_sends(x):
    while True:
        y = (yield x)
        if y is not None:
            x = y

it = generator_with_sends(5)
print it.next()   # 5
print it.send(10) # 10
print it.next()   # 10

Note that it.send(10) returns 10, not 5. This is because yield halts when the right hand side of the statement is evaluated but before the assignment. Calling send() will resume from there, but this time the return value of (yield x) will be the value passed to send. The code will then execute until the next yield, in which x contains the value passed to y. It’s not allowed to call send() before the generator had yielded the first time.

When a generator is garbage-collected it raises an GeneratorExit exception. It’s possible to force that by calling the stop() method.

def stubborn_generator():
    try:
        while True:
            yield 'no'
    except GeneratorExit:
	print 'generator stopped. clean up...'
        raise StopIteration
    finally:
        print 'can do some shared clean up here'

def scoped():
    s = stubborn_generator()
    print s.next()

scoped()

Map, filter, reduce

Python has built-in function that work with sequences: map(), filter() and reduce() being the most common. These functions work with iterators too. For example:

# map_generator.py
def generate_ints(n):
    for i in range(n):
        yield i

def double(x):
    return x*2

print map(double, generate_ints(5)) 
# [0, 2, 4, 6, 8]

The problem is that it has to convert a generator to a list, so it has to evaluate a generator until it stops, thus we can’t do this with an “infinite” iterator, like the one from this generator:

def generate_forever():
    i = 0
    while True:
        yield i
        i += 1

the alternative is using the itertools library, which contains the imap() and ifilter() methods, which generate new iterators with the function applied.

h = itertools.imap(double, generate_forever())
for i in range(10):
    print h.next()

We can see generators as lazily evaluated functions, which is the default behavior in languages like Haskell.

Partial function application

One very useful functional paradigm is partial application. It’s available in the functools module. For example, we can re-write our double() (from map_generator.py) function as follows:

# partial.py
import operator
import functools

double = functools.partial(operator.mul, 2)

Here, operator.mul is just the function version for the * operator, that is,

def mul(a, b):
    return a*b

The code partial.py will return another function where the first argument to the operator.mul function is replaced by 2. The nice thing about the partial() function is that it accepts keyword arguments:

def f(a, b, c, d):
    return [a, b, c, d]
g = functools.partial(f, b=2, c=3)
print g(a=1, d=4)

Conclusion

In this second post we covered functions in Python. We learned about namespaces, scopes and variable number of arguments. We also learned about generators and increased our knowledge on iterators by introducing the itertools module. Finally, we saw that Python can work with higher-order functions by using libraries like functools.

References

[1] The Python Tutorial: More Control Flow Tools
[2] The Python HOWTOs: Functional Programming HOWTO

Revisiting Python: Basic Types

I’ve been using Python for quite a while now, but always on pet projects, small scripts, programming contests and I feel that my knowledge didn’t really improve much. I had a chance to work on a Python project and realized there are many basic things I don’t know.

python-logo

Every time I dedicate some time to learn and write about some subject, I feel like my knowledge about it improves a lot. Thus, I decided to revisit some Python concepts and write a series of posts about things I didn’t know in Python 2.7. The first of our posts will be the Built-in types.

Booleans Types

Boolean types can take True and False as values. Falsy values include False, 0, empty sequences (string, list, dictionary) and any class that implements __len__() and returns 0.

Logical operators. many languages use && and ||. In Python it is and and or, and it works the same way (short-circuiting).

Negation is also more verbose: not instead of !. Other logical operators test whether an element belongs to a sequence/container or if a string is a substring of another, in.

> a = [1, 2, 3]
> 3 in a
True
> b = {'a': 'apple', 'b': 'banana'}
> 'b' in b
True
> 'banana' in b
False
> 'banana' not in b
True
> 'ban' in 'banana'
True

Note that it can be composed with the not operator for better readability. Another operator is the is operator (similar to triple equal in PHP or Javascript). It’s the same as the == operator for basic types, but for other types, it only returns True if they point to the same object:

> a = [1, 2]
> b = [1, 2]
> c = a
> a is b
False
> a is c
> [1,2] is [1,2]
False
> "abc" is "abc"
True
> (1-1) is 0
True
> 1 is not 0
True

This operator can also be combined with the not operator.

Comparison operators. It’s the same as in most languages. In Python, they work with custom classes if these classes implement the following methods: __lt__(), __le__(), __gt__(), __ge__(), the __cmp__().

Numeric Types

Numeric types can be one of the following: int (plain integer), float (floating point numbers), long (long integers) and complex. int‘s are like C++’s long, 32 bits of precision. float‘s are equivalent to C++’s double, usually 64 bits, 53 bits for mantisse, 10 bits for exponents and 1 bit for sinal. long‘s have unlimited precision. complex is a pair of floats (named real and imag).

Math operations. Most operators are the same as other languages. The different ones include //, which performs floored quotient, for comparison

> 11./2
5.5
> 11.//2
5.0
> 11/2
5

Note that if both values are integers, the division is integer. divmod() is also interesting. It’s basically divmod(a, b) = (a // b, a % b).

Bitwise operators works as in C++.

Numeric types are classes too, and implement numbers.Integral. We can invoke methods on variables, but not on literals:

> 6.bit_length()
Syntax error
> b = 6
> b.bit_length()
3

Iterator Types

Python supports a concept of iteration over containers. Classes that implement the __iter__() and next() methods are of type iterator. The next() method should return the current value and proceed to the next value, using raise StopIteration when the end is reached.

We can implement our own (x)range function as an example:

class Range:
    def __init__(self, start, end = None, step = 1):
        if (end is None):
            end = start
            start = 0
        self.start = start
        self.end = end
        self.step = step
    def __iter__(self):
        return Iterator(self.start, self.end, self.step)

class Iterator:
    def __init__(self, start, end, step):
        self.end = end
        self.step = step
        self.counter = start
    def __iter__(self):
        return self
    def next(self):
        self.counter += 1
        if self.counter > self.end:
            raise StopIteration
        return self.counter

for i in Range(1, 10, 2):
    print i

Sequence Types

The types that fall in this family of type are str, unicode, list, tuple, bytearray, buffer, xrange.

Strings. is a list of characters, which themselves are 8-bits encoded ascii values (strings have some overhead besides the characters [1]). Strings literals can be written in single or double quotes.

Formatting strings: It accepts a syntax similar to sprintf from C. One interesting form is passing an dictionary of values and naming the patterns by the key name:

> print '%(language)s has %(number)03d quote types.' % \
...       {"language": "Python", "number": 2}
Python has 002 quote types.

There also an alternative formatting using the .format() method. A discussion can be read here.

Unicode strings. Python uses UTF-8 encoding for unicode. Literals of this type can be created by prefixing the value with an u, for example u'abc'.

Tuples. are shallowly “immutable” containers. Their contents can’t be changed, but the objects their elements point to might be. It can be used without parenthesis and can be used in the LHS to unwrap values. For example:

> a, b = 1, 2
> a
1
> b
2

Here, (1, 2) and (a, b) are tuples. It has bracket access, for example

> x = 1, 2
> x[0]
1

Tuples are hashable if all its elements are hashable. This allows using tuples in sets or dictionary keys.

Lists. are mutable sequences. They’re indexed from 0 to length-1 and access out of this range throws an exception. The + operator can be used to concatenate lists. Conveniently, the * operator where the first operator is a list and second operant is an integer N, creates N shallow copies of the list (this works for tuples too).

Access to lists can be made using ranges, in which case it returns another list, for example

> x = range(10) # [0, 1, 2, ..., 10]
> x[0:5]
[0, 1, 2, 3, 4]

We must be careful in making copies of arrays where the elements are references to objects (for example other lists). In the example below, it’s likely not doing what we would want:

> lists = [[]] * 3
> lists
[[], [], []]
> lists[0].append(3)
> lists
[[3], [3], [3]]

Xranges. The existence of the xrange type is justified by the xrange() function. The python docs explain it well:

This function is very similar to range(), but returns an xrange object instead of a list. This is an opaque sequence type which yields the same values as the corresponding list, without actually storing them all simultaneously. The advantage of xrange() over range() is minimal (since xrange() still has to create the values when asked for them) except when a very large range is used on a memory-starved machine or when all of the range’s elements are never used (such as when the loop is usually terminated with break).

Bytearrays. are essentially mutable strings.

Buffer. is intended for memory-efficient manipulation of a large arrays, which otherwise would cause a copy. Guido van Rossum describes an example [2]:

It was created with the desire to avoid an expensive memory-copy operation when reading or writing large arrays. For example, if you have an array object containing several millions of double precision floating point numbers, and you want to dump it to a file, you might prefer to do the I/O directly from the array’s memory buffer rather than first copying it to a string.

This Stack Overflow question also discusses the subject.

Set types

Set and Frozenset. The main difference between these two is that set is mutable while frozenset is immutable. These structures are implemented using a hash table, so all elements in a set/frozenset must be hashable. A type is hashable if it implements __hash()__ (which shouldn’t change during its lifetime) and either __eq()__ or __cmp()__. Since frozenset is immutable and all its elements are hashable, frozenset itself is hashable.

In Python 2.7, sets can be constructed by a shorthand {'foo', 'bar'}.

Map types

Dictionaries are associative arrays. They’re called dict in Python code. Dictionary keys must be hashable.

We can get a dictionary’s keys and values by .keys() and .values() respectively. The .items() method returns an array of pairs, each containing key and value. These methods all return copies, so if we assign .keys() to a variable and make changes to the dictionary, the changes won’t get reflected in the list assigned to the variable.

To get references instead of copies, we can use the .viewkeys(), .viewvalues() and .viewitems() methods, which are read-only references.

File Objects

Returned by the open() function. It’s used for file manipulation operations.

Memory View

It was introduced in Python 2.7 and is a replacement for the buffer type.

Context Manager Types

Any user defined class that implements __enter()__ and __exit()__ has a context manager type. These types are useful for abstracting a specific try/finally pattern. More specifically, imagine we have the following pseudo-code:

set things up
try:
  do something
finally:
  tear things down

If we do “set things up” and “tear things down” in many places and only change “do something”, we can abstract those in a class implementing a context manager type:

class ControlledExecution:
  def __enter__(self):
    set things up
      return thing
  def __exit__(self, type, value, traceback):
      tear things down

with ControlledExecution() as thing:
  do something

This post from Effbot as a very clear explanation.

Conclusion

In this post we covered the basic types from the Python environment. We were able to learn about some interesting features even from basic types like booleans and numeric. We also covered some more exoteric types like buffer and xrange. We got some exposure to other features like context manager.

In the next post in the series we’ll talk about functions.

References

[1] Theano – Python Memory Management
[2] Python-Dev – The buffer interface
[3] Python Docs – Built-in Types

Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

649px-Bloom_filter.svg

In its original form, it doesn’t allow removing elements.

Algorithm

The bloom filter structure consists of a bit array of size m. We then define k independent hash functions, which can distribute a given value into any of the m positions uniformly.

Inserting an element consists in applying the k hash functions to the element and set the bit in the k positions returned by these functions to 1.

Querying for an element consists in applying the same k hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to k, assuming random access memory and O(1) hash functions.

Probability of false positives. The probability p that the algorithm will return true when the element is not there (false positive), can be described in terms of m, k, n, through the following equation:

p = (1 -e^{(kn/m)})^k

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values n and m, we have a choice of how many hash functions to use, k. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed m, n, the value of k that minimizes p is given by:

k = ln(2)m/n

Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, m, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)

The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash

To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)

Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of k hash functions from 2 independent hash functions in the following way:

h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
  lambda i: (h64l + i*h64u) % k,
  range(k)
)

All the code is available on github.

Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5) 

murmur3

From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_grid(class ~ .) 

multi

We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_wrap(~class, ncol=5) 

pairwise

We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

h_i(x) = i \cdot h_A(x), i = 1, \cdots, k

bad_hash

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

\dfrac{FP}{FP + TN}

We fix m and k and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements (FP + TN) and count how many of those our Bloom filter thinks they are in the set FP. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements'
)

bloom

Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements', 
  size=I(1), 
  color=data$m
) + scale_colour_manual(
  values=c("red", "blue", "green", "black"), 
  name="m values"
)

Rplot04

One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

Rplot05

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is 1/m. Conversely, the probability of a bit not being set is 1 - 1/m. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the k hash functions is

\left(1 - \frac{1}{m}\right)^k

Inserting n elements is equivalent to running the k functions n times, so after inserting n elements, the probability that a given bit is set to 0 is

\left(1 - \frac{1}{m}\right)^{kn}

or the probability it’s set to one as

1 - \left(1 - \frac{1}{m}\right)^{kn}

Now, the probability of a false positive p, is the probability k positions being set for a element that was never inserted.

p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e

So, for large values of m, we can assume that

(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}

Finally we have:

p = (1 -e^{(kn/m)})^k

Appendix B: Optimal number of hash functions

If we make p a function of k, it’s possible to prove it has a global minimum for positive k‘s so we can use the derivative to find the value of k that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}

where c=n/m is a constant. If we make y = e^{-ck} (and use k = \ln(y)/-c)

\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y

By making it equals to 0, we get the equation:

\ln(1 - y) (1-y) = \ln(y) y

To solve this for y, let x = 1 - y,

\ln(x) x = \ln(y) y

Since \ln(n) and n are both monotonic functions, so is \ln(n)n. Then if () holds, we can conclude that x = y, because otherwise, if x > y, \ln(x) x > \ln(y) y and if x < y, \ln(x) x < \ln(y) y.

Replacing x back, we find out that y = 1/2 and finally k = ln(2)m/n

The Algorithm X and the Dancing Links

Introduction

In a Donald Knuth’s paper called Dancing Links [1], he shows an algorithm that can be used to solve puzzles like Sudoku via backtracking in a efficient way.

The backtracking algorithm is named simply the Algorithm X for a lack of a better name [1] and because it’s very simple and not the focus of the paper.

The main concept is actually a data structure used to implement the algorithm X. It’s a sparse matrix where Knuth uses some clever tricks to make removing/restoring columns and rows efficient and in-place operations. He refers to these operations as dancing links, in allusion to how the pointers from the cells change during these operations.

In this post we’ll describe in more details the problem we are trying to solve and then present the idea of the algorithm X. We’ll follow with the description of the data structure and how the mains steps of the algorithm can be implemented using dancing links.

Finally, we’ll present a simple example implementation on Python.

Set Cover

The Sudoku puzzle can be modeled as a more generaal problem, the set cover problem.

Given a set of items U and a set S of sets each of which covering some subset of U, the set cover problem consists in finding a subset of S such that each element is covered by exactly one set. This problem is known to be NP-Complete.

The set cover can be viewed as a binary matrix where the columns represent the elements to be covered and the rows represent the sets. An entry 1 in the cell i,j, means that the set i covers element j.

The objective is then finding a subset of the rows that such that for each column, it has exactly an entry 1 in exactly one column.

In fact, this is the constraint matrix of a common integer linear programming formulation for this problem.

The Algorithm X

Knuth’s algorithm performs a full search of all possible solutions recursively, in such a way that at each node of the search tree we have a submatrix representing a sub-problem.

At a given node, we try adding a given set for our solution. We then discard all elements covered by this set and also discard all other sets that cover at least one of these elements, because by definition one element can’t be covered by more than one set, so we are sure these other sets won’t be in the solution. We then repeat the operation on the remaining subproblem.

If, at any point, there’s an element that can’t be covered by any set, we backtrack, trying to pick a different set. On the other hand, if we have no element left, our current solution is feasible.

More formally, at a given node of the backtrack tree, we have a matrix binary M . We first pick some column j. For each row i such that M_{ij} = 1, we try adding i to our current solution and recurse over a submatrix M' , constructed by removing from M all columns j' such that M_{ij'} = 1, and all rows i' such that M_{i'k} = M_{ik} = 1 for some column k.

Dancing Links

A naive implementation of the above algorithm would scan the whole matrix to generate the submatrix and it would store a new matrix in each node of the search tree.

Knuth’s insight was to represent the binary matrix as a doubly linked sparse matrix. As we’ll see later, this structure allows us to undo the operations we do for the recursion, so we can always work with a single instance of such sparse matrix.

The basic idea of a sparse matrix is creating a node for each non-zero entry and linking it its adjacent cells on the same column and to adjacent cells on the same line.

In our case, our nodes (depicted in green in Fig. 1) are doubly linked and form a circular chain. We also have one header node (blue) for each column that is linked to the first and last nodes of that column and a single header (yellow) that connects the header of the first column and the last one.

Here’s an example for the matrix \begin{pmatrix}0 & 1\\1 & 1\end{pmatrix}:

Figure 1: Sparse matrix example

Figure 1: Sparse matrix example

Note that the pointers that are coming/going to the border of the page are circular.

For each node we also have a pointer to the corresponding header of its column.

Removing a node. A known way to remove or de-attach a node from a doubly linked circular list is to make its neighbors to point to each other:

node.prev.next = node.next
node.next.prev = node.prev

Restoring a node. Knuth tells us that it’s also possible to restore or re-attach a node to its original position assuming we didn’t touch it since the removal:

node.prev = node
node.next = node

Removing a column. For our algorithm, removing a column is just a matter of de-attaching its corresponding header from the other headers (not from the nodes in the column), so we call this a horizontal de-attachment.

Removing a row. To remove a row, we want to de-attach each node in that row from it’s vertical neighbors, but we are not touching the links between nodes of the same row, so we call this a vertical de-attachment.

A Python implementation

We’ll present a simple implementation of these ideas in Python. The complete code can be found on Github.

Data Structures

Our basic structures are nodes representing cells and heads representing columns (and also a special head sentinel). We need all the four links (left, up, right, down), but we don’t need to declare them explicitly because python allow us to set them dynamically. We will have an additional field pointing to the corresponding header. The main difference is that we only attach/de-attach nodes vertically and heads horizontally, so we have different methods for them:

class Node:
    def __init__(self, row, col):
        self.row, self.col = row, col

    def deattach(self):
        self.up.down = self.down
        self.down.up = self.up
        
    def attach(self):
        self.down.up = self.up.down = self

class Head:
    def __init__(self, col):
        self.col = col

    def deattach(self):
        self.left.right = self.right
        self.right.left = self.left
        
    def attach(self):
        self.right.left = self.left.right = self

Now we need to build our sparse matrix out of a regular python matrix. We basically create one node for each entry 1 in the matrix, one head for each column and one global head. We then link them with helper functions:

class SparseMatrix:

    def createLeftRightLinks(self, srows):
        for srow in srows:
            n = len(srow)
            for j in range(n):
                srow[j].right = srow[(j + 1) % n]
                srow[j].left = srow[(j - 1 + n) % n]
            
    def createUpDownLinks(self, scols):
        for scol in scols:
            n = len(scol)
            for i in range(n):
                scol[i].down = scol[(i + 1) % n]
                scol[i].up = scol[(i - 1 + n) % n]
                scol[i].head = scol[0]
        
    def __init__(self, mat):
        
        nrows = len(mat)
        ncols = len(mat[0])
    
        srow = [[ ] for _ in range(nrows)]
        heads = [Head(j) for j in range(ncols)]        
        scol = [[head] for head in heads]

        # Head of the column heads
        self.head = Head(-1)
        heads = [self.head] + heads
            
        self.createLeftRightLinks([heads])
            
        for i in range(nrows):
            for j in range(ncols):
                if mat[i][j] == 1:
                    node = Node(i, j)
                    scol[j].append(node)
                    srow[i].append(node)

        self.createLeftRightLinks(srow)
        self.createUpDownLinks(scol)

Iterators

We were repeating the following code in several places:

it = node.left
while it != node:
  # Do some stuff
  it = it.left

With left eventually replaced by right, up or down. So we abstract that using an iterator:

class NodeIterator:

    def __init__(self, node):
        self.curr = self.start = node

    def __iter__(self):
        return self

    def next(self):
        _next = self.move(self.curr)
        if _next == self.start:
            raise StopIteration
        else:
            self.curr = _next
            return _next

    def move(self):
        raise NotImplementedError

This basically goes through a linked list using a specific move operation. So we can implement specific iterators for each of the directions:

class LeftIterator (NodeIterator):
    def move(self, node):
        return node.left

class RightIterator (NodeIterator):
    def move(self, node):
        return node.right

class DownIterator (NodeIterator):
    def move(self, node):
        return node.down

class UpIterator (NodeIterator):
    def move(self, node):
        return node.up

Then, our previous while block becomes

for it in LeftIterator(node):
  # Do some stuff

Algorithm

With our data structures and syntax sugar iterators set, we’re ready to implement our backtracking algorithm.

The basic operation is the covering and uncovering of a column. The covering consists in removing that column and also all the rows from its row list (remember that a column can only be covered by exactly one row, so we can remove the other rows from the candidate list).

class DancingLinks:

    def cover(self, col):
        col.deattach()
        for row in DownIterator(col):
            for cell in RightIterator(row):
                cell.deattach()

    def uncover(self, col):
        for row in UpIterator(col):
            for cell in LeftIterator(row):
                cell.attach()
        col.attach()    

   ...

When covering a row of a column col, we start from the column to the right of col and finish at the column on its left, thus, we don’t actually de-attach the cell from col from its vertical neighbors. This is not needed because we are already “removed” the column from the matrix and this allow us to make a more elegant implementation.

It’s important the uncover to do the operations in the reverse other of the cover so we won’t mess with the pointers in the matrix.

The main body of the algorithm is given below, which is essentially the definition of the Algorithm X, that returns true whenever it has found a solution.

    def backtrack(self):
        # Let's cover the first uncovered item
        col = self.smat.head.right
        # No column left
        if col == self.smat.head:
            return True
        # No set to cover this element
        if col.down == col:
            return False

        self.cover(col)
        
        for row in DownIterator(col):

            for cell in RightIterator(row):
                self.cover(cell.head)

            if self.backtrack():
                self.solution.append(row)
                return True

            for cell in LeftIterator(row):
                self.uncover(cell.head)

        self.uncover(col)
        
        return False

Knuth notes that in order to reduce the expected size of the search tree we should minimize the branching factor at earlier nodes by selecting the columns with more 1 entries, which will throw the most number of candidate rows away. But for the sake of simplicity we are choosing the first one.

Conclusion

In this post we revisited concepts like the set cover problem and the sparse matrix data structure. We saw that with a clever trick we can remove and insert rows and columns efficiently and in-place.

Complexity

Suppose that in a given node we have a n \times m matrix. In the naive approach, for each candidate row we need to go through all of the rows and columns to generate a new matrix, leading to a O(n^2m) complexity for each node.

In the worst case, using a sparse matrix will lead to the same complexity, but hard set cover instances are generally sparse, so in practice we may have a performance boost. Also, since we do everything in-place, our memory footprint is considerably smaller.

References

[1] Dancing Links – D. E. Knuth (arxiv.org)
[2] Wikipedia – Algorithm X
[3] Wikipedia – Dancing Links