Numerical Representations as inspiration for Data Structures

Book

In this chapter Okasaki describes a technique for developing efficient data structures through analogies with numerical representations, in particular the binary and its variations.

We’ve seen this pattern arise with Binomial Heaps in the past. Here the author presents the technique in its general form and applies it to another data structure, binary random access lists.

Binary Random Access Lists

These lists allows efficient insertion at/removal from the beginning, and also access and update at a particular index.

The simple version of this structure consists in distributing the elements in complete binary leaf trees. A complete binary leaf tree (CBLF) is one that only stores elements only at the leaves, so a tree with height i, has 2^(i+1)-1 nodes, but only 2^i elements.

Consider an array of size n, and let Bn be the binary representation of n. If the i-th digit of Bn is 1, then we have a tree containing 2^i leaves. We then distribute the elements into these trees, starting with the least significant digit (i.e. the smallest tree) and traversing the tree in
pre-order.

For example, an array of elements (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) has 11 elements, which is 1011 in binary. So we have one tree with a single leave (1), a tree with 2 leaves (2, 3) and another containing 8 leaves (4, 5, 6, 7, 8, 9, 10, 11).

We use a dense representation of the binary number, by having explicit elements for the 0 digit. Here’s a possible signature for this implementation:

Inserting a new element consists in adding a new tree with a single leave. If a tree already exists for a given size, they have to be merged into a tree of the next index. Merging two CBLFs of the same size is straightforward. We just need to make them children of a new root. Since elements are stored in pre-order, the tree being inserted or coming from carry over should be the left child.

Looping back to our example, if we want to insert the element 100, we first insert a tree with a single leaf (100). Since the least significant digit already has an element, we need to merge them into a new tree containing (100, 1) and try to insert at the next position. A conflict will arise with (2, 3), so we again merge them into (100, 1, 2, 3) and try the next position. We finally succeed in inserting at position 2, for a new list containing trees like (100, 1, 2, 3) and (4, 5, 6, 7, 8, 9, 10, 11).

The complexity of inserting an element is O(log n) in the worst case which requires merging tree for all digits (e.g. if Bn = 111...111). Merging two trees is O(1).

Removing the first element is analogous to decrementing the number, borrowing from the next digit if the current digit is 0.

Searching for an index consists in first finding the tree containing the index and then searching within the tree. More specifically, because the elements are sorted beginning from the smallest tree to the largest, we can find the right tree just by inspecting the number of elements in each tree until we find the one whose range includes the desired index. Within a tree, elements are stored in pre-order, so we can find the right index in O(height) of the tree.

After finding the right index, returning the element at that index is trivial. Updating the element at a given index requires rebuilding the tree when returning from the recursive calls.

Okasaki then proposes a few different numbering systems that allow to perform insertion/removal in O(1) time. Here we’ll only discuss the less obvious but more elegant one, using skew binary numbers.

characters

Skew Binary Random Access Lists

A skew binary number representation supports the digits 0, 1 and 2.

The weight of the i-th digit is 2^(i+1) - 1. In its canonical form, it only allows the least significant non-zero digit to be 2.

Examples:

table-v2

Decimal and Skewed Binary

It’s possible to show this number system offers a unique representation for decimal numbers. See the Appendix for a sketch of the proof and an algorithm for converting decimals to skewed binary numbers.

Incrementing a number follows these rules:

  • If there’s a digit 2 in the number, turn it into 0 and increment the next digit. By definition that is either 0 or 1, so we can safely increment it without having to continue carrying over.
  • Otherwise the least significant digit is either 0 or 1, and it can be incremented without carry overs.

The advantage of this number system is that increments (similarly, decrements) never carry over more than once so the complexity O(1), as opposed to the worst-case O(log n) for regular binary numbers.

A skew binary random access list can be implemented using this idea. We use a sparse representation (that is, not including 0s). Each digit one with position i corresponds to a tree with (2^(i+1) - 1) elements, in this case a complete binary tree with height i+1. A digit 2 is represented by two consecutive trees
with same weight.

Adding a new element to the beginning of the list is analogous to incrementing the number, which we saw can be done in O(1). Converting a digit 0 to 1 or 1 to 2, is a matter of prepending a tree to a list. To convert a 2 to 0 and increment the next position, we need to merge two trees representing it with the element to be inserted. Because each tree is traversed in pre-order, we make the element the root of the tree.

Elements are inserted in pre-order in each tree, so when searching for an
index, we can first find the right tree by looking at the tree sizes and within a tree we can do a “binary search” in O(height) of the tree.

Binomial Heaps

In this chapter, this technique is also applied to improve the worst case runtime of insertion of binomial heaps. The implementation, named Skewed Binomial Heap, is on github.

Conclusion

This chapter demonstrated that binary representations are a useful analogy to come up with data structures and algorithms, because they’re simple. This simplicity can lead to inefficient running times, though. Representations such as skewed binary numbers can improve the worst case of some operations with the trade-off of less intuitive and more complex implementations.

Appendix A – Proof

Sketch of the proof. First, it’s obvious that two different decimals cannot map to the same binary representation. Otherwise the same equation with the same weights would result in different values. We need to show that two binary representations do not map to the same decimal.

Suppose it does, and let them be B1 and B2. Let k be the largest position where these number have a different digit. Without loss of generality, suppose that B1[k] > B2[k].

First case. suppose that B1[k] = 1, and B2[k] = 0 and B2 doesn’t have any digit 2. B1 is then at least M + 2^{k+1} - 1, while B2 is at most M + \sum_{i = 1}^{k} (2^{i} - 1) which is M + 2^{k + 1} - k (M corresponds to the total weight of digits in positions > k). This implies that B2 < B1, a contradiction.

Second case. suppose that B1[k] = 1, but now B2 does have a digit 2 at position j. It has to be that j < k. Since only zeros follow it, we can write B2‘s upper bound as

M + \sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1

Since 2(2^{j + 1} - 1) < 2^{j + 2} - 1, we have

\sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1 < \sum_{i = j + 2}^{k} (2^{i} - 1) + 2^{j + 2} - 1

We can continue this argument until we get that B2 is less than M + 2(2^{k} - 1) which is less than M + 2^{k + 1} - 1, B1.

Third case. Finally, suppose we have B1' such that B1'[k] = 2, and B2'[k] = 1. We can subtract 2^{k+1} - 1 from both and reduce to the previous case. ▢

Appendix B – Conversion algorithm

Converting from a decimal representation to a binary one is straightforward, but it’s more involved to do so for skewed binary numbers.

Suppose we allow trailing zeros and have all the numbers with k-digits. For example, if k=2, we have 00, 01, 02, 10, 11, 12 and 20. We can construct the numbers with k+1-digits by either prefixing 0 or 1, and the additional 2 followed by k zeros. For k=3, we have 000, 001, 002, 010, 011, 012, 020, 100, 101, 102, 110, 111, 112, 120 and finally 200.

More generally, we can see there are 2^(k+1) - 1 numbers with k digits. We can construct the k+1 digits by appending 0 or 1 and then adding an extra number which is starts 2 and is followed by k zeros, for a total of 2^(k+1) - 1 + 2^(k+1) - 1 + 1 = 2^(k + 2) - 1, so we can see this invariant holds by induction on k, and we verify that is true for the base, since for k = 1 we enumerated 3 numbers.

This gives us a method to construct the skewed number representation if we know the number of its digits say, k. If the number is the first 2^(k) - 1 numbers, that is, between 0 and 2^k - 2, we know it starts with a 0. If it’s the next 2^(k) - 1, that is, between 2^k - 1 and 2^(k+1) - 3, we know it starts with a 1. If it’s the next one, exactly 2^(k+1) - 2, we know it starts with a 2.

We can continue recursively by subtracting the corresponding weight for this digit until k = 0. We can find out how many digits a number n has (if we’re to exclude leading zeros) by find the smallest k such that 2^(k+1)-1 is greater than n. For example, for 8, the smallest k is 3, since 2^(k+1)-1 = 15, and 2^(k)-1 = 7.

The Python code below uses these ideas to find the skewed binary number representation in O(log n):

One might ask: why not OCaml? My excuse is that I already have a number theory repository in Python, so it seemed like a nice addition. Converting this to functional code, in particular OCaml is easy.

This algorithm requires an additional O(log n) memory, while the conversion to a binary number can be done with constant extra memory. My intuition is that this is possible because the weights for the binary numbers are powers of the same number, 2^k, unlike the skewed numbers’ weights. Is it possible to work around this?

Advertisements

Tree Ring Matching using the KMP Algorithm

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.

x

Tool to explain the matching system

The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
    minlen = min(len(prefix), len(suffix))
    max_match = 0
    for match_len in range(1, minlen + 1):
        if prefix[:match_len] == suffix[-match_len:]:
            max_match = match_len
    return max_match

In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

    tlen = len(T)
    qlen = len(Q)

    for i, _ in enumerate(T):
        match = 0
        # Searches for Q starting from position i in T
        while match < qlen and \
              i + match < tlen and \
              Q[match] == T[i + match]:
            match += 1

        if match == qlen:
            print 'found substring in position', i

Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
        ^ mismatch in position 4

If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
T = ababadac...
         ^ mismatch in position 5

When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
T = ababadac...
         ^ continue search from here

If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen and j < qlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
        else:
            i += 1

        # string was found
        if j == qlen:
            print i - j

We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
    qlen = len(Q)
    lsp = [0]*qlen

    i = 1
    while i < qlen :

        j = 0
        while (i + j < qlen and Q[j] == Q[i + j]):
            lsp[i + j] = max(lsp[i + j], j + 1)
            j += 1

        i += 1
    return lsp

The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.

philosopher

The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

    qlen = len(Q)

    lsp = [0]*qlen

    i = 1
    j = 0
    while i < qlen:

        if Q[j] == Q[i]:
            j += 1
            i += 1
            # We only added this line
            lsp[i - 1] = j 
        elif j > 0:
            j = lsp[j - 1] 
        else:
            i += 1

    return lsp

We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
        else:
            i += 1

        yield_indices(i, j)

Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
    matches = []
    qlen = len(Q)
    def match_accumulator(i, j):
        if j == qlen:
            matches.append(i - j)
    kmp_generic(Q, T, match_accumulator)
    return matches

Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
    slen = len(suffix)
    max_match = [None]
    def max_matcher(i, j):
        if max_match[0] is None and i == slen:
            max_match[0] = j

    # Search prefix in suffix
    kmp(prefix, suffix, max_matcher)

    return 0 if max_match[0] is None else max_match[0]

Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.

Conclusion

I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.

References

[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings

Revisiting Python: Object Oriented Programming

This is the third post in the series on revisiting Python. In the previous post we talked about Python functions.

python-logo

We’ll talk about classes in Python 2.7. Following the premisses of previous posts, this is not intended to be an introduction to Object Oriented Programming, but rather how we can use this paradigm in Python.

We’ll first cover basic concepts like classes and inheritance. Later we’ll discuss some more advanced features from the Python language that allows us extending the set of built-in patterns provided by the language.

Classes

Classes are the building blocks of object oriented programming. Python classes look similar to those in languages like C++ and Java, but there are some important differences, which we’ll comment on this section.

Classes are objects. When writing a class definition the code is executed and an class object is created and assigned to a name corresponding to the class name. For example, in the code below, an object representing the class is assigned to a variable in the current scope MyClass.

class MyClass:
    print 'hi'

print MyClass # __main__.MyClass
x = MyClass
print x       # __main__.MyClass

If we run the code above it will print ‘hi’. We can manipulate MyClass as a regular variable, assigning it to other variables, passing it as parameter to functions, etc.

Instantiating. We can create an instance of a given class by using a function call notation, that is, MyClass().

This will also call the method __init()__ which can be used to initialize the instance, like a constructor. Functions defined within a class become methods to the instances. Methods can be called using the syntax instance.method(). For example:

class MyClass:
    def myMethod(self):
        print(self)

instance = MyClass()

instance.myMethod() # Prints &amp;lt;__main__.MyClass instance at 0x10891f290&amp;gt; 

When invoking a function using instance.method(), instance() is bound to the first argument to method() in this case. We usually name this first parameter self, but it’s just a convention.

Class vs. instance members. Note that local variables defined at the class level belong to the class object, not to a specific instance object. Making an analogy to C++, this is a static member variable.

class MyClass:
    x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x # 20

To make a variable instance specific, we need to make use of the instance passed as the first parameter to the methods.

class MyClass:
    def __init__(self):
        self.x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x

All methods and variables defined at the class level are stored internally in the __dict__ variable.

class MyClass:
    x = 10
    def f():
        pass

print MyClass.__dict__
# 'x': 1, '__module__': '__main__', '__doc__': None, 'f': &amp;lt;function f at 0x109f32d70&amp;gt;}

Methods. In Python, methods are “static”, but by default it requires an instance of the class to be bound to the first parameter. As we saw above, this happens automatically when we use the instance.method() syntax. Alternatively, we could explicitly pass the instance to the static method, using the SomeClass.method(instance) syntax. To illustrate that, imagine we have a method in our class a method printX(). We can invoke it either by a method from the instance object or from the class object passing the instance:

class MyClass:
    def __init__(self):
        self.x = 10

    def printX(self):
        print self.x

inst = MyClass()

# Both are equivalent:
inst.printX() 
MyClass.printX(inst)

Methods are essentially functions assigned to class member variables. To make the point clear, we can rewrite the previous snippet replacing the inline definition of printX() by an external function:

def externalPrintX(self):
    print self.x

class MyClass:
    def __init__(self):
        self.x = 10

    printX = externalPrint

Note that the external function still needs to have as the first arguments an instance of MyClass.

Given that all methods are public and data variables take precedence over static ones, it can cause some confusion if we assign functions to instance members:

def externalPrintX():
    print 'external hello world'

class MyClass:
    def __init__(self):
        self.x = 10
        self.printX = externalPrintX

    def printX(self):
        print 'hello world'

inst = MyClass()
inst.printX()

Here, the method printX() method got replaced by another function, externalPrintX(). Note that, differently from regular methods, the instance is not bound to the first argument when invoking inst.printX().

Static Methods. Since we’re required to provide an instance to the first argument, it’s not truly static in the C++ class terminology. It’s possible to override this requirement by using the staticmethod() function. To illustrate this, here’s an example:

class ClassWithStaticMethod():
    def regularMethod(x):
        print 'This is not bound to any instance:', x

    staticMethod = staticmethod(regularMethod)

x = ClassWithStaticMethod()
x.staticMethod(10) # This is not bound to any instance: 10

Class Methods. are different from regular methods because they always receive the class object instead of the instance object as the first argument. More specifically, when calling from a class object, it is bound automatically as the first parameter and when called from the instance object, its corresponding class object is bound instead. For example,

class ClassWithClassMethod():
    def regularMethod(type):
        print type
    classMethod = ClassMethod(regularMethod)

ClassWithClassMethod.classMethod() 

x = ClassWithClassMethod() # ClassWithClassMethod
x.classMethod()            # ClassWithClassMethod

Inheritance

The syntax for inheritance is class Derived(Base). Methods defined in the base classes can be overridden by derived classes. If a name (variable or method) is referenced from an instance, it’s searched from the current class, going up on the inheritance chain, until the definition is first found.

# animal.py
class Animal:
    def breath(self):
        return 'breathing air'
    def eat(self):
        return 'eating'

class Dog(Animal):
    def eat(self):
        return Animal.eat(self) + ' meat'

dog = Dog()
print dog.breath()  # breathing air
print dog.eat()     # eating meat

In this example, eat() is first found at Dog, while breath() is only found at Animal. Note that we can refer to a specific implementation by providing the class explicitly, like Animal.eat(self).

Old-style vs. new style classes. In a nutshell, the difference between old-style and new-style classes is that the latter is a descendant of object. Thus, by default user defined classes are old-style by default (though old-style classes are removed from Python 3).

According to the docs, the motivation for creating the new-style classes is to unify Python types and user defined classes. In this model, classes are able to derive from built-in types [2].

Multiple-inheritance. is supported in Python. The syntax is a comma separated list of base classes:

class DerivedClassName(Base1, Base2, Base3):
   ...

Method resolution order (MRO). We just saw that for simple inheritance chains, the natural lookup order for methods is straightforward. For multiple-inheritance it can be complicated mainly because of the diamond pattern.

For old-style classes, the method lookup order, MRO, is defined by a “depth-first search”. It first looks in the entire inheritance chain of Base1, then Base2, and so on. The MRO is always well defined as long as the inheritance graph is a DAG (direct acyclic graph).

The problem is that it’s not intuitive. Depending on the order a given class appears in a descendant, its own MRO can vary. For example, consider the following complex dependency:

class A:
    def name(self):
        return 'a'

class B:
    def name(self):
        return 'b'

class C(B, A):
    def composed_name(self):
        return 'c ' + self.name()

class D(A, C): pass

d = D()
print d.composed_name()

When we call composed_name() on the instance of class D, the name resolution will only find it in class C. This class on its turn needs to resolve the name() method. Should we get it from class B, since it’s listed first on C‘s inheritance?

That’s not what happens. name() is resolved from the class D, because there, A appears before. This is a very confusing behavior and error-prone.

In new-style classes, the diamond pattern exists by default when multiple inheritance is used, since everything ends up deriving from object. But in the new style, the type of inheritance that can be created is more strict. It uses an interesting algorithm, C3, described in [3] to create a consistent ordering or throw an exception in case such ordering doesn’t exist.

super. In the animal.py example, we had to hard code Animal.eat() to be able to call Dog‘s parent class. The super() function takes a class object and an instance, and it returns a proxy object on which we can call methods and it will resolve them based on the MRO the object class. More specifically, consider the following example:

class A(object):
    def method(self):
        print 'calling method from a'

class B(object):
    def method(self):
        print 'calling method from b'
    def anotherMethod(self):
        print 'calling another method from b'

class C(A, B):
   def method(self):
        print 'calling method from c'
    def anotherMethod(self):
        print 'calling another method from c'

    def run(self):
        proxy = super(C, self)
        proxy.method()          # calling method from a
        proxy.anotherMethod()   # calling another method from b

C().run()

In this example, super() returned an object, proxy, which can resolve method() and anotherMethod() properly. Since C derives from A first, it resolves it to A.method(). On the other hand, since anotherMethod() is only defined in B, it will resolve to B.anotherMethod().

Descriptors

So far, we discussed the basic components of object oriented programming. Now we’ll describe some advanced concepts that will help us better understand how some functions like staticmethod() work.

A class is said to be a descriptor if it implements any of __get__(), __set__() or __delete__(). In particular, if it doesn’t implement __set__(), it’s called non-data descriptor. If it implements both __get__() and __set__(), it’s a data descriptor [4, 5].

If another class contains class members that are descriptors, the __get__() and __set()__ will be executed when do reads and assignments respectively. For a concrete example, considering the following descriptor:

class Descriptor(object):
    def __get__(self, obj, objtype):
        print 'getting x'
        return self.x

    def __set__(self, obj, val):
        print 'setting x to ' + str(val)
        self.x = val

It stores a value internally at the variable x, but includes some logging when setting or getting this value. Now, suppose we have a class that has a member that is a descriptor:

class ClassWithDescriptors(object):
    member = Descriptor()

x = ClassWithDescriptors()
x.member = 20
print x.member

When we try to assign a value to x.member, the method __set__() from Descriptor is called instead. The same applies to when we try to read from x.member. It’s important that member is a class member, not an instance member. For example, if we have the instance member variable, it would not print the logging messages:

class ClassWithDescriptors(object):
    def __init__(self):
        self.instance_member = Descriptor()

x = ClassWithDescriptors()

x.instance_member = 20   # This is just overriding instance_member with an integer
print x.instance_member  

staticmethod() and classmethod(). One interesting use case for descriptors to implement the staticmethod() and classmethod() “functions”. We could write the following descriptor for each of these, respectively:

class StaticMethod(object):
    def __init__(self, f):
        self.f = f

    def __get__(self, obj, objtype=None):
      return self.f

class ClassMethod(object):
     def __init__(self, f):
          self.f = f

     def __get__(self, obj, objtype=None):
          if objtype is None:
               objtype = type(obj)
          def newfunc(*args):
               return self.f(objtype, *args)
          return newfunc

Functors

Python has the concept of function objects or functors. It’s an object that can be invoked as a function so long its class defines the __call__() method. A simple example is:

class Multiplier:
    def __init__(self, factor):
        self._factor = factor
    def __call__(self, a):
        return self._factor * a

double = Multiplier(2)
triple = Multiplier(3)
print double(5) # Prints 10
print triple(7) # Prints 21

One advantage of functors over regular functions is the capability of carrying context.

Decorators

Decorator is a design pattern in which we can modify the behavior of an object without modifying all the objects from that particular class. It’s also known as a wrapper, because we wrap the object inside another object that adds the desired behavior.

We’ll now describe a motivation for using decorators and how Python offers a syntax sugar for this pattern using annotations. For our example, imagine we have a function that takes a three variables and returns the sum of them:

def add(a, b, c):
    return a + b + c

Now imagine we want to validate the parameters passed to add(). We could add that to the beginning of the function, but maybe we already have a function to do the validation for us. In this case, we could wrap the function in a another one, which would first perform the parameters validation and then call the function. The code below does that:

def validate(f):
    def closure(*args):
        for arg in args:
            assert isinstance(arg, int)
        return f(*args)
    return closure

validated_add = validate(add)

print validated_add(5, 2, 3)
# Throws an exception
print validated_add(5, 2, 3.0) 

validate() takes a function f() and wraps it inside closure(), which verifies all arguments are integers before invoking f(). So by passing add() to validate(), we’re getting a decorated function validated_add() which performs validation on the arguments.

Python offers a syntax sugar to do exactly that. By annotating a function with @something, it passes the function to a function named something(), that should act as a function transformation, and uses the transformed function instead.

@validate
def add(a, b, c):
    return a + b + c

Imagine we want to customize the validate function to enable us to define the type of each element passed to the decorated function.

Annotations with parameters. We can do that by passing parameters to the decorator, using this syntax @something(arg1, arg2, ...). In this form, we actually construct a decorator builder, so we need to wrap our validate() function inside another one:

def validate(*arg_types):
    def validate_impl(f):
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure
    return validate_impl

With this new validate() function, we can update the annotation to our simple add() function:

@validate(int, int, int)
def add(a, b, c):
    return a + b + c

We can now use this as a very simple type validation. Imagine we have a function that takes a string and repeats it N times. We can enforce the types using the same annotation:

@validate(str, int)
def rep(s, n):
    return &amp;quot;&amp;quot;.join([s]*n)

Decorators can also be function objects, so we can define a class to do a similar work the validate() function does:

class Validator:
    def __init__(self, *arg_types):
        self._arg_types = arg_types

    def __call__(self, f):
        arg_types = self._arg_types
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure

In this case, the parameters defined in the annotation are passed to the constructor of the class.

Finally, two very common annotations are staticmethod and classmethod. There is no magic here. These cause the staticmethod() and classmethod() instances to be instantiated, and we saw how these work in previous sections.

Conclusion

In this post we covered object oriented programming in Python. We learned details about classes, especially how they compare to classes in other languages. We then discussed inheritance, including multiple inheritance and old and new-style classes. We then delved into more advanced concepts related to object oriented programming. This included descriptors, functors and decorators.

Python requires relatively few special constructs to accomplish features from other object-oriented languages. Examples include the concept of super and static methods, which are implemented using general purpose features from Python like descriptors and decorators.

Reference

[1] The Python Tutorial – Classes
[2] Python.org Docs – New-style Classes
[3] The Python 2.3 Method Resolution Order
[4] Python Descriptors Demystified
[5] Python HOWTOs – Descriptor HowTo Guide

Revisiting Python: Functions

This is the second post in the series on revisiting Python. In the first post we discussed the motivation for these posts and started by revisiting Built-in Types.

python-logo

In this post, we’ll talk about functions in Python 2.7.

Functions

Functions in python consist of the keyword def, a function name and a list of arguments:

def hello(arg1, arg2):
  print "hello world"

The pythonic-way to document what a function does is through docstrings (block of strings delimited by triple-double quotes:

def hello(arg1, arg2):
  """
    Prints hello world
  """
  print "hello world"

Some tools are able to parse these docstrings to automatically build documentation.

Namespace. is a map from names (variables names, function names) to objects (their values). Examples of namespaces include the built-in namespace, the global namespace within a module and local namespace within a function.

Scope. is a textual region of a Python program where a namespace is directly accessible.

For example, within a function we have a local scope. Variable from the arguments or created within the function itself belong to this local scope. Consider the following example:

d = 4
def foo(a):
    b = 1
    return abs(a - b)
foo(10)

In the code above, a, b are in the local scope. d is in the module global scope and the function abs is in the built-in scope. When we refer to a variable or function name, Python search scopes in the following order:

1. Local scope
2. Local scope of the calling functions
3. Global scope within a module
4. Built-in scope

Item 2 requires some clarification. In Python we can define a function inside another function. For example,

def outer():
    scope_name = 'outer'
    def inner():
        print(scope_name)
    return inner

scope_name = 'global'
# Get the inner function
inner = outer()
inner() # outer

In the code above, scope_name is defined both in the global scope and in the outer() function scope. When referenced within inner(), the outer scope takes precedence, so this will print 'outer'. Note how the scope is resolved statically, when defining inner(), not in runtime, explaining why the docs emphasize “textual region“. This is known as lexical scoping, as we discussed when studying the R language.

Variables that are not in the local scope are read-only by default. If we try to assign a new value to an existing variable in an outer scope, we are just creating a new local variable with the same name. Note that if the global variable is a reference to an object, we can modify the object, but we still can’t re-assign a value to that global variable.

x = [1, 2, 3]

def modify_global_variable_content():
    x.append(4) 

def reassign_global_variable():
    x = [10, 20, 30]

modify_global_variable_content()
reassign_global_variable()
print x

In the code above, modify_global_variable_content() modifies the content of x, but reassign_global_variable() just creates a local variable x. To actually modify the variable we can make it explicit using the global keyword.

def reassign_global_variable():
    global x
    x = [10, 20, 30]

Functions can be assigned to variables and passed around, for example:

def inc(x):
    return x + 1

def print_f(f, x):
    print f(x)

print_f(inc, 2)

Here we pass the function inc(), which then is assigned to the variable f in print_f(), and used to invoke the original function.

Default arguments. It’s possible to define default arguments to functions, like in C++. The initialization only happens once and the value is shared between subsequent calls, so for objects, this is what happens:

def append(x, xs = []):
    xs.append(x)
    return xs

print append(1) # [1]
print append(2) # [1, 2]
print append(3) # [1, 2, 3]

Keyword arguments. The default match between the arguments passed to a function and the arguments the function receives is positional, that is, the first value is assigned to the first parameter and so on. Besides that, we can make the argument names explicit.

def listify(a, b, c):
    return [a, b, c]
print listify(b = 2, c = 3, a = 1)

There are some constraints though: 1. Positional arguments must come before keyword arguments. 2. If we use a keyword for a given argument, we have to name all required arguments that appear after it in the function definition.

def listify(a, b, c):
    return [a, b, c]

# Error: Violates rule 1
print listify(c = 3, 1, 2)

# Error: Violates rule 2
print listify(2, 3, a = 1)

This feature is one of my favorite in Python, which I miss from many other languages. This is particular useful in the scenario where a function takes many default arguments, but we only want to provide the last one.

Var arg. Functions can define two different sets of variable arguments. An argument named * is assigned with a list of all position arguments passed beyond the required arguments; an argument named ** is assigned with a dictionary with all keyword arguments not in the required arguments. A simple example:

def f(fixed, *arguments, **keywords):
    print fixed     # 1
    print arguments # (2, 3) 
    print keywords  # {a: 4, b: 5}

f(1, 2, 3, a = 4, b = 5)

Argument unpacking. With an analogous idea, it’s possible to call a function with a list of arguments using the * modifier. For example:

def f(a, b, c):
    print a, b, c
args = [1, 2, 3] 
f(*args)

Or a map of keyword arguments using the ** modifier:

def f(a, b, c):
    print a, b, c
args = {'b': 2, 'c': 3, 'a': 1}
f(**args)

Generators

In the previous post we saw the iterator type. Remember how to create a custom iterator, he had to define a class and implement the required methods (__iter__() and next()).

We can also instantiate a generator from a function, by using the yield keyword. For example,

def sample_generator(n):
    for i in range(n):
        yield i

The result of the function is an iterator. Every time we run next(), we execute the function until it reaches a yield statement, in which case it return that value.

it = sample_generator(10)
it.next() # 0
it.next() # 1
...

Generators are particularly useful if the iterator has to work in sequential steps that performs different actions based at each step. For example, consider the following iterator:

class MyIterator:
    def __init__(self, x):
        self.step = 0
        self.x = x

    def __iter__(self):
        return self

    def next(self):
        if self.step == 0:
            ret = self.x
        elif self.step == 1:
            ret = 2*self.x
        elif self.step == 2:
            ret = self.x*self.x
        else:
            raise StopIteration

        self.step += 1
        return ret

This is a toy example that runs different functions of x based on which stage it is on. Using generators, this would simply become:

def my_generator(x):
    yield x
    yield 2*x
    yield x*x

A generator can use the return statement (without value) to stop the iterator, which is equivalent to raise StopIteration.

It’s possible to send values back to generators through the method send(). Consider the following example:

def generator_with_sends(x):
    while True:
        y = (yield x)
        if y is not None:
            x = y

it = generator_with_sends(5)
print it.next()   # 5
print it.send(10) # 10
print it.next()   # 10

Note that it.send(10) returns 10, not 5. This is because yield halts when the right hand side of the statement is evaluated but before the assignment. Calling send() will resume from there, but this time the return value of (yield x) will be the value passed to send. The code will then execute until the next yield, in which x contains the value passed to y. It’s not allowed to call send() before the generator had yielded the first time.

When a generator is garbage-collected it raises an GeneratorExit exception. It’s possible to force that by calling the stop() method.

def stubborn_generator():
    try:
        while True:
            yield 'no'
    except GeneratorExit:
	print 'generator stopped. clean up...'
        raise StopIteration
    finally:
        print 'can do some shared clean up here'

def scoped():
    s = stubborn_generator()
    print s.next()

scoped()

Map, filter, reduce

Python has built-in function that work with sequences: map(), filter() and reduce() being the most common. These functions work with iterators too. For example:

# map_generator.py
def generate_ints(n):
    for i in range(n):
        yield i

def double(x):
    return x*2

print map(double, generate_ints(5)) 
# [0, 2, 4, 6, 8]

The problem is that it has to convert a generator to a list, so it has to evaluate a generator until it stops, thus we can’t do this with an “infinite” iterator, like the one from this generator:

def generate_forever():
    i = 0
    while True:
        yield i
        i += 1

the alternative is using the itertools library, which contains the imap() and ifilter() methods, which generate new iterators with the function applied.

h = itertools.imap(double, generate_forever())
for i in range(10):
    print h.next()

We can see generators as lazily evaluated functions, which is the default behavior in languages like Haskell.

Partial function application

One very useful functional paradigm is partial application. It’s available in the functools module. For example, we can re-write our double() (from map_generator.py) function as follows:

# partial.py
import operator
import functools

double = functools.partial(operator.mul, 2)

Here, operator.mul is just the function version for the * operator, that is,

def mul(a, b):
    return a*b

The code partial.py will return another function where the first argument to the operator.mul function is replaced by 2. The nice thing about the partial() function is that it accepts keyword arguments:

def f(a, b, c, d):
    return [a, b, c, d]
g = functools.partial(f, b=2, c=3)
print g(a=1, d=4)

Conclusion

In this second post we covered functions in Python. We learned about namespaces, scopes and variable number of arguments. We also learned about generators and increased our knowledge on iterators by introducing the itertools module. Finally, we saw that Python can work with higher-order functions by using libraries like functools.

References

[1] The Python Tutorial: More Control Flow Tools
[2] The Python HOWTOs: Functional Programming HOWTO

Revisiting Python: Basic Types

I’ve been using Python for quite a while now, but always on pet projects, small scripts, programming contests and I feel that my knowledge didn’t really improve much. I had a chance to work on a Python project and realized there are many basic things I don’t know.

python-logo

Every time I dedicate some time to learn and write about some subject, I feel like my knowledge about it improves a lot. Thus, I decided to revisit some Python concepts and write a series of posts about things I didn’t know in Python 2.7. The first of our posts will be the Built-in types.

Booleans Types

Boolean types can take True and False as values. Falsy values include False, 0, empty sequences (string, list, dictionary) and any class that implements __len__() and returns 0.

Logical operators. many languages use && and ||. In Python it is and and or, and it works the same way (short-circuiting).

Negation is also more verbose: not instead of !. Other logical operators test whether an element belongs to a sequence/container or if a string is a substring of another, in.

> a = [1, 2, 3]
> 3 in a
True
> b = {'a': 'apple', 'b': 'banana'}
> 'b' in b
True
> 'banana' in b
False
> 'banana' not in b
True
> 'ban' in 'banana'
True

Note that it can be composed with the not operator for better readability. Another operator is the is operator (similar to triple equal in PHP or Javascript). It’s the same as the == operator for basic types, but for other types, it only returns True if they point to the same object:

> a = [1, 2]
> b = [1, 2]
> c = a
> a is b
False
> a is c
> [1,2] is [1,2]
False
> "abc" is "abc"
True
> (1-1) is 0
True
> 1 is not 0
True

This operator can also be combined with the not operator.

Comparison operators. It’s the same as in most languages. In Python, they work with custom classes if these classes implement the following methods: __lt__(), __le__(), __gt__(), __ge__(), the __cmp__().

Numeric Types

Numeric types can be one of the following: int (plain integer), float (floating point numbers), long (long integers) and complex. int‘s are like C++’s long, 32 bits of precision. float‘s are equivalent to C++’s double, usually 64 bits, 53 bits for mantisse, 10 bits for exponents and 1 bit for sinal. long‘s have unlimited precision. complex is a pair of floats (named real and imag).

Math operations. Most operators are the same as other languages. The different ones include //, which performs floored quotient, for comparison

> 11./2
5.5
> 11.//2
5.0
> 11/2
5

Note that if both values are integers, the division is integer. divmod() is also interesting. It’s basically divmod(a, b) = (a // b, a % b).

Bitwise operators works as in C++.

Numeric types are classes too, and implement numbers.Integral. We can invoke methods on variables, but not on literals:

> 6.bit_length()
Syntax error
> b = 6
> b.bit_length()
3

Iterator Types

Python supports a concept of iteration over containers. Classes that implement the __iter__() and next() methods are of type iterator. The next() method should return the current value and proceed to the next value, using raise StopIteration when the end is reached.

We can implement our own (x)range function as an example:

class Range:
    def __init__(self, start, end = None, step = 1):
        if (end is None):
            end = start
            start = 0
        self.start = start
        self.end = end
        self.step = step
    def __iter__(self):
        return Iterator(self.start, self.end, self.step)

class Iterator:
    def __init__(self, start, end, step):
        self.end = end
        self.step = step
        self.counter = start
    def __iter__(self):
        return self
    def next(self):
        self.counter += 1
        if self.counter > self.end:
            raise StopIteration
        return self.counter

for i in Range(1, 10, 2):
    print i

Sequence Types

The types that fall in this family of type are str, unicode, list, tuple, bytearray, buffer, xrange.

Strings. is a list of characters, which themselves are 8-bits encoded ascii values (strings have some overhead besides the characters [1]). Strings literals can be written in single or double quotes.

Formatting strings: It accepts a syntax similar to sprintf from C. One interesting form is passing an dictionary of values and naming the patterns by the key name:

> print '%(language)s has %(number)03d quote types.' % \
...       {"language": "Python", "number": 2}
Python has 002 quote types.

There also an alternative formatting using the .format() method. A discussion can be read here.

Unicode strings. Python uses UTF-8 encoding for unicode. Literals of this type can be created by prefixing the value with an u, for example u'abc'.

Tuples. are shallowly “immutable” containers. Their contents can’t be changed, but the objects their elements point to might be. It can be used without parenthesis and can be used in the LHS to unwrap values. For example:

> a, b = 1, 2
> a
1
> b
2

Here, (1, 2) and (a, b) are tuples. It has bracket access, for example

> x = 1, 2
> x[0]
1

Tuples are hashable if all its elements are hashable. This allows using tuples in sets or dictionary keys.

Lists. are mutable sequences. They’re indexed from 0 to length-1 and access out of this range throws an exception. The + operator can be used to concatenate lists. Conveniently, the * operator where the first operator is a list and second operant is an integer N, creates N shallow copies of the list (this works for tuples too).

Access to lists can be made using ranges, in which case it returns another list, for example

> x = range(10) # [0, 1, 2, ..., 10]
> x[0:5]
[0, 1, 2, 3, 4]

We must be careful in making copies of arrays where the elements are references to objects (for example other lists). In the example below, it’s likely not doing what we would want:

> lists = [[]] * 3
> lists
[[], [], []]
> lists[0].append(3)
> lists
[[3], [3], [3]]

Xranges. The existence of the xrange type is justified by the xrange() function. The python docs explain it well:

This function is very similar to range(), but returns an xrange object instead of a list. This is an opaque sequence type which yields the same values as the corresponding list, without actually storing them all simultaneously. The advantage of xrange() over range() is minimal (since xrange() still has to create the values when asked for them) except when a very large range is used on a memory-starved machine or when all of the range’s elements are never used (such as when the loop is usually terminated with break).

Bytearrays. are essentially mutable strings.

Buffer. is intended for memory-efficient manipulation of a large arrays, which otherwise would cause a copy. Guido van Rossum describes an example [2]:

It was created with the desire to avoid an expensive memory-copy operation when reading or writing large arrays. For example, if you have an array object containing several millions of double precision floating point numbers, and you want to dump it to a file, you might prefer to do the I/O directly from the array’s memory buffer rather than first copying it to a string.

This Stack Overflow question also discusses the subject.

Set types

Set and Frozenset. The main difference between these two is that set is mutable while frozenset is immutable. These structures are implemented using a hash table, so all elements in a set/frozenset must be hashable. A type is hashable if it implements __hash()__ (which shouldn’t change during its lifetime) and either __eq()__ or __cmp()__. Since frozenset is immutable and all its elements are hashable, frozenset itself is hashable.

In Python 2.7, sets can be constructed by a shorthand {'foo', 'bar'}.

Map types

Dictionaries are associative arrays. They’re called dict in Python code. Dictionary keys must be hashable.

We can get a dictionary’s keys and values by .keys() and .values() respectively. The .items() method returns an array of pairs, each containing key and value. These methods all return copies, so if we assign .keys() to a variable and make changes to the dictionary, the changes won’t get reflected in the list assigned to the variable.

To get references instead of copies, we can use the .viewkeys(), .viewvalues() and .viewitems() methods, which are read-only references.

File Objects

Returned by the open() function. It’s used for file manipulation operations.

Memory View

It was introduced in Python 2.7 and is a replacement for the buffer type.

Context Manager Types

Any user defined class that implements __enter()__ and __exit()__ has a context manager type. These types are useful for abstracting a specific try/finally pattern. More specifically, imagine we have the following pseudo-code:

set things up
try:
  do something
finally:
  tear things down

If we do “set things up” and “tear things down” in many places and only change “do something”, we can abstract those in a class implementing a context manager type:

class ControlledExecution:
  def __enter__(self):
    set things up
      return thing
  def __exit__(self, type, value, traceback):
      tear things down

with ControlledExecution() as thing:
  do something

This post from Effbot as a very clear explanation.

Conclusion

In this post we covered the basic types from the Python environment. We were able to learn about some interesting features even from basic types like booleans and numeric. We also covered some more exoteric types like buffer and xrange. We got some exposure to other features like context manager.

In the next post in the series we’ll talk about functions.

References

[1] Theano – Python Memory Management
[2] Python-Dev – The buffer interface
[3] Python Docs – Built-in Types

Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

649px-Bloom_filter.svg

In its original form, it doesn’t allow removing elements.

Algorithm

The bloom filter structure consists of a bit array of size m. We then define k independent hash functions, which can distribute a given value into any of the m positions uniformly.

Inserting an element consists in applying the k hash functions to the element and set the bit in the k positions returned by these functions to 1.

Querying for an element consists in applying the same k hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to k, assuming random access memory and O(1) hash functions.

Probability of false positives. The probability p that the algorithm will return true when the element is not there (false positive), can be described in terms of m, k, n, through the following equation:

p = (1 -e^{(kn/m)})^k

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values n and m, we have a choice of how many hash functions to use, k. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed m, n, the value of k that minimizes p is given by:

k = ln(2)m/n

Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, m, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)

The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash

To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)

Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of k hash functions from 2 independent hash functions in the following way:

h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
  lambda i: (h64l + i*h64u) % k,
  range(k)
)

All the code is available on github.

Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5) 

murmur3

From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_grid(class ~ .) 

multi

We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_wrap(~class, ncol=5) 

pairwise

We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

h_i(x) = i \cdot h_A(x), i = 1, \cdots, k

bad_hash

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

\dfrac{FP}{FP + TN}

We fix m and k and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements (FP + TN) and count how many of those our Bloom filter thinks they are in the set FP. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements'
)

bloom

Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements', 
  size=I(1), 
  color=data$m
) + scale_colour_manual(
  values=c("red", "blue", "green", "black"), 
  name="m values"
)

Rplot04

One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

Rplot05

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is 1/m. Conversely, the probability of a bit not being set is 1 - 1/m. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the k hash functions is

\left(1 - \frac{1}{m}\right)^k

Inserting n elements is equivalent to running the k functions n times, so after inserting n elements, the probability that a given bit is set to 0 is

\left(1 - \frac{1}{m}\right)^{kn}

or the probability it’s set to one as

1 - \left(1 - \frac{1}{m}\right)^{kn}

Now, the probability of a false positive p, is the probability k positions being set for a element that was never inserted.

p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e

So, for large values of m, we can assume that

(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}

Finally we have:

p = (1 -e^{(kn/m)})^k

Appendix B: Optimal number of hash functions

If we make p a function of k, it’s possible to prove it has a global minimum for positive k‘s so we can use the derivative to find the value of k that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}

where c=n/m is a constant. If we make y = e^{-ck} (and use k = \ln(y)/-c)

\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y

By making it equals to 0, we get the equation:

\ln(1 - y) (1-y) = \ln(y) y

To solve this for y, let x = 1 - y,

\ln(x) x = \ln(y) y

Since \ln(n) and n are both monotonic functions, so is \ln(n)n. Then if () holds, we can conclude that x = y, because otherwise, if x > y, \ln(x) x > \ln(y) y and if x < y, \ln(x) x < \ln(y) y.

Replacing x back, we find out that y = 1/2 and finally k = ln(2)m/n

The Algorithm X and the Dancing Links

Introduction

In a Donald Knuth’s paper called Dancing Links [1], he shows an algorithm that can be used to solve puzzles like Sudoku via backtracking in a efficient way.

The backtracking algorithm is named simply the Algorithm X for a lack of a better name [1] and because it’s very simple and not the focus of the paper.

The main concept is actually a data structure used to implement the algorithm X. It’s a sparse matrix where Knuth uses some clever tricks to make removing/restoring columns and rows efficient and in-place operations. He refers to these operations as dancing links, in allusion to how the pointers from the cells change during these operations.

In this post we’ll describe in more details the problem we are trying to solve and then present the idea of the algorithm X. We’ll follow with the description of the data structure and how the mains steps of the algorithm can be implemented using dancing links.

Finally, we’ll present a simple example implementation on Python.

Set Cover

The Sudoku puzzle can be modeled as a more generaal problem, the set cover problem.

Given a set of items U and a set S of sets each of which covering some subset of U, the set cover problem consists in finding a subset of S such that each element is covered by exactly one set. This problem is known to be NP-Complete.

The set cover can be viewed as a binary matrix where the columns represent the elements to be covered and the rows represent the sets. An entry 1 in the cell i,j, means that the set i covers element j.

The objective is then finding a subset of the rows that such that for each column, it has exactly an entry 1 in exactly one column.

In fact, this is the constraint matrix of a common integer linear programming formulation for this problem.

The Algorithm X

Knuth’s algorithm performs a full search of all possible solutions recursively, in such a way that at each node of the search tree we have a submatrix representing a sub-problem.

At a given node, we try adding a given set for our solution. We then discard all elements covered by this set and also discard all other sets that cover at least one of these elements, because by definition one element can’t be covered by more than one set, so we are sure these other sets won’t be in the solution. We then repeat the operation on the remaining subproblem.

If, at any point, there’s an element that can’t be covered by any set, we backtrack, trying to pick a different set. On the other hand, if we have no element left, our current solution is feasible.

More formally, at a given node of the backtrack tree, we have a matrix binary M . We first pick some column j. For each row i such that M_{ij} = 1, we try adding i to our current solution and recurse over a submatrix M' , constructed by removing from M all columns j' such that M_{ij'} = 1, and all rows i' such that M_{i'k} = M_{ik} = 1 for some column k.

Dancing Links

A naive implementation of the above algorithm would scan the whole matrix to generate the submatrix and it would store a new matrix in each node of the search tree.

Knuth’s insight was to represent the binary matrix as a doubly linked sparse matrix. As we’ll see later, this structure allows us to undo the operations we do for the recursion, so we can always work with a single instance of such sparse matrix.

The basic idea of a sparse matrix is creating a node for each non-zero entry and linking it its adjacent cells on the same column and to adjacent cells on the same line.

In our case, our nodes (depicted in green in Fig. 1) are doubly linked and form a circular chain. We also have one header node (blue) for each column that is linked to the first and last nodes of that column and a single header (yellow) that connects the header of the first column and the last one.

Here’s an example for the matrix \begin{pmatrix}0 & 1\\1 & 1\end{pmatrix}:

Figure 1: Sparse matrix example

Figure 1: Sparse matrix example

Note that the pointers that are coming/going to the border of the page are circular.

For each node we also have a pointer to the corresponding header of its column.

Removing a node. A known way to remove or de-attach a node from a doubly linked circular list is to make its neighbors to point to each other:

node.prev.next = node.next
node.next.prev = node.prev

Restoring a node. Knuth tells us that it’s also possible to restore or re-attach a node to its original position assuming we didn’t touch it since the removal:

node.prev = node
node.next = node

Removing a column. For our algorithm, removing a column is just a matter of de-attaching its corresponding header from the other headers (not from the nodes in the column), so we call this a horizontal de-attachment.

Removing a row. To remove a row, we want to de-attach each node in that row from it’s vertical neighbors, but we are not touching the links between nodes of the same row, so we call this a vertical de-attachment.

A Python implementation

We’ll present a simple implementation of these ideas in Python. The complete code can be found on Github.

Data Structures

Our basic structures are nodes representing cells and heads representing columns (and also a special head sentinel). We need all the four links (left, up, right, down), but we don’t need to declare them explicitly because python allow us to set them dynamically. We will have an additional field pointing to the corresponding header. The main difference is that we only attach/de-attach nodes vertically and heads horizontally, so we have different methods for them:

class Node:
    def __init__(self, row, col):
        self.row, self.col = row, col

    def deattach(self):
        self.up.down = self.down
        self.down.up = self.up
        
    def attach(self):
        self.down.up = self.up.down = self

class Head:
    def __init__(self, col):
        self.col = col

    def deattach(self):
        self.left.right = self.right
        self.right.left = self.left
        
    def attach(self):
        self.right.left = self.left.right = self

Now we need to build our sparse matrix out of a regular python matrix. We basically create one node for each entry 1 in the matrix, one head for each column and one global head. We then link them with helper functions:

class SparseMatrix:

    def createLeftRightLinks(self, srows):
        for srow in srows:
            n = len(srow)
            for j in range(n):
                srow[j].right = srow[(j + 1) % n]
                srow[j].left = srow[(j - 1 + n) % n]
            
    def createUpDownLinks(self, scols):
        for scol in scols:
            n = len(scol)
            for i in range(n):
                scol[i].down = scol[(i + 1) % n]
                scol[i].up = scol[(i - 1 + n) % n]
                scol[i].head = scol[0]
        
    def __init__(self, mat):
        
        nrows = len(mat)
        ncols = len(mat[0])
    
        srow = [[ ] for _ in range(nrows)]
        heads = [Head(j) for j in range(ncols)]        
        scol = [[head] for head in heads]

        # Head of the column heads
        self.head = Head(-1)
        heads = [self.head] + heads
            
        self.createLeftRightLinks([heads])
            
        for i in range(nrows):
            for j in range(ncols):
                if mat[i][j] == 1:
                    node = Node(i, j)
                    scol[j].append(node)
                    srow[i].append(node)

        self.createLeftRightLinks(srow)
        self.createUpDownLinks(scol)

Iterators

We were repeating the following code in several places:

it = node.left
while it != node:
  # Do some stuff
  it = it.left

With left eventually replaced by right, up or down. So we abstract that using an iterator:

class NodeIterator:

    def __init__(self, node):
        self.curr = self.start = node

    def __iter__(self):
        return self

    def next(self):
        _next = self.move(self.curr)
        if _next == self.start:
            raise StopIteration
        else:
            self.curr = _next
            return _next

    def move(self):
        raise NotImplementedError

This basically goes through a linked list using a specific move operation. So we can implement specific iterators for each of the directions:

class LeftIterator (NodeIterator):
    def move(self, node):
        return node.left

class RightIterator (NodeIterator):
    def move(self, node):
        return node.right

class DownIterator (NodeIterator):
    def move(self, node):
        return node.down

class UpIterator (NodeIterator):
    def move(self, node):
        return node.up

Then, our previous while block becomes

for it in LeftIterator(node):
  # Do some stuff

Algorithm

With our data structures and syntax sugar iterators set, we’re ready to implement our backtracking algorithm.

The basic operation is the covering and uncovering of a column. The covering consists in removing that column and also all the rows from its row list (remember that a column can only be covered by exactly one row, so we can remove the other rows from the candidate list).

class DancingLinks:

    def cover(self, col):
        col.deattach()
        for row in DownIterator(col):
            for cell in RightIterator(row):
                cell.deattach()

    def uncover(self, col):
        for row in UpIterator(col):
            for cell in LeftIterator(row):
                cell.attach()
        col.attach()    

   ...

When covering a row of a column col, we start from the column to the right of col and finish at the column on its left, thus, we don’t actually de-attach the cell from col from its vertical neighbors. This is not needed because we are already “removed” the column from the matrix and this allow us to make a more elegant implementation.

It’s important the uncover to do the operations in the reverse other of the cover so we won’t mess with the pointers in the matrix.

The main body of the algorithm is given below, which is essentially the definition of the Algorithm X, that returns true whenever it has found a solution.

    def backtrack(self):
        # Let's cover the first uncovered item
        col = self.smat.head.right
        # No column left
        if col == self.smat.head:
            return True
        # No set to cover this element
        if col.down == col:
            return False

        self.cover(col)
        
        for row in DownIterator(col):

            for cell in RightIterator(row):
                self.cover(cell.head)

            if self.backtrack():
                self.solution.append(row)
                return True

            for cell in LeftIterator(row):
                self.uncover(cell.head)

        self.uncover(col)
        
        return False

Knuth notes that in order to reduce the expected size of the search tree we should minimize the branching factor at earlier nodes by selecting the columns with more 1 entries, which will throw the most number of candidate rows away. But for the sake of simplicity we are choosing the first one.

Conclusion

In this post we revisited concepts like the set cover problem and the sparse matrix data structure. We saw that with a clever trick we can remove and insert rows and columns efficiently and in-place.

Complexity

Suppose that in a given node we have a n \times m matrix. In the naive approach, for each candidate row we need to go through all of the rows and columns to generate a new matrix, leading to a O(n^2m) complexity for each node.

In the worst case, using a sparse matrix will lead to the same complexity, but hard set cover instances are generally sparse, so in practice we may have a performance boost. Also, since we do everything in-place, our memory footprint is considerably smaller.

References

[1] Dancing Links – D. E. Knuth (arxiv.org)
[2] Wikipedia – Algorithm X
[3] Wikipedia – Dancing Links