Tree Ring Matching using the KMP Algorithm

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.

x

Tool to explain the matching system

The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
    minlen = min(len(prefix), len(suffix))
    max_match = 0
    for match_len in range(1, minlen + 1):
        if prefix[:match_len] == suffix[-match_len:]:
            max_match = match_len
    return max_match

In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

    tlen = len(T)
    qlen = len(Q)

    for i, _ in enumerate(T):
        match = 0
        # Searches for Q starting from position i in T
        while match < qlen and \
              i + match < tlen and \
              Q[match] == T[i + match]:
            match += 1

        if match == qlen:
            print 'found substring in position', i

Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
        ^ mismatch in position 4

If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
T = ababadac...
         ^ mismatch in position 5

When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
T = ababadac...
         ^ continue search from here

If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen and j < qlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
        else:
            i += 1

        # string was found
        if j == qlen:
            print i - j

We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
    qlen = len(Q)
    lsp = [0]*qlen

    i = 1
    while i < qlen :

        j = 0
        while (i + j < qlen and Q[j] == Q[i + j]):
            lsp[i + j] = max(lsp[i + j], j + 1)
            j += 1

        i += 1
    return lsp

The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.

philosopher

The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

    qlen = len(Q)

    lsp = [0]*qlen

    i = 1
    j = 0
    while i < qlen:

        if Q[j] == Q[i]:
            j += 1
            i += 1
            # We only added this line
            lsp[i - 1] = j 
        elif j > 0:
            j = lsp[j - 1] 
        else:
            i += 1

    return lsp

We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
        else:
            i += 1

        yield_indices(i, j)

Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
    matches = []
    qlen = len(Q)
    def match_accumulator(i, j):
        if j == qlen:
            matches.append(i - j)
    kmp_generic(Q, T, match_accumulator)
    return matches

Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
    slen = len(suffix)
    max_match = [None]
    def max_matcher(i, j):
        if max_match[0] is None and i == slen:
            max_match[0] = j

    # Search prefix in suffix
    kmp(prefix, suffix, max_matcher)

    return 0 if max_match[0] is None else max_match[0]

Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.

Conclusion

I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.

References

[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings

Revisiting Python: Object Oriented Programming

This is the third post in the series on revisiting Python. In the previous post we talked about Python functions.

python-logo

We’ll talk about classes in Python 2.7. Following the premisses of previous posts, this is not intended to be an introduction to Object Oriented Programming, but rather how we can use this paradigm in Python.

We’ll first cover basic concepts like classes and inheritance. Later we’ll discuss some more advanced features from the Python language that allows us extending the set of built-in patterns provided by the language.

Classes

Classes are the building blocks of object oriented programming. Python classes look similar to those in languages like C++ and Java, but there are some important differences, which we’ll comment on this section.

Classes are objects. When writing a class definition the code is executed and an class object is created and assigned to a name corresponding to the class name. For example, in the code below, an object representing the class is assigned to a variable in the current scope MyClass.

class MyClass:
    print 'hi'

print MyClass # __main__.MyClass
x = MyClass
print x       # __main__.MyClass

If we run the code above it will print ‘hi’. We can manipulate MyClass as a regular variable, assigning it to other variables, passing it as parameter to functions, etc.

Instantiating. We can create an instance of a given class by using a function call notation, that is, MyClass().

This will also call the method __init()__ which can be used to initialize the instance, like a constructor. Functions defined within a class become methods to the instances. Methods can be called using the syntax instance.method(). For example:

class MyClass:
    def myMethod(self):
        print(self)

instance = MyClass()

instance.myMethod() # Prints &amp;lt;__main__.MyClass instance at 0x10891f290&amp;gt; 

When invoking a function using instance.method(), instance() is bound to the first argument to method() in this case. We usually name this first parameter self, but it’s just a convention.

Class vs. instance members. Note that local variables defined at the class level belong to the class object, not to a specific instance object. Making an analogy to C++, this is a static member variable.

class MyClass:
    x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x # 20

To make a variable instance specific, we need to make use of the instance passed as the first parameter to the methods.

class MyClass:
    def __init__(self):
        self.x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x

All methods and variables defined at the class level are stored internally in the __dict__ variable.

class MyClass:
    x = 10
    def f():
        pass

print MyClass.__dict__
# 'x': 1, '__module__': '__main__', '__doc__': None, 'f': &amp;lt;function f at 0x109f32d70&amp;gt;}

Methods. In Python, methods are “static”, but by default it requires an instance of the class to be bound to the first parameter. As we saw above, this happens automatically when we use the instance.method() syntax. Alternatively, we could explicitly pass the instance to the static method, using the SomeClass.method(instance) syntax. To illustrate that, imagine we have a method in our class a method printX(). We can invoke it either by a method from the instance object or from the class object passing the instance:

class MyClass:
    def __init__(self):
        self.x = 10

    def printX(self):
        print self.x

inst = MyClass()

# Both are equivalent:
inst.printX() 
MyClass.printX(inst)

Methods are essentially functions assigned to class member variables. To make the point clear, we can rewrite the previous snippet replacing the inline definition of printX() by an external function:

def externalPrintX(self):
    print self.x

class MyClass:
    def __init__(self):
        self.x = 10

    printX = externalPrint

Note that the external function still needs to have as the first arguments an instance of MyClass.

Given that all methods are public and data variables take precedence over static ones, it can cause some confusion if we assign functions to instance members:

def externalPrintX():
    print 'external hello world'

class MyClass:
    def __init__(self):
        self.x = 10
        self.printX = externalPrintX

    def printX(self):
        print 'hello world'

inst = MyClass()
inst.printX()

Here, the method printX() method got replaced by another function, externalPrintX(). Note that, differently from regular methods, the instance is not bound to the first argument when invoking inst.printX().

Static Methods. Since we’re required to provide an instance to the first argument, it’s not truly static in the C++ class terminology. It’s possible to override this requirement by using the staticmethod() function. To illustrate this, here’s an example:

class ClassWithStaticMethod():
    def regularMethod(x):
        print 'This is not bound to any instance:', x

    staticMethod = staticmethod(regularMethod)

x = ClassWithStaticMethod()
x.staticMethod(10) # This is not bound to any instance: 10

Class Methods. are different from regular methods because they always receive the class object instead of the instance object as the first argument. More specifically, when calling from a class object, it is bound automatically as the first parameter and when called from the instance object, its corresponding class object is bound instead. For example,

class ClassWithClassMethod():
    def regularMethod(type):
        print type
    classMethod = ClassMethod(regularMethod)

ClassWithClassMethod.classMethod() 

x = ClassWithClassMethod() # ClassWithClassMethod
x.classMethod()            # ClassWithClassMethod

Inheritance

The syntax for inheritance is class Derived(Base). Methods defined in the base classes can be overridden by derived classes. If a name (variable or method) is referenced from an instance, it’s searched from the current class, going up on the inheritance chain, until the definition is first found.

# animal.py
class Animal:
    def breath(self):
        return 'breathing air'
    def eat(self):
        return 'eating'

class Dog(Animal):
    def eat(self):
        return Animal.eat(self) + ' meat'

dog = Dog()
print dog.breath()  # breathing air
print dog.eat()     # eating meat

In this example, eat() is first found at Dog, while breath() is only found at Animal. Note that we can refer to a specific implementation by providing the class explicitly, like Animal.eat(self).

Old-style vs. new style classes. In a nutshell, the difference between old-style and new-style classes is that the latter is a descendant of object. Thus, by default user defined classes are old-style by default (though old-style classes are removed from Python 3).

According to the docs, the motivation for creating the new-style classes is to unify Python types and user defined classes. In this model, classes are able to derive from built-in types [2].

Multiple-inheritance. is supported in Python. The syntax is a comma separated list of base classes:

class DerivedClassName(Base1, Base2, Base3):
   ...

Method resolution order (MRO). We just saw that for simple inheritance chains, the natural lookup order for methods is straightforward. For multiple-inheritance it can be complicated mainly because of the diamond pattern.

For old-style classes, the method lookup order, MRO, is defined by a “depth-first search”. It first looks in the entire inheritance chain of Base1, then Base2, and so on. The MRO is always well defined as long as the inheritance graph is a DAG (direct acyclic graph).

The problem is that it’s not intuitive. Depending on the order a given class appears in a descendant, its own MRO can vary. For example, consider the following complex dependency:

class A:
    def name(self):
        return 'a'

class B:
    def name(self):
        return 'b'

class C(B, A):
    def composed_name(self):
        return 'c ' + self.name()

class D(A, C): pass

d = D()
print d.composed_name()

When we call composed_name() on the instance of class D, the name resolution will only find it in class C. This class on its turn needs to resolve the name() method. Should we get it from class B, since it’s listed first on C‘s inheritance?

That’s not what happens. name() is resolved from the class D, because there, A appears before. This is a very confusing behavior and error-prone.

In new-style classes, the diamond pattern exists by default when multiple inheritance is used, since everything ends up deriving from object. But in the new style, the type of inheritance that can be created is more strict. It uses an interesting algorithm, C3, described in [3] to create a consistent ordering or throw an exception in case such ordering doesn’t exist.

super. In the animal.py example, we had to hard code Animal.eat() to be able to call Dog‘s parent class. The super() function takes a class object and an instance, and it returns a proxy object on which we can call methods and it will resolve them based on the MRO the object class. More specifically, consider the following example:

class A(object):
    def method(self):
        print 'calling method from a'

class B(object):
    def method(self):
        print 'calling method from b'
    def anotherMethod(self):
        print 'calling another method from b'

class C(A, B):
   def method(self):
        print 'calling method from c'
    def anotherMethod(self):
        print 'calling another method from c'

    def run(self):
        proxy = super(C, self)
        proxy.method()          # calling method from a
        proxy.anotherMethod()   # calling another method from b

C().run()

In this example, super() returned an object, proxy, which can resolve method() and anotherMethod() properly. Since C derives from A first, it resolves it to A.method(). On the other hand, since anotherMethod() is only defined in B, it will resolve to B.anotherMethod().

Descriptors

So far, we discussed the basic components of object oriented programming. Now we’ll describe some advanced concepts that will help us better understand how some functions like staticmethod() work.

A class is said to be a descriptor if it implements any of __get__(), __set__() or __delete__(). In particular, if it doesn’t implement __set__(), it’s called non-data descriptor. If it implements both __get__() and __set__(), it’s a data descriptor [4, 5].

If another class contains class members that are descriptors, the __get__() and __set()__ will be executed when do reads and assignments respectively. For a concrete example, considering the following descriptor:

class Descriptor(object):
    def __get__(self, obj, objtype):
        print 'getting x'
        return self.x

    def __set__(self, obj, val):
        print 'setting x to ' + str(val)
        self.x = val

It stores a value internally at the variable x, but includes some logging when setting or getting this value. Now, suppose we have a class that has a member that is a descriptor:

class ClassWithDescriptors(object):
    member = Descriptor()

x = ClassWithDescriptors()
x.member = 20
print x.member

When we try to assign a value to x.member, the method __set__() from Descriptor is called instead. The same applies to when we try to read from x.member. It’s important that member is a class member, not an instance member. For example, if we have the instance member variable, it would not print the logging messages:

class ClassWithDescriptors(object):
    def __init__(self):
        self.instance_member = Descriptor()

x = ClassWithDescriptors()

x.instance_member = 20   # This is just overriding instance_member with an integer
print x.instance_member  

staticmethod() and classmethod(). One interesting use case for descriptors to implement the staticmethod() and classmethod() “functions”. We could write the following descriptor for each of these, respectively:

class StaticMethod(object):
    def __init__(self, f):
        self.f = f

    def __get__(self, obj, objtype=None):
      return self.f

class ClassMethod(object):
     def __init__(self, f):
          self.f = f

     def __get__(self, obj, objtype=None):
          if objtype is None:
               objtype = type(obj)
          def newfunc(*args):
               return self.f(objtype, *args)
          return newfunc

Functors

Python has the concept of function objects or functors. It’s an object that can be invoked as a function so long its class defines the __call__() method. A simple example is:

class Multiplier:
    def __init__(self, factor):
        self._factor = factor
    def __call__(self, a):
        return self._factor * a

double = Multiplier(2)
triple = Multiplier(3)
print double(5) # Prints 10
print triple(7) # Prints 21

One advantage of functors over regular functions is the capability of carrying context.

Decorators

Decorator is a design pattern in which we can modify the behavior of an object without modifying all the objects from that particular class. It’s also known as a wrapper, because we wrap the object inside another object that adds the desired behavior.

We’ll now describe a motivation for using decorators and how Python offers a syntax sugar for this pattern using annotations. For our example, imagine we have a function that takes a three variables and returns the sum of them:

def add(a, b, c):
    return a + b + c

Now imagine we want to validate the parameters passed to add(). We could add that to the beginning of the function, but maybe we already have a function to do the validation for us. In this case, we could wrap the function in a another one, which would first perform the parameters validation and then call the function. The code below does that:

def validate(f):
    def closure(*args):
        for arg in args:
            assert isinstance(arg, int)
        return f(*args)
    return closure

validated_add = validate(add)

print validated_add(5, 2, 3)
# Throws an exception
print validated_add(5, 2, 3.0) 

validate() takes a function f() and wraps it inside closure(), which verifies all arguments are integers before invoking f(). So by passing add() to validate(), we’re getting a decorated function validated_add() which performs validation on the arguments.

Python offers a syntax sugar to do exactly that. By annotating a function with @something, it passes the function to a function named something(), that should act as a function transformation, and uses the transformed function instead.

@validate
def add(a, b, c):
    return a + b + c

Imagine we want to customize the validate function to enable us to define the type of each element passed to the decorated function.

Annotations with parameters. We can do that by passing parameters to the decorator, using this syntax @something(arg1, arg2, ...). In this form, we actually construct a decorator builder, so we need to wrap our validate() function inside another one:

def validate(*arg_types):
    def validate_impl(f):
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure
    return validate_impl

With this new validate() function, we can update the annotation to our simple add() function:

@validate(int, int, int)
def add(a, b, c):
    return a + b + c

We can now use this as a very simple type validation. Imagine we have a function that takes a string and repeats it N times. We can enforce the types using the same annotation:

@validate(str, int)
def rep(s, n):
    return &amp;quot;&amp;quot;.join([s]*n)

Decorators can also be function objects, so we can define a class to do a similar work the validate() function does:

class Validator:
    def __init__(self, *arg_types):
        self._arg_types = arg_types

    def __call__(self, f):
        arg_types = self._arg_types
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure

In this case, the parameters defined in the annotation are passed to the constructor of the class.

Finally, two very common annotations are staticmethod and classmethod. There is no magic here. These cause the staticmethod() and classmethod() instances to be instantiated, and we saw how these work in previous sections.

Conclusion

In this post we covered object oriented programming in Python. We learned details about classes, especially how they compare to classes in other languages. We then discussed inheritance, including multiple inheritance and old and new-style classes. We then delved into more advanced concepts related to object oriented programming. This included descriptors, functors and decorators.

Python requires relatively few special constructs to accomplish features from other object-oriented languages. Examples include the concept of super and static methods, which are implemented using general purpose features from Python like descriptors and decorators.

Reference

[1] The Python Tutorial – Classes
[2] Python.org Docs – New-style Classes
[3] The Python 2.3 Method Resolution Order
[4] Python Descriptors Demystified
[5] Python HOWTOs – Descriptor HowTo Guide

Revisiting Python: Functions

This is the second post in the series on revisiting Python. In the first post we discussed the motivation for these posts and started by revisiting Built-in Types.

python-logo

In this post, we’ll talk about functions in Python 2.7.

Functions

Functions in python consist of the keyword def, a function name and a list of arguments:

def hello(arg1, arg2):
  print "hello world"

The pythonic-way to document what a function does is through docstrings (block of strings delimited by triple-double quotes:

def hello(arg1, arg2):
  """
    Prints hello world
  """
  print "hello world"

Some tools are able to parse these docstrings to automatically build documentation.

Namespace. is a map from names (variables names, function names) to objects (their values). Examples of namespaces include the built-in namespace, the global namespace within a module and local namespace within a function.

Scope. is a textual region of a Python program where a namespace is directly accessible.

For example, within a function we have a local scope. Variable from the arguments or created within the function itself belong to this local scope. Consider the following example:

d = 4
def foo(a):
    b = 1
    return abs(a - b)
foo(10)

In the code above, a, b are in the local scope. d is in the module global scope and the function abs is in the built-in scope. When we refer to a variable or function name, Python search scopes in the following order:

1. Local scope
2. Local scope of the calling functions
3. Global scope within a module
4. Built-in scope

Item 2 requires some clarification. In Python we can define a function inside another function. For example,

def outer():
    scope_name = 'outer'
    def inner():
        print(scope_name)
    return inner

scope_name = 'global'
# Get the inner function
inner = outer()
inner() # outer

In the code above, scope_name is defined both in the global scope and in the outer() function scope. When referenced within inner(), the outer scope takes precedence, so this will print 'outer'. Note how the scope is resolved statically, when defining inner(), not in runtime, explaining why the docs emphasize “textual region“. This is known as lexical scoping, as we discussed when studying the R language.

Variables that are not in the local scope are read-only by default. If we try to assign a new value to an existing variable in an outer scope, we are just creating a new local variable with the same name. Note that if the global variable is a reference to an object, we can modify the object, but we still can’t re-assign a value to that global variable.

x = [1, 2, 3]

def modify_global_variable_content():
    x.append(4) 

def reassign_global_variable():
    x = [10, 20, 30]

modify_global_variable_content()
reassign_global_variable()
print x

In the code above, modify_global_variable_content() modifies the content of x, but reassign_global_variable() just creates a local variable x. To actually modify the variable we can make it explicit using the global keyword.

def reassign_global_variable():
    global x
    x = [10, 20, 30]

Functions can be assigned to variables and passed around, for example:

def inc(x):
    return x + 1

def print_f(f, x):
    print f(x)

print_f(inc, 2)

Here we pass the function inc(), which then is assigned to the variable f in print_f(), and used to invoke the original function.

Default arguments. It’s possible to define default arguments to functions, like in C++. The initialization only happens once and the value is shared between subsequent calls, so for objects, this is what happens:

def append(x, xs = []):
    xs.append(x)
    return xs

print append(1) # [1]
print append(2) # [1, 2]
print append(3) # [1, 2, 3]

Keyword arguments. The default match between the arguments passed to a function and the arguments the function receives is positional, that is, the first value is assigned to the first parameter and so on. Besides that, we can make the argument names explicit.

def listify(a, b, c):
    return [a, b, c]
print listify(b = 2, c = 3, a = 1)

There are some constraints though: 1. Positional arguments must come before keyword arguments. 2. If we use a keyword for a given argument, we have to name all required arguments that appear after it in the function definition.

def listify(a, b, c):
    return [a, b, c]

# Error: Violates rule 1
print listify(c = 3, 1, 2)

# Error: Violates rule 2
print listify(2, 3, a = 1)

This feature is one of my favorite in Python, which I miss from many other languages. This is particular useful in the scenario where a function takes many default arguments, but we only want to provide the last one.

Var arg. Functions can define two different sets of variable arguments. An argument named * is assigned with a list of all position arguments passed beyond the required arguments; an argument named ** is assigned with a dictionary with all keyword arguments not in the required arguments. A simple example:

def f(fixed, *arguments, **keywords):
    print fixed     # 1
    print arguments # (2, 3) 
    print keywords  # {a: 4, b: 5}

f(1, 2, 3, a = 4, b = 5)

Argument unpacking. With an analogous idea, it’s possible to call a function with a list of arguments using the * modifier. For example:

def f(a, b, c):
    print a, b, c
args = [1, 2, 3] 
f(*args)

Or a map of keyword arguments using the ** modifier:

def f(a, b, c):
    print a, b, c
args = {'b': 2, 'c': 3, 'a': 1}
f(**args)

Generators

In the previous post we saw the iterator type. Remember how to create a custom iterator, he had to define a class and implement the required methods (__iter__() and next()).

We can also instantiate a generator from a function, by using the yield keyword. For example,

def sample_generator(n):
    for i in range(n):
        yield i

The result of the function is an iterator. Every time we run next(), we execute the function until it reaches a yield statement, in which case it return that value.

it = sample_generator(10)
it.next() # 0
it.next() # 1
...

Generators are particularly useful if the iterator has to work in sequential steps that performs different actions based at each step. For example, consider the following iterator:

class MyIterator:
    def __init__(self, x):
        self.step = 0
        self.x = x

    def __iter__(self):
        return self

    def next(self):
        if self.step == 0:
            ret = self.x
        elif self.step == 1:
            ret = 2*self.x
        elif self.step == 2:
            ret = self.x*self.x
        else:
            raise StopIteration

        self.step += 1
        return ret

This is a toy example that runs different functions of x based on which stage it is on. Using generators, this would simply become:

def my_generator(x):
    yield x
    yield 2*x
    yield x*x

A generator can use the return statement (without value) to stop the iterator, which is equivalent to raise StopIteration.

It’s possible to send values back to generators through the method send(). Consider the following example:

def generator_with_sends(x):
    while True:
        y = (yield x)
        if y is not None:
            x = y

it = generator_with_sends(5)
print it.next()   # 5
print it.send(10) # 10
print it.next()   # 10

Note that it.send(10) returns 10, not 5. This is because yield halts when the right hand side of the statement is evaluated but before the assignment. Calling send() will resume from there, but this time the return value of (yield x) will be the value passed to send. The code will then execute until the next yield, in which x contains the value passed to y. It’s not allowed to call send() before the generator had yielded the first time.

When a generator is garbage-collected it raises an GeneratorExit exception. It’s possible to force that by calling the stop() method.

def stubborn_generator():
    try:
        while True:
            yield 'no'
    except GeneratorExit:
	print 'generator stopped. clean up...'
        raise StopIteration
    finally:
        print 'can do some shared clean up here'

def scoped():
    s = stubborn_generator()
    print s.next()

scoped()

Map, filter, reduce

Python has built-in function that work with sequences: map(), filter() and reduce() being the most common. These functions work with iterators too. For example:

# map_generator.py
def generate_ints(n):
    for i in range(n):
        yield i

def double(x):
    return x*2

print map(double, generate_ints(5)) 
# [0, 2, 4, 6, 8]

The problem is that it has to convert a generator to a list, so it has to evaluate a generator until it stops, thus we can’t do this with an “infinite” iterator, like the one from this generator:

def generate_forever():
    i = 0
    while True:
        yield i
        i += 1

the alternative is using the itertools library, which contains the imap() and ifilter() methods, which generate new iterators with the function applied.

h = itertools.imap(double, generate_forever())
for i in range(10):
    print h.next()

We can see generators as lazily evaluated functions, which is the default behavior in languages like Haskell.

Partial function application

One very useful functional paradigm is partial application. It’s available in the functools module. For example, we can re-write our double() (from map_generator.py) function as follows:

# partial.py
import operator
import functools

double = functools.partial(operator.mul, 2)

Here, operator.mul is just the function version for the * operator, that is,

def mul(a, b):
    return a*b

The code partial.py will return another function where the first argument to the operator.mul function is replaced by 2. The nice thing about the partial() function is that it accepts keyword arguments:

def f(a, b, c, d):
    return [a, b, c, d]
g = functools.partial(f, b=2, c=3)
print g(a=1, d=4)

Conclusion

In this second post we covered functions in Python. We learned about namespaces, scopes and variable number of arguments. We also learned about generators and increased our knowledge on iterators by introducing the itertools module. Finally, we saw that Python can work with higher-order functions by using libraries like functools.

References

[1] The Python Tutorial: More Control Flow Tools
[2] The Python HOWTOs: Functional Programming HOWTO

Revisiting Python: Basic Types

I’ve been using Python for quite a while now, but always on pet projects, small scripts, programming contests and I feel that my knowledge didn’t really improve much. I had a chance to work on a Python project and realized there are many basic things I don’t know.

python-logo

Every time I dedicate some time to learn and write about some subject, I feel like my knowledge about it improves a lot. Thus, I decided to revisit some Python concepts and write a series of posts about things I didn’t know in Python 2.7. The first of our posts will be the Built-in types.

Booleans Types

Boolean types can take True and False as values. Falsy values include False, 0, empty sequences (string, list, dictionary) and any class that implements __len__() and returns 0.

Logical operators. many languages use && and ||. In Python it is and and or, and it works the same way (short-circuiting).

Negation is also more verbose: not instead of !. Other logical operators test whether an element belongs to a sequence/container or if a string is a substring of another, in.

> a = [1, 2, 3]
> 3 in a
True
> b = {'a': 'apple', 'b': 'banana'}
> 'b' in b
True
> 'banana' in b
False
> 'banana' not in b
True
> 'ban' in 'banana'
True

Note that it can be composed with the not operator for better readability. Another operator is the is operator (similar to triple equal in PHP or Javascript). It’s the same as the == operator for basic types, but for other types, it only returns True if they point to the same object:

> a = [1, 2]
> b = [1, 2]
> c = a
> a is b
False
> a is c
> [1,2] is [1,2]
False
> "abc" is "abc"
True
> (1-1) is 0
True
> 1 is not 0
True

This operator can also be combined with the not operator.

Comparison operators. It’s the same as in most languages. In Python, they work with custom classes if these classes implement the following methods: __lt__(), __le__(), __gt__(), __ge__(), the __cmp__().

Numeric Types

Numeric types can be one of the following: int (plain integer), float (floating point numbers), long (long integers) and complex. int‘s are like C++’s long, 32 bits of precision. float‘s are equivalent to C++’s double, usually 64 bits, 53 bits for mantisse, 10 bits for exponents and 1 bit for sinal. long‘s have unlimited precision. complex is a pair of floats (named real and imag).

Math operations. Most operators are the same as other languages. The different ones include //, which performs floored quotient, for comparison

> 11./2
5.5
> 11.//2
5.0
> 11/2
5

Note that if both values are integers, the division is integer. divmod() is also interesting. It’s basically divmod(a, b) = (a // b, a % b).

Bitwise operators works as in C++.

Numeric types are classes too, and implement numbers.Integral. We can invoke methods on variables, but not on literals:

> 6.bit_length()
Syntax error
> b = 6
> b.bit_length()
3

Iterator Types

Python supports a concept of iteration over containers. Classes that implement the __iter__() and next() methods are of type iterator. The next() method should return the current value and proceed to the next value, using raise StopIteration when the end is reached.

We can implement our own (x)range function as an example:

class Range:
    def __init__(self, start, end = None, step = 1):
        if (end is None):
            end = start
            start = 0
        self.start = start
        self.end = end
        self.step = step
    def __iter__(self):
        return Iterator(self.start, self.end, self.step)

class Iterator:
    def __init__(self, start, end, step):
        self.end = end
        self.step = step
        self.counter = start
    def __iter__(self):
        return self
    def next(self):
        self.counter += 1
        if self.counter > self.end:
            raise StopIteration
        return self.counter

for i in Range(1, 10, 2):
    print i

Sequence Types

The types that fall in this family of type are str, unicode, list, tuple, bytearray, buffer, xrange.

Strings. is a list of characters, which themselves are 8-bits encoded ascii values (strings have some overhead besides the characters [1]). Strings literals can be written in single or double quotes.

Formatting strings: It accepts a syntax similar to sprintf from C. One interesting form is passing an dictionary of values and naming the patterns by the key name:

> print '%(language)s has %(number)03d quote types.' % \
...       {"language": "Python", "number": 2}
Python has 002 quote types.

There also an alternative formatting using the .format() method. A discussion can be read here.

Unicode strings. Python uses UTF-8 encoding for unicode. Literals of this type can be created by prefixing the value with an u, for example u'abc'.

Tuples. are shallowly “immutable” containers. Their contents can’t be changed, but the objects their elements point to might be. It can be used without parenthesis and can be used in the LHS to unwrap values. For example:

> a, b = 1, 2
> a
1
> b
2

Here, (1, 2) and (a, b) are tuples. It has bracket access, for example

> x = 1, 2
> x[0]
1

Tuples are hashable if all its elements are hashable. This allows using tuples in sets or dictionary keys.

Lists. are mutable sequences. They’re indexed from 0 to length-1 and access out of this range throws an exception. The + operator can be used to concatenate lists. Conveniently, the * operator where the first operator is a list and second operant is an integer N, creates N shallow copies of the list (this works for tuples too).

Access to lists can be made using ranges, in which case it returns another list, for example

> x = range(10) # [0, 1, 2, ..., 10]
> x[0:5]
[0, 1, 2, 3, 4]

We must be careful in making copies of arrays where the elements are references to objects (for example other lists). In the example below, it’s likely not doing what we would want:

> lists = [[]] * 3
> lists
[[], [], []]
> lists[0].append(3)
> lists
[[3], [3], [3]]

Xranges. The existence of the xrange type is justified by the xrange() function. The python docs explain it well:

This function is very similar to range(), but returns an xrange object instead of a list. This is an opaque sequence type which yields the same values as the corresponding list, without actually storing them all simultaneously. The advantage of xrange() over range() is minimal (since xrange() still has to create the values when asked for them) except when a very large range is used on a memory-starved machine or when all of the range’s elements are never used (such as when the loop is usually terminated with break).

Bytearrays. are essentially mutable strings.

Buffer. is intended for memory-efficient manipulation of a large arrays, which otherwise would cause a copy. Guido van Rossum describes an example [2]:

It was created with the desire to avoid an expensive memory-copy operation when reading or writing large arrays. For example, if you have an array object containing several millions of double precision floating point numbers, and you want to dump it to a file, you might prefer to do the I/O directly from the array’s memory buffer rather than first copying it to a string.

This Stack Overflow question also discusses the subject.

Set types

Set and Frozenset. The main difference between these two is that set is mutable while frozenset is immutable. These structures are implemented using a hash table, so all elements in a set/frozenset must be hashable. A type is hashable if it implements __hash()__ (which shouldn’t change during its lifetime) and either __eq()__ or __cmp()__. Since frozenset is immutable and all its elements are hashable, frozenset itself is hashable.

In Python 2.7, sets can be constructed by a shorthand {'foo', 'bar'}.

Map types

Dictionaries are associative arrays. They’re called dict in Python code. Dictionary keys must be hashable.

We can get a dictionary’s keys and values by .keys() and .values() respectively. The .items() method returns an array of pairs, each containing key and value. These methods all return copies, so if we assign .keys() to a variable and make changes to the dictionary, the changes won’t get reflected in the list assigned to the variable.

To get references instead of copies, we can use the .viewkeys(), .viewvalues() and .viewitems() methods, which are read-only references.

File Objects

Returned by the open() function. It’s used for file manipulation operations.

Memory View

It was introduced in Python 2.7 and is a replacement for the buffer type.

Context Manager Types

Any user defined class that implements __enter()__ and __exit()__ has a context manager type. These types are useful for abstracting a specific try/finally pattern. More specifically, imagine we have the following pseudo-code:

set things up
try:
  do something
finally:
  tear things down

If we do “set things up” and “tear things down” in many places and only change “do something”, we can abstract those in a class implementing a context manager type:

class ControlledExecution:
  def __enter__(self):
    set things up
      return thing
  def __exit__(self, type, value, traceback):
      tear things down

with ControlledExecution() as thing:
  do something

This post from Effbot as a very clear explanation.

Conclusion

In this post we covered the basic types from the Python environment. We were able to learn about some interesting features even from basic types like booleans and numeric. We also covered some more exoteric types like buffer and xrange. We got some exposure to other features like context manager.

In the next post in the series we’ll talk about functions.

References

[1] Theano – Python Memory Management
[2] Python-Dev – The buffer interface
[3] Python Docs – Built-in Types

Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

649px-Bloom_filter.svg

In its original form, it doesn’t allow removing elements.

Algorithm

The bloom filter structure consists of a bit array of size m. We then define k independent hash functions, which can distribute a given value into any of the m positions uniformly.

Inserting an element consists in applying the k hash functions to the element and set the bit in the k positions returned by these functions to 1.

Querying for an element consists in applying the same k hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to k, assuming random access memory and O(1) hash functions.

Probability of false positives. The probability p that the algorithm will return true when the element is not there (false positive), can be described in terms of m, k, n, through the following equation:

p = (1 -e^{(kn/m)})^k

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values n and m, we have a choice of how many hash functions to use, k. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed m, n, the value of k that minimizes p is given by:

k = ln(2)m/n

Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, m, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)

The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash

To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)

Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of k hash functions from 2 independent hash functions in the following way:

h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
  lambda i: (h64l + i*h64u) % k,
  range(k)
)

All the code is available on github.

Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5) 

murmur3

From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_grid(class ~ .) 

multi

We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_wrap(~class, ncol=5) 

pairwise

We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

h_i(x) = i \cdot h_A(x), i = 1, \cdots, k

bad_hash

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

\dfrac{FP}{FP + TN}

We fix m and k and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements (FP + TN) and count how many of those our Bloom filter thinks they are in the set FP. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements'
)

bloom

Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements', 
  size=I(1), 
  color=data$m
) + scale_colour_manual(
  values=c("red", "blue", "green", "black"), 
  name="m values"
)

Rplot04

One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

Rplot05

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is 1/m. Conversely, the probability of a bit not being set is 1 - 1/m. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the k hash functions is

\left(1 - \frac{1}{m}\right)^k

Inserting n elements is equivalent to running the k functions n times, so after inserting n elements, the probability that a given bit is set to 0 is

\left(1 - \frac{1}{m}\right)^{kn}

or the probability it’s set to one as

1 - \left(1 - \frac{1}{m}\right)^{kn}

Now, the probability of a false positive p, is the probability k positions being set for a element that was never inserted.

p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e

So, for large values of m, we can assume that

(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}

Finally we have:

p = (1 -e^{(kn/m)})^k

Appendix B: Optimal number of hash functions

If we make p a function of k, it’s possible to prove it has a global minimum for positive k‘s so we can use the derivative to find the value of k that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}

where c=n/m is a constant. If we make y = e^{-ck} (and use k = \ln(y)/-c)

\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y

By making it equals to 0, we get the equation:

\ln(1 - y) (1-y) = \ln(y) y

To solve this for y, let x = 1 - y,

\ln(x) x = \ln(y) y

Since \ln(n) and n are both monotonic functions, so is \ln(n)n. Then if () holds, we can conclude that x = y, because otherwise, if x > y, \ln(x) x > \ln(y) y and if x < y, \ln(x) x < \ln(y) y.

Replacing x back, we find out that y = 1/2 and finally k = ln(2)m/n

The Algorithm X and the Dancing Links

Introduction

In a Donald Knuth’s paper called Dancing Links [1], he shows an algorithm that can be used to solve puzzles like Sudoku via backtracking in a efficient way.

The backtracking algorithm is named simply the Algorithm X for a lack of a better name [1] and because it’s very simple and not the focus of the paper.

The main concept is actually a data structure used to implement the algorithm X. It’s a sparse matrix where Knuth uses some clever tricks to make removing/restoring columns and rows efficient and in-place operations. He refers to these operations as dancing links, in allusion to how the pointers from the cells change during these operations.

In this post we’ll describe in more details the problem we are trying to solve and then present the idea of the algorithm X. We’ll follow with the description of the data structure and how the mains steps of the algorithm can be implemented using dancing links.

Finally, we’ll present a simple example implementation on Python.

Set Cover

The Sudoku puzzle can be modeled as a more generaal problem, the set cover problem.

Given a set of items U and a set S of sets each of which covering some subset of U, the set cover problem consists in finding a subset of S such that each element is covered by exactly one set. This problem is known to be NP-Complete.

The set cover can be viewed as a binary matrix where the columns represent the elements to be covered and the rows represent the sets. An entry 1 in the cell i,j, means that the set i covers element j.

The objective is then finding a subset of the rows that such that for each column, it has exactly an entry 1 in exactly one column.

In fact, this is the constraint matrix of a common integer linear programming formulation for this problem.

The Algorithm X

Knuth’s algorithm performs a full search of all possible solutions recursively, in such a way that at each node of the search tree we have a submatrix representing a sub-problem.

At a given node, we try adding a given set for our solution. We then discard all elements covered by this set and also discard all other sets that cover at least one of these elements, because by definition one element can’t be covered by more than one set, so we are sure these other sets won’t be in the solution. We then repeat the operation on the remaining subproblem.

If, at any point, there’s an element that can’t be covered by any set, we backtrack, trying to pick a different set. On the other hand, if we have no element left, our current solution is feasible.

More formally, at a given node of the backtrack tree, we have a matrix binary M . We first pick some column j. For each row i such that M_{ij} = 1, we try adding i to our current solution and recurse over a submatrix M' , constructed by removing from M all columns j' such that M_{ij'} = 1, and all rows i' such that M_{i'k} = M_{ik} = 1 for some column k.

Dancing Links

A naive implementation of the above algorithm would scan the whole matrix to generate the submatrix and it would store a new matrix in each node of the search tree.

Knuth’s insight was to represent the binary matrix as a doubly linked sparse matrix. As we’ll see later, this structure allows us to undo the operations we do for the recursion, so we can always work with a single instance of such sparse matrix.

The basic idea of a sparse matrix is creating a node for each non-zero entry and linking it its adjacent cells on the same column and to adjacent cells on the same line.

In our case, our nodes (depicted in green in Fig. 1) are doubly linked and form a circular chain. We also have one header node (blue) for each column that is linked to the first and last nodes of that column and a single header (yellow) that connects the header of the first column and the last one.

Here’s an example for the matrix \begin{pmatrix}0 & 1\\1 & 1\end{pmatrix}:

Figure 1: Sparse matrix example

Figure 1: Sparse matrix example

Note that the pointers that are coming/going to the border of the page are circular.

For each node we also have a pointer to the corresponding header of its column.

Removing a node. A known way to remove or de-attach a node from a doubly linked circular list is to make its neighbors to point to each other:

node.prev.next = node.next
node.next.prev = node.prev

Restoring a node. Knuth tells us that it’s also possible to restore or re-attach a node to its original position assuming we didn’t touch it since the removal:

node.prev = node
node.next = node

Removing a column. For our algorithm, removing a column is just a matter of de-attaching its corresponding header from the other headers (not from the nodes in the column), so we call this a horizontal de-attachment.

Removing a row. To remove a row, we want to de-attach each node in that row from it’s vertical neighbors, but we are not touching the links between nodes of the same row, so we call this a vertical de-attachment.

A Python implementation

We’ll present a simple implementation of these ideas in Python. The complete code can be found on Github.

Data Structures

Our basic structures are nodes representing cells and heads representing columns (and also a special head sentinel). We need all the four links (left, up, right, down), but we don’t need to declare them explicitly because python allow us to set them dynamically. We will have an additional field pointing to the corresponding header. The main difference is that we only attach/de-attach nodes vertically and heads horizontally, so we have different methods for them:

class Node:
    def __init__(self, row, col):
        self.row, self.col = row, col

    def deattach(self):
        self.up.down = self.down
        self.down.up = self.up
        
    def attach(self):
        self.down.up = self.up.down = self

class Head:
    def __init__(self, col):
        self.col = col

    def deattach(self):
        self.left.right = self.right
        self.right.left = self.left
        
    def attach(self):
        self.right.left = self.left.right = self

Now we need to build our sparse matrix out of a regular python matrix. We basically create one node for each entry 1 in the matrix, one head for each column and one global head. We then link them with helper functions:

class SparseMatrix:

    def createLeftRightLinks(self, srows):
        for srow in srows:
            n = len(srow)
            for j in range(n):
                srow[j].right = srow[(j + 1) % n]
                srow[j].left = srow[(j - 1 + n) % n]
            
    def createUpDownLinks(self, scols):
        for scol in scols:
            n = len(scol)
            for i in range(n):
                scol[i].down = scol[(i + 1) % n]
                scol[i].up = scol[(i - 1 + n) % n]
                scol[i].head = scol[0]
        
    def __init__(self, mat):
        
        nrows = len(mat)
        ncols = len(mat[0])
    
        srow = [[ ] for _ in range(nrows)]
        heads = [Head(j) for j in range(ncols)]        
        scol = [[head] for head in heads]

        # Head of the column heads
        self.head = Head(-1)
        heads = [self.head] + heads
            
        self.createLeftRightLinks([heads])
            
        for i in range(nrows):
            for j in range(ncols):
                if mat[i][j] == 1:
                    node = Node(i, j)
                    scol[j].append(node)
                    srow[i].append(node)

        self.createLeftRightLinks(srow)
        self.createUpDownLinks(scol)

Iterators

We were repeating the following code in several places:

it = node.left
while it != node:
  # Do some stuff
  it = it.left

With left eventually replaced by right, up or down. So we abstract that using an iterator:

class NodeIterator:

    def __init__(self, node):
        self.curr = self.start = node

    def __iter__(self):
        return self

    def next(self):
        _next = self.move(self.curr)
        if _next == self.start:
            raise StopIteration
        else:
            self.curr = _next
            return _next

    def move(self):
        raise NotImplementedError

This basically goes through a linked list using a specific move operation. So we can implement specific iterators for each of the directions:

class LeftIterator (NodeIterator):
    def move(self, node):
        return node.left

class RightIterator (NodeIterator):
    def move(self, node):
        return node.right

class DownIterator (NodeIterator):
    def move(self, node):
        return node.down

class UpIterator (NodeIterator):
    def move(self, node):
        return node.up

Then, our previous while block becomes

for it in LeftIterator(node):
  # Do some stuff

Algorithm

With our data structures and syntax sugar iterators set, we’re ready to implement our backtracking algorithm.

The basic operation is the covering and uncovering of a column. The covering consists in removing that column and also all the rows from its row list (remember that a column can only be covered by exactly one row, so we can remove the other rows from the candidate list).

class DancingLinks:

    def cover(self, col):
        col.deattach()
        for row in DownIterator(col):
            for cell in RightIterator(row):
                cell.deattach()

    def uncover(self, col):
        for row in UpIterator(col):
            for cell in LeftIterator(row):
                cell.attach()
        col.attach()    

   ...

When covering a row of a column col, we start from the column to the right of col and finish at the column on its left, thus, we don’t actually de-attach the cell from col from its vertical neighbors. This is not needed because we are already “removed” the column from the matrix and this allow us to make a more elegant implementation.

It’s important the uncover to do the operations in the reverse other of the cover so we won’t mess with the pointers in the matrix.

The main body of the algorithm is given below, which is essentially the definition of the Algorithm X, that returns true whenever it has found a solution.

    def backtrack(self):
        # Let's cover the first uncovered item
        col = self.smat.head.right
        # No column left
        if col == self.smat.head:
            return True
        # No set to cover this element
        if col.down == col:
            return False

        self.cover(col)
        
        for row in DownIterator(col):

            for cell in RightIterator(row):
                self.cover(cell.head)

            if self.backtrack():
                self.solution.append(row)
                return True

            for cell in LeftIterator(row):
                self.uncover(cell.head)

        self.uncover(col)
        
        return False

Knuth notes that in order to reduce the expected size of the search tree we should minimize the branching factor at earlier nodes by selecting the columns with more 1 entries, which will throw the most number of candidate rows away. But for the sake of simplicity we are choosing the first one.

Conclusion

In this post we revisited concepts like the set cover problem and the sparse matrix data structure. We saw that with a clever trick we can remove and insert rows and columns efficiently and in-place.

Complexity

Suppose that in a given node we have a n \times m matrix. In the naive approach, for each candidate row we need to go through all of the rows and columns to generate a new matrix, leading to a O(n^2m) complexity for each node.

In the worst case, using a sparse matrix will lead to the same complexity, but hard set cover instances are generally sparse, so in practice we may have a performance boost. Also, since we do everything in-place, our memory footprint is considerably smaller.

References

[1] Dancing Links – D. E. Knuth (arxiv.org)
[2] Wikipedia – Algorithm X
[3] Wikipedia – Dancing Links

Skip Lists in Python

Skip list is a probabilistic data structure that allows efficient search, insertion and removal operations. It was invented by William Pugh [1] in 1989.

Other structures that have efficient operations are self-balancing binary trees, such as AVL, Red-black and splay tree. But they are often considered difficult to implement.

On the other hand, skip lists are much like multiple linked lists with some randomization.

Figure 1: Skip list

In the first level, we have a regular linked list with the elements sorted. Each element of this list has a probability p to be also present in the level above. The second level will probably contain fewer elements and each of these elements will also have a chance p to be on the third level, and so on. Figure 1 shows an example of a skip list.

We’ll implement a simple version of the skip list in python. To start, we define a skip list node. We’ll represent each level where the node appears by a list of pointers to the next nodes.

class SkipNode:
    def __init__(self, height = 0, elem = None):
        self.elem = elem
        self.next = [None]*height

Our skip list is just a sentinel skip node with height initially set to 0 and that stores a null element.

class SkipList:
    def __init__(self):
        self.head = SkipNode()

Now, let’s implement the search operation of this list.

Search

To search for an element q in a skip list we begin in the topmost level of the header. We go through the list in this level until we find node with the largest element that is smaller than q.

We then go to the level below and search again for node with the largest element that is smaller than q, but this time we began the search from the node we found in the level above.

When we find such node, we go down again and repeat this process until we reach the bottom level. The node x found in the bottom level will be the largest element that is smaller than q in the whole list and if q is in this list, it will be to the right of x.

Also, we want to keep the nodes found in each level right before going down to the level below since it will make the insertion and deletion operations very simple.

This idea can be translated into the following code:

def updateList(self, elem):
    
    update = [None]*len(self.head.next)
    x = self.head
    
    for i in reversed(range(len(self.head.next))):
        while x.next[i] != None and \ 
            x.next[i].elem < elem:
            x = x.next[i]
        update[i] = x
        
    return update

It returns a list of nodes in each level that contains the greatest value that is smaller than elem.

The actual find function returns the node corresponding to the query element or None if it is not present in the skip list.

def find(self, elem, update = None):
    if update == None:
        update = self.updateList(elem)
    if len(update) > 0:
        candidate = update[0].next[0]
        if candidate != None and candidate.elem == elem:
            return candidate
    return None

Complexity

The complexity of the search is given by the following Theorem [2]:

Theorem: The number of moves in a search is O(\log n) with high probability.

By high probability we mean that we can set an arbitrarily high probability by increasing the constant hidden in the O() notation. The proof of this Theorem is sketched in Appendix A.

Insertion

The insertion consists in deciding the height of the new node, using randomHeight() and for each of the levels up to this height, insert this new node after the node specified in update.

def insert(self, elem):

    node = SkipNode(self.randomHeight(), elem)

    while len(self.head.next) < len(node.next):
        self.head.next.append(None)

    update = self.updateList(elem)            
    if self.find(elem, update) == None:
        for i in range(len(node.next)):
            node.next[i] = update[i].next[i]
            update[i].next[i] = node

Deletion

The deletion is pretty much like the insertion, but now we delete the node found using find() from all levels in which it appears.

def remove(self, elem):

    update = self.updateList(elem)
    x = self.find(elem, update)
    if x != None:
        for i in range(len(x.next)):
            update[i].next[i] = x.next[i]
            if self.head.next[i] == None:

Note that for the sake of simplicity we do not resize head.next when the lists at top levels become empty. It does not change the theoretical complexity in the worst case, but in practice it may improve performance.

Computational Experiments

The complete implementation of skip list is available at Github as well as some test cases and a simple implementation of a linked list.

Our computational experiments consist in comparing the execution time for linked lists (O(n) per operation), our simple skip list (O(\log n) with high probability) and an implementation of a red-black tree (worst-case O(\log n) per operation).

We ran 10000 insertions in each of these structures with a random sequence, an increasing sequence and a decreasing sequence. We measure the CPU time in seconds:

Table 1: Times for insertion in different structures

As we can see, for random input Red-black tree and Skip list have similar performance. For increasing and decreasing sequences, Skip list performed better than Red-black tree because the former is unaffected by the ordering of insertions, while the latter has to make many balancing operations in such cases.

As for linked lists, we can verify it’s much slower than the other two structures since it has O(n) worst case insertion time, but it outperforms when the elements are inserted in decreasing order since in this case insertion is O(1) :)

Conclusion

There is a combination of insertions and removals that may degenerate a skip list to a linked list, so the operations of searching, inserting and deleting become O(n) in the worst case scenario, though it is very unlikely to happen.

Demaine’s analysis [2] is stronger than Pugh’s [1]. The latter proves that the cost of the search is O(\log n) in average (i.e. expected cost) while the former proves it is O(\log n) for most of the cases.

References

[1] Skip Lists: A Probabilistic Alternative to Balanced Trees – W. Pugh.
[2] Introduction to Algorithms MIT – Lecture 12: Skip Lists – Erik Demaine

Appendix A: Proofs

In this section we present the proof of the Theorem stated in the post. Let L(n) = \log_{1/p} n.

Before proving the Theorem, let’s prove the following Lemma:

Lemma: The number of levels in a skip list is O(L(n)) with high probability.

Proof:

Let’s consider the error-probability, that is, the probability that there are more than c \cdot L(n) levels. We use the Boole’s inequality which says that for a set of events \{E_1, E_2, \cdots, E_k\}:

Pr\{E_1 \cup E_2 \cup \cdots \cup E_k \} \le Pr\{E_1\} + Pr\{E_2\} + \cdots + Pr\{E_k \}

Thus,

Pr\{ \mbox{max level } \ge c \cdot L(n) \} \le
n \cdot Pr\{\mbox{node level } \ge c \cdot L(n) \}

Since each node height is given by a geometric distribution, we have that for some given level x:

Pr\{\mbox{node level } = x\} = p^{x - 1}(1 - p)

Pr\{\mbox{node level } < x\} = \sum_{i = 0}^{x - 1} Pr\{\mbox{node level } = x\} =
\, (1 - p)\sum_{i = 0}^{x - 1} p^i = 1 - p^{k}

Pr\{\mbox{node level } \ge x \} = p^{k}

Also, p^{c \cdot L(n)} = p^{c \cdot \log_{1/p} n} = \frac{1}{n^{c}}

Finally,

Pr\{ \mbox{max level } < c \cdot L(n) \} \ge 1 - \frac{1}{n^{c - 1}}

Thus, for a sufficiently large constant c, we have a very high probability, which proves the Lemma.

Theorem: The number of moves in a search is O(\log n) with high probability.

Proof:

Let’s prove that the number of moves in a search is O(L(n)) with high probability.

First, consider the reversed path to find the returned element. This reversed path consists of up and left movements along the skip list. Note first that the number of up movements is bounded by the levels of the skip list.

An up movement is done with probability p, which is the case that the current element has at least one more level. Otherwise a left movement is done with probability 1 - p.

Then, the length of the path is given by the number of movements until we reach c \cdot L(n) up movements. We claim that such number of movements is O(L(n)) with high probability.

To prove our claim, let the number of movements be c'cL(n) for some other constant c'. In [2], some combinatoric relations are used to show that

Pr\{\mbox{\# up moves} \le cL(n) \mbox{ among } c'cL(n) \mbox{ moves}\} \le \frac{1}{n^\alpha}

where \alpha = ((c'- 1) - L(c'e)) \cdot c. We also have the converse:

Pr\{\mbox{\# up moves} > cL(n) \mbox{ among } O(cL(n)) \mbox{ moves}\} = 1 - \frac{1}{n^\alpha}

Since (c'- 1) > L(c'e) for sufficiently large values of c', we can choose c' to make \alpha arbitrarily large, which makes the probability above very high, proving the claim.

We conclude that the number of movements is O(cL(n)) = O(L(n)) with high probability.