# Largest sets of subsequences in OCaml

I’ve noticed that there is this set of words in English that look very similar: tough, though, through, thought, thorough, through and trough. Except thought, they have one property in common: they’re all subsequence of thorough. It made me wonder if there are interesting sets of words that are subsequences of other words.

Word cloud made with Jason Davie’s tool (https://www.jasondavies.com/wordcloud/)

This post is an attempt to answer a more general question: given a list of words, what is the largest set of these words such that they’re subsequences of a given word?

A word A is a subsequence of a word B if A can be obtained by removing zero or more characters from B. For example, “ac” is a subsequence of “abc”, so is “bc” and even “abc”, but not “ba” nor “aa”.

A simple algorithm to determine if a word A is a subsequence of another is to start with 2 pointers at the beginning of each word, pA and pB. We move pB forward until pA and pB point to the same character. In that case we move pA forward. A is a subsequence of B if and only if we reach the end of A before B. We could then iterate over each word W and find all the words that are subsequences of W. If the size of the dictionary is n, and the size of the largest word is w, this algorithm would run in $O(n^2 w)$.

For English words, we can use entries from /usr/share/dict/words. In this case, n is around 235k (wc -l /usr/share/dict/words), so a quadratic algorithm will take a while to run (around 5e10 operations).

Another approach is to generate all subsequences of words for a given word W and search the dictionary for the generated word. There are $O(2^w)$ subsequences of a word of length $w$. If we use a hash table, we can then do it in $O(n w 2^w)$. In /usr/share/dict/words, the length of the largest word, w, is 24.

Running a calculation with the numbers (R script), the number of high-level operations is 4e10, about the same order of magnitude as the quadratic algorithm.

Distribution using ggplot2

A third approach is to use a trie. This data structure allows us to store the words from the dictionary in a space-efficient way and we can search for all subsequences using this structure. The trie will have at most 2e6 characters (sum of characters of all words), less because of shared prefixes. Since any valid subsequence has to be a node in the trie, the cost of search for a given word cannot be more than the size of the trie t, so the complexity per word is $O(\min(2^w, t))$. A back of envelope calculation gives us 2e9. But we’re hoping that the size of the trie will be much less than 2e6.

Before implementing the algorithm, let’s define out trie data structure.

### The Trie data structure

A trie is a tree where each node has up to |E| children, where |E| is the size of the alphabet in consideration. For this problem, we’ll use lower case ascii only, so it’s 26. The node has also a flag telling whether there’s a word ending at this node.

Notice that in this implementation of trie, the character is in the edge of the trie, not in the node. The Map structure from the stlib uses a tree underneath so get and set operations are $O(log |E|)$.

The insertion is the core method of the structure. At a given node we have a string we want to insert. We look at the first character of the word. If a corresponding edge exists, we keep following down that path. If not, we first create a new node.

To decide whether a trie has a given string, we just need to traverse the trie until we either can’t find an edge to follow or after reaching the end node it doesn’t have the hasEntry flag set to true:

This and other trie methods are available on github.

### The search algorithm

Given a word W, we can search for all its subsequences in a trie with the following recursive algorithm: given a trie and a string we perform two searches: 1) for all the subsequences that contain the first character of current string, in which case we “consume” the first character and follow the corresponding node and 2) for all the subsequences that do not contain the first character of the current string, in which case we “consume” the character but stay at the current node. In pseudo-code:

Search(t: TrieNode, w: string):
Let c be the first character of w.
Let wrest be w with the first character removed

If t contains a word, it's a subsequence of the
original word. Save it.

// Pick character c
Search(t->child[c], wrest)

// Do not pick character c
Search(t, wrest)


The implementation in OCaml is given below:

### Experiments

Our experiment consists in loading the words from /usr/share/dict/words into a trie, and then, for each word in the dictionary, look for its subsequences. The full code is available on github.

The code takes 90 seconds to run on my laptop. Not too bad but I’m still exploring ways to improve the performance. One optimization I tried is to, instead of returning an explicit list of strings as mentioned in the search implementation, return them encoded in a trie, since we can save some operations due to shared prefixes. I have that version on github, but unfortunately that takes 240 seconds to run and requires more memory.

Another way is to parallelize the code. The search for subsequences is independent for each word, so it’s an embarrassingly parallel case. I haven’t tried this path yet.

The constructed trie has 8e5 nodes or ~40% of the size of sum of characters.

#### Subsequences of “thorough”

The question that inspired this analysis was finding all the subsequences of thorough. It turns out it has 44 subsequences, but most of them are boring, that is, single letter or small words that look completely unrelated to the original word. The most interesting ones are those that start with t and have at least three letters. I selected some of them here:

• tho
• thoo
• thoro
• thorough
• thou
• though
• thro
• throu
• through
• thug
• tog
• tou
• toug
• tough
• trough
• trug
• tug

The word with most subsequences is pseudolamellibranchiate, 1088! The word cloud at the beginning of the post contains the 100 words with the largest number of subsequences. I tried to find interesting words among these, but they’re basically the largest words – large words have exponentially more combination of subsequences, and hence the chance of them existing in the dictionary is greater. I tried to come up with penalization for the score:

1) Divide the number of subsequences by the word’s length. This is not enough, the largest words still show on top.
2) Apply log2 to the number of subsequences and divide by the word’s length. In theory this should account for the exponential number of subsequences of a word. This turns out to be too much of a penalization and the smallest word fare too well in this scenario.

I plotted the distribution of number of subsequences by word lengths. We can see a polynomial curve but with increased variance:

Generated with this ggplot2

In the chart above, we’d see all points with the same x-value in a single vertical line. One neat visualization trick is to add noise (jitter) so we also get a sense of density.

If we use a box plot instead, we can see a quadratic pattern more clearly by looking at the median for each length.

Generated with this ggplot2

Given this result, I tried a final scoring penalization, by dividing the number of subsequences by the square of the length of the word, but it’s still not enough to surface too many interesting words. Among the top 25, streamlined is the most common word, and it has 208 subsequences.

One interesting fact is that the lowest scoring words are those with repeated patterns, for example: kivikivi, Mississippi, kinnikinnick, curucucu and deedeed. This is basically because we only count unique subsequences.

### Conclusion

This was a fun problem to think about and even though it didn’t have very interesting findings, I learned more about OCaml and R. After having to deal with bugs, compilation and execution errors, I like OCaml more than before and I like R less than before.

R has too many ways of doing the same thing and the API is too lenient. That works well for the 80% of the cases which it supports, but finding what went wrong in the other 20% is a pain. OCaml on the other hand is very strict. It doesn’t even let you add an int and a float without an explicit conversion.

I learned an interesting syntax that allows to re-use the qualifier/namespace between several operations when chaining them, for example:

I also used the library Batteries for the first time. It has a nice extension for the rather sparse String module. It allows us to simply do open Batteries but that overrides a lot of the standard modules and that can be very confusing. I was scratching my head for a long time to figure out why the compiler couldn’t find the union() function in the Map module, even though I seemed to have the right version, until I realized it was being overridden by Batteries. From now on, I’ll only use the specific modules, such as BatString, so it’s easy to tell which method is coming from which module.

### References

OCaml

• [1] OCaml Tutorials > Map
• [2] Strings – Batteries included
• [3] Using batteries when compiling

R

• [1] R Tutorial – Histogram
• [2] Creating plots in R using ggplot2 – part 10: boxplots
• [3] My R Cheat sheet

# Amortization and Persistence via Lazy Evaluation

In this chapter Okasaki works around the problem of doing amortized analysis with persistent data structures because the amortized analysis assumes in place modification while for persistent data structures (partial) copies are made. The intuition is that lazy evaluation, which comes with memoization and avoids recomputation, solves this problem.

He adapts the Banker’s and Physicists’s methods to work with lazy evaluated operations and applies them to a few structures including Binomial Heaps, Queues and Lazy Pairing Heaps. In this post we’ll only cover the examples of the Queues using both methods.

We’ll first introduce some concepts and terminology, then we’ll present a queue implementation using lazy evaluation that allows us analyzing it under persistence. Following that we’ll explain the Banker’s and Physicist’s methods and prove that the implementation for push/pop has an efficient amortized cost.

### Persistence as a DAG

An execution trace is a DAG where nodes represent the operations (e.g. updates to a data structure) and an edge from nodes $v$ to $v'$ indicates that the operation corresponding to $v'$ uses the output of the one corresponding to $v$.

For example, if we have this set of operations:

let a = push 0 newEmpty
let b = push 1 a
let c = pop b
let d = push 2 b
let e = append c d
let f = pop c
let g = push 3 d


The corresponding execution graph is:

Execution Graph

The logical history of an operation v is the set of all operations it depends on (directly or indirectly, and including itself). Equivalently, in terms of the DAG, it’s the set of nodes that have a directed path to v.

A logical future of an operation v is any directed path from v to a terminal node.

### A framework for analyzing lazy evaluated data structures

We need to introduce a few more concepts and terminology. Note: we’ll use suspension and lazy operation interchangeably, which can be evaluated or forced.

The unshared cost of an operation is the time it would take to execute it if it had already been performed and memoized before, so if the operation involves any expression that is lazy, that expression would be O(1).

The shared cost of an operation is the time it would take it to execute (force) all the suspensions created (but not evaluated) by the operation.

The complete cost is the sum of the shared and unshared costs. Alternatively, the complete cost of an operation is the time it would take to execute the operation if lazy evaluation was replaced by strict. To see why, first we note that the unshared costs have to be paid regardless of laziness. Since we’re assuming no laziness, the operation has to pay the cost associated with the suspension it creates, which corresponds to the shared costs. Note that under this assumption we wouldn’t need to account for the cost of forcing suspensions created by previous operations because in theory they have already been evaluated.

When talking about a sequence of operations, we can break down the shared costs into two types: realized and unrealized costs. The realized costs are the shared costs from suspensions were actually forced by some operation in the sequence. Example: say that operations A and B are in the sequence and A creates a suspension, and then B forces it. The cost for B to force it is included in the realized cost. The unrealized costs are the shared costs for suspensions that were created but never evaluated within the sequence. The total actual cost of a sequence of operations is the sum of the realized costs and the unshared costs.

Throughout a set of operations, we keep track of the accumulated debt, which starts at 0 at the beginning of the sequence. Whenever an operation is performed, we add its shared cost to it. For each operation, we can decide how much of this debt we want to pay. When the debt of a suspension is paid off, we can force it. The amortized cost of an operation is its unshared cost plus the amount of debt it paid (note that it does not include the realized cost). Note that as long as we always pay the cost of a suspension before it’s forced, the amortized cost will be an upper bound on the actual cost.

This framework simplifies the analysis for the case when a suspension is used more than once by assuming that its debt was paid off within the logical history of when it was forced, so we can always analyze a sequence of operations and don’t worry about branching. This might cause the debt being paid multiple times, but it simplifies the analysis.

The author uses the term discharge debit as synonym of pay off debt. I find the latter term easier to grasp, so I’ll stick with it throughout this post.

Let’s introduce an example first and then proceed with the explanation of the Physicist’s method and the corresponding analysis of the example.

### The Stream Queue

To allow efficient operations on a queue in the presence of persistence, we can make some of the operations lazy. Recall in a previous post we defined a queue using two lists. To avoid immediate computation, a natural replacement for lists is using its lazy version, the stream data structure, which we also talked about in a previous post.

For the list-based queue, the invariant was that if the front list is empty, then the rear list must be empty as well. For the stream queue, we have a tighter constraint: ‘front’ must be always greater or equal than ‘rear’. This constraint is necessary for the analysis.

The definition of the stream queue is the following:

We store the lengths of the streams explicitly for efficiency.

We’ll be using the Stream developed in the previous chapter, so we’ll refer to the module Stream2 to avoid ambiguity with the standard Stream module.

Inserting an element at the end of the queue is straightforward, since the rear stream represents the end of the queue and is reversed:

The problem is that inserting at rear can cause the invariant of the queue to be violated. check() changes the structure so to conform to the invariant by potentially reversing rear and concatenating with front:

Removing an element from the queue requires us to evaluate the first element of the front stream. Again, the invariant can be violated in this case so we need to invoke check() again:

The complete code for the stream queue is on Github.

### Analysis using the Banker’s Method

The idea of the Banker’s Method is basically define an invariant for the accumulated debt and a strategy for paying it off (that is, decide how much debt each operation pays off). Then we show that whenever we need to force a suspension, the invariant guarantees that the accumulated debt has been paid off. One property of the Banker’s method is that it allows associating the debt to specific locations of the data structure. This is particularly interesting for streams, because it contains multiple (nested) suspensions, so we might force parts of this structure before we paid the debt associated with the entire structure.

By inspection, we can see that the unshared cost of both push and pop are O(1). It’s obvious in the case of push, and in the case of pop, in theory check could take O(m) where m is the size of the queue, but since Stream2.concat() and Stream2.reverse() are both lazy, and hence memoized, they are not included in the unshared costs.

To show that the amortized cost of both operations is O(1), we can show that paying off O(1) debt at each operation is enough to pay for the suspension before it is forced. For the queue, we also need to associate the debt with parts of the data structure, so that we could force the suspension of only some parts of it (for example, on the stream we can evaluate only the head, not necessarily the entire structure).

We now define an invariant that should be respected by all operations. Let $d(i)$ be the debt at the i-th node on the front stream, and $D(i) = \sum_{j=0}^{i} d(j)$ the accumulated debt up to node $i$. The invariant is:

$D(i) \le \min(2i, |f| - |r|)$

This constraint allows us evaluating the head at any time, because $D(0) = 0$, which means its debt has been paid off. The second term in min(), guarantees that if $|f| = |r|$ the entire stream can be evaluated, because $D(i) = 0$ for all $i$.

The author then proves that by paying off one debt in push() and two debt units in pop() is enough to keep the debt under the constraint.

### Queue with suspended lists

Because the Physicist’s method cannot assign costs to specific parts of the data structure, it doesn’t matter if the structure can be partially forced (like streams) or if it’s monolithic. With that in mind, we can come up with a simpler implementation of the queue by working with suspended lists instead of streams. Only the front list has to be suspended because the cost we want to avoid, the reversal of the back list and concatenation to the front list, happens on the front list.

On the other hand, we don’t want to evaluate the front list when we perform a peek or pop, so we keep a evaluated version of the front list too.

The signature of the structure is as follows:

As mentioned in the code above, the invariants we want to enforce is that the front list is never smaller than the rear list

and that the evaluated version of the front list is never empty if the lazy version still has some elements.

The push and pop operations are similar to the other versions of queue, but since we mutate the structure, we might need to adjust it to conform to the invariants:

Finally, because of our second invariant, peeking at the queue is straightforward:

The complete code for the suspended queue is on Github.

### Analysis using the Physicist’s Method

We’ve seen the Physicist’s Method in a previous post when we’re ignore the persistence of the data structures. We adapt the method to work with debits instead of credits. To avoid confusion, we’ll use $\Psi$ to represent the potential function. $\Psi(i)$, represents the accumulated debt of the structure at step $i$. At each operation we may decide to pay off some debit, which will be then included in the amortized cost. We have that $\Psi(i) - \Psi(i - 1)$ is the increase in debt after operation $i$. Remember that the shared cost of an operation corresponds to the increase in debt if we don’t pay any of the debt. Thus, we can find out how much debt was paid off then by $s_i - \Psi(i) - \Psi(i - 1)$, where $s(i)$ is the shared costs of operation $i$. Let $u(i)$ and $c(i)$ be the unshared and complete costs of the operation $i$. Given that, by definition, $c(i) = u(i) + s(i)$, we can then express the amortized cost as:

$a_i = cc_i - (\Psi(i) - \Psi(i - 1))$

To analyze the suspended queue we need to assign values to the potentials such that by the time we need to evaluate a suspension the potential on the structure is 0 (that is, the debt has been paid off). For the suspended queues we’ll use the following potential function:

$\Psi(q) = \min(2|w|, |f| - |r|)$

Where w is the forcedFront, f is lazyFront and r is rear.

We now claim that the amortized cost of push is at most 2. If we push and element that doesn’t cause a rotation (i.e. doesn’t violate $|f| \ge |r|$), then $|r|$ increases by 1, and the potential decreases by 1. No shared is incurred and the unshared cost, inserting an element at the beginning of rear is 1, hence the amortized cost for this case is 1 – (-1) = 2. If it does cause a rotation, then it must be that after the insertion $|f| = m$ and $|r| = m + 1$. After the rotation we have $|f| = 2*m + 1$ and $|r| = 0$, but w hasn’t changed and cannot be larger than the original $f$, so the potential function is at most $2*m$. The reversal of $r$ costs $m + 1$ and concatenating to a list of size $x$ costs $x$ (discussed previously), plus the cost of initially appending an element to read, so the unshared cost is $2*m + 2$. No suspensions were created, so the amortized cost is given by $2*m + 2 - (2*m) = 2$.

Our next claim is that the amortized cost of pop is at most 4. Again, if pop doesn’t cause a rotation, $|w|$ decreases by 1, so the potential is reduced by 2. The unshared cost is 1, removing an element from $w$, and the shared cost, 1 comes from the suspension that lazily removes the head of lazyFront. The amortized cost is 2 – (-2) = 4. Note that if $w = 0$, we’ll evaluate $f$, but the ideas is that it has been paid off already. If the pop operation causes a rotation, then the analysis is similar to the push case, except that the complete cost is must account for the shared cost of lazily reming the head of lazyFront, so it’s $2*m + 3$, for an amortized cost of 3.

Note that when the suspensions are evaluated, the potential is 0, either when $|f| = |r|$ or $|w| = 0$.

### Conclusion

In this post we covered a simple data structure, the queue, and modified it to be lazy evaluated and can show, with theory, that it allows for efficient amortized costs. The math for proving the costs is not complicated. The hardest part for me to grasp is to get the intuition of how the analysis works. The analogy with debt is very useful.

Meta: Since wordpress code plugin doesn’t support syntax highlighting for OCaml, I’m experimenting with the Gist plugin. Other advantages is that Gists allow comments and has version control!

# Amortization Analysis

In Chapter 5 of Purely Functional Data Structures, Okasaki introduces the amortization analysis as preparation for the complexity analysis of functional data structures.

The idea of amortized analysis is to prove the complexity of an operation when the average cost per operation is less than the worst case cost of each individual operation.

In this post we’ll perform analysis of 2 data structures and show that their amortized running time is much better than their worst case. There are 2 methods presented in the book, the Banker’s Method and the Physicist Method. I found the Physicist easier to work with, so I’ll only use that. Also, I find it easy to understand concepts with an examples, so we’ll first introduce the data structure and then introduce the analysis.

### Queue with lists

An easy way to implement the queue data structure is by using two lists, one representing the beginning (left) and the other the end (right). Inserting/removing an element from the beginning of a list in OCaml is O(1), while doing it at the end is O(n). For this reason, we store the end of the queue reversed in the right list. (In imperative languages the opposite is true, insertions/removals at the end are O(1) and at the beginning O(n), so the left array that is reversed).

Queue using lists representing [1, 2, 3, 4, 5, 6]

Insertions are performed by appending an element to the beginning of the right array. Removing from the front of the queue requires removing the first element of the left array. When the left array is empty, we need to reverse the contents of the right array and move it to the left.

We work with an invariant, in which the left array is only empty if the right array is empty as well. That requires a slight change to the removal, where we perform the reversal and move of the right array when the left array has only one element and is about to become empty.

I’ve implemented this idea in Ocaml.

### Analysis – The Physicist Method

If we only look at the worst case, we will get that removing from a queue is an O(n) operation. We can improve this by looking at the amortized cost. One way to perform this analysis is via the physicist method, in which we define a potential (or cost) function $\Phi(d)$ for the data structure $d$. In the queue case, we can define it to be the size of the right list.

The amortized cost of step i, $a_i$ is defined by

$a_i = t_i + \Phi(d_i) - \Phi(d_{i-1})$

Where $t_i$ is the actual cost (in number of operations)

The intuition is that some costly operations such as reversing the list, which takes O(n) operations, can only happen after O(n) insertions happened, so the average cost per operation is O(1). Translating it to the potential analogy, each insersion operation increases the potential of the data structure by 1, so by the time a revert happens, the right list is emptied and that causes a loss of potential of n, which had been accumulated and that offsets the cost of the operation $t_i$.

We can get the accumulated amortized cost of the operations up to a specific step by summing them, and we’ll get

$A_i = T_i + \Phi(d_i) - \Phi(d_0)$

If we choose $\Phi(d)$ such that $\Phi(d_i) \ge \Phi(d_0)$, for all i, then A_i is an upper bound of $T_i$. Since we assume $d_0$ is the empty queue, $\Phi(d_0) = 0$ and we have $\Phi(d_i) \ge 0$, this property is satisfied.

One important remark is that this analysis is only applicable to non-persistent data structures, which is not the case in OCaml. We present the OCaml code as exercise, but the correct analysis will done in later chapters.

### Splay Trees

Splay Tree is a binary search tree. The insertion of an element e consists in spliting the current tree in 2 subtrees, one of all elements less or equal than e, which we’ll call small and another will all elements larger than e, which we’ll call big. We then make e the root of tree and small the left subtree and big the right subtree.

Partition

The partition operation is O(height of tree). The intuition is, when looking for the small subtree, imagine we’re analysing a node x. If x >= e, then we know all elements in the left subtree are also >= e, so are all the elements in the right subtree, so we can discard it and only look at the left subtree. The same idea applies to looking for the big subtree.

In an ideal scenario, the tree will be well balanced in such a way that the height of the tree is O(log n), and so the insertion operation. But that’s not always the case, so we introduce a balancing strategy. Whenever we follow 2 lefts in a row when searching for the big subtree, we perform a right rotation. Similarly, when following 2 rights in a row when searching for the small subtree, we perform a left rotation.

These re-balancing are enough to make partition and also insertion an O(log n) amortized time, as we shall see in the next section.

Rotation

The operation for finding the minimum element is trivial. To delete the minimum element, we also perform rotations. Since we only traverse the left paths, we only have to do a left rotation.

I’ve implemented these ideas in OCaml.

### Amortized Analysis of Splay Trees

We first define a potential function. Given a tree $t$ and a node $x \in t$, let $t(x)$ be the subtree of $t$ rooted in $x$. Then $\phi(t(x)) = \log(|t(x)| + 1)$ be a potential function where $|t(x)|$ the number of nodes of the subtree of which $x$ is root, and let the potential of a tree $t$, $\Phi(t)$, be the sum of the sub-potentials of all nodes of $t$, that is,

(1) $\Phi(t) = \sum_{x \in t} \phi(t(x))$.

The amortized equation of calling partition() is the actual cost of the partition, plus the potential of the resulting trees (small and big) minus the potential of the original tree. Let’s denote small(t) as $s$, and big(t) as $b$.

(2) $\mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)$

We can prove that $\mathcal{A}(t) \le 1 + 2 \log(|t|)$ (see appendix), which means the amortized cost of the partition operation is $O(\log n)$. It implies that insertion also has an amortized cost of $O(\log n)$.

The same idea can be used to prove the amortized costs of the deletion of the minimum element. Okasaki claims that splay trees are one of the fastest implementations for heaps when persistence is not needed.

### Conclusion

I found amortized analysis very hard to grasp. My feeling is that the amortized analysis is too abstracted from the real problem, making it not very intuitive. It’s also not clear how one picks the potential function and whether there’s a choice of potential that might result in a better than the “actual” amortized cost.

### Appendix

As we can see in code, there are four cases to consider: left-left, left-right, right-left and right-right. The left-* and right-* are symmetric, so let’s focus on the left-left and left-right cases. For the left-left case, suppose we perform the following partition:

Left-left case of partition()

(3) $\mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)$

Since $= \mathcal{T}$ corresponds to the number of recursive calls to partition, $\mathcal{T}(t) = 1 + \mathcal{T}(u)$, so

(4) $= 1 + \mathcal{T}(u) + \Phi(s) + \Phi(b) - \Phi(t)$

The recursive call to u yields (s’, b’), so we have the equivalence, $\mathcal{A}(u) = \mathcal{T}(u) + \Phi(s') + \Phi(b') - \Phi(u)$,

(5) $= 1 + \mathcal{A}(u) - \Phi(s') - \Phi(b') + \Phi(u) + \Phi(s) + \Phi(b) - \Phi(t)$

Since, s = s’,

(6) $= 1 + \mathcal{A}(u) - \Phi(b') + \Phi(u) + \Phi(b) - \Phi(t)$

and $\Phi(b') - \Phi(b) = \phi(b(x)) + \phi(b(y)) + \Phi(c) + \Phi(d)$, and $\Phi(t) - \Phi(u) = \phi(t(x)) + \phi(t(y)) + \Phi(c) + \Phi(d)$, we can cancel out some terms and simplify the equation above to

(7) $= 1 + \mathcal{A}(u) - \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

by induction, $\mathcal{A}(u) \le 1 + 2 \phi(u)$,

(8) $\le 2 + 2\phi(u) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

Since $u$ is a subtree of $t(y)$, $\phi(u) < \phi(t(y))$ and since the nodes in $b$ is a subset of t, $\phi(b(y)) \le \phi(t(x))$

(9) $< 2 + \phi(u) + \phi(b(x))$

It’s possible to show that, given $x, y, z$ such that $y + z \le x$, then $1 + \log y + \log z < 2 \log x$. We have that $|t| = |u| + |c| + |d| + 2$, while $|b(x)| = |c| + |d| + 1$, thus, $|u| + |b(x)| < |t|$ and by the relation above, $1 + \log (|u|) + \log (|b(x)|) + 1 < 2 \log (|t|) + 1$, $\phi(u) + \phi(b(x)) + 1 < 2 \phi(t)$, which yields

(10) $\mathcal{A}(t) < 1 + 2 \phi(t)$.

It’s very subtle, but in this last step we used the assumption of the rotation, in particular in $|b(x)| = |c| + |d| + 1$, if we hadn’t performed the rotation, $|b(x)| = |b| + |c| + |d| + 2$, and $|u| + |b(x)| < |t|$ wouldn’t hold!

For the left-right case, we can do a similar analysis, except that now we’ll apply the recursion to $c$, and the resulting partition will look different, as depicted below:

Left-right case of partition()

In this case, (8) will become

(11) $\le 2 + 2\phi(c) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

We then use that $\phi(c) < \phi(t(y))$ and $\phi(t(y)) < \phi(t(x))$ this will allow us to cancel out some terms to

(12) $< 2 + \phi(b(x)) + \phi(b(y))$

Similar to what we've done before, we can show that $|b(x)| + |b(y)| = |s| + |b| = |t|$ since the partitions of $|t|$ are mutually exclusive. We can still use the result from the other case and get (10).

### References

[1] Wikipedia – Amortized Analysis

# Streams and Lazy Evaluation in OCaml

In this short post we’ll discuss lazy evaluation in OCaml and study a data structure called Stream. It’s based mainly on chapter 4 of Purely Functional Data Structures and it’s part of a series of study notes on that book.

### Lazy Evaluation

Lazy evaluation is a property in which an expression is not evaluated immediately (suspended) and when it’s evaluated the first time, the subsequent calls are cached (memoized). Functional languages like Haskell are lazy evaluated, but not OCaml, which is eagerly evaluated. Because the results are memoized, expressions that are lazily evaluated must always return the same value given the same inputs. In Haskell it’s easy to enforce because functions are pure, that is, they do not rely on side effects.

Lazy? Source: Flickr – Brian Gratwicke

In the book the author defines a notation for lazy evaluation:

datatype a susp = $of a In OCaml, we can work with lazily evaluated expressions through the Lazy module. The definition of a suspension is similar: type 'a t = 'a lazy_t and we can use the lazy construct. Let’s define a simple expensive function, a naive Fibonacci, which runs at O(2^n) time: let rec fibo n = if n <= 1 then 1 else (fibo (n - 1)) + (fibo (n - 2)) ;;  We can create a lazy evaluated version of it: let lazy_fibo n = lazy (fibo n);;  We can see that by assigning it to a variable, it doesn’t cause the function to be executed: let r = lazy_fibo 42;;  The author defines a matching operator ($) that causes a lazy expression to be evaluated, but I couldn’t find a corresponding operator in OCaml. Nevertheless, the Lazy module has the force() function, which does exactly that:

Lazy.force r;; // It might take a while!


Note that if we execute the same expression again, the second time it returns much faster, because of the memoization.

We are now ready to introduce the stream data structure.

### Streams

A stream is a lazy version of a linked list. Recall that a linked list is composed of nodes which point to the next node, or equivalently, to the remaining of the list. The usual definition of a linked list is:

type 'a node = Nil | Node of 'a * 'a node


If we want to be explicit that a node is actually pointing to a sublist, we could use an intermediate type, listType:

type 'a node = Nil | Node of 'a * 'a list
and 'a list = 'a node


Note that node and list are mutually recursive (they depend on each other), so we have to define them together by using the and construct.

In a stream the pointer to the remaining of the list is lazily evaluated, so the type is:

type 'a streamCell = Nil | StreamCell of 'a * 'a stream
and
'a stream = ('a streamCell) Lazy.t


With this basic structure we can implement many of the list functions for streams.

#### Concatenation

let rec (++) (streamA: 'a stream) (streamB: 'a stream): ('a stream) =
let computedStreamA = Lazy.force streamA in
match computedStreamA with
| Nil -> streamB
| StreamCell (elem, rest) -> lazy (StreamCell (elem, rest ++ streamB))
;;


Note that it never evaluates streamB and it only evaluates the first cell of streamA.

To help us testing, we can define function to convert from a list:

let rec fromList (l: 'a list): ('a stream) = match l with
| [] -> lazy Nil
| x :: xs -> lazy (StreamCell (x, fromList xs))
;;


and a function that forces the evaluation of the entire stream, essentially converting it back to a list:

let rec toList (stream: 'a stream): ('a list) =
let computedStream = Lazy.force stream in
match computedStream with
| Nil -> []
| StreamCell (elem, rest) -> elem :: (toList rest)
;;


#### Take

The take(n) function returns the first n elements from a stream. Like the concat function, only the first node of the stream is evaluated. The recursive call is suspended.

let rec take (n: int) (stream: 'a stream) : ('a stream) =
if n == 0 then lazy Nil
else
let computedStream = Lazy.force stream in
match computedStream with
| Nil -> lazy Nil
| StreamCell (elem, rest) -> lazy (StreamCell (elem, (take (n - 1) rest)))
;;


#### Drop

The drop(n) function removes the first n elements from a stream and returns the result. In this case, we need to evaluate all the n recursive calls:

let rec drop (n: int) (stream: 'a stream): ('a stream) =
if n == 0 then stream
else
let computedStream = Lazy.force stream in
match computedStream with
| Nil -> lazy Nil
| StreamCell (_, rest) -> drop (n - 1) rest
;;


take and drop look very similar but one is lazy while the other is not. That’s because the head of the stream is not suspended, but the tail is. In the drop case we need to find the (n+1)-th element that will be the new head of the stream. In the take case, we’re not changing the head, and since the tail is suspended, it can wait.

#### Reverse

The reverse function reverts the order the elements in a stream. In this case it’s more obvious that since we’re changing the location of the head, it must be eagerly evaluated.

let reverse (stream: 'a stream): ('a stream) =
let rec reverse' = fun oldStream newStream ->
let computedStream = Lazy.force oldStream in
match computedStream with
| Nil -> newStream
| StreamCell (elem, rest) -> reverse' rest  (lazy (StreamCell (elem, newStream)))
in reverse' stream (lazy Nil)
;;


### Conclusion

In this post we saw that OCaml is not lazy evaluated but we can rely on the Lazy module to accomplish that. We also learned a new data structure, stream, which is recursively lazily evaluated and operations like concat and take play well with laziness, while other like drop and reverse do not.

The full implementation with comments is available on github.

### References

[1] OCaml Module Lazy
[2] cyocum – Mutually Recursive Types
[3] Implementing lazy from scratch in OCaml

# Persistent Data Structures

In this post we’ll discuss some basic data structures in functional programming and their implementation in OCaml. Our main reference are the first two chapters of Purely Functional Data Structures by Chris Okasaki.

The first chapter explains persistent data structures and the second chapter provides examples of three common data structures in imperative programming and how to implement them in ML (a functional language).

### Persistent Data Structures

The core idea behind designing data structures in functional languages is that they’re immutable, that is, if you want to modify it, a new copy has to be made.

If we want to modify a data structure in the naive way, we would need to clone the entire structure with the modifications. Instead, this is done in a smart way so different versions of the data structure share them. Two examples are given to illustrate this idea: lists and binary trees.

Lists are a central native data structure in functional languages. In Ocaml, they’re implemented as linked list. From our data structure class, we know that inserting an element at the beginning of a list is an O(1) operation, but this requires modifying pointers in the data structure. As we said data structures are immutable, so we need to make a copy to perform this operation.

Note that we only need to make the new node point to the beginning of the original list, which means we don’t have to clone this list, we just reuse it.

Now, if we’re inserting in the middle of the list will require copying the existing nodes until we reach our newly inserted element, which we can then point to the remaining of the original list. Consider the example below, where we insert the element 4 in the second position of the list.

Inserting a new element (4) in a linked list

Note that inserting an element at the beginning of a linked list is much more efficient than at the end, both in terms of complexity and space. If we were to insert it at the end, we would require full copies and insertion would be O(length of the list). As an experiment, we can write a function that generates a list representing a range of integers in two ways.

In example 1, we create the list by inserting an element at the beginning of the array,

let rec list_range_fast start_range end_range =
if start_range == end_range then []
else start_range :: (list_range_fast (start_range + 1) end_range)
;;


In example 2 we do it at the end:

let rec list_range_slow start_range end_range =
if start_range == end_range then []
else (list_range_slow start_range (end_range - 1)) @ [end_range]
;;


If I run the slow version with start_range = 0 and end_range = 50000, it takes over a minute to run on my computer, while the fast version runs in a few milliseconds.

### Immutable Binary Trees

Binary trees are a generalized version of a linked list and it works the same way. The key point to notice is that when inserting an element in a (binary) tree only the nodes in the path from the root to the inserted node need to have their pointers modified, so we need to clone a number of nodes that are at most the height of the tree. See the example below:

Inserting an element in a binary tree. Green nodes are the new ones

With this basic concept understood for simple constructs like linked lists and binary trees, we can construct basic data structures on top of them: heaps and balanced binary trees.

### Leftist Heaps

A (minimum) leftist heap is a binary tree with the following properties:

1. The value of a node is smaller or equal the value of its children.

Let the spine of a node be the path defined by following the right children until a leaf is found (i.e. the rightmost path), and the rank of a node the length of its spine.

2. The rank of the left child is always greater or equal than the right child.

It’s possible to show that the rank of a heap of n nodes is less or equal than O(floor(log n + 1)).

Insertion. Before talking about inserting an element in a heap, let’s define the more general merge() function, which merges two leftist heaps, A and B, into one. We compare the root of each heap, say A.val and B.val. If A.val <= B.val, we make A.val the root of the new heap, A.left the left of the new heap and the right of the new heap will be the merge of A.right and B. If A.val >= B.val we do an analogous operation.

Note that the result of the merge might violate property 2, since we’re adding a new node to the right subtree. This is easy to fix: we just to swap the left and right subtree when coming back from the recursion. This is what makeTree() does.

Note that we always perform the merges of the right, which means that the number of such merges will be proportional to the largest rank the original heaps, which implies we can do it in O(log(A.size + B.size)).

In the context of immutable structures, this implementation of heap is efficient because the only nodes whose pointers we need to mutate (even when a swap is needed) are on the spine of the trees, so it only needs to create O(log n) new nodes.

The insertion process consists in creating a heap with one element and merging into the target heap.

Returning the minimum element is trivial: we just need to return the element at the root. Removing the minimum is easy as well: just merge the left and right subtree using merge().

My implementation for the leftist heap is on github.

### Binomial Heaps

Binomial heaps are an alternative heap implementation. Before defining a binomial heap, let’s introduce the binomial tree. A binomial tree is a can be defined recursively based on a property called rank. A single node is a binomial tree of rank 0. A tree of rank r > 0, is formed by combining two trees of rank r-1 making one tree the leftmost child of the other. We denote this operation as linking.

Examples of binomial trees of different ranks

A binomial heap is a list of binomial trees ordered by increasing rank, with no trees with the same rank.

Insertion. Before talking about insertion of an element into a heap, let’s see how to insert a binomial tree of rank r into the heap. To keep the order of the rank of the tree list, we need to traverse through it to find the right place to insert it. Since we can’t have two trees with the same rank, if we run into this case we need to merge those two trees into one with rank r + 1. This can cascade so we might need to repeat this process with the new tree.

Linking a tree is a direct application of its definition. We just need to decide which of the trees will become child of the other. Since we want a minimum heap, we need to guarantee that the minimum element is always on top, so we always make the tree with smallest root to be on top.

Linking is a constant time operation, since we only need to update the pointer of the root of the top tree, which also means we only need to clone the root node to generate a new immutable binomial tree.

For the complexity of traversing the heap list, we can first node that a heap of rank r has 2^r nodes. Which means that a heap of n elements has at most log(n) trees. In fact, a neat way to represent a binomial heap of size n is by its binary representation. For example, 10(dec) = 1010(bin), so for a heap of size 10, we should have a list of trees with ranks 1 and 3. This shows that we can traverse a list in O(log n) and in the worst case we might need to clone this many number of nodes.

Returning the minimum element requires us to scan the list of trees, because even though we know the minimum element is the root of a tree, we don’t know which tree it is, so this operation is O(log n). For removing this element, we can define an auxiliary function, removeMinTree(), which removes the tree with the minimum element from the tree list. Since we only want to remove the root of this tree, we need to re-insert the subtrees back to the heap.

One key observation is that, in a binomial tree of rank r, the children of the root are also binomial trees of ranks from 0 to r-1, which form a binomial heap. We can then define a merge() function that merges two heaps using an idea similar to a merge sort. If we refer back to the analogy of the binary representation of the heap size, the merge operation is analogous to adding two binary numbers!

My implementation for the binomial heap is on github.

### Red Black Trees

A red-black tree is a binary seach tree in which every node is either Red or Black. It respects the following invariants:

1. No red node has a red child
2. Every path from the root to an empty node has the same number of black nodes

Property. the height of a red-black tree of n nodes is at most 2*floor(log(n + 1))

Proof sketch. If the tree has a path to an empty node of length greater than 2*floor(log(n + 1)), this must have more than floor(log(n + 1)) black nodes in the path because we cannot have two consecutive red nodes (invariant 1). Now, by removing all the red nodes, we must have a complete tree of height >= floor(log(n) + 1) + 1, which means 2^(floor(log(n + 1)) + 1) - 1 which is greater than n.

Membership. Since a Red-Black tree is a binary search tree, search can be done in O(height of the tree) which is O(log n) by the property above.

Insertion. inserting an element in a Red Black tree is similar to inserting it into a binary search tree. The challenge is that it may violate one of the invariants. To avoid that we must perform re-balancing along the path we follow to insert the node in the tree.

We always color the inserted node Red. This doesn’t violate the Black nodes constraint, but it might violate the Red nodes one, in case its parent is also Red. If that’s the grandparent of the inserted is necessarily Black. We now have 4 possible scenarios, depicted at the top, right, bottom and left trees:

Unbalanced Red-Black trees and the result of the balancing operation

We assume that the subtrees a, b, c and d are balanced. For all these 4 cases we can perform a “rotation” to achieve the tree at the center.

The Black nodes constraint is not violated, because the path from the root still has the same number of Black nodes as before, and we fixed the consecutive Reds. Notice that y might violate the Red node constraints with its parent, so we need to do it recursively.

In terms of complexity, the insertion can be done in O(log n) and rebalancing takes a constant amount of time. Regarding the immutable structure, notice we only need to change nodes from the insertion path, so only O(log n) nodes have to be cloned.

My implementation for the Red Black tree in Ocaml is on github.

### Conclusion

The first two chapters from this book are really interesting. I have seen the binomial heap and Red-Black trees before in data structure classes and also implemented some data structures such as AVL trees in functional programming in the past (Lisp) during college.

I wasn’t aware of the immutability of data in functional programming until much later, when I was learning Haskell. Okasaki first introduces this concept in Chapter 2, so it allow us keeping that in mind when studying the implementation of functional data structures.

He doesn’t make it explicit on Chapter 3 that the data structures presented are efficient in terms of extra memory necessary to clone them, but they are easy to see.

The ML syntax is very similar to OCaml, but it was a good exercise implementing the code on my own. I tried making it more readable with comments and longer variable names. This also led me to learn a few OCaml constructs and libraries, including:

* How to perform assertions (See the binomial heap implementation)
* Unit testing (using OUnit2 library)
* The “or matching” pattern (see the balance() function in red_black_tree.ml)

# Tree Ring Matching using the KMP Algorithm

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

### Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.

Tool to explain the matching system

### The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
minlen = min(len(prefix), len(suffix))
max_match = 0
for match_len in range(1, minlen + 1):
if prefix[:match_len] == suffix[-match_len:]:
max_match = match_len
return max_match


In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

### The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

tlen = len(T)
qlen = len(Q)

for i, _ in enumerate(T):
match = 0
# Searches for Q starting from position i in T
while match < qlen and \
i + match < tlen and \
Q[match] == T[i + match]:
match += 1

if match == qlen:
print 'found substring in position', i


#### Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
^ mismatch in position 4


If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
^ mismatch in position 5


When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
^ continue search from here
(not from position 2)


We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
^ continue search from here
(not from position 2)


Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
^ continue search from here


If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

lsp = precompute_lsp(Q)

i = 0
j = 0

qlen = len(Q)
tlen = len(T)

while i < tlen and j < qlen:

if j < qlen and Q[j] == T[i]:
j += 1
i += 1
elif j > 0:
j = lsp[j - 1]
else:
i += 1

# string was found
if j == qlen:
print i - j


We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

#### The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
qlen = len(Q)
lsp = [0]*qlen

i = 1
while i < qlen :

j = 0
while (i + j < qlen and Q[j] == Q[i + j]):
lsp[i + j] = max(lsp[i + j], j + 1)
j += 1

i += 1
return lsp


The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.

The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

qlen = len(Q)

lsp = [0]*qlen

i = 1
j = 0
while i < qlen:

if Q[j] == Q[i]:
j += 1
i += 1
# We only added this line
lsp[i - 1] = j
elif j > 0:
j = lsp[j - 1]
else:
i += 1

return lsp


We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

### Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

lsp = precompute_lsp(Q)

i = 0
j = 0

qlen = len(Q)
tlen = len(T)

while i < tlen:

if j < qlen and Q[j] == T[i]:
j += 1
i += 1
elif j > 0:
j = lsp[j - 1]
else:
i += 1

yield_indices(i, j)


Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
matches = []
qlen = len(Q)
def match_accumulator(i, j):
if j == qlen:
matches.append(i - j)
kmp_generic(Q, T, match_accumulator)
return matches


Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
slen = len(suffix)
max_match = [None]
def max_matcher(i, j):
if max_match[0] is None and i == slen:
max_match[0] = j

# Search prefix in suffix
kmp(prefix, suffix, max_matcher)

return 0 if max_match[0] is None else max_match[0]


Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.

### Conclusion

I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.

### References

[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings

# Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

### Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

In its original form, it doesn’t allow removing elements.

### Algorithm

The bloom filter structure consists of a bit array of size $m$. We then define k independent hash functions, which can distribute a given value into any of the $m$ positions uniformly.

Inserting an element consists in applying the $k$ hash functions to the element and set the bit in the $k$ positions returned by these functions to 1.

Querying for an element consists in applying the same $k$ hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to $k$, assuming random access memory and $O(1)$ hash functions.

Probability of false positives. The probability $p$ that the algorithm will return true when the element is not there (false positive), can be described in terms of $m$, $k$, $n$, through the following equation:

$p = (1 -e^{(kn/m)})^k$

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values $n$ and $m$, we have a choice of how many hash functions to use, $k$. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed $m$, $n$, the value of $k$ that minimizes $p$ is given by:

$k = ln(2)m/n$

### Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, $m$, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)


### The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash


To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)


### Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of $k$ hash functions from 2 independent hash functions in the following way:

$h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1$

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
lambda i: (h64l + i*h64u) % k,
range(k)
)


All the code is available on github.

### Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5)


From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class))
+ geom_point(size=1.5) + facet_grid(class ~ .)


We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class))
+ geom_point(size=1.5) + facet_wrap(~class, ncol=5)


We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

$h_i(x) = i \cdot h_A(x), i = 1, \cdots, k$

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

$\dfrac{FP}{FP + TN}$

We fix $m$ and $k$ and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements $(FP + TN)$ and count how many of those our Bloom filter thinks they are in the set $FP$. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
data$n, data$prob,
ylab='Pr(false positive)',
xlab='N elements'
)


Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
data$n, data$prob,
ylab='Pr(false positive)',
xlab='N elements',
size=I(1),
color=data\$m
) + scale_colour_manual(
values=c("red", "blue", "green", "black"),
name="m values"
)


One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

### Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

### Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

### References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

### Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is $1/m$. Conversely, the probability of a bit not being set is $1 - 1/m$. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the $k$ hash functions is

$\left(1 - \frac{1}{m}\right)^k$

Inserting $n$ elements is equivalent to running the $k$ functions $n$ times, so after inserting $n$ elements, the probability that a given bit is set to 0 is

$\left(1 - \frac{1}{m}\right)^{kn}$

or the probability it’s set to one as

$1 - \left(1 - \frac{1}{m}\right)^{kn}$

Now, the probability of a false positive $p$, is the probability $k$ positions being set for a element that was never inserted.

$p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k$

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

$\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e$

So, for large values of m, we can assume that

$(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}$

Finally we have:

$p = (1 -e^{(kn/m)})^k$

### Appendix B: Optimal number of hash functions

If we make $p$ a function of $k$, it’s possible to prove it has a global minimum for positive $k$‘s so we can use the derivative to find the value of $k$ that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

$\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}$

where $c=n/m$ is a constant. If we make $y = e^{-ck}$ (and use $k = \ln(y)/-c$)

$\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y$

By making it equals to 0, we get the equation:

$\ln(1 - y) (1-y) = \ln(y) y$

To solve this for $y$, let $x = 1 - y$,

$\ln(x) x = \ln(y) y$

Since $\ln(n)$ and $n$ are both monotonic functions, so is $\ln(n)n$. Then if () holds, we can conclude that $x = y$, because otherwise, if $x > y$, $\ln(x) x > \ln(y) y$ and if $x < y$, $\ln(x) x < \ln(y) y$.

Replacing $x$ back, we find out that $y = 1/2$ and finally $k = ln(2)m/n$