Largest sets of subsequences in OCaml

I’ve noticed that there is this set of words in English that look very similar: tough, though, through, thought, thorough, through and trough. Except thought, they have one property in common: they’re all subsequence of thorough. It made me wonder if there are interesting sets of words that are subsequences of other words.

Word cloud made with Jason Davie’s tool (https://www.jasondavies.com/wordcloud/)

This post is an attempt to answer a more general question: given a list of words, what is the largest set of these words such that they’re subsequences of a given word?

A word A is a subsequence of a word B if A can be obtained by removing zero or more characters from B. For example, “ac” is a subsequence of “abc”, so is “bc” and even “abc”, but not “ba” nor “aa”.

A simple algorithm to determine if a word A is a subsequence of another is to start with 2 pointers at the beginning of each word, pA and pB. We move pB forward until pA and pB point to the same character. In that case we move pA forward. A is a subsequence of B if and only if we reach the end of A before B. We could then iterate over each word W and find all the words that are subsequences of W. If the size of the dictionary is n, and the size of the largest word is w, this algorithm would run in O(n^2 w).

For English words, we can use entries from /usr/share/dict/words. In this case, n is around 235k (wc -l /usr/share/dict/words), so a quadratic algorithm will take a while to run (around 5e10 operations).

Another approach is to generate all subsequences of words for a given word W and search the dictionary for the generated word. There are O(2^w) subsequences of a word of length w. If we use a hash table, we can then do it in O(n w 2^w). In /usr/share/dict/words, the length of the largest word, w, is 24.

Running a calculation with the numbers (R script), the number of high-level operations is 4e10, about the same order of magnitude as the quadratic algorithm.

Distribution using ggplot2

A third approach is to use a trie. This data structure allows us to store the words from the dictionary in a space-efficient way and we can search for all subsequences using this structure. The trie will have at most 2e6 characters (sum of characters of all words), less because of shared prefixes. Since any valid subsequence has to be a node in the trie, the cost of search for a given word cannot be more than the size of the trie t, so the complexity per word is O(\min(2^w, t)). A back of envelope calculation gives us 2e9. But we’re hoping that the size of the trie will be much less than 2e6.

Before implementing the algorithm, let’s define out trie data structure.

The Trie data structure

A trie is a tree where each node has up to |E| children, where |E| is the size of the alphabet in consideration. For this problem, we’ll use lower case ascii only, so it’s 26. The node has also a flag telling whether there’s a word ending at this node.

Notice that in this implementation of trie, the character is in the edge of the trie, not in the node. The Map structure from the stlib uses a tree underneath so get and set operations are O(log |E|).

The insertion is the core method of the structure. At a given node we have a string we want to insert. We look at the first character of the word. If a corresponding edge exists, we keep following down that path. If not, we first create a new node.

To decide whether a trie has a given string, we just need to traverse the trie until we either can’t find an edge to follow or after reaching the end node it doesn’t have the hasEntry flag set to true:

This and other trie methods are available on github.

The search algorithm

Given a word W, we can search for all its subsequences in a trie with the following recursive algorithm: given a trie and a string we perform two searches: 1) for all the subsequences that contain the first character of current string, in which case we “consume” the first character and follow the corresponding node and 2) for all the subsequences that do not contain the first character of the current string, in which case we “consume” the character but stay at the current node. In pseudo-code:

Search(t: TrieNode, w: string):
    Let c be the first character of w.
    Let wrest be w with the first character removed

    If t contains a word, it's a subsequence of the
    original word. Save it.

    // Pick character c
    Search(t->child[c], wrest)

    // Do not pick character c
    Search(t, wrest)

The implementation in OCaml is given below:

Experiments

Our experiment consists in loading the words from /usr/share/dict/words into a trie, and then, for each word in the dictionary, look for its subsequences. The full code is available on github.

The code takes 90 seconds to run on my laptop. Not too bad but I’m still exploring ways to improve the performance. One optimization I tried is to, instead of returning an explicit list of strings as mentioned in the search implementation, return them encoded in a trie, since we can save some operations due to shared prefixes. I have that version on github, but unfortunately that takes 240 seconds to run and requires more memory.

Another way is to parallelize the code. The search for subsequences is independent for each word, so it’s an embarrassingly parallel case. I haven’t tried this path yet.

The constructed trie has 8e5 nodes or ~40% of the size of sum of characters.

Subsequences of “thorough”

The question that inspired this analysis was finding all the subsequences of thorough. It turns out it has 44 subsequences, but most of them are boring, that is, single letter or small words that look completely unrelated to the original word. The most interesting ones are those that start with t and have at least three letters. I selected some of them here:

  • tho
  • thoo
  • thoro
  • thorough
  • thou
  • though
  • thro
  • throu
  • through
  • thug
  • tog
  • tou
  • toug
  • tough
  • trough
  • trug
  • tug

The word with most subsequences is pseudolamellibranchiate, 1088! The word cloud at the beginning of the post contains the 100 words with the largest number of subsequences. I tried to find interesting words among these, but they’re basically the largest words – large words have exponentially more combination of subsequences, and hence the chance of them existing in the dictionary is greater. I tried to come up with penalization for the score:

1) Divide the number of subsequences by the word’s length. This is not enough, the largest words still show on top.
2) Apply log2 to the number of subsequences and divide by the word’s length. In theory this should account for the exponential number of subsequences of a word. This turns out to be too much of a penalization and the smallest word fare too well in this scenario.

I plotted the distribution of number of subsequences by word lengths. We can see a polynomial curve but with increased variance:

Generated with this ggplot2

In the chart above, we’d see all points with the same x-value in a single vertical line. One neat visualization trick is to add noise (jitter) so we also get a sense of density.

If we use a box plot instead, we can see a quadratic pattern more clearly by looking at the median for each length.

Generated with this ggplot2

Given this result, I tried a final scoring penalization, by dividing the number of subsequences by the square of the length of the word, but it’s still not enough to surface too many interesting words. Among the top 25, streamlined is the most common word, and it has 208 subsequences.

One interesting fact is that the lowest scoring words are those with repeated patterns, for example: kivikivi, Mississippi, kinnikinnick, curucucu and deedeed. This is basically because we only count unique subsequences.

Conclusion

This was a fun problem to think about and even though it didn’t have very interesting findings, I learned more about OCaml and R. After having to deal with bugs, compilation and execution errors, I like OCaml more than before and I like R less than before.

R has too many ways of doing the same thing and the API is too lenient. That works well for the 80% of the cases which it supports, but finding what went wrong in the other 20% is a pain. OCaml on the other hand is very strict. It doesn’t even let you add an int and a float without an explicit conversion.

I learned an interesting syntax that allows to re-use the qualifier/namespace between several operations when chaining them, for example:

I also used the library Batteries for the first time. It has a nice extension for the rather sparse String module. It allows us to simply do open Batteries but that overrides a lot of the standard modules and that can be very confusing. I was scratching my head for a long time to figure out why the compiler couldn’t find the union() function in the Map module, even though I seemed to have the right version, until I realized it was being overridden by Batteries. From now on, I’ll only use the specific modules, such as BatString, so it’s easy to tell which method is coming from which module.

References

OCaml

  • [1] OCaml Tutorials > Map
  • [2] Strings – Batteries included
  • [3] Using batteries when compiling

R

  • [1] R Tutorial – Histogram
  • [2] Creating plots in R using ggplot2 – part 10: boxplots
  • [3] My R Cheat sheet

Amortization and Persistence via Lazy Evaluation

In this chapter Okasaki works around the problem of doing amortized analysis with persistent data structures because the amortized analysis assumes in place modification while for persistent data structures (partial) copies are made. The intuition is that lazy evaluation, which comes with memoization and avoids recomputation, solves this problem.

He adapts the Banker’s and Physicists’s methods to work with lazy evaluated operations and applies them to a few structures including Binomial Heaps, Queues and Lazy Pairing Heaps. In this post we’ll only cover the examples of the Queues using both methods.

We’ll first introduce some concepts and terminology, then we’ll present a queue implementation using lazy evaluation that allows us analyzing it under persistence. Following that we’ll explain the Banker’s and Physicist’s methods and prove that the implementation for push/pop has an efficient amortized cost.

Persistence as a DAG

An execution trace is a DAG where nodes represent the operations (e.g. updates to a data structure) and an edge from nodes v to v' indicates that the operation corresponding to v' uses the output of the one corresponding to v.

For example, if we have this set of operations:

let a = push 0 newEmpty
let b = push 1 a
let c = pop b
let d = push 2 b
let e = append c d
let f = pop c
let g = push 3 d 

The corresponding execution graph is:

Execution Graph

The logical history of an operation v is the set of all operations it depends on (directly or indirectly, and including itself). Equivalently, in terms of the DAG, it’s the set of nodes that have a directed path to v.

A logical future of an operation v is any directed path from v to a terminal node.

A framework for analyzing lazy evaluated data structures

We need to introduce a few more concepts and terminology. Note: we’ll use suspension and lazy operation interchangeably, which can be evaluated or forced.

The unshared cost of an operation is the time it would take to execute it if it had already been performed and memoized before, so if the operation involves any expression that is lazy, that expression would be O(1).

The shared cost of an operation is the time it would take it to execute (force) all the suspensions created (but not evaluated) by the operation.

The complete cost is the sum of the shared and unshared costs. Alternatively, the complete cost of an operation is the time it would take to execute the operation if lazy evaluation was replaced by strict. To see why, first we note that the unshared costs have to be paid regardless of laziness. Since we’re assuming no laziness, the operation has to pay the cost associated with the suspension it creates, which corresponds to the shared costs. Note that under this assumption we wouldn’t need to account for the cost of forcing suspensions created by previous operations because in theory they have already been evaluated.

When talking about a sequence of operations, we can break down the shared costs into two types: realized and unrealized costs. The realized costs are the shared costs from suspensions were actually forced by some operation in the sequence. Example: say that operations A and B are in the sequence and A creates a suspension, and then B forces it. The cost for B to force it is included in the realized cost. The unrealized costs are the shared costs for suspensions that were created but never evaluated within the sequence. The total actual cost of a sequence of operations is the sum of the realized costs and the unshared costs.

Throughout a set of operations, we keep track of the accumulated debt, which starts at 0 at the beginning of the sequence. Whenever an operation is performed, we add its shared cost to it. For each operation, we can decide how much of this debt we want to pay. When the debt of a suspension is paid off, we can force it. The amortized cost of an operation is its unshared cost plus the amount of debt it paid (note that it does not include the realized cost). Note that as long as we always pay the cost of a suspension before it’s forced, the amortized cost will be an upper bound on the actual cost.

This framework simplifies the analysis for the case when a suspension is used more than once by assuming that its debt was paid off within the logical history of when it was forced, so we can always analyze a sequence of operations and don’t worry about branching. This might cause the debt being paid multiple times, but it simplifies the analysis.

The author uses the term discharge debit as synonym of pay off debt. I find the latter term easier to grasp, so I’ll stick with it throughout this post.

Let’s introduce an example first and then proceed with the explanation of the Physicist’s method and the corresponding analysis of the example.

The Stream Queue

To allow efficient operations on a queue in the presence of persistence, we can make some of the operations lazy. Recall in a previous post we defined a queue using two lists. To avoid immediate computation, a natural replacement for lists is using its lazy version, the stream data structure, which we also talked about in a previous post.

For the list-based queue, the invariant was that if the front list is empty, then the rear list must be empty as well. For the stream queue, we have a tighter constraint: ‘front’ must be always greater or equal than ‘rear’. This constraint is necessary for the analysis.

The definition of the stream queue is the following:

We store the lengths of the streams explicitly for efficiency.

We’ll be using the Stream developed in the previous chapter, so we’ll refer to the module Stream2 to avoid ambiguity with the standard Stream module.

Inserting an element at the end of the queue is straightforward, since the rear stream represents the end of the queue and is reversed:

The problem is that inserting at rear can cause the invariant of the queue to be violated. check() changes the structure so to conform to the invariant by potentially reversing rear and concatenating with front:

Removing an element from the queue requires us to evaluate the first element of the front stream. Again, the invariant can be violated in this case so we need to invoke check() again:

The complete code for the stream queue is on Github.

Analysis using the Banker’s Method

The idea of the Banker’s Method is basically define an invariant for the accumulated debt and a strategy for paying it off (that is, decide how much debt each operation pays off). Then we show that whenever we need to force a suspension, the invariant guarantees that the accumulated debt has been paid off. One property of the Banker’s method is that it allows associating the debt to specific locations of the data structure. This is particularly interesting for streams, because it contains multiple (nested) suspensions, so we might force parts of this structure before we paid the debt associated with the entire structure.

By inspection, we can see that the unshared cost of both push and pop are O(1). It’s obvious in the case of push, and in the case of pop, in theory check could take O(m) where m is the size of the queue, but since Stream2.concat() and Stream2.reverse() are both lazy, and hence memoized, they are not included in the unshared costs.

To show that the amortized cost of both operations is O(1), we can show that paying off O(1) debt at each operation is enough to pay for the suspension before it is forced. For the queue, we also need to associate the debt with parts of the data structure, so that we could force the suspension of only some parts of it (for example, on the stream we can evaluate only the head, not necessarily the entire structure).

We now define an invariant that should be respected by all operations. Let d(i) be the debt at the i-th node on the front stream, and D(i) = \sum_{j=0}^{i} d(j) the accumulated debt up to node i. The invariant is:

D(i) \le \min(2i, |f| - |r|)

This constraint allows us evaluating the head at any time, because D(0) = 0, which means its debt has been paid off. The second term in min(), guarantees that if |f| = |r| the entire stream can be evaluated, because D(i) = 0 for all i.

The author then proves that by paying off one debt in push() and two debt units in pop() is enough to keep the debt under the constraint.

Queue with suspended lists

Because the Physicist’s method cannot assign costs to specific parts of the data structure, it doesn’t matter if the structure can be partially forced (like streams) or if it’s monolithic. With that in mind, we can come up with a simpler implementation of the queue by working with suspended lists instead of streams. Only the front list has to be suspended because the cost we want to avoid, the reversal of the back list and concatenation to the front list, happens on the front list.

On the other hand, we don’t want to evaluate the front list when we perform a peek or pop, so we keep a evaluated version of the front list too.

The signature of the structure is as follows:

As mentioned in the code above, the invariants we want to enforce is that the front list is never smaller than the rear list

and that the evaluated version of the front list is never empty if the lazy version still has some elements.

The push and pop operations are similar to the other versions of queue, but since we mutate the structure, we might need to adjust it to conform to the invariants:

Finally, because of our second invariant, peeking at the queue is straightforward:

The complete code for the suspended queue is on Github.

Analysis using the Physicist’s Method

We’ve seen the Physicist’s Method in a previous post when we’re ignore the persistence of the data structures. We adapt the method to work with debits instead of credits. To avoid confusion, we’ll use \Psi to represent the potential function. \Psi(i), represents the accumulated debt of the structure at step i. At each operation we may decide to pay off some debit, which will be then included in the amortized cost. We have that \Psi(i) - \Psi(i - 1) is the increase in debt after operation i. Remember that the shared cost of an operation corresponds to the increase in debt if we don’t pay any of the debt. Thus, we can find out how much debt was paid off then by s_i - \Psi(i) - \Psi(i - 1), where s(i) is the shared costs of operation i. Let u(i) and c(i) be the unshared and complete costs of the operation i. Given that, by definition, c(i) = u(i) + s(i), we can then express the amortized cost as:

a_i = cc_i - (\Psi(i) - \Psi(i - 1))

To analyze the suspended queue we need to assign values to the potentials such that by the time we need to evaluate a suspension the potential on the structure is 0 (that is, the debt has been paid off). For the suspended queues we’ll use the following potential function:

\Psi(q) = \min(2|w|, |f| - |r|)

Where w is the forcedFront, f is lazyFront and r is rear.

We now claim that the amortized cost of push is at most 2. If we push and element that doesn’t cause a rotation (i.e. doesn’t violate |f| \ge |r|), then |r| increases by 1, and the potential decreases by 1. No shared is incurred and the unshared cost, inserting an element at the beginning of rear is 1, hence the amortized cost for this case is 1 – (-1) = 2. If it does cause a rotation, then it must be that after the insertion |f| = m and |r| = m + 1. After the rotation we have |f| = 2*m + 1 and |r| = 0, but w hasn’t changed and cannot be larger than the original f, so the potential function is at most 2*m. The reversal of r costs m + 1 and concatenating to a list of size x costs x (discussed previously), plus the cost of initially appending an element to read, so the unshared cost is 2*m + 2. No suspensions were created, so the amortized cost is given by 2*m + 2 - (2*m) = 2.

Our next claim is that the amortized cost of pop is at most 4. Again, if pop doesn’t cause a rotation, |w| decreases by 1, so the potential is reduced by 2. The unshared cost is 1, removing an element from w, and the shared cost, 1 comes from the suspension that lazily removes the head of lazyFront. The amortized cost is 2 – (-2) = 4. Note that if w = 0, we’ll evaluate f, but the ideas is that it has been paid off already. If the pop operation causes a rotation, then the analysis is similar to the push case, except that the complete cost is must account for the shared cost of lazily reming the head of lazyFront, so it’s 2*m + 3, for an amortized cost of 3.

Note that when the suspensions are evaluated, the potential is 0, either when |f| = |r| or |w| = 0.

Conclusion

In this post we covered a simple data structure, the queue, and modified it to be lazy evaluated and can show, with theory, that it allows for efficient amortized costs. The math for proving the costs is not complicated. The hardest part for me to grasp is to get the intuition of how the analysis works. The analogy with debt is very useful.

Meta: Since wordpress code plugin doesn’t support syntax highlighting for OCaml, I’m experimenting with the Gist plugin. Other advantages is that Gists allow comments and has version control!

Amortization Analysis

In Chapter 5 of Purely Functional Data Structures, Okasaki introduces the amortization analysis as preparation for the complexity analysis of functional data structures.

The idea of amortized analysis is to prove the complexity of an operation when the average cost per operation is less than the worst case cost of each individual operation.

In this post we’ll perform analysis of 2 data structures and show that their amortized running time is much better than their worst case. There are 2 methods presented in the book, the Banker’s Method and the Physicist Method. I found the Physicist easier to work with, so I’ll only use that. Also, I find it easy to understand concepts with an examples, so we’ll first introduce the data structure and then introduce the analysis.

Queue with lists

An easy way to implement the queue data structure is by using two lists, one representing the beginning (left) and the other the end (right). Inserting/removing an element from the beginning of a list in OCaml is O(1), while doing it at the end is O(n). For this reason, we store the end of the queue reversed in the right list. (In imperative languages the opposite is true, insertions/removals at the end are O(1) and at the beginning O(n), so the left array that is reversed).

x

Queue using lists representing [1, 2, 3, 4, 5, 6]

Insertions are performed by appending an element to the beginning of the right array. Removing from the front of the queue requires removing the first element of the left array. When the left array is empty, we need to reverse the contents of the right array and move it to the left.

We work with an invariant, in which the left array is only empty if the right array is empty as well. That requires a slight change to the removal, where we perform the reversal and move of the right array when the left array has only one element and is about to become empty.

I’ve implemented this idea in Ocaml.

Analysis – The Physicist Method

If we only look at the worst case, we will get that removing from a queue is an O(n) operation. We can improve this by looking at the amortized cost. One way to perform this analysis is via the physicist method, in which we define a potential (or cost) function \Phi(d) for the data structure d. In the queue case, we can define it to be the size of the right list.

The amortized cost of step i, a_i is defined by

a_i = t_i + \Phi(d_i) - \Phi(d_{i-1})

Where t_i is the actual cost (in number of operations)

The intuition is that some costly operations such as reversing the list, which takes O(n) operations, can only happen after O(n) insertions happened, so the average cost per operation is O(1). Translating it to the potential analogy, each insersion operation increases the potential of the data structure by 1, so by the time a revert happens, the right list is emptied and that causes a loss of potential of n, which had been accumulated and that offsets the cost of the operation t_i.

We can get the accumulated amortized cost of the operations up to a specific step by summing them, and we’ll get

A_i = T_i + \Phi(d_i) - \Phi(d_0)

If we choose \Phi(d) such that \Phi(d_i) \ge \Phi(d_0), for all i, then A_i is an upper bound of T_i. Since we assume d_0 is the empty queue, \Phi(d_0) = 0 and we have \Phi(d_i) \ge 0, this property is satisfied.

One important remark is that this analysis is only applicable to non-persistent data structures, which is not the case in OCaml. We present the OCaml code as exercise, but the correct analysis will done in later chapters.

Splay Trees

Splay Tree is a binary search tree. The insertion of an element e consists in spliting the current tree in 2 subtrees, one of all elements less or equal than e, which we’ll call small and another will all elements larger than e, which we’ll call big. We then make e the root of tree and small the left subtree and big the right subtree.

Partition

Partition

The partition operation is O(height of tree). The intuition is, when looking for the small subtree, imagine we’re analysing a node x. If x >= e, then we know all elements in the left subtree are also >= e, so are all the elements in the right subtree, so we can discard it and only look at the left subtree. The same idea applies to looking for the big subtree.

In an ideal scenario, the tree will be well balanced in such a way that the height of the tree is O(log n), and so the insertion operation. But that’s not always the case, so we introduce a balancing strategy. Whenever we follow 2 lefts in a row when searching for the big subtree, we perform a right rotation. Similarly, when following 2 rights in a row when searching for the small subtree, we perform a left rotation.

These re-balancing are enough to make partition and also insertion an O(log n) amortized time, as we shall see in the next section.

Rotation

Rotation

The operation for finding the minimum element is trivial. To delete the minimum element, we also perform rotations. Since we only traverse the left paths, we only have to do a left rotation.

I’ve implemented these ideas in OCaml.

Amortized Analysis of Splay Trees

We first define a potential function. Given a tree t and a node x \in t, let t(x) be the subtree of t rooted in x. Then \phi(t(x)) = \log(|t(x)| + 1) be a potential function where |t(x)| the number of nodes of the subtree of which x is root, and let the potential of a tree t, \Phi(t), be the sum of the sub-potentials of all nodes of t, that is,

(1) \Phi(t) = \sum_{x \in t} \phi(t(x)).

The amortized equation of calling partition() is the actual cost of the partition, plus the potential of the resulting trees (small and big) minus the potential of the original tree. Let’s denote small(t) as s, and big(t) as b.

(2) \mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)

We can prove that \mathcal{A}(t) \le 1 + 2 \log(|t|) (see appendix), which means the amortized cost of the partition operation is O(\log n). It implies that insertion also has an amortized cost of O(\log n).

The same idea can be used to prove the amortized costs of the deletion of the minimum element. Okasaki claims that splay trees are one of the fastest implementations for heaps when persistence is not needed.

Conclusion

I found amortized analysis very hard to grasp. My feeling is that the amortized analysis is too abstracted from the real problem, making it not very intuitive. It’s also not clear how one picks the potential function and whether there’s a choice of potential that might result in a better than the “actual” amortized cost.

Appendix

As we can see in code, there are four cases to consider: left-left, left-right, right-left and right-right. The left-* and right-* are symmetric, so let’s focus on the left-left and left-right cases. For the left-left case, suppose we perform the following partition:

Left-left case of partition()

Left-left case of partition()

We start with (2)

(3) \mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)

Since = \mathcal{T} corresponds to the number of recursive calls to partition, \mathcal{T}(t) = 1 + \mathcal{T}(u), so

(4) = 1 + \mathcal{T}(u) + \Phi(s) + \Phi(b) - \Phi(t)

The recursive call to u yields (s’, b’), so we have the equivalence, \mathcal{A}(u) = \mathcal{T}(u) + \Phi(s') + \Phi(b') - \Phi(u),

(5) = 1 + \mathcal{A}(u) - \Phi(s') - \Phi(b') + \Phi(u) + \Phi(s) + \Phi(b) - \Phi(t)

Since, s = s’,

(6) = 1 + \mathcal{A}(u) - \Phi(b') + \Phi(u) + \Phi(b) - \Phi(t)

and \Phi(b') - \Phi(b) = \phi(b(x)) + \phi(b(y)) + \Phi(c) + \Phi(d), and \Phi(t) - \Phi(u) = \phi(t(x)) + \phi(t(y)) + \Phi(c) + \Phi(d), we can cancel out some terms and simplify the equation above to

(7) = 1 + \mathcal{A}(u) - \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))

by induction, \mathcal{A}(u) \le 1 + 2 \phi(u),

(8) \le 2 + 2\phi(u) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))

Since u is a subtree of t(y), \phi(u) < \phi(t(y)) and since the nodes in b is a subset of t, \phi(b(y)) \le \phi(t(x))

(9) < 2 + \phi(u) + \phi(b(x))

It’s possible to show that, given x, y, z such that y + z \le x, then 1 + \log y + \log z < 2 \log x. We have that |t| = |u| + |c| + |d| + 2, while |b(x)| = |c| + |d| + 1, thus, |u| + |b(x)| < |t| and by the relation above, 1 + \log (|u|) + \log (|b(x)|) + 1 < 2 \log (|t|) + 1, \phi(u) + \phi(b(x)) + 1 < 2 \phi(t), which yields

(10) \mathcal{A}(t) < 1 + 2 \phi(t).

It’s very subtle, but in this last step we used the assumption of the rotation, in particular in |b(x)| = |c| + |d| + 1, if we hadn’t performed the rotation, |b(x)| = |b| + |c| + |d| + 2, and |u| + |b(x)| < |t| wouldn’t hold!

For the left-right case, we can do a similar analysis, except that now we’ll apply the recursion to c, and the resulting partition will look different, as depicted below:

Left-right case of partition()

Left-right case of partition()

In this case, (8) will become

(11) \le 2 + 2\phi(c) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))

We then use that \phi(c) < \phi(t(y)) and \phi(t(y)) < \phi(t(x)) this will allow us to cancel out some terms to

(12) < 2 + \phi(b(x)) + \phi(b(y))

Similar to what we've done before, we can show that |b(x)| + |b(y)| = |s| + |b| = |t| since the partitions of |t| are mutually exclusive. We can still use the result from the other case and get (10).

References

[1] Wikipedia – Amortized Analysis

Streams and Lazy Evaluation in OCaml

In this short post we’ll discuss lazy evaluation in OCaml and study a data structure called Stream. It’s based mainly on chapter 4 of Purely Functional Data Structures and it’s part of a series of study notes on that book.

Lazy Evaluation

Lazy evaluation is a property in which an expression is not evaluated immediately (suspended) and when it’s evaluated the first time, the subsequent calls are cached (memoized). Functional languages like Haskell are lazy evaluated, but not OCaml, which is eagerly evaluated. Because the results are memoized, expressions that are lazily evaluated must always return the same value given the same inputs. In Haskell it’s easy to enforce because functions are pure, that is, they do not rely on side effects.

Lazy?

Lazy? Source: Flickr – Brian Gratwicke

In the book the author defines a notation for lazy evaluation:

datatype a susp = $ of a

In OCaml, we can work with lazily evaluated expressions through the Lazy module. The definition of a suspension is similar:

type 'a t = 'a lazy_t

and we can use the lazy construct. Let’s define a simple expensive function, a naive Fibonacci, which runs at O(2^n) time:

let rec fibo n =
  if n <= 1 then 1
  else (fibo (n - 1)) + (fibo (n - 2))
;;

We can create a lazy evaluated version of it:

let lazy_fibo n = lazy (fibo n);;

We can see that by assigning it to a variable, it doesn’t cause the function to be executed:

let r = lazy_fibo 42;;

The author defines a matching operator ($) that causes a lazy expression to be evaluated, but I couldn’t find a corresponding operator in OCaml. Nevertheless, the Lazy module has the force() function, which does exactly that:

Lazy.force r;; // It might take a while!

Note that if we execute the same expression again, the second time it returns much faster, because of the memoization.

We are now ready to introduce the stream data structure.

Streams

A stream is a lazy version of a linked list. Recall that a linked list is composed of nodes which point to the next node, or equivalently, to the remaining of the list. The usual definition of a linked list is:

type 'a node = Nil | Node of 'a * 'a node

If we want to be explicit that a node is actually pointing to a sublist, we could use an intermediate type, listType:

type 'a node = Nil | Node of 'a * 'a list
and 'a list = 'a node

Note that node and list are mutually recursive (they depend on each other), so we have to define them together by using the and construct.

In a stream the pointer to the remaining of the list is lazily evaluated, so the type is:

type 'a streamCell = Nil | StreamCell of 'a * 'a stream
and
'a stream = ('a streamCell) Lazy.t

With this basic structure we can implement many of the list functions for streams.

Concatenation

Let’s start with the concat operator (++):

let rec (++) (streamA: 'a stream) (streamB: 'a stream): ('a stream) =
  let computedStreamA = Lazy.force streamA in
  match computedStreamA with
    | Nil -> streamB
    | StreamCell (elem, rest) -> lazy (StreamCell (elem, rest ++ streamB))
;;

Note that it never evaluates streamB and it only evaluates the first cell of streamA.

To help us testing, we can define function to convert from a list:

let rec fromList (l: 'a list): ('a stream) = match l with
  | [] -> lazy Nil
  | x :: xs -> lazy (StreamCell (x, fromList xs))
;;

and a function that forces the evaluation of the entire stream, essentially converting it back to a list:

let rec toList (stream: 'a stream): ('a list) =
  let computedStream = Lazy.force stream in
  match computedStream with
    | Nil -> []
    | StreamCell (elem, rest) -> elem :: (toList rest)
;;

Take

The take(n) function returns the first n elements from a stream. Like the concat function, only the first node of the stream is evaluated. The recursive call is suspended.

let rec take (n: int) (stream: 'a stream) : ('a stream) =
  if n == 0 then lazy Nil
  else
    let computedStream = Lazy.force stream in
    match computedStream with
      | Nil -> lazy Nil
      | StreamCell (elem, rest) -> lazy (StreamCell (elem, (take (n - 1) rest)))
;;

Drop

The drop(n) function removes the first n elements from a stream and returns the result. In this case, we need to evaluate all the n recursive calls:

let rec drop (n: int) (stream: 'a stream): ('a stream) =
  if n == 0 then stream
  else
    let computedStream = Lazy.force stream in
    match computedStream with
      | Nil -> lazy Nil
      | StreamCell (_, rest) -> drop (n - 1) rest
;;

take and drop look very similar but one is lazy while the other is not. That’s because the head of the stream is not suspended, but the tail is. In the drop case we need to find the (n+1)-th element that will be the new head of the stream. In the take case, we’re not changing the head, and since the tail is suspended, it can wait.

Reverse

The reverse function reverts the order the elements in a stream. In this case it’s more obvious that since we’re changing the location of the head, it must be eagerly evaluated.

let reverse (stream: 'a stream): ('a stream) =
  let rec reverse' = fun oldStream newStream ->
    let computedStream = Lazy.force oldStream in
    match computedStream with
      | Nil -> newStream
      | StreamCell (elem, rest) -> reverse' rest  (lazy (StreamCell (elem, newStream)))
  in reverse' stream (lazy Nil)
;;

Conclusion

In this post we saw that OCaml is not lazy evaluated but we can rely on the Lazy module to accomplish that. We also learned a new data structure, stream, which is recursively lazily evaluated and operations like concat and take play well with laziness, while other like drop and reverse do not.

The full implementation with comments is available on github.

References

[1] OCaml Module Lazy
[2] cyocum – Mutually Recursive Types
[3] Implementing lazy from scratch in OCaml

Persistent Data Structures

Book

In this post we’ll discuss some basic data structures in functional programming and their implementation in OCaml. Our main reference are the first two chapters of Purely Functional Data Structures by Chris Okasaki.

The first chapter explains persistent data structures and the second chapter provides examples of three common data structures in imperative programming and how to implement them in ML (a functional language).

Persistent Data Structures

The core idea behind designing data structures in functional languages is that they’re immutable, that is, if you want to modify it, a new copy has to be made.

If we want to modify a data structure in the naive way, we would need to clone the entire structure with the modifications. Instead, this is done in a smart way so different versions of the data structure share them. Two examples are given to illustrate this idea: lists and binary trees.

Immutable Linked Lists

Lists are a central native data structure in functional languages. In Ocaml, they’re implemented as linked list. From our data structure class, we know that inserting an element at the beginning of a list is an O(1) operation, but this requires modifying pointers in the data structure. As we said data structures are immutable, so we need to make a copy to perform this operation.

Note that we only need to make the new node point to the beginning of the original list, which means we don’t have to clone this list, we just reuse it.

Now, if we’re inserting in the middle of the list will require copying the existing nodes until we reach our newly inserted element, which we can then point to the remaining of the original list. Consider the example below, where we insert the element 4 in the second position of the list.

Inserting a new element (4) in a linked list

Inserting a new element (4) in a linked list

Note that inserting an element at the beginning of a linked list is much more efficient than at the end, both in terms of complexity and space. If we were to insert it at the end, we would require full copies and insertion would be O(length of the list). As an experiment, we can write a function that generates a list representing a range of integers in two ways.

In example 1, we create the list by inserting an element at the beginning of the array,

let rec list_range_fast start_range end_range =
  if start_range == end_range then []
  else start_range :: (list_range_fast (start_range + 1) end_range)
;;

In example 2 we do it at the end:

let rec list_range_slow start_range end_range =
  if start_range == end_range then []
  else (list_range_slow start_range (end_range - 1)) @ [end_range]
;;

If I run the slow version with start_range = 0 and end_range = 50000, it takes over a minute to run on my computer, while the fast version runs in a few milliseconds.

Immutable Binary Trees

Binary trees are a generalized version of a linked list and it works the same way. The key point to notice is that when inserting an element in a (binary) tree only the nodes in the path from the root to the inserted node need to have their pointers modified, so we need to clone a number of nodes that are at most the height of the tree. See the example below:

Inserting an element in a binary tree

Inserting an element in a binary tree. Green nodes are the new ones

With this basic concept understood for simple constructs like linked lists and binary trees, we can construct basic data structures on top of them: heaps and balanced binary trees.

Leftist Heaps

A (minimum) leftist heap is a binary tree with the following properties:

1. The value of a node is smaller or equal the value of its children.

Let the spine of a node be the path defined by following the right children until a leaf is found (i.e. the rightmost path), and the rank of a node the length of its spine.

2. The rank of the left child is always greater or equal than the right child.

It’s possible to show that the rank of a heap of n nodes is less or equal than O(floor(log n + 1)).

Insertion. Before talking about inserting an element in a heap, let’s define the more general merge() function, which merges two leftist heaps, A and B, into one. We compare the root of each heap, say A.val and B.val. If A.val <= B.val, we make A.val the root of the new heap, A.left the left of the new heap and the right of the new heap will be the merge of A.right and B. If A.val >= B.val we do an analogous operation.

Note that the result of the merge might violate property 2, since we’re adding a new node to the right subtree. This is easy to fix: we just to swap the left and right subtree when coming back from the recursion. This is what makeTree() does.

Note that we always perform the merges of the right, which means that the number of such merges will be proportional to the largest rank the original heaps, which implies we can do it in O(log(A.size + B.size)).

In the context of immutable structures, this implementation of heap is efficient because the only nodes whose pointers we need to mutate (even when a swap is needed) are on the spine of the trees, so it only needs to create O(log n) new nodes.

The insertion process consists in creating a heap with one element and merging into the target heap.

Returning the minimum element is trivial: we just need to return the element at the root. Removing the minimum is easy as well: just merge the left and right subtree using merge().

My implementation for the leftist heap is on github.

Binomial Heaps

Binomial heaps are an alternative heap implementation. Before defining a binomial heap, let’s introduce the binomial tree. A binomial tree is a can be defined recursively based on a property called rank. A single node is a binomial tree of rank 0. A tree of rank r > 0, is formed by combining two trees of rank r-1 making one tree the leftmost child of the other. We denote this operation as linking.

Binomial Heap

Examples of binomial trees of different ranks

A binomial heap is a list of binomial trees ordered by increasing rank, with no trees with the same rank.

Insertion. Before talking about insertion of an element into a heap, let’s see how to insert a binomial tree of rank r into the heap. To keep the order of the rank of the tree list, we need to traverse through it to find the right place to insert it. Since we can’t have two trees with the same rank, if we run into this case we need to merge those two trees into one with rank r + 1. This can cascade so we might need to repeat this process with the new tree.

Linking a tree is a direct application of its definition. We just need to decide which of the trees will become child of the other. Since we want a minimum heap, we need to guarantee that the minimum element is always on top, so we always make the tree with smallest root to be on top.

Linking is a constant time operation, since we only need to update the pointer of the root of the top tree, which also means we only need to clone the root node to generate a new immutable binomial tree.

For the complexity of traversing the heap list, we can first node that a heap of rank r has 2^r nodes. Which means that a heap of n elements has at most log(n) trees. In fact, a neat way to represent a binomial heap of size n is by its binary representation. For example, 10(dec) = 1010(bin), so for a heap of size 10, we should have a list of trees with ranks 1 and 3. This shows that we can traverse a list in O(log n) and in the worst case we might need to clone this many number of nodes.

Returning the minimum element requires us to scan the list of trees, because even though we know the minimum element is the root of a tree, we don’t know which tree it is, so this operation is O(log n). For removing this element, we can define an auxiliary function, removeMinTree(), which removes the tree with the minimum element from the tree list. Since we only want to remove the root of this tree, we need to re-insert the subtrees back to the heap.

One key observation is that, in a binomial tree of rank r, the children of the root are also binomial trees of ranks from 0 to r-1, which form a binomial heap. We can then define a merge() function that merges two heaps using an idea similar to a merge sort. If we refer back to the analogy of the binary representation of the heap size, the merge operation is analogous to adding two binary numbers!

My implementation for the binomial heap is on github.

Red Black Trees

A red-black tree is a binary seach tree in which every node is either Red or Black. It respects the following invariants:

1. No red node has a red child
2. Every path from the root to an empty node has the same number of black nodes

Property. the height of a red-black tree of n nodes is at most 2*floor(log(n + 1))

Proof sketch. If the tree has a path to an empty node of length greater than 2*floor(log(n + 1)), this must have more than floor(log(n + 1)) black nodes in the path because we cannot have two consecutive red nodes (invariant 1). Now, by removing all the red nodes, we must have a complete tree of height >= floor(log(n) + 1) + 1, which means 2^(floor(log(n + 1)) + 1) - 1 which is greater than n.

Membership. Since a Red-Black tree is a binary search tree, search can be done in O(height of the tree) which is O(log n) by the property above.

Insertion. inserting an element in a Red Black tree is similar to inserting it into a binary search tree. The challenge is that it may violate one of the invariants. To avoid that we must perform re-balancing along the path we follow to insert the node in the tree.

We always color the inserted node Red. This doesn’t violate the Black nodes constraint, but it might violate the Red nodes one, in case its parent is also Red. If that’s the grandparent of the inserted is necessarily Black. We now have 4 possible scenarios, depicted at the top, right, bottom and left trees:

Unbalanced Red-Black trees and the result of the balancing operation

Unbalanced Red-Black trees and the result of the balancing operation

We assume that the subtrees a, b, c and d are balanced. For all these 4 cases we can perform a “rotation” to achieve the tree at the center.

The Black nodes constraint is not violated, because the path from the root still has the same number of Black nodes as before, and we fixed the consecutive Reds. Notice that y might violate the Red node constraints with its parent, so we need to do it recursively.

In terms of complexity, the insertion can be done in O(log n) and rebalancing takes a constant amount of time. Regarding the immutable structure, notice we only need to change nodes from the insertion path, so only O(log n) nodes have to be cloned.

My implementation for the Red Black tree in Ocaml is on github.

Conclusion

The first two chapters from this book are really interesting. I have seen the binomial heap and Red-Black trees before in data structure classes and also implemented some data structures such as AVL trees in functional programming in the past (Lisp) during college.

I wasn’t aware of the immutability of data in functional programming until much later, when I was learning Haskell. Okasaki first introduces this concept in Chapter 2, so it allow us keeping that in mind when studying the implementation of functional data structures.

He doesn’t make it explicit on Chapter 3 that the data structures presented are efficient in terms of extra memory necessary to clone them, but they are easy to see.

The ML syntax is very similar to OCaml, but it was a good exercise implementing the code on my own. I tried making it more readable with comments and longer variable names. This also led me to learn a few OCaml constructs and libraries, including:

* How to perform assertions (See the binomial heap implementation)
* Unit testing (using OUnit2 library)
* The “or matching” pattern (see the balance() function in red_black_tree.ml)

OCaml Modules

camel

One prevalent syntax in some OCaml code I’ve encountered is about modules, so I decided to study them a little bit to be able to better understand the code I’ve been reading.

This post is based on Chapters 4 and 9 from Real World OCaml [1, 2], which we started in the last post.

Defining modules

By default, a file defines a module. Module names are derived from filenames and is always capitalized (even if the filename is not). Let’s define a toy example:

myModule.ml:

let a = 1;;

main.ml:

Printf.printf "%d\n" MyModule.a;;

We can now compile using the built-in ocamlc (make sure to follow the setup here):

ocamlc myModule.ml main.ml

Note that the order is important. Since main.ml depends on myModule.ml, it has to be included first. In case the files do not live in the same directory, we can use the -I option. For example, if myModule was in another_dir, we could compile using:

ocamlc -I another_dir/ another_dir/myModule.ml main.ml

A module has two parts: the definition and the signature. A module defined by a file filename.ml can be constrained by a signature in a file called filename.mli. The definition (mli) should be included in the compiling step and appear before the definition. For example, we could have

myModule.mli:

val a : int;;

and compile using:

ocamlc myModule.mli myModule.ml main.ml

Submodules

It’s possible to define multiple modules inside a file through submodules. We can add this to myModule.ml:

module MySubmodule : sig
  val inc : int -> int
end = struct
  let inc x = x + 1
end

and in main.ml:

Printf.printf "%d\n" (MyModule.MySubmodule.inc 1);;

Note that we still need to provide the module name defined by the filename. The first part of the definition is the type signature of the module and the second the definition. The general syntax is:

module <module name> : sig
  type <abstract type name>
  val <name> : <type signature>
end = struct
  type <abstract type name> = <concrete type>
  let <name> <definition>

Alternatively, we can separate the type definition from the implementation. In that case, we create a separate file with extension .mli containing the interface

myModule.mli:

module MySubmodule : sig
  val inc : int -> int
end

and in myModule.ml:

module MySubmodule = struct
  let inc x = x + 1
end

In general, the syntax for the signature is:

module type <module type name> : sig
  type <abstract type name>
  val <name> : <type signature>
end

and for the module definition is:

module <module name> : <module signature> = struct
  type <abstract type name> = <concrete type>
  let <name> <definition>

Including modules

Modules are made available when linking during compile time, but if we want to use a function from a module, we still need to qualify it. We can alternatively include them explicitly to avoid qualifying module names (similar to use namespace in C++).

open Module
foo

Instead of:

Module.foo

We can also invoke open inline:

# let average x y =
    let open Int64 in
    x + y / of_int 2;;
val average : int64 -> int64 -> int64 = <fun>

or alias the module to a shorter name with the let module construct:

let print_median m =
  let module C = Counter in
  match m with
  | C.Median string -> printf "True median:\n   %s\n" string
  | C.Before_and_after (before, after) ->
    printf "Before and after median:\n   %s\n   %s\n" before after

Functors

Functors are functions that transform modules, so it’s a function from a module to a module. A basic example is provided in [2]. First we define a toy signature signature:

module type X_int = sig 
  val x : int 
end;;

Then, we define a functor, which we call Increment:

module Increment (M : X_int) = struct
    let y = M.x + 1
end;;

What tells us this is a function is the extra parameter the module takes (M: X_int). In here, M is the name we give to the module and X_int is its interface. Since the interface tells us M has the x value, we can access it within our function. Note that Increment acts like a function, taking M (module) as parameter and returning another module, defined by the struct block. In this case, the type signature is different because the returned module has y instead of x. We can force the returned type of a function by adding a constraint:

module Increment (M : X_int) : X_int = struct
    let x = M.x + 1
end;;

Now, if we try to use y, we can a compilation error. To fix it, we just change y to x.

Functors cannot be used by themselves. They’re useful for creating modules out of existing modules. For an example, imagine we have a module implementing X_int:

module Three = struct
  let x = 3
end;;

We can create a new module Four, by transforming our Three module:

module Four = Increment(Three);;

// Testing the modules
Printf.printf "%d\n" Three.x;; // 3
Printf.printf "%d\n" Four.x;;  // 4

“Abstract” functors

In [2], the authors provide an example of an MakeInterval module, in which there’s a dependent type. First it creates a Comparable signature:

module type Comparable = sig
  type t
  val compare : t -> t -> int
end;;

to make it shorter (and less “real world”), I’ve created a simpler version here, MakeElement:

module MakeElement (InputModule : Comparable) = struct
  type t = Element of InputModule.t
  let create x = Element x
end;;

we can then create a module:

module IntElement = MakeElement(Int);;

The above works because Int satisfies the constraint defined by the Comparable module signature. The authors make a point here that sticking to standard conventions can improve reuse such as cases like this. We can finally use

# let e = IntElement.create 10;;
val e : IntElement.t = IntElement.Element 10

The authors argue that the IntElement exposes implementation details because Element is “exposed”:

# let e2 = IntElement.Element 10;;
val e2 : IntElement.t = IntElement.Element 10

One solution is to constrain the return type of the functor and not expose Element in the signature. The signature would look like:

module type ElementInterface = sig
  type t
  type element
  val create : element -> t
end;;

and the functor would look like:

module MakeElement (InputModule : Comparable): ElementInterface = struct
  type t = Element of InputModule.t
  let create x = Element x
end;;

The problem is the type element is not bound to anything, so we have to explicitly do it when defining the functor. The construct is the following

module MakeElement (InputModule : Comparable): 
  (ElementInterface with type element = InputModule.t) = struct
  type t = Element of InputModule.t
  type element = InputModule.t
  let create x = Element x
end;;

now MakeElement return a module with interface ElementInterface which doesn’t expose Element.

Destructive Substitution

One problem with the approach above is the redundant binding of type element. One slightly different syntax removes that requirement:

module MakeElement (InputModule : Comparable): 
  (ElementInterface with type element := InputModule.t) = 
struct
  type t = Element of InputModule.t
  let create x = Element x
end;;

We basically changed from = to :=, which is called destructive substitution.

Multiple Interfaces

It’s possible to “extend” more than one module signature when creating a new one. For example:

module type MyFunctor = sig
   include MyModuleInterface
   include MyModuleInterface2 with type t := t
end;;

Conclusion

Modules seems a very powerful concept in Ocaml. Besides organizing files, modules can act as functions and can model concepts from Object Oriented Programming like classes and interfaces.

I’ve been following a different study strategy while learning Ocaml. I try to read some real world code and when I get stuck understand a syntax or a pattern, I can search for them in a book. This makes it more interesting than reading a book cover to cover.

References

[1] Real World OCaml – Chapter 4. Files, Modules, and Programs
[2] Real World OCaml – Chapter 9. Functors
[3] Compiling OCaml projects

Exploring OCaml

camel

I’ve been learning the basics of OCaml in order to be able to follow the examples from Purely Functional Data Structures, from Chris Okasaki.

My background is that I have zero knowledge about OCaml, but having studied a bit of Haskell, I’m familiar with the main concepts in functional programming, so I’ll try to focus on syntax + difference between the two languages. I’ll probably skim through concepts from functional programming I’ve already covered previously on past Haskell posts.

Setup

I used to skip the setup from posts, but it might be useful to someone, especially with a similar environment, so I decided to include it here. This setup assumes MacOS and emacs.

Installing

Easily available on Brew:

$ brew install ocaml
$ brew install opam
$ ocaml

For other OS’es: https://ocaml.org/docs/install.html

Emacs syntax highlighting

Looks like tuareg is a popular mode for developing OCaml in emacs:

opam init
opam install tuareg

At the configuration file (~/.emacs.d/init.el) we can add:

(load "~/.opam/system/share/emacs/site-lisp/tuareg-site-file")

OCaml CLI

Typing ocaml in the terminal will bring up the CLI, but it’s not very interactive, not having a good way to go back previous command or edit strings in the middle. I’ve learned about this command line called rlwrap that implements this functionality. We can easily install it on mac:

brew install rlwrap

And invoke ocaml like this:

rlwrap ocaml

We can also add a simple alias to have these features as default:

alias ocaml='rlwrap ocaml'

Language Basics

Statements

Statements boundaries are defined by ;;. Example:

# 1 + 1;;
- : int = 2

This let us define multi-line expressions,

# 1 +
# 1
# ;;
- : int = 2

Comments

(* This is a single-line comment. *)

(* This is a
 * multi-line
 * comment.
 *)

Load code from file

It can become tedious to edit functions in the CLI. It’s possible to execute the contents of a file:

> ocaml
# #use "my-file.ml";;

Basic Types

* int – 31 or 63-bits, depending on the platform one of the bits is used for internal memory management

When writing literal number values, the underscore character is ignored (as long as it’s not the leading character). For example:

# 10_1___02__  + 1;;
- : int = 10103

This can be useful to define large numbers in a more user friendly way:

# let x = 12_434_934
val x : int = 12434934 

* float – IEEE double-precision floating point

OCaml doesn’t do explicit casts, especially between ints and floats. We have to cast using functions like int_of_float or float_of_int. Examples:

# 1 + 1;;
- : int = 2
# 1.3 +. 1.2;;
- : float 2.5
# 1 + 1.0;;
Error: This expression has type float but an expression was expected of type int
# 1 +. 1.
Error: This expression has type int but an expression was expected of type float
# 1 + (int_of_float 1.0)
- : int = 2
# (float_of_int 1) +. 1.
- : float 2.

Note that the sum operator for floats has an extra . character (+.)

* bool – true/false
* char – 8-bit ascii character
* string – more than a list of char, efficient internal representation

Variables

We can define variables by the use of let

# let a = 3 in
  let b = 4 in
  a + b;;
- : int = 3

This looks like imperative code at the first glance, but it’s slightly different. let <expr1> in <expr2>. The expr1 is only made available inside the scope of expr2. For example:

# let a = 3 in
  let b = 4 in
  a + b;;
- : int = 3
# a;;
Error: Unbound value a

Here the variable a was defined only for the expression:

  let b = 4 in
  a + b;;

When we terminated it with ;; , a was out of scope. We can also bind multiple variables in the same expression, for example:

# let a = 3 and let b = 4 in
  a + b;;
- : int = 3

Functions

Defining Functions

Example: A simple sum function

# let sum a b = a + b;;
val sum : int -> int -> int = <fun>
# sum 1 2
- : int = 3

Note how the type signature syntax is very similar to Haskell.

Explicit function type signature

Sometimes to avoid ambiguity, we might want to provide the types for the inputs/outputs of the function. For example, we might want to define a sum function only intended to be used by ints.

let intSum (a: int) (b: int) : int = a + b;;

Lambda functions

Denoted by the use of the fun construct, it’s useful for passing simple functions as parameter (for example to a map over lists).

# let doubleValues ls = 
  List.map (fun x -> 2*x) ls
;;
val doubleValues : int list -> int list = <fun>
# doubleValues [1; 2; 3];;
- : int list = [2; 4; 6]

Recursive functions

A function must be explicitly marked as recursive by add a rec, according to [2], due to technical reasons – mainly related to type inference.

Example: Computing the factorial of a natural number n:

# let rec factorial n =
  if n == 0 then 1
  else n * (factorial (n -1))
;;
val factorial : int -> int = <fun>
# factorial 10;;
- : int = 3628800

Matching patterns

Similar to Haskell, Ocaml has pattern match which we can use to decide which body of function to apply. For example, to invert a list we can do:

# let rec reverseList xs =
  match xs with
  | [] -> []
  | x :: xs -> (invertList xs) @ [x]
;;
# reverseList [1; 2; 3];;
- : int list = [3; 2; 1]

The _ operator indicates all the non-matched cases. For example, we could rewrite the reverseList function as

# let rec reverseList xs =
  match xs with
  | x :: xs -> (invertList xs) @ [x]
  | _ -> []
;;
# reverseList [1; 2; 3];;
- : int list = [3; 2; 1]

Labeled arguments

We can prefix a function parameter with ~ to indicate it’s a labeled (named) argument. Example:

# let div ~a ~b = float a /. float b;;
val div : a:int -> b:int -> float = <fun>
# div 10 2
- : float = 5.
# div ~b:10 ~a:2
- : float : 0.2

Note how the variable name shows up in the function’s type signature. It’s important because we may pass a function with labeled arguments to another function and it may make use of this fact (it’s also useful for documentation).

If the variable name matches the named parameter, we don’t need to repeat ourselves:

# let a = 10
val a : int = 10
# leb b = 2
val b : int = 2
# div ~b ~a

We can also apply currying using named arguments. For example, if we want generate a new function with the value b “bound”, we can do

# let b = 2
val b : int = 2
# let half = div ~b;;
val half : a:int -> float = <fun>
# half ~a
- : float = 0.5

When currying, positional parameters (i.e. non-labeled arguments) are always bound before the labeled ones. For example:

# let makeList3 ~a ~b c = [a; b; c];;
val makeList3 : a:'a -> b:'a -> 'a -> 'a list = <fun>
# let makeList2 = makeList3 1
val makeList3 : a:'a -> b:'a -> 'a list = <fun>
# makeList2 2 3
- : int list = [2; 3; 1]

Optional parameters

Optional parameters are prefixed with ? like sep in the example below:

# let concat ?sep x y =
  let sep = match sep with None -> "" | Some x -> x in
    x ^ sep ^ y
;;
val concat : ?sep:string -> string -> string -> string = <fun>

# concat "ab" "cd";;
- : string = "abbc"
# concat "ab" "bc" ~sep:",";;
- : string = "ab,cd"

The value coming from an optional parameter is either None or Some x. An optional parameter is also a labeled parameter.

In the example above, we use pattern matching to provide a default value to sep. There’s a shorter syntax for this:

let concat ?(sep=" ") x y =
  x ^ sep ^ y
;;
val concat : ?sep:string -> string -> string -> string = <fun>

By providing a default value, the value in the sep variable won’t be either None/Some, but the actual type expected, in this case a string.

It can be tricky to apply curry in the presence of optional arguments. [2] discusses in detail the heuristics applied by the compiler in different scenarios.

Conclusion

In this post we covered the very minimum to get our hands dirty with some OCaml code and learn the basic syntax. I don’t plan to study [2] for now. I’m mostly interested in learning enough to follow along Purely Functional Data Structures.

Some first impressions: the ocaml CLI is pretty limited, but thanks to rlwrap it becomes manageable. Haskell is more elegant, Ocaml is more practical.

For the next post in the series I plan to study the most basic and simple data structure in functional programming: lists.

References

[1] OCaml.org – The Basics
[2] Real World OCaml – Chapter 2. Variables and Functions