Paper Reading – F1: A Distributed SQL Database That Scales

In this post we’ll discuss F1: A Distributed SQL Database That Scales. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

F1 is a database used at Google to serve data for the AdWords product. This is the system that displays embedded ads in external pages using Google’s
infrastructure.

F1 is named after a term from Genetics, F1 hybrid, in analogy to the idea that it combines the best aspects of relational databases and NoSQL systems (my initial thought was that it was named after Formula One).

In a high-level, F1 is a layer on top of the Spanner database which we covered in a previous post. It adds features such a SQL interface (worth noting that Spanner recently has evolved to support SQL natively too). Spanner handles a lot of the distributed system difficulties under the hood, providing some guarantees and F1 builds relational database features on top of that.

The paper mentions that AdWords migrated from a sharded MySQL solution, which was very expensive to maintain, especially when the data outgrew the instance and re-sharding was necessary. Now, this data redistribution is handled transparently by Spanner and while F1 is theoretically slower, in practice they were able to structure the data in such a way that it’s performant in practice (5-10 ms for reads and 50-150ms for writes).

Architecture

F1 is deployed into multiple servers, across geographically distributed data centers. A client sends requests to a load balancer that then redirects to an F1 server, usually in the closest datacenter.

Screen Shot 2018-01-31 at 8.30.13 PM

Architecture of F1: In display are two datacenters containing multiple machines running F1 servers, Spanner and CFS instances.

F1 servers are usually in the same Data Center as the Spanner servers that contain their data. F1 can communicate to Spanner servers outside of its datacenter directly, but Spanner only communicates with the Colossus file system (CFS) within the same datacenter because CFS is not globally replicated.

F1 servers are stateless except when the client performs pessimistic transactions, which make them easier to scale, since no data movement is required.

In addition to F1 servers, there are F1 master and a pool of F1 slaves. According to the paper the master controls which processes are in the pool, and that F1 slaves execute part of the query plan on behalf of F1 servers, but it’s not clear why they need to be a separate component.

Data Model

The table data can be organized in a hierarchical schema much like Spanner (see
“Data Model” in the post about Spanner). This hierarchical schema is optional but it’s a key piece in making the performance of F1 competitive with the previous sharded MySQL system AdWords used.

In this hierarchy, we have a root table and a child table. For each row in the root table, called root row, we have rows clustered under that row based on their matching keys.

For example, say that we have tables Customer and Campaign. The Customer table has customerID and Campaign (customerID, campaignID). A possible structure for the rows would be:

Screen Shot 2018-01-31 at 9.05.08 PM

Where Campaign rows with customerID=3 are clustered together under the corresponding Customer row.

The paper mentions that a given cluster of child rows fall within the same Spanner directory, which means that a transaction would query a single Spanner server, avoiding the overhead of synchronization between multiple servers (see Read-write transactions in the Spanner post).

Indexes. Tables support two types of indexes: local or global. Local index entries are stored in the same Spanner server of the rows the index but it must include the key of the root table. Global indexes do not have such restrictions but they can be distributed across multiple Spanner servers and very expensive to update within a transaction.

Schema changes. Updating the schema of a table is challenging because the rows of the table are distributed, which makes schema consistency expensive to achieve, and the system cannot have downtime.

The proposed solution is to break schema changes into smaller steps, such that as if no 2 severs are more than 2 steps apart, the changes are safe.

Writes

Transactions

F1 supports 3 types of transactions: snapshot transactions, pessimist transactions and optimistic transactions. It relies on the transactions supported by Spanner, so it’s worth re-reading the Implementation Details of the Spanner post.

Snapshot transactions are read-only transactions and map to the corresponding read-only transactions from Spanner. The pessimistic transaction maps to the read-write transactions from Spanner.

The optimistic transaction consists of multiple reads followed by a single write. F1 stores an extra column with the last modified timestamp for each row. When the client performs the reads, it stores the largest timestamp it saw for any row it received. It sends this timestamp together with the final write request. F1 then performs a pessimistic transaction to read only the timestamps of the affected rows. If any of them differ from the timestamp sent by the client, it aborts. Otherwise it performs the write.

It’s worth noting that F1 didn’t create a transaction until the very last write request was sent. It’s assuming there were no writes between these reads and the final write, so we say it’s optimistic. This is the default transaction type and the paper describes a few advantages over the pessimistic type.

Locks

F1 supports very granular locking of the tables, including row level and cell (set of columns for a given row) level.

Change history

F1 stores change history of the data. The changes are stored as regular tables in a structure called ChangeBatch, children of a root table. If a transaction updates multiple root rows, several ChangeBatches are created under their corresponding root tables.

This feature allows clients to subscribe to changes in F1 tables using a publish-subscriber system. Whenever the client receives a notification, it can ask for incremental changes after a specific timestamp and apply to the data it has stored locally.

Reads

F1 supports both NoSQL and SQL interfaces. The SQL dialect was extended to support Protocol Buffers, which are complex data types with strong types (as opposed to loosely typed structures such as JSON). This extension allows, for example, to read and update internal fields of such structures.

Local vs distributed. F1 supports centralized execution (running on a single F1 node) or distributed. The query optimizer decides which one to execute. The paper details the process of executing the query in a distributed fashion.

Distributed Query Example

The paper describes an example of a SQL query being mapped to an execution plan. It involves two joins, filters and aggregations:

Screen Shot 2018-02-02 at 11.10.01 PM

This SQL query is parsed and converted to a query plan that will be executed by multiple machines called operators. A possible execution plan for the sample query is:

Screen Shot 2018-02-02 at 11.13.23 PM

The arrows indicate the data flow, starting at the bottom. The first join is a lookup join (hash join). The operation reads rows from the AdClick table until it has about 100k unique lookup keys stored in a hash table. Then it performs a single (batch) lookup to the other table, AdGroupCreative. The rows from the AdClick table are kept in memory for faster lookup.

As soon as the join for these keys are completed, they’re streamed to the next stage of the pipeline.

The second join with Creative is a distributed join. It first repartitions each row from each table based on the values of the columns listed in the USING clause. Each partition might end up in different machines for the next stage which consists of joining the columns of matching rows.

Finally, the rows are again repartitioned, now by the values from the group by columns and then aggregators apply the aggregation for sets of rows under the same partition.

Distributed Execution Overview

More generally, the query plan created by F1 is a DAG (directed aclyclic graph) where each node is an operator like the join or aggregator described above. Note that there are multiple operators running the same operation in parallel.

The paper says:

A technique frequently used by distributed database systems is to take advantage of an explicit co-partitioning of the stored data.

It’s not very clear to me what they mean with that, especially because they don’t cite any references, but from context it suggests that the base operators (the scan and lookup join) are in the same machine as the data (co-located) and they do as much of the processing upfront as possible. This helps minimize data transfer which can become the bottleneck in a distributed computation. F1 cannot do that because Spanner abstracts the data location from F1.

A side effect is that there’s a lot of network traffic. The authors claim that Google has network switch hardware improvements which allows servers to communicate with each other close to full network speed.

When the hash tables in memory grow too large, they write part of the data to disk. So while F1 doesn’t store data in a persistent way, it still needs to write to disk for intermediate operations.

For efficiency, F1 doesn’t write checkpoints to disk. The downside is that the system is not fault tolerant. Failures in any stage of the execution causes the entire query to fail. Retries are done transparently but long queries (>1h) are bound to fail.

Other features

F1 exposes data from intermediate nodes to clients. This avoid having all the data concentrating at the last node of the query execution. They cite Map-Reduce jobs as examples of such feature.

Conclusion

In this post we learned about one of Google’s many distributed databases, F1. It’s a loosely coupled layer on top of Spanner to provide a more familiar level of abstraction which are relational databases.

It seems that we could make an analogy between Google’s systems and similar open source solutions. The Colossus File System (CFS) is the distributed filesystem that could map to Hadoop Distributed File System (HDFS), and Spanner would map to Hadoop’s YARN, and F1 providing SQL semantics on top of Spanner could be mapped to Hive which does the same for Hadoop. It’s a very rough comparison, and maybe Spanner is more similar to Spark but it’s interesting to see the patterns and relationship between these systems.

References

[1] F1: A Distributed SQL Database That Scales
[2] Spanner: Google’s Globally-Distributed Database

Appendix: Terminology and Concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

We also have a lot of experience with eventual consistency
systems at Google

Eventual consistency means that a set of servers might not contain the most recent updates but eventually will. For example, the client might issue a write that affects multiple machines. If the system only provides eventual consistency then a subsequent read is not guaranteed to get the data up-to-date with those writes.

The quote mentions experience with eventual consistency in a negative way, because of the extra complexities that clients have to deal with to work around this limitation.

Advertisements

CSS Layout

Introduction

In a previous post we studied some major components of the browser, including the rendering engine. In this post we’ll dig a bit further on the layout of the render tree and take a look at an important piece of this process: the cascading style sheets or CSS.

To recap, every DOM element is usually represented by one rectangle. The job of the rendering engine is to determine two properties of these rectangles: their size (height, width), position (top, left) and stacking order when they overlap. In this post we’ll see how different CSS properties can affect the resulting rectangle.

Note: we’ll use box and rectangle interchangeably in this post.

CSS

History

hakon

CSS stands for cascading style sheets. It was proposed by Håkon Wium Lie in 1994.

The CSS 1 specification was finished in 1996. CSS 2 was created to address some issues with the previous version in 1997. CSS 3 was started in 1998 but hasn’t been concluded yet! The thing is that CSS 3 is subdivided in modules and each is fairly independent of each other, which resulted in different modules having different phases.

The diagram below provides a good overview on the different modules from CSS and their stage:

css-modules-status

CSS Modules and levels

Wikipedia has an interesting history of the development of CSS, including the initial lack of compliance to the specification which caused a lot of headaches to front-end developers, especially when working with early versions of the Internet Explorer.

It’s interesting to take a look at the early days of CSS because it plays a big role in how it looks today (due to back-compatibility). It’s useful to remember that back in the days when CSS first came around, web pages were generally pure HTML containing mostly text, so a lot of CSS was designed around concepts such as paragraphs and simple images.

In this post we’ll focus on the CSS modules that affect the layout of the DOM elements, in particular the Visual formatting model. According to the spec, these are factors that influence the layout of a box:

* Box dimensions (height and width properties)
* Type of box (display property)
* Positional schemes (position, left and top properties)
* Relationship between elements (hierarchy of DOM tree)
* External information (e.g. window size)

We’ll briefly cover some of these properties and then play with a few examples.

The display property

There are many possible values for the display property. The main ones are none, block, inline and inline-block. I’ve been using flex increasingly but that deserves a post in itself (this article is a great reference).

Display none removes the element from the layout calculation so it’s effectively invisible.

A value of block causes the element to be visually formatted as a block [5] (other values like list-item and table do to). In general a block is a box that starts at a new line and takes the entire width of the parent.

An inline box on the other hand starts from the left of the previous box. It also ignores explicit width and height values and any explicit vertical spacing (i.e.
top/bottom of margin/padding).

The main difference between inline and inline-box is that the latter does account for width and height, and vertical spacing [6].

The position property

There are 5 possible values for the position property: static, relative, absolute, fixed, sticky.

static is the default positioning schema and it follows the normal positioning flow of the page.

A relative positioned element accounts for top and left properties. These are in relation to the parent.

An absolute positioned element is similar to a relative, except that it is removed from the normal layout flow (i.e. other elements ignore its existence when being positioned) and its top and left are in relation to the first positioned ancestor in the DOM tree (or the document if none is). A positioned element is any with position != static.

Here is an example where we only change the inner (red) div to static, relative and absolute, respectively.

Screen Shot 2018-01-13 at 4.25.01 PM

In the first example, A ignores the top/left properties. In the third example is “crosses” the boundary of the middle (blue) box because it’s not positioned.

Note that in the code above we have set overflow to auto. This is a hack to prevent margin-collapsing.

An element with position:fixed is similar to position:absolute, except that instead of having its offset relative to an ancestor with position:relative, usually it’s relative to the viewport. That implies that such element scrolls with the page. The special case happens when one of its ancestors has transform, perspective, or filter set to something other than none, in which case it behaves much like an absolute positioned element.

Finally, as described in [2] an element with position:sticky is treated as relatively positioned until it crosses a specified threshold, at which point it is treated as fixed until it reaches the boundary of its parent.

The float property

When a element is floated, it is taken out of the normal layout flow of the document. It is shifted to the left (assuming a float:left) until it touches the edge of its containing box, or another floated element.

Because float implies the use of the block layout, it modifies the computed value of most of the display values to block.

Floats are better understood in relationship with other types of boxes. Let’s check some examples.

Float + Block

Whether the renderer accounts for a floated element when positioning a new block depends on the block formatting context. For example:

Screen Shot 2018-01-12 at 10.20.07 AM

The green block ignores the presence of the blue block but its child (yellow) does not. That’s because blue and yellow are in different block formatting contexts.

Float + Inline

Analogous to a block context, an inline formatting context is a set of inline elements. In such context, the boxes are laid out horizontally. Each “row” of inline elements is called a line box. In the presence of floats, the spec states the following:

In general, the left edge of a line box touches the left edge of its containing block and the right edge touches the right edge of its containing block. However, floating boxes may come between the containing block edge and the line box edge. Thus, although line boxes in the same inline formatting context generally have the same width (that of the containing block), they may vary in width if available horizontal space is reduced due to floats.

We can see an example of that in here. In this example the first three line boxes have a shorter width than the fourth one due to the presence of the float element.

Screen Shot 2018-01-19 at 11.20.05 PM

Clear

When applied to a floating element, clear moves the margin edge of the element below the margin edge of all relevant floats (depending on whether it’s clear left, right or both). Here’s an example:

Screen Shot 2018-01-15 at 7.49.52 PM

The z-index property

Besides determining the size and position of boxes, the layout engine needs to determine how to handle overlapping. The boxes ordering is transitive, meaning that if a box A is under B, and B is under C, A has to be under C.

The main attribute to control the stack order of elements is the z-index property (in reference of the z-axis, commonly used as the “depth” dimension in 3D). But this number only applies for boxes under the same class. As we’ll see now, it’s more complicated than it seems.

First, we need to define the concept of stacking context. A stacking context encompasses a set of DOM elements which can be compared to each other. The order of the stack context always take precedence over individual orders of elements within a stack context. For example, imagine that we have 2 stacking contexts:

Stack context 1: [A -> B -> C -> D]
Stack context 2: [E -> F -> G]
Stack context 1 -> Stack context 2

The arrow (->) represents that the element on the left is on top of the element on the right. Because A, B, C and D belong to context 1, all of them are placed over E, F and G, no matter how big the z-index of elements in the second group are.

Another way to see it is that a stacking context defines an “atomic position”, meaning that elements from outside it cannot be placed in between its elements. It has to be either above or below.

In the DOM tree, if a node E satisfies some conditions, it starts a new stack context, which means that all elements in the DOM subtree of that element will be under that stack context (we say E is the root of such stacking context). Note that a stack context can contain other stack contexts. A few properties that cause a stacking context to be formed are:

* position: absolute or relative and z-index other than auto
* position: fixed or sticky
* opacity less than 1.
* others

Note that position: static ignores z-indexes, so a corollary is that the use of z-index effectively creates a new stacking context.

Within a stacking context, the order of elements is defined by the spec. Here’s a simplified version, from bottom to top:

* The background and borders of the element forming the stacking context;
* Stacking contexts with negative z-indexes;
* Non-positioned block boxes;
* Non-positioned floats;
* Non-positioned inlined boxes;
* Stacking contexts with positive z-indexes;

Another corollary is that a parent is always rendered below its children due to the first rule above.

Here is an example with elements of each of these categories.

Screen Shot 2018-01-18 at 8.51.44 PM


Here is an interesting example with stacking contexts.

Screen Shot 2018-01-19 at 8.47.39 PM

The red box is the parent, so it has to be under the green box. The green, purple and black boxes are in the same stacking-context so they are ordered based on the z-indexes (since red doesn’t start a stack context, green belongs to the top-level stack context). Finally, note how pink has the highest z-index, but is still placed under green because it belongs to the stack context of purple, which is placed under green.

References

[1] MDN Web Docs: CSS display
[2] MDN Web Docs: CSS Position
[3] MDN Web Docs: CSS Float
[4] A Complete Guide to Flexbox
[5] MDN Web Docs: Block formatting context
[6] StackOverflow: display: inline vs inline-block
[7] All about floats
[8] CSS 2.1 Specification: Visual formatting model

Conclusion

My main goal with this study was to learn more about the layout algorithm used by render engines but this turned out to be an analysis of some CSS properties. It was an interesting exercise anyway. I work with CSS on a daily basis but often times I don’t take the time to read the spec or understanding why a given set of properties behave in a certain way.

2017 in Review

This is a meta-post to review what happened in 2017.

I focused on finishing reading Purely Functional Data Structures, by Chris Okasaki, which was one of my goals from last year, so it’s no surprise that more than half of my posts were notes on it. In the process, I learned more about OCaml’s features. Speaking of OCaml, I used it to answer a question on subsequences.

I also completed a course on Scala, which was another of my 2017’s resolutions.

I finally took some time to play with my raspberry Pi, to develop a web server and a simple monitoring system.

Other things included learning more about how JavaScript works, about Google’s Spanner database. I’ve also attended a conference, the OpenVis Conf.

Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

Trips

As in 2016, 2017 was very busy with trips.

I visited quite a few cities in the Eastern US this year. This included Washington DC, Baltimore, Boston and Chicago. I must say they have better museums than the West coast. My favorite ones were the Smithsonian and the Boston MFA.

cities2

Top: Lincoln Memorial in Washington DC, Downtown Boston. Bottom: Marina City Towers in Chicago, MIT in Cambridge.

I love National Parks and was glad to be able to visit a few of them. This included trips to Sequoia and Kings Canyon National Parks, here in California. In Seattle, I made a quick trip to the Olympic National Park. And more recently, Carlsbad Caverns and Big Bend (New Mexico and Texas).

nparks

Top: Trail in Kings Canyon National Park; Cape Flattery, near Olympic National Park.
Bottom: Santa Elena Canyon in Big Bend National Park; White Sands National Monument

For a long time I wished to visit Japan and I was finally able to make it. I spent 2 weeks there, exploring the main cities like Tokyo, Kyoto, Nara, Osaka, Komatsu, Hiroshima, Himeji.

japan

Top: 1. Tokyo, 2. Temple in Nara, 3. Fushimi Inari-taisha.
Bottom: 4. Itsukushima Shrine, 5. Himeji Castle, 6. Mount Hakusan National Park

Books

I read a lot of fiction this year.

War and Peace took me a long time to get through, but it was a very rewarding read. It’s a mix between history, philosophy and fiction, set during the 1812 invasion or Russia by Napoleon. Slaughterhouse-Five is also about war, the WWII, with a dark humor take.

If on a winter’s night a traveler by Italo Calvino is sort of a meta-book, full of self-references. Pretty interesting read.

Hard-Boiled Wonderland and the End of the World, Strange Weather in Tokyo and Snow Country all have their background in Japan. I enjoyed the idea of reading books related to a country before visiting it.

I’m usually not into science-fiction but I enjoyed reading Red Mars.

books

For non-fiction, Survival of the Sickest was my favorite science book of the year. Homo Deus was also very interesting. It’s from the same author of Homo Sapiens, Yuval Harari. Even though it’s about the future of humanity, I liked his discussions of the present better – especially about humanism.

The Blog in 2017

Like last year, the most popular post was the Introduction to the Parsec Library, with 1.6k visits. From this year, the notes on Google’s Spanner database paper got 250 views. Overall it had 9.5k visitors.

I kept the resolution to post at least once a month. The blog completed 5 years with 66 posts.

Resolutions for 2018

For this year, I plan to learn Rust. I haven’t decided which technical book to read yet. I also want to learn mobile development and read more papers.

Generalized Tries in OCaml

In Chapter 10 of Purely Functional Data Structures, Okasaki describes a generalized trie. In this post we’ll see how tries can be seen as maps over a specific type of keys and how we can construct maps over any type of keys.

We discussed tries in OCaml in a previous post. At first glance, it looks like a special data structure but we’ll now see it’s basically a map.

The basic form of a map is a pair of (key, value). In most programming languages the key has to be a scalar (e.g. int, string) or some complex that can be converted to one (think of hashCode() in Java).

For simplicity, let’s assume our scalar is an int and our value is of generic type T. Also, note that the syntax we present below is not valid OCaml (I find OCaml’s type syntax a bit hard to read). A map with int keys can be represented as:

map<int, T>

If we want a two dimensional key, for example (int, int), we can have a map of maps:

map<int, map<int, T>>

and if the dimensions are not fixed, for example, a list of integers, then we can define a recursive structure for our map

map<list<int>, T> = 
  Node of (option<T> * map<int, mapOverList>)

If we think of strings as lists of characters, a trie is then a map where the key is a list of characters. From the type above we can simply change the key of our map above to a list of characters and for a basic trie T is a boolean

map<string, bool> = 
  Node of (bool * map<char, mapOverList>)

Note that option<bool> is redundant, so we can use bool only.

Maps over trees

tree2

We can now generalize the key to more complex recursive types such as trees. Suppose we fave the following type for tree:

tree<int> = (int * tree<int> * tree<int>)

The outermost map is indexed by the root of the tree. The inner maps are indexed by the left and right subtrees respectively:

map<tree<int>, T> =
  map<int, map<tree<int>, map<tree<int>, T>>>

If we expand the key of the second map we get the following:

map<tree<int>, map<tree<int>, T>> =
  map<int, map<tree<int>, map<tree<int>, map<tree<int>, T>>>

It gets pretty involved very quickly but because we traverse these types recursively, the implementation is still simple. Let’s see how to implement these in OCaml. The type of a map over an int tree can be defined as follows:

Note that the outermost map is a regular IntMap, which uses the root element of the map, a scalar, as key.

The search function takes a tree representing the key, and the map. The base case is when the tree is empty, when we’re just past a leaf. The recursive case consists in obtaining the maps in order, first using the root element, then using the left tree and finally using the right tree:

Note that the recursive call is non-uniform, so we need explicit annotations, as we discussed previously.

The insertion is similar, expect that when we fail to find a key in any of the maps, we need to first insert it with an empty element before recursing.

Because the Map.find() implementation throws exceptions when a key doesn’t exist, we can wrap the call with a try and if an exception occurs, we can insert the missing key (alternatively we could use Map.mem()).

Maps over products and sums

There are two ways a structure can branch out recursively: through combination or through alternative. For a combination, refer to our definition of Node:

Node of 'a * 'a tree * 'a tree

In theory, any combination of values for 'a, 'a tree, 'a tree are possible values for a Node. The * in between the components in fact represent the cartesian product operator. This is also known as product.

For alternative, the type of the tree itself can be either Empty or Node:

type 'a tree = Empty | Node of (...)

In this case, the valid values for a tree is the sum of values of each alternative. Hence, this is also known as sum.

We can generalize the map with keys of any types by looking at their definition. If it’s a product, we end up with nested maps. For example, if

Tk = (Tk1 * Tk2)

then the map over Tk can be defined as

map<Tk, T> = map<Tk1, map<Tk2, T>>

In our example, this came the nested maps 'a trie trie IntMap.t.

For sums, if the type is

Tk = Tk1 | Tk2

the map over Tk would end up as a product of maps:

map<Tk, T> = (map<Tk1, T> * map<Tk2, T>)

In our example, this came the product ('a option) and ('a trie trie IntMap.t). option can be thought as a one-dimensional map.

Conclusion

In this post we saw how a trie can be modeled as a map using the same principles as the ones we use to construct matrices, that is, two-dimensional maps (nested maps). We then generalized the string keys to trees, and implemented the insert/find functions in OCaml. I found it pretty hard to reason about these structures.

We then went a step further and saw how to construct maps based on the key structure. And we learned about product and sum of types when discussing recursive types.

Mutually Recursive Modules in OCaml

In Chapter 10 of Purely Functional Data Structures, Okasaki describes a technique called data structure bootstrapping. It’s a way to reuse existing implementation of data structures to construct (bootstrap) new ones.

In one of the examples he creates a new heap implementation with an efficient merge operation using another heap as basis, but it turns out that to implement this, we need to rely on mutually recursive modules, that is, two modules A and B, where A depends on B, and B depends on A.

In this post we’ll study the bootstrapped heap and learn how to implement mutually recursive modules in OCaml.

Heap with efficient merging

Assume we have a heap implementation with O(1) insert, and O(log n) merge, findMin and deleteMin operations. We’ve seen such an implementation with Skewed Binomial Heaps

We’ll see how to construct a new heap implementation which will improve the merge complexity to O(1).

Let’s call the base heap PrimaryHeap and define our heap type as

this type can be either empty or a node with an element (root) and a primary heap whose element is the bootstrapped heap itself, that is, heap and PrimaryHeap.heap form a mutually recursive types. Note that the above is not a valid OCaml code. We’re using it to explain the theoretical concepts.

We can think of this as a k-ary tree where the element is the root and the children of that node are the subtrees, but these subtrees are stored in a heap instead of an array.

The root element at each node is the smallest among all of the subtrees. Hence, to obtain the minimum element for a heap, findMin, is trivial: we can simply return that element:

Merging two bootstrapped heaps is analogous to linking two trees. The only invariant we need to maintain is that the smallest root continues being the root.

Since the primary heap has O(1) insert, the bootstrapped heap has O(1) merge, which was our goal. Note that we can implement insert using merge by creating a singleton node and merging it with an existing heap.

We need to handle the deletion of the minimum element, which is the more involved operation. It consists in discarding the root of the present node, and finding a new root from the primary heap.

Since each element in the primary heap is a bootstrapped heap, we first obtain the bootstrapped heap containing the smallest element:

then we remove this node from the primaryHeap, and we merge the minPrimaryHeap back into primaryHeap.

finally we make newMinElem the new root element of our top level bootstrapped heap. The complete code is

The only missing part is defining the correct type of the bootstrapped heap.

Mutually Recursive Modules

escher

Drawing Hands by M. C. Escher

Okasaki mentions that recursive structures are not supported in Standard ML (at least at the time my copy of the book was printed), but they are supported in OCaml.

To make modules mutually depend on another, we need to mark it as recursive via the rec keyword, and declaring both modules at the same time by using the and connector. Let’s work with a toy example: two modules Even and Odd, where each depend on the other.

This will lead to a compilation error:

Error: Recursive modules require an explicit module type.

We need to write the signatures explicitly:

This blog post from Jane Street describes a way to define mutually recursive modules by only listing its signatures:

The OCaml compiler can infer the implementation part from the type definitions, but unfortunately this won’t work if the module has function definitions, which is the case for our heap implementation.

Things get more complicated in our case because the primary heap implementation uses a functor to generate a heap taking the element’s module as parameter. In this case the element’s module is our bootstrapped heap. A valid module definition is given below:

Let’s understand what is going on.

type t = Empty | Heap of Element.t * PrimaryHeap.heap

is the definition we presented above. We also implement the methods from the Set.OrderedType interface, namely compare, since this is the interface the heap maker expects. The comparison is based solely on the root element.

Then we declare the PrimaryHeap type at the same time, with type IHeapWithMerge, and because tv is unbound in that interface, we need to bind it to BootstrappedElement.t:

PrimaryHeap: IHeapWithMerge with type tv := BootstrappedElement.t

Finally we provide the implementation, using the result of the SkewBinomialHeap() functor having the BootstrappedElement module as element type:

PrimaryHeap (...) = SkewBinomialHeap(BootstrappedElement)

The syntax is pretty involved, but it accomplishes what we wanted. We can further refine this definition by adding

include Set.OrderedType with type t := t

to the BootstrappedElement signature. This includes all the interface of Set.OrderedType.

These newly defined modules are defined within a functor, the BootstrappedHeap, together with the methods we defined above. Like other heap generators, the functor takes a module representing the element type as parameter. In this case we can also allow the primary heap type to be passed as parameter so we don’t have to use SkewBinomialHeap as implementation. Any heap with merge will do.

The constructors define within BootstrappedElement are visible within BootstrappedHeap but they need qualification, such as BootstrappedElement.Heap. To avoid repeating this qualifier, we can use:

include BootstrappedElement

The complete implementation for BootstrappedHeap can be found on github.

Conclusion

The idea of using implementations of a given data structure to yield improve implementations is amazing! The mutual recursion nature of the bootstrap heap got me at first, but making analogies with a k-ary tree made it easier to understand.

I was struggling a lot to get the syntax right for the recursive modules required for this implementation until I stumbled upon this github repository, from which I learned many new things about OCaml.

References

[1] Purely Function Data Structures, Chapter 10 – Chris Okasaki
[2] Jane Street Tech Blog – A trick: recursive modules from recursive signatures
[3] Github yuga/readpfds: bootStrappedHeap.ml

Polymorphic Recursion in OCaml

In Chapter 10 of Purely Functional Data Structures, Okasaki describes recursive types that are non-uniform. In this post we’ll learn more about these types, how to implement them in OCaml and see an example by studying the implementation of Random Access Binary Lists using such a construct.

Uniform recursive type

As an example of uniform recursive data structure we have a simple list

Cons(1, Cons(2, Cons(3, Nil)))

Which has a recursive type, for example

type 'a myList = Nil | Cons of 'a * 'a myList

Each element of the list is either Nil (terminal) or it has a value of a polymorphic type 'a, followed a recursive list also of type 'a.

Non-uniform recursive type

Now, say that the type of the recursive list is not the same as the current list? Then we have a non-uniform polymorphic recursive type, for example:

type 'a seq = Nil | Cons of 'a * ('a * 'a) seq

We’ll name this a sequence. A int seq would have the value in the first node would have type int, but the element from the second node would have type (int, int), the third type ((int, int), (int, int)) and so on. This structure is equivalent to a complete binary tree, where the i-th element of seq represents the i-th level of the tree.

An example of value with this type is:

Cons (1, Cons ((2, 3), Cons (((4, 5), (6, 7)), Nil)))

We need a special syntax to type recursive functions that take recursive non-uniform types, because the type of the function called recursively might be a different polymorphic type than the caller. OCaml by default tries to infer the generic types of the function and bind them to specific instances [2]. For example, in

let f: 'a list -> 'a list = fun x -> 13 :: x

OCaml will assume 'a is int and will compile fine. We can see this by pasting that code in the command line, utop.

utop # let f: 'a list -> 'a list = fun x -> 13 :: x;;
val f : int list -> int list =

The function will then not be polymorphic anymore. To prevent OCaml from auto-binding specific type instances, we can use a special syntax introduced in OCaml 3.12 [3]

utop # let f3: 'a. 'a list -> 'a list = fun x -> 13 :: x;;

This time we’ll get a compilation error:

Error: This definition has type int list -> int list which is less general than ‘a. ‘a list -> ‘a list

The important thing is that this allow us binding the recursive calls with different types. According to the Jane Street Tech Blog [3]:

Note that a polymorphic type annotation holds inside the body of a recursive definition as well as outside, allowing what is known as polymorphic recursion, where a recursive call is made at some non-trivial instantiation of the polymorphic type.

So for example, we can write this function to calculate the size of a sequence:

The problem with this structure is that it can only represent lists of size in the form of 2^k - 1. To work around that, we allow some items to not hold any elements at all, so that each item corresponds to a digit in the binary representation of the size of the list.

type 'a seq = Nil | Zero of ('a * 'a) seq | One of 'a * ('a * 'a) seq

For example, we can now represent a list of 10 elements, as

Zero(One((1, 2), Zero(One(((3, 4), (5, 6)), ((7, 8),(9, 10))), Nil))))

mandelbrot

Sequence binary random access list

We can use a sequence to implement a random access binary access list in a concise way.

Insertion

Inserting an element at the beginning is analogous to incrementing the binary number, that is, starting from the least significant digit, if it’s a zero, we make it one, if it’s one we make it a 0, and carry over a 1, by adding it to the next digit.

The carry over process is simple in this structure. Because the type of an item following an item of type 'a is ('a, 'a), to merge the element to be added with the existing element, we simply make a tuple and pass it to the recursive call.

Head and Tail

Removing or retrieving the first element is analogous to decrementing a binary number. If the digit is one, we make it zero and return the element and the resulting list. If it’s a zero, we make a recursive call to get the next available element. However since the returned element is of type ('a, 'a), and our return type is 'a, we only use the first value of the pair.

Implementing head and tail using popAux is now trivial

Lookup

Finding an element can be done by transforming the problem into smaller instances.

It helps to look at some simple examples. Let’s consider 3 cases.

Case 1. If we had a single element, we either return it if the index is 0, or throw if it’s larger than that.

0: (0)

Case 2. If we have 2 elements,

0: ()
1: (0, 1)

Notice that when we go from the first level to the second, all items “doubled” in size, so we can “transform” this to the single element case by treating pairs as a single element, but since the index has individual elements granularity, we need to transform it by halving it. We reduced it to Case 1.

If our initial index was either 0 or 1, it’s now 0, and we found our element in level 1.


1: (0)

The problem is that we need to return a single element at level 0, not a pair, so we need to undo the transformation. We can use the parity of the original index will to decide which side of the pair to return. If it’s even, we return the first element, otherwise the second one.

Case 3. If we have 3 elements,
0: (0)
1: (1, 2)

and our index is larger than 0, we move to the next level but we need to account for the level we’re skipping, so the transformation of index would look like:
0: ()
1: (0)

which is reduced to Case 2.

These 3 cases can be used to find elements larger than 3. For example, say we have 10 elements and want to find the element at position 6:

0: ()
1: (0, 1)
2: ()
3: (((2, 3), (4, 5)), ((6, 7), (8, 9)))

Step 1. This is Case 2. We need to transform this by treating pairs as elements and halving the index:

1': (0)
2': ()
3': ((1, 2), (3, 4))

Note how this reduced the problem of finding the element at position 3 of a list with size 5. Step 2. We now are in case 3, where we need to skip the current element:

1': ()
2': ()
3': (((0), (1)), ((2), (3)))

Our index is now 2. Step 3. we go one level down

2: ()
3: (0, 1)

With an index of 1. Step 4. Finally, we halve it once again and we finally found the right level that contains our index.

3: (0)

We now need to recurse back to find exactly which element on that level to pick. On Step 4, we can see our index 1 was on the right side of the pair in level 3, so we pick the right side, that is, ((6, 7), (8, 9)).

On Step 3, our index 2 was on the left size of the innermost pair, that is (6, 7). On Step 2, we skipped the current element but didn’t change levels, so there’s no need to choose an element from the pair. Finally, on Step 1, the index 6 was on the left side of the innermost pair, which should return the element with a corresponding index 6.

In general, we can tell which side of the innermost pair to pick by observing that the indexes are ordered from left to right in a given level. And because every level has an even number of elements, we can assume that the first index in the level – or the first element in the first pair – is even. Hence the parity of the index is sufficient to determine which side of the pair to pick.

With this algorithm in mind, the lookup function is quite compact:

The update follows a similar idea as the lookup, but the problem is that we need to return the updated level when returning from the recursion. That is, we need to update the level before returning.

To accomplish that, we can pass a callback, the updater, that encodes which pair we would pick at each level. We start with a function that simply returns the element to be updated

(fun _ -> element)

Then, at each level we create a new updater, which applies the previous updater on the left or right side of the pair, depending on the parity of the index:

When we finally find the level that has our index, we can apply the function, which has the effect of “narrowing” down the elements from a given level to a single element, replacing the value at the target index and then returning the updated elements when returning from the recursion.

After applying the updater, we return the updated level recursively.

Structural Decomposition

Okasaki introduces this implementation in the context of Structural Decomposition, which is a technique for creating data structures from incomplete ones. In this example, the raw implementation of the sequence can only represent lists of size 2^k, but modeling each node in the sequence to be zero or one, zero not having any elements, we can work around the size restriction.

Conclusion

The implementation of random access binary access list using sequences is very neat, but very hard to understand.

One of the major selling points of shorter code is that it tends to contain less bugs and also less corner cases. On the other hand, if the code is also harder to read and understand, it might be harder to spot bugs.

This post helped me understand a bit more about OCaml’s type system. Digging a little also led me to interesting topics such as Parametric Polymorphism [4] and Existential vs. Universally quantified types [5].

References

[1] Purely Function Data Structures – Chris Okasaki
[2] Jane Street – Ensuring that a function is polymorphic
[3] Ensuring that a function is polymorphic in Ocaml 3.12
[4] Wikipedia – Parametric polymorphism
[5] Existential vs. Universally quantified types in Haskell

Numerical Representations as inspiration for Data Structures

Book

In this chapter Okasaki describes a technique for developing efficient data structures through analogies with numerical representations, in particular the binary and its variations.

We’ve seen this pattern arise with Binomial Heaps in the past. Here the author presents the technique in its general form and applies it to another data structure, binary random access lists.

Binary Random Access Lists

These lists allows efficient insertion at/removal from the beginning, and also access and update at a particular index.

The simple version of this structure consists in distributing the elements in complete binary leaf trees. A complete binary leaf tree (CBLF) is one that only stores elements only at the leaves, so a tree with height i, has 2^(i+1)-1 nodes, but only 2^i elements.

Consider an array of size n, and let Bn be the binary representation of n. If the i-th digit of Bn is 1, then we have a tree containing 2^i leaves. We then distribute the elements into these trees, starting with the least significant digit (i.e. the smallest tree) and traversing the tree in
pre-order.

For example, an array of elements (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) has 11 elements, which is 1011 in binary. So we have one tree with a single leave (1), a tree with 2 leaves (2, 3) and another containing 8 leaves (4, 5, 6, 7, 8, 9, 10, 11).

We use a dense representation of the binary number, by having explicit elements for the 0 digit. Here’s a possible signature for this implementation:

Inserting a new element consists in adding a new tree with a single leave. If a tree already exists for a given size, they have to be merged into a tree of the next index. Merging two CBLFs of the same size is straightforward. We just need to make them children of a new root. Since elements are stored in pre-order, the tree being inserted or coming from carry over should be the left child.

Looping back to our example, if we want to insert the element 100, we first insert a tree with a single leaf (100). Since the least significant digit already has an element, we need to merge them into a new tree containing (100, 1) and try to insert at the next position. A conflict will arise with (2, 3), so we again merge them into (100, 1, 2, 3) and try the next position. We finally succeed in inserting at position 2, for a new list containing trees like (100, 1, 2, 3) and (4, 5, 6, 7, 8, 9, 10, 11).

The complexity of inserting an element is O(log n) in the worst case which requires merging tree for all digits (e.g. if Bn = 111...111). Merging two trees is O(1).

Removing the first element is analogous to decrementing the number, borrowing from the next digit if the current digit is 0.

Searching for an index consists in first finding the tree containing the index and then searching within the tree. More specifically, because the elements are sorted beginning from the smallest tree to the largest, we can find the right tree just by inspecting the number of elements in each tree until we find the one whose range includes the desired index. Within a tree, elements are stored in pre-order, so we can find the right index in O(height) of the tree.

After finding the right index, returning the element at that index is trivial. Updating the element at a given index requires rebuilding the tree when returning from the recursive calls.

Okasaki then proposes a few different numbering systems that allow to perform insertion/removal in O(1) time. Here we’ll only discuss the less obvious but more elegant one, using skew binary numbers.

characters

Skew Binary Random Access Lists

A skew binary number representation supports the digits 0, 1 and 2.

The weight of the i-th digit is 2^(i+1) - 1. In its canonical form, it only allows the least significant non-zero digit to be 2.

Examples:

table-v2

Decimal and Skewed Binary

It’s possible to show this number system offers a unique representation for decimal numbers. See the Appendix for a sketch of the proof and an algorithm for converting decimals to skewed binary numbers.

Incrementing a number follows these rules:

  • If there’s a digit 2 in the number, turn it into 0 and increment the next digit. By definition that is either 0 or 1, so we can safely increment it without having to continue carrying over.
  • Otherwise the least significant digit is either 0 or 1, and it can be incremented without carry overs.

The advantage of this number system is that increments (similarly, decrements) never carry over more than once so the complexity O(1), as opposed to the worst-case O(log n) for regular binary numbers.

A skew binary random access list can be implemented using this idea. We use a sparse representation (that is, not including 0s). Each digit one with position i corresponds to a tree with (2^(i+1) - 1) elements, in this case a complete binary tree with height i+1. A digit 2 is represented by two consecutive trees
with same weight.

Adding a new element to the beginning of the list is analogous to incrementing the number, which we saw can be done in O(1). Converting a digit 0 to 1 or 1 to 2, is a matter of prepending a tree to a list. To convert a 2 to 0 and increment the next position, we need to merge two trees representing it with the element to be inserted. Because each tree is traversed in pre-order, we make the element the root of the tree.

Elements are inserted in pre-order in each tree, so when searching for an
index, we can first find the right tree by looking at the tree sizes and within a tree we can do a “binary search” in O(height) of the tree.

Binomial Heaps

In this chapter, this technique is also applied to improve the worst case runtime of insertion of binomial heaps. The implementation, named Skewed Binomial Heap, is on github.

Conclusion

This chapter demonstrated that binary representations are a useful analogy to come up with data structures and algorithms, because they’re simple. This simplicity can lead to inefficient running times, though. Representations such as skewed binary numbers can improve the worst case of some operations with the trade-off of less intuitive and more complex implementations.

Appendix A – Proof

Sketch of the proof. First, it’s obvious that two different decimals cannot map to the same binary representation. Otherwise the same equation with the same weights would result in different values. We need to show that two binary representations do not map to the same decimal.

Suppose it does, and let them be B1 and B2. Let k be the largest position where these number have a different digit. Without loss of generality, suppose that B1[k] > B2[k].

First case. suppose that B1[k] = 1, and B2[k] = 0 and B2 doesn’t have any digit 2. B1 is then at least M + 2^{k+1} - 1, while B2 is at most M + \sum_{i = 1}^{k} (2^{i} - 1) which is M + 2^{k + 1} - k (M corresponds to the total weight of digits in positions > k). This implies that B2 < B1, a contradiction.

Second case. suppose that B1[k] = 1, but now B2 does have a digit 2 at position j. It has to be that j < k. Since only zeros follow it, we can write B2‘s upper bound as

M + \sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1

Since 2(2^{j + 1} - 1) < 2^{j + 2} - 1, we have

\sum_{i = j + 1}^{k} (2^{i} - 1) + 2^{j + 1} - 1 < \sum_{i = j + 2}^{k} (2^{i} - 1) + 2^{j + 2} - 1

We can continue this argument until we get that B2 is less than M + 2(2^{k} - 1) which is less than M + 2^{k + 1} - 1, B1.

Third case. Finally, suppose we have B1' such that B1'[k] = 2, and B2'[k] = 1. We can subtract 2^{k+1} - 1 from both and reduce to the previous case. ▢

Appendix B – Conversion algorithm

Converting from a decimal representation to a binary one is straightforward, but it’s more involved to do so for skewed binary numbers.

Suppose we allow trailing zeros and have all the numbers with k-digits. For example, if k=2, we have 00, 01, 02, 10, 11, 12 and 20. We can construct the numbers with k+1-digits by either prefixing 0 or 1, and the additional 2 followed by k zeros. For k=3, we have 000, 001, 002, 010, 011, 012, 020, 100, 101, 102, 110, 111, 112, 120 and finally 200.

More generally, we can see there are 2^(k+1) - 1 numbers with k digits. We can construct the k+1 digits by appending 0 or 1 and then adding an extra number which is starts 2 and is followed by k zeros, for a total of 2^(k+1) - 1 + 2^(k+1) - 1 + 1 = 2^(k + 2) - 1, so we can see this invariant holds by induction on k, and we verify that is true for the base, since for k = 1 we enumerated 3 numbers.

This gives us a method to construct the skewed number representation if we know the number of its digits say, k. If the number is the first 2^(k) - 1 numbers, that is, between 0 and 2^k - 2, we know it starts with a 0. If it’s the next 2^(k) - 1, that is, between 2^k - 1 and 2^(k+1) - 3, we know it starts with a 1. If it’s the next one, exactly 2^(k+1) - 2, we know it starts with a 2.

We can continue recursively by subtracting the corresponding weight for this digit until k = 0. We can find out how many digits a number n has (if we’re to exclude leading zeros) by find the smallest k such that 2^(k+1)-1 is greater than n. For example, for 8, the smallest k is 3, since 2^(k+1)-1 = 15, and 2^(k)-1 = 7.

The Python code below uses these ideas to find the skewed binary number representation in O(log n):

One might ask: why not OCaml? My excuse is that I already have a number theory repository in Python, so it seemed like a nice addition. Converting this to functional code, in particular OCaml is easy.

This algorithm requires an additional O(log n) memory, while the conversion to a binary number can be done with constant extra memory. My intuition is that this is possible because the weights for the binary numbers are powers of the same number, 2^k, unlike the skewed numbers’ weights. Is it possible to work around this?