Async Functions in JavaScript

js

Async functions are a JavaScript construct introduced in the ES2017 spec. Put it simply, async functions allow writing asynchronous code using synchronous syntax.

In this post we’ll discuss async functions in JavaScript, covering some other concepts such as iterators, generators that can be used to implement async functions in cases they’re not supported by the runtime.

A Glimpse

Before we start delving into the details, let’s get a sense on why async functions are useful, especially in the context of Promises.

In case you don’t know about Promises in JavaScript or need a refresher, I recommend checking that out first. If you’d like, we talked about Promises in a previous post which might serve as an introduction.

As we’ve learned, they’re very handy for reducing the so called “callback hell”. Here’s a contrived example prior to Promises:

We saw that with Promises we can simplify this to:

Which is much cleaner. Now, the async syntax allows an even cleaner code by making it look synchronous:

We can now proceed into some details on how async functions work. To start, let’s learn about intermediate concepts, namely iterators and generators.

JavaScript Iterators

An iterator is basically an object that has a method next(), which returns an object with fields value and done, the latter being a boolean indicating whether there’s a next value to be iterated on. An iterator is more like a design or code pattern: it’s not explicitly supported by the JavaScript runtime in any special way.

Iterable on the other hand is a contract that an object can be used as iterator. To indicate this we add a special field containing Symbol.iterator, which maps to a function that returns this object (similar to an interface in an OOP language) – and this constructed is handled as a special case.

In the example below we create an example iterator and use it with the for-of construct:

JavaScript Generators

Generators are a syntax sugar for iterators in what it allows us to not keep track of a “global” state (in the example above via this.cnt). It does so by allowing the function to yield the execution back to the caller and resume from where it stopped when it’s called again. Behind the scenes, it creates an object with the same structure as the iterator object we defined above, namely with a next() method. It’s much clearer with an example:

First, we indicate the function is a generator with the * modifier (i.e. function*). Here we don’t have to explicitly define the next() function and we don’t need to keep track of the variable cnt outside of the function – it will be resumed from the state it had when we called yield.

As with iterators, we can make generators iterable by implementing a contract. In this case we create an object with a special field containing Symbol.iterator which maps to the generator function:

Async Functions <> Promises

We’re now ready to come back to async functions. We can think of async functions as syntax sugar for Promises. Suppose a function f() exists that returns a Promise. If we want to use the result of that Promise and return a new one, we could do, for example:

Instead, we could replace g() with an async function, which “understands” Promises and returns them, making it possible to easily mix with Promise code. The code above would look like:

Note how we swapped a Promise-based implementation with an async one without making any changes to the call stack that expected Promises throughout.

Handling errors. Async functions have a familiar syntax for error handling too. Suppose our function f() rejects with some probability:

If we are to replace g() with an async version, using the try/catch syntax:

Async Functions as Generators

As of this writing most major browsers support async functions on their latest versions except Internet Explorer. For a while though, if developers wanted to use async functions they needed to rely on transpilation (i.e. translate their async-based code into browser-compatible code). One of the most popular tools for this is Babel, which transpiles code with async functions into one using generators and some helpers.

We can study that code to learn how to implement async-like functions using generators. Consider this simple example chaining two Promises using an async function.

If we translate it using Babel we get some generated code. I removed parts dealing with error handling and inlined some definitions to make it easier to read. Here’s the result:

Let’s see what is happening here. First, we note that our async function got translated into a generator, basically replacing the await with yield. Then it’s transformed somehow via the _asyncToGenerator() function.

In _asyncToGenerator() we’re basically invoking the generator recursively (via gen.next()) and at each level we chain the Promise returned by a yield call with the result of the recursion. Finally we wrap it in a Promise which is what the async function does implicitly.

Intuition. Let’s try to gain an intuition on what’s happening here on a high level. The ability of resuming execution of a function at particular points (via yield in this case) is what enables us to avoid passing callbacks every where. Part of why we need pass the callback is that we need to carry the “code” around as a callback, but by having the run time keep the code around solves this problem. For example, in a Promise world, code 1 and code 2 are wrapped in the arrow functions:

In a world where we can remember where we were when an async execution happened, we can in-line the code:

This translation relies on the existence of generators being fully supported by the runtime. In a world where generators didn’t exist as first class citizens, how could we implement them via helpers and also transpilation? We could probably use some sort of iterators and switches to simulate resuming execution at specific points in code, but this is out of the scope of this post and left as food for thought.

Conclusion

In this post we learned about some more language features that help with code authoring and readability, namely generators and async functions. These are very useful abstractions that ends up being added to programming languages such as Python, C#, and Hack.

Advertisements

Von Neumann Architecture

220px-JohnvonNeumann-LosAlamosJohn von Neumann was a Hungarian-American mathematician, physicist, computer scientist and polymath, often regarded as the greatest mathematician of his time. He has contributed to a wide range of fields including quantum mechanics, geometry, topology, game theory, cellular automata, linear programming and computer architecture.

In this post we’ll discuss his contribution to the architecture of modern computers, known as von Neumann architecture (aka Princeton architecture).

Historical Background

Von Neumann was working at the Manhattan project, which required a lot of computation (in particular to solve differential equations). He got involved on the design of the EDVAC computer together with J. Presper Eckert and John Mauchly and together they wrote a document titled First Draft of a Report on the EDVAC [1]. For an unfortunate reason the report circulated only with von Neumann’s name on it, and the architecture based on the report has only von Neumann’s name [2].

Furthermore, around the same time Alan Turing, who proposed the concept of stored-programs in the form of theoretical Universal Turing Machines (in the paper On Computable Numbers, with an Application to the Entscheidungsproblem), also wrote a paper Proposed Electronic Calculator, discussing the practical aspects of constructing such machines.

These independent approaches led to a debate on whether stored-program machines should be actually referred to von Neumann machines.

Overview

von_neumann

von Neumann architecture diagram (source: Wikipedia)

The architecture consists of 5 specific parts [1]:

  • (i) Central Arithmetic part (CA): an arithmetic logic unit (circuit capable of performing elementary arithmetic and bitwise operations) and a set of registers (small fast-access memory).
  • (ii) Central Control (CC): a general purpose unit to carry out the execution of the instructions, to be stored elsewhere.
  • (iii) Memory (M):  to store data during the program’s execution and also to store the program’s instructions.

The draft also specifies that there must be a way to connect between these 3 parts. It’s interesting the analogy it makes to the human brain:

The three specific parts CA, CC (together C) and M correspond to the associative neurons in the human nervous system. It remains to discuss the equivalents of the sensory or afferent and the motor or efferent neurons.

The external world is represented by the external medium, called R (which is not considered a part of the machine).

  • (iv) Input mechanism (I): a way to transfer information from R to C (CA + CC) and M.
  • (v) Output mechanism (O): a way to transfer information from C and M to R.

The authors in [1] also pose an interesting question on whether information should be stored in M or R. Supposedly R representing some sort of external memory. It does resemble a more modern debate on volatile (RAM) or persistent memory (disk).

Modifications

One bottleneck of the original von Neumann machines is that both data and instruction go through the same bus. This offers a potential limit on speed because data and instructions cannot be read in parallel.

The Harvard architecture doesn’t have this issue by separating the memory (or at least having separate channels of communication with the central unit) and was implemented in the Harvard Mark I computer [3].

harvard

Harvard architecture. (source: Wikipedia)

However, there might be advantages of treating data and instructions as data since this allows for concepts such as just-in-time compilation where instructions might be written to memory during runtime and read as data. Modern computers (ARM, x86) use the so-called Modified Harvard architecture [4] which overcomes the bottleneck from von Neumann architecture by having a dedicated memory with a copy of the program (in the form CPU cache) but instructions can still be read as data when needed.

Limitations of classical architectures

We’ll now focus on the limitations of current implementations of the Modified Harvard architecture. It’s really hard to make any concrete estimates on the early architecture proposals because they’re very high-level and the actual implementation might vary widely.

Processing Power

In 1965 Gordon Moore, founder of Fairchild Semiconductor, wrote a paper predicting that the number of transistors in the CPU would double every year for the next decade. The industry was eager to follow this prophecy and the trend followed for several decades (it was later adjusted to every 2 years), but it slowed down in the early 2010s. As of today, CPUs have in the order of 10’s billions transistors.

moores_law_chart

Moore’s Law: transistor count (log axis) over year. (source: Wikpedia)

It’s impractical to think that we’ll be able to put more transistors in a chip forever since there are physical constraints. The question is what kind of constraints are we going to hit?

To pack more transistors in a chip we have to increase the size of the chip.

Increase the size of the chip. this incurs in more power consumption and heat dissipation and information has to travel longer distances, making computation potentially slower. Furthermore, large chips might be infeasible for small devices like smartphones.

Reduce the size of the transistor. For the current transistor design the size has a hard lower bound of the Silicon atom, which is about 0.2nm [6], but before we get there, we have to figure out how to manufacture them with such precision in a cost effective way. As of today, 14nm seems to be the smallest size that can be viably produced.

One question we need to ask ourselves is how having a large number of transistor in a chip translates into computing power? It’s a convenient proxy for CPU performance because it’s easy to measure. However, what can we achieve with more transistors? It allows higher parallelism via multiple cores and transistors can be used to build CPU caches, which improves the speed of common operations.

Other ways to potentially increase processing power is to keep the number of transistors constant but reduce the chip size. This decreases the distance needed for electrons to travel and by dissipating less heat, it’s possible to increase the clock frequency.

Besides reducing the size of the transistor, other strategies are being explored to reduce the chip’s area: instead of the classic 2D square layout, chip manufacturers are exploring stacking approaches to reduce the overall size of the chip.

Memory Bandwidth

The speed improvements of RAM memories haven’t followed the ones from the CPU. The widening gap can become so large that memory speed will become a bottleneck for the CPU speed (known as the Memory Wall [8]).

To work around this limitation CPU cache is currently being used. Alternative solutions include adding an in-chip memory to reduce latency in the transportation of data.

Conclusion

I didn’t have a good idea of what to write, but I was interested in understanding how close we are to practical limits of current computer architectures, so the idea was to go back to the early inception of computer architectures and learn some about it.

We’ve made rapid progress in the early days and had steady progress for a long time, so it’s reasonable to be optimistic, but progress has been slowing down, at least in the general purpose single-node computation. We’ve seen specialized hardware take place with GPUs and TPUs, and also the increase of parallel, concurrent and distributed computing.

Quantum computer still seems a dream far away. I wonder if there’s any value in rethinking the classical architecture model from scratch to see if we can escape from local minimum?

References

[1] Introduction to “The First Draft Report on the EDVAC”
[2] Wikipedia – Von Neumann architecture
[3] Wikipedia – Harvard architecture
[4] Wikipedia – Modified Harvard architecture
[5] Wikipedia – Transistor count
[6] Is 14nm the end of the road for silicon chips?
[7] Intel demos first Lakefield chip design using its 3D stacking architecture
[8] Wikipedia -Random-access memory: memory wall

Constructing Trees from a Distance Matrix

dawkins

Richard Dawkins is an evolutionary biologist and author of many science books. In The Blind Watchmaker he explains how complex systems can exist without the need of an intelligent design.

Chapter 10 of that book delves into the tree of life. He argues that the tree of life is not arbitrary taxonomy like the classification of animals into kingdoms or families, but it is more like a family tree, where the branching of the tree uniquely describes the true ancestry relationship between the nodes.

Even though we made great strides in genetics and mapped the DNA from several different species, determining the structure of the tree is very difficult. First, we need to define a suitable metric that would encode the ancestry proximity of two species. In other words, if species A evolved into B and C, we need a metric that would lead us to link A-B and A-C but not B-C. Another problem is that internal nodes can be missing (e.g. ancestor species went extinct without fossils).

tree-of-life

David Hill’s tree of life based on sequenced genomes. Source: Wikipedia

In this post we’ll deal with a much simpler version of this problem, in which we have the metric well defined, we know the distance between every pair of nodes (perfect information), and all our nodes are leaves, so we have the freedom to decide the internal nodes of the tree.

This simplified problem can be formalized as follows:

Constructing a tree for its distance matrix problem. Suppose we are given a n x n distance matrix D. Construct a tree with n leaves such that the distance between every pair of leaves can be represented by D.

To reduce the amount of possible solutions, we will assume a canonical representation of a tree. A canonical tree doesn’t have any nodes with degree 2. We can always reduce a tree with nodes with degree 2 into a canonical one. For example:

simple_trees

Nodes with degree 2 can be removed and the edges combined.

Terminology

Let’s introduce the terminology necessary to define the algorithm for solving our problem. A distance matrix D is a square matrix where d_ij represents the distance between elements i and j. This matrix is symmetric (d_ij = d_ji), all off-diagonal entries are positive, the diagonal entries are 0, and a triplet (i, j, k) satisfy the triangle inequality, that is,

d_ik <= d_ij + d_jk

A distance matrix is additive if there is a solution to the problem above.

We say two leaves are neighbors if they share a common parent. An edge connecting a leaf to its parent is called limb (edges connecting internal nodes are not limbs).

Deciding whether a matrix is additive

We can decide whether a matrix is additive via the 4-point theorem:

Four-point Theorem. Let D be a distance matrix. If, for every possible set of 4 indexes (i, j, k, l), the following inequality holds (for some permutation):

(1) d_ij + d_kl <= d_ik + d_jl = d_il + d_jk

then D is additive.

Sketch of proof. We can derive the general idea from the example tree below:

4-point-theorem

We can see by inspection that (1) is true by inspecting the edges on the path between each pair of leaves. This will be our base case for induction.

Now, we’ll show that if we’re given a distance matrix satisfying (1), we are able to reconstruct a valid tree from it. We have that d_ik = a + e + c, d_jl = b + e + d, d_ij = a + b and d_kl = c + d. If we add the first two terms and subtract the last two, we have d_ik + d_jl - d_ij + d_kl = 2e, so we have

e = (d_ik + d_jl - d_ij + d_kl) / 2

We know from (1) that d_ik + d_jl >= d_kl + d_ij > d_kl, so e is positive.

If we add d_ik and d_ij and subtract d_jl, we get d_ik + d_ij - d_jk = 2a, so

a = (d_ik + d_ij - d_jk) / 2

To show that a is positive, we need to remember that a distance matrix satisfy the triangle inequality, that is, for any three nodes, x, y, z, d_xy + d_yz >= d_xz. In our case, this means d_ij + d_jk >= d_ik, hence d_ij >= d_ik - d_jk and a is positive. We can use analogous ideas to derive the values for b, c and d.

To show this for the more general case, if we can show that for every possible set of 4 leaves (i, j, k, l) this property is held, then we can show there’s a permutation of these four leaves such that the tree from the induced paths between each pair of leaves looks like the specific example we showed above.

For at least one one of these quadruplets, i and j will be neighbors in the reconstructed tree. With the computed values of a, b, c, d, e, we are able to merge i and j into its parent and generate the distance matrix for n-1 leaves, which we can keep doing until n = 4. We still need to prove that this modified n-1 x n-1 satisfies the 4-point theorem if and only if the n x n does.

Limb cutting approach

We’ll see next an algorithm for constructing a tree from an additive matrix.

The general idea is that even though we don’t know the structure of the tree that will “yield” our additive matrix, we are able to determine the length of the limb of any leaf.  Knowing that, we can remove the corresponding leaf (and limb) from our unknown tree by removing the corresponding row and column in the distance matrix. We can then solve for the n-1 x n-1 case. Once we have the solution (a tree) for the smaller problem we can “attach” the leaf into the tree.

To compute the limb length and where to attach it, we can rely on the following theorem.

Limb Length Theorem: Given an additive matrix D and a leaf j, limbLength(j) is equal to the minimum of

(2) (d_ij + d_jk - d_ik)/2

over all pairs of leaves i and k.

The idea behind the theorem is that if we remove parent(j) from the unknown tree, it will divide it into at least 3 subtrees (one being leaf j on its own). This means that there exists leaves i and k that are in different subtrees. This tells us that the path from i to k has to go through parent(j) and also that the path from i to j and from j to k are disjoint except for j‘s limb, so we can conclude that:

d_ik = d_ij + d_jk - 2*limbLength(j)

which yields (2) for limbLength(j). We can show now that for i and k on the same subtree d_ik <= d_ij + d_jk - 2*limbLength(j), and hence

limbLength(j) <= (d_ij + d_jk - d_ik)/2

This means that finding the minimum of (2) will satisfy these constraints.

Attaching leaf j back in. From the argument above, there are at least one pair of leaves (i, k) that yields the minimum limbLength(j) that belongs to different subtrees when parent(j) is removed. This means that parent(j) lies in the path between i and k. We need to plug in j at some point on this path such that when computing the distance from j to i and from j to k, it will yield d_ij and d_jk respectively. This might fall in the middle of an edge, in which case we need to create a new node. Note that as long as the edges all have positive values, there’s only one location within the path from i to k that we can attach j.

Note: There’s a missing detail in the induction argument here. How can we guarantee that no matter what tree is returned from the inductive step, it is such that attaching j will yield consistent distances from j to all other leaves besides i and k?

This constructive proof gives us an algorithm to find a tree for an additive matrix.

Runtime complexity. Finding limbLength(j) takes O(n^2) time since we need to inspect every pair of entries in D. We can generate an n-1 x n-1 matrix in O(n^2) and find the attachment point in O(n). Since each recursive step is proportional to the size of the matrix and we have n such steps, the total runtime complexity is O(n^3).

Detecting non-additive matrices. If we find a non-positive limbLength(j), this condition is sufficient for a matrix to be considered non-additive, since if we have a corresponding tree we know it has to represent the length of j’s limb. However, is this necessary? It could be that we find a positive value for limbLength(j) but when trying to attach j back in the distances won’t match.

The answer to this question goes back to the missing detail on the induction step and I don’t know how to answer.

The Neighbor-Joining Algorithm

Naruya Saitou and Masatoshi Nei developed an algorithm, called Neighbor Joining, that also constructs a tree from an additive matrix, but has the additional property that for non-additive ones it serves as heuristic.

The idea behind is simple: It transforms the distance matrix D into another n x n matrix, D*, such that the minimum non-diagonal entry, say d*_ij, in that matrix corresponds to neighboring vertices (i ,j) in the tree, which is generally not true for a distance matrix.

The proof that D* has this property is non-trivial and will not provide here. Chapter 7 of [1] has more details and the proof.

Given this property, we can find i and j such that d*_ij is minimal and compute the limbs distances limbLength(i) and limbLength(j), replace them with a single leaf m, and solve the problem recursively. With the tree returned by the recursive step we can then attach i and j into m, which will become their parents.

Conclusion

In this post we saw how to construct a tree from the distance between its leaves. The algorithms are relatively simple, but proving that they work is not. I got the general idea of the proofs but didn’t get with 100% of detail.

The idea of reconstructing the genealogical tree of all the species is fascinating and is a very interesting application of graph theory.

References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P. – Chapter 10

[2] The Blink Watchmaker – Richard Dawkins

Consistent Hashing

lewin-leighton

Daniel Lewin was an Israeli-American mathematician and entrepreneur. He was aboard the American Airlines Flight 11, which was hijacked by al-Qaeda during the September 11 attacks.

Tom Leighton is a professor (on leave) of Applied Mathematics at CSAIL @ MIT and an expert on algorithms for network applications.

Together, Lewin and Leighton founded the company Akamai, which was a pioneer in the business of content delivery networks (CDNs) and is currently one of the top players in the segment. One of the key technologies employed by the company was the use of consistent hashing, which we’ll present in this post.

Motivation

One of the main purposes of the CDN is to be a cache for static data. Due to large amounts of data, we cannot possibly store the cache in a single machine. Instead we’ll have many servers each of which will be responsible for storing a portion of the data.

We can see this as a distributed key-value store, and we have two main operations: read and write. For the write part, we provide the data to be written and an associated key (address). For the read part, we provide the key and the system either returns the stored data or decides it doesn’t exist.

In scenarios where we cannot make any assumptions over the pattern of data and keys, we can try to distribute the entries uniformly over the set of servers. One simple way to do this is to hash the keys and get the remainder of the division by N (mod N), where N corresponds to the number of servers. Then we assign the entry (key, value) to the corresponding server.

The problem arises when the set of servers changes very frequently. This can happen in practice, for example, if servers fail and need to be put offline, or we might need to reintroduce servers after reboots or even add new servers for scaling.

Changing the value of N would cause almost complete redistribution of the keys to different servers which is very inefficient. We need to devise a way to hash the keys in a way that adding or removing servers will only require few keys from changing servers.

Consistent Hashing

The key idea of the consistent hashing algorithm is to include the key for the server in the hash table. A possible key for the server could be its IP address.

Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with hash larger than h(k), we wrap around and start looking from the beginning of the array.

consistent_hash

Big blue circles are servers, orange circles are keys. Right: If we remove server S3, only entries corresponding to keys K5 and K4 need to be moved / re-assigned.

If we assume that the hash distributes the keys uniformly, including the server keys, we’ll still get a uniform distribution of keys to each server.

The advantage comes to when adding and removing servers to the list. When adding a new server sx to the system, its hash will be in between 2 server hashes, say h(s1) and h(s2) in the circle. Only the keys from h(s1) to h(sx), which belonged to s2, will change servers, to sx. Conversely, when removing a server sx, only the keys assigned to it will need to go to a different server, in this case the server that immediately follows sx.

How can we find the server associated to a given key? The naive way is to scan linearly the hashes until we find a server hash. A more efficient way is to keep the server hashes in a binary balanced search tree, so we can find the leaf with the smallest value larger that h(x) in O(log n), while adding and removing servers to the tree is also a O(log n) operation.

Implementation in Rust

We will provide an implementation of the ideas above in Rust as an exercise. We define the interface of our structure as

Note that we’ll store the list of servers (containers) and keys (entries) in separate structures. We can store the entries in a simple hash table since we just need efficient insertion, deletion and look up. For the containers we need insertion, deletion but also finding the smallest element that is larger than a given value, which we’ll call successor. As we discussed above, we can use a binary balanced search tree which allow all these operations in O(log n), for example a Red-Black tree. I found this Rust implementation of the Red-Black tree [1].

Finally, we also include the hash function as part of the structure in case we want customize the implementation (handy for testing), but we can provide a default implementation.

To “construct” a new structure, we define a method new() in the implementation section, and use farmhash as the default implementation for the hash function [2].

The insertion and removal are already provided by the data structures, and are trivial to extend to ours. The interesting method is determining the server corresponding to a given key, namely get_container_id_for_entry().

In there we need to traverse the Red-Black tree to find the successor of our value v. The API of the Red-Black tree doesn’t have such method, only one to search for the exact key. However due to the nature of binary search trees, we can guarantee that the smallest element greater than the searched value v will be visited while searching for v.

Thus, we can modify the search algorithm to include a visitor, that is, a callback that is called whenever a node is visited during the search. In the code below we start with a reference to the root, temp, and in a loop we keep traversing the tree depending on comparison between the key and the value at the current node.

Let’s take a detour to study the Rust code a bit. First, we see the unsafe block [3]. It can be used to de-reference a raw pointer. A raw pointer is similar to a C pointer, i.e. it points to a specific memory address. When we de-reference the pointer, we have access to the value stored in that memory address. For example:

The reason we need the unsafe block in our implementation is that self.root is a raw pointer to RBTreeNode, as we can see in line 1 and 4 below:

The other part worth mentioning is the type of the visitor function. It’s defined as

It relies on several concepts from Rust, including Traits, Closures, and Trait Bounds [4, 5]. The syntax indicates that the type of visitor must be FnMut(&K), which in turns mean a closure that has a single parameter of type &K (K is the type of the key of the RB tree). There are three traits a closure can implement: Fn, FnMut and FnOnce. FnMut allows closures that can capture and mutate variables in their environment (see Capturing the Environment with Closures). We need this because our visitor will update a variable defined outside of the closure as we’ll see next.

We are now done with our detour into the Rust features realm, so we can analyze the closure we pass as visitor. It’s a simple idea: whenever we visit a node, we check if it’s greater than our searched value and if it’s smaller than the one we found so far. It’s worth noticing we define closest_key outside of the closure but mutate it inside it:

We also need to handle a corner case which is that if the hash of the value is larger than all of those of the containers, in which case we wrap around our virtual circular table and return the container with smallest hash:

The full implementation is on Github and it also contains a set of basic unit tests.

Conclusion

The idea of a consistent hash is very clever. It relies on the fact that binary search trees can be used to search not only exact values (those stored in the nodes) but also the closest value to a given query.

In a sense, this use of binary trees is analogous to a common use of quad-trees, which is to subdivide the 2d space into regions. In our case we’re subdividing the 1d line into segments, or more precisely, we’re subdividing  a 1d circumference into segments, since our line wraps around.

I struggled quite a bit with the Rust strict typing, especially around passing lambda functions as arguments and also setting up the testing. I found the mocking capability from the Rust toolchain lacking, and decided to work with dependency injection to mock the hash function and easier to test. I did learn a ton, though!

References

[1] GitHub: /tickbh/rbtree-rs
[2] GitHub: seiflotfy/rust-farmhash
[3] The Rust Programming Language book – Ch19: Unsafe Rust
[4] The Rust Programming Language book – Ch13: Closures: Anonymous Functions that Can Capture Their Environment
[5] The Rust Programming Language book – Ch13: Traits: Defining Shared Behavior

Rust Memory Management

Graydon Hoare is a Software Developer who created the Rust programming language while working at Mozilla Research [1]. He has an interesting presence on the internet, for example responding to this question and on Twitter.

In this post we’ll talk about one of the key features of Rust, which I find the hardest to wrap my head around, which is its memory management.

Motivation

In most programming languages we either have to manage memory allocation ourselves or rely on a complex garbage collector to which we have limited control and can lead to unpredictable performance bottlenecks.

Rust has a set of constraints around memory allocation that results in a deterministic automatic memory management.

We’ll now delve into more details on what these constraints are, how they enforce the necessary guarantees to enable an efficient memory management by the runtime.

Ownership Rules

Rust has the following basic rules around ownership:

  • Each value in Rust has a variable that’s called its owner
  • There can only be one owner at a time
  • When the owner goes out of scope, the value will be dropped

They’re very basic but have deep implications in how we think about programs. Let’s analyze some of them.

Assignment transfers ownership

To conform the second rule, whenever we assign a variable to another, we are transferring the ownership from one variable to another. For example:

In the example above, vec1 transfer ownership of its data to vec2. This means that we can neither read nor write to vec1 anymore. It’s as if it was out of scope (unless it is assigned ownership to some other data later).

By having a single owner we don’t have to worry about keeping track of references to a given object as garbage collectors do to know when we are allowed to free the memory. For example, if we had:

Because vec2 owns the vector allocated initially and it goes out of scope after line 4, the runtime can free the memory safely, since we know vec1 cannot access that data after line 3.

Similarly, when we use a variable as argument to a function, its data is transferred to the parameter. In the example below, vec1‘s data is transferred to vec in mutate_vec(), so we cannot access it in line 9.

One way to return the ownership back to vec1 is for the function to return the argument.

References & Borrowing

To avoid transferring data to a variable on assignment, we can use references. In the example below, vec2 “borrows” the data from vec1 but there’s no ownership transfer. We have read access to the data via vec2, which we can do by dereferencing vec2 with &vec2.

If we want write access, we need to make vec1 mutable and also obtain a mutable reference to vec1 via the &mut operator, like in the example below:

However, if we try to access the data via vec1 while vec2 has a mutable reference to it, we’ll get an error.

We can take as many (non-mutable) references as we want:

But once we try to obtain a mutable reference, we get an error:

Read and write locks. Rust enforces these constraints to prevent race condition bugs in multi-thread applications. In fact the borrowing mechanism via references is very similar to read locks and write locks.

To recall, if some data has a read lock, we can acquire as many other reads locks as we want but we cannot acquire a write lock. This way we prevent data inconsistency between multiple reads since we prevent data mutations by not allowing writes. Conversely, if some data has a write lock, we cannot acquire neither read or write locks.

We can see that the regular borrowing implements a read lock while a mutable borrowing implements a write lock.

Dangling references. One potential issue with references is that if we return a reference to some variable that went out of scope.

Luckily, the compiler will prevent this case by making a compile error. It’s now always the case that we cannot return a reference. If the reference we’re returning was sent as argument, it’s still a valid one, for example:

However, if we try to pass two parameters, we’ll run into a compile error:

To see why it’s not possible to guarantee this memory-safe, we can consider the following code calling get_largest:

Here we’re sending references for both vec1 and vec2, and returning back either of them. If vec2 happened to be larger, we’d return it to result and try to access the data after vec2 went out of scope.

However, if result is used while both vec1 and vec2 are in scope, it should be theoretically safe to allow calling get_largest:

In fact, it’s possible and we’ll see how next, but we’ll need to introduce a new terminology.

Lifetimes

The lifetime of a variable represents the duration in which the variable is valid. In the example below, the lifetime of variable a is from line 2 to line 7. The lifetimes for b is from 4 to 5, for c is line 5 and for d is line 7. Note that a’s lifetime contains all other lifetimes and b’s lifetime contains c’s lifetime.

We can annotate a function using a syntax similar to generics to parametrize the function arguments by their lifetimes. For example, if we want to include the lifetimes of the arguments in get_largest(), we can do:

This is essentially binding the each variable to the lifetime specified by 'a. Note it’s forcing that both variables have the same lifetime and that the return type has the same lifetime as the input parameters.

Now, if we replace get_largest() with get_largest_with_lifetime(), we won’t get compiler errors. In the example below, result has the same lifetime was the common lifetimes of vec1 and vec2, which is vec2‘s lifetime. This means we’re fine using result with the inner block.

Conclusion

Rust documentation is very detailed and well written. All the concepts presented in this post are explained in length there. In here I’m presenting them in my own words and different examples (vec instead of string).

I also tried to group topics around memory management, which is different from the documentation. In there “lifetimes” are not covered under ownership and I skipped the generics discussion.

References

[1] Wikipedia – Rust (programming language)
[2] Rust Documentation: What is Ownership?
[3] Rust Documentation: References & Borrowing
[4] Validating References with Lifetimes

DNA Assembly

In this post we’ll discuss the problem of reconstructing a DNA segment from its fragments, also known as DNA assembly.

shredded-paper.jpg

Context. When we first talked about DNA sequencing, we learned that there’s no good way to “scan” the whole series of nucleotides from a DNA with current technology. What can be done is to break the target segment into many small (overlapping) segments and then rely on computers to help with the task of reconstructing the original DNA sequence from these segments.

We can start by making some assumptions over the nature of the fragments to start. First, we’ll assume every fragment has the same length and second that we have every possible fragment of such length.

For instance, if our sequence is TATGGGGTGC, all possible fragments of length 3 are: ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG.

Note that the fragments overlap with each other. For example, TAT and ATG have an overlap of AT. This is crucial for us solve the problem, since if there was no overlap it would be impossible to order the fragments to obtain the original sequence, since there would be no “link” between any two fragments.

Let’s state the problem more formally given these constraints.

The String Reconstruction Problem

Definitions. A k-mer of a string S is any substring of S with length k.  The String Composition Problem consists in, given the set of all k-mers from S, reconstruct the string S.

Reusing the example from above, if we are given ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG, we could reconstruct the string TATGGGGTGC. We’ll now see a way to solve this problem.

Solution. Assuming a solution exists, it will consist of a (ordered) sequence of the k-mers such that adjacent k-mers overlap in k-1 positions. From the example above, the permutation is

TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC

And two adjacent k-mers such as TGG and GGG, overlap in k-1 positions (GG).

We can model this problem as a graph problem. If we define a directed graph where each vertex corresponds to a k-mer, and an edge (u, v) exists if and only if the suffix of k-mer u is the prefix of the k-mer v, in other words, u overlaps with v.

Now, if we can find a path visiting each vertex exactly once, that will be a valid reconstructed string. This is known as the Hamiltonian path problem for general graphs it’s a NP-Hard problem.

Instead, we can model this problem using a graph as follows: for each k-mer, we have a vertex corresponding to its prefix of length k-1 and another to its suffix with length k-1. For example, for a k-mer TGG, there would exist vertices TG and GG. There’s then an edge from u to v, if the overlap of u and v in k-2 positions is a k-mer in the input. In the example, above, there’s an edge from TG to GG because TGG is a k-mer. Note we can have repeated (multiple) edges.

In this new graph, if we can find a path visiting each edge exactly once, we’ll find a reconstructed string to the set of k-mers. To see why, we can observe that each edge in this new graph is a k-mer and two consecutive edges must overlap in k-1 positions (the overlap being the vertex that “links” these two edges). The graph for the example we discussed above can be seen below:

dna-assembly

Graph representing the k-mer set:ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

Note that this second graph is the line graph of the one we first used, in a similar fashion that a de Bruijn graph of dimension n is a line graph of the one with dimension n-1. In fact, these graphs are a subgraph of the de Bruijn graphs.

As we saw in our discussions about Eulerian Circuits that this is a much easier problem to solve.

Dealing with Ambiguity

Even if are able to solve the String Reconstruction problem, we might not end up with the right DNA segment. In [1], the authors provide the example TATGCCATGGGATGTT which has the same 3-mer composition of TAATGGGATGCCATGTT. Let’s see strategies employed to work around this problem.

The String Reconstruction from Read-Pairs Problem

While it’s currently infeasible to generate longer fragments that would reduce ambiguity, it’s possible to obtain what is called read-pairs. These are a pair of k-mers that are separated by a distance of exactly d in the target segment.

For example, TGC and ATG are a pair of 3-mers separated by distance 1 in TATGCCATGGGATGTT. We refer to a pair of k-mers separated by distance d as (k, d)-mers, or (pattern1 | pattern2) if we can omit the distance.

Solution. We can construct a de Bruijn-like graph for this problem too, which we’ll call Paired de Bruijn graph. First, let’s define the prefix and suffix of a (k, d)-mer.  Given a (k, d)-mer in the form of (a1, ..., ak | b1, ..., bk), its prefix is given by (a1, ..., ak-1 | b1, ..., bk-1) and its suffix by (a2, ..., ak | b2, ..., bk).

For every (k, d)-mer, we’ll have one vertex corresponding to the prefix and one to the suffix of this (k, d)-mer. There’s an edge from vertex u to vertex v if there’s a (k, d)-mer whose prefix is u and suffix is v.

Similar to the solution to the String Reconstruction problem, we can find an Eulerian path, but in this case that might not yield a valid solution. In [1] the authors provide an example:

Consider the set of (2, 1)-mers is given by (AG|AG), (AG | TG), (CA | CT), (CT | CA), (CT | CT), (GC | GC), (GC | GC), (GC | GC), (TG | TG).

After constructing the graph, one of the possible Eulerian paths is (AG|AG) → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) → (CT | CT) → (TG | TG) → (GC | GC) →  (CT | CA) which spells AGCAAGCTGCTGCA, which is a valid solution.

However, another valid Eulerian path, (AG|AG) → (GC | GC) →  (CT | CT) →  (TG | TG)  → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) →  (CT | CA) does not yield a valid string.

In [1] the authors don’t provide an explicit way to overcome this issue but they go on to describe how to enumerate all Eulerian paths, which seems to suggest a brute-force approach.

Practical Challenges

Missing fragments. One of the assumptions we made, that all fragments of the sequence are present doesn’t hold true for the state-of-the-art sequencers.

A technique to address this issue is to break the fragments into smaller ones until we get full coverage.

combined

Left: 10-mers not providing full coverage. Right: 5-mers obtained from 10-mers and having full coverage.

This trades off coverage with ambiguity, since smaller k-mers are more likely to contain repeats and that might not lead to a single solution.

Typos. Another limitation of sequencers is that they can misread nucleotides. If we perform multiple reads – some containing the correct nucleotides, some not – we’ll end up with a graph where some paths are valid and some are not. It’s possible to remove them via heuristics but they’re not perfect and can lead to the removal of valid paths.

Conclusion

While following the textbook [1] I felt a bit lost due to so many detours and kept losing track of the main problem being solved. Because the textbook is meant to also be accessible to people without prior knowledge of Computer Science, so it does need to provide the base for concepts such as Graph Theory.

One think I missed from the content were a section for experiment results. Bioinformatics is a highly applied branch of Computer Science and all of these methods are heuristics or based on approximate models. I’d be interested in knowing how well they perform in practice.

What I liked the most about the presentation style is that it provides a simpler solution first, describe the issues with it and then provide a better solution. This helps understanding on why a given algorithm is this or that way.

Reconstructing a string using (k,d)-mers in an efficient way seems like an open problem, given the solution presented requires brute force in the worst case. I wonder if there has been any progress since.

References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Wikipedia – Sequence Assembly

2018 in Review

This is a meta-post to review what happened in 2018.

Posts Summary

This year I set out to learn about Bioinformatics. I completed the Bioinformatics class on Coursera. Under this umbrella I wrote about Cell Biology and DNA Sequencing. I’m on my way to write about DNA Fragment Assembly, but wanted to work out the theory behind it first, which led me to Eulerian Circuits and De Bruijn Graphs.

Screen Shot 2018-04-28 at 11.09.01 PM

I was curious about current technologies such as Blockchain and Two-factor Authentication and wrote a bit about them.

One of my resolutions for last year was to learn the Rust programming language. I implemented the code from few of my posts using it, including the HyperLogLog data structure and a solver for a game called Bulls and Cows. I still ventured a bit with OCaml by learning BuckleScript (JavaScript with OCaml syntax).

I continued my slow progress in studying Distributed Systems. This year I read Google’s F1 Database paper and wrote about LSM Trees.

Besides BuckleScript, I haven’t dedicated too much time to Web Development topics, the other one being Layout properties of CSS.

The Blog in 2018

The most popular post is still the 2014 Introduction to the Parsec Library, with 1.3k visits. From this year, the recreational math problem, Bulls and Cows was most viewed. Overall the blog had a total of 9.6k visitors.

I kept the resolution to post once a month on average. The blog completed 6 years with 79 posts.

Resolutions for 2019

I’ll repeat my resolutions from 2018 for 2019. I don’t think I learned nearly the minimum of Rust, especially around memory management, and I’ve only scratched the surface on Bioinformatics. Besides DNA analysis I learned more about other problems like protein folding that seem exciting.

I haven’t done any mobile projects and only read one paper, so I’ll put these on the bucket list as well.

Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

Trips

I enjoy traveling and 2018 had plenty of trips. I haven’t had been to Europe before and this year I happened to go twice! Once for work, to England and another time for pleasure, to Greece.

In England I explored mostly around London including the cities of Bath and Dover.

london.png

Top: Tower Bridge; Iconic double-decker bus; Rosetta Stone at the British Museum. Bottom: Roman Baths; Dover Cliffs; Windsor Castle.

The trip to Greece included Athens, Santorini and a train ride to Kalambaka, to see the Meteora monasteries.

greece.png

Top: Athens seen from the Acropolis, the Parthenon, and Santorini. Bottom: Temple of Zeus in Athens, a monastery on top of a mountain and the Akrotiri museum.

There were also trips around the US, including Albuquerque in New Mexico, New Orleans in Louisiana and Los Angeles in California.

us.png

Top: Taos Pueblo near Santa Fe NM, Petroglyphs in Albuquerque NM, Venice Canals in Los Angeles. Bottom: Getty Museum in Los Angeles; Jackson Square in Louisiana; French Quarter in Louisiana.

There was also a trip to Montana, to the Glacier National Park. I really like National Parks and I’m glad to have visited this one, which is very beautiful.

glacier.png

Glacier National Park: Iceberg Glacier, Mountain Goat and Bearhat Mountain

 

Books

This year I read a lot of non-fiction, especially science-related. My favorites science books were:

  •  Blind Watchmaker from Richard Dawkins. He delves into Darwin’s theory of evolution to  conclude it’s the most probable explanation for the existence of life on Earth.
  • Genome: The Autobiography of a Species in 23 Chapters by Matt Ridley is a highly engaging and elucidating tour of our genome. Each chapter is dedicated to one chromosome and he provides an example of trait or disease related to it.
  • The Ghost Map: The Story of London’s Most Terrifying Epidemic – and How It Changed Science, Cities, and the Modern World by Steven Johnson. It describes some fascinating detective work from a doctor during a time we knew a lot less about diseases.

In the realms of humanities,

  • Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker is a thorough presentation of facts and data to back the claim that despite localized set backs and short-term regressions, the world has been becoming more progressive. The idea that stuck the most with me is that each new generation tends to be more progressive than their parents, which shines a optimist light to a more humane future.
  • Why Nations Fail: The Origins of Power, Prosperity, and Poverty makes the claim that the success of a nation has nothing to do with geography, race or culture. There is a lot of world history in this book and I learned a bunch about different countries.

I also enjoyed some biographies,

  • I Am Malala – inspiring story from a Pakistani girl who went on to win the Nobel Peace prize in 2014. Great overview of the history of Pakistan, and the life of a civilian under the regime of the Taliban.
  • Born a Crime – The comedian Trevor Noah had a pretty happening life. The book covers the recent history of South Africa and especially the Apartheid. He provides an interesting perspective on growing up on the later part of that regime and for being the son of a black mother and a white father.

and reading stuff related to trips,

  • For Greece, I chose The King Must Die by Mary Renault. It is a fiction set in the mythical kingdom of Minos in Crete. I really like the fact it alludes to Greek myths but the story itself does not rely on supernatural elements.
  • For Montana, I picked A River Runs Through It by Norman Maclean. It’s a short story set in rural Montana and a constant theme is family relations, fishing and the big questions of life.
  • A Modern History of Japan by Andrew Gordon. I was in Japan in 2017, not in 2018, but I only managed to finish the book this past year. I learned a bunch about the recent Japanese history, but not in enough detail to change how I thought about the experiences from my trip.

Movies

I haven’t watched many movies but really enjoyed Coco and Crazy Rich Asians.