# Von Neumann Architecture

John von Neumann was a Hungarian-American mathematician, physicist, computer scientist and polymath, often regarded as the greatest mathematician of his time. He has contributed to a wide range of fields including quantum mechanics, geometry, topology, game theory, cellular automata, linear programming and computer architecture.

In this post we’ll discuss his contribution to the architecture of modern computers, known as von Neumann architecture (aka Princeton architecture).

### Historical Background

Von Neumann was working at the Manhattan project, which required a lot of computation (in particular to solve differential equations). He got involved on the design of the EDVAC computer together with J. Presper Eckert and John Mauchly and together they wrote a document titled First Draft of a Report on the EDVAC [1]. For an unfortunate reason the report circulated only with von Neumann’s name on it, and the architecture based on the report has only von Neumann’s name [2].

Furthermore, around the same time Alan Turing, who proposed the concept of stored-programs in the form of theoretical Universal Turing Machines (in the paper On Computable Numbers, with an Application to the Entscheidungsproblem), also wrote a paper Proposed Electronic Calculator, discussing the practical aspects of constructing such machines.

These independent approaches led to a debate on whether stored-program machines should be actually referred to von Neumann machines.

### Overview

von Neumann architecture diagram (source: Wikipedia)

The architecture consists of 5 specific parts [1]:

• (i) Central Arithmetic part (CA): an arithmetic logic unit (circuit capable of performing elementary arithmetic and bitwise operations) and a set of registers (small fast-access memory).
• (ii) Central Control (CC): a general purpose unit to carry out the execution of the instructions, to be stored elsewhere.
• (iii) Memory (M):  to store data during the program’s execution and also to store the program’s instructions.

The draft also specifies that there must be a way to connect between these 3 parts. It’s interesting the analogy it makes to the human brain:

The three specific parts CA, CC (together C) and M correspond to the associative neurons in the human nervous system. It remains to discuss the equivalents of the sensory or afferent and the motor or efferent neurons.

The external world is represented by the external medium, called R (which is not considered a part of the machine).

• (iv) Input mechanism (I): a way to transfer information from R to C (CA + CC) and M.
• (v) Output mechanism (O): a way to transfer information from C and M to R.

The authors in [1] also pose an interesting question on whether information should be stored in M or R. Supposedly R representing some sort of external memory. It does resemble a more modern debate on volatile (RAM) or persistent memory (disk).

### Modifications

One bottleneck of the original von Neumann machines is that both data and instruction go through the same bus. This offers a potential limit on speed because data and instructions cannot be read in parallel.

The Harvard architecture doesn’t have this issue by separating the memory (or at least having separate channels of communication with the central unit) and was implemented in the Harvard Mark I computer [3].

Harvard architecture. (source: Wikipedia)

However, there might be advantages of treating data and instructions as data since this allows for concepts such as just-in-time compilation where instructions might be written to memory during runtime and read as data. Modern computers (ARM, x86) use the so-called Modified Harvard architecture [4] which overcomes the bottleneck from von Neumann architecture by having a dedicated memory with a copy of the program (in the form CPU cache) but instructions can still be read as data when needed.

### Limitations of classical architectures

We’ll now focus on the limitations of current implementations of the Modified Harvard architecture. It’s really hard to make any concrete estimates on the early architecture proposals because they’re very high-level and the actual implementation might vary widely.

### Processing Power

In 1965 Gordon Moore, founder of Fairchild Semiconductor, wrote a paper predicting that the number of transistors in the CPU would double every year for the next decade. The industry was eager to follow this prophecy and the trend followed for several decades (it was later adjusted to every 2 years), but it slowed down in the early 2010s. As of today, CPUs have in the order of 10’s billions transistors.

Moore’s Law: transistor count (log axis) over year. (source: Wikpedia)

It’s impractical to think that we’ll be able to put more transistors in a chip forever since there are physical constraints. The question is what kind of constraints are we going to hit?

To pack more transistors in a chip we have to increase the size of the chip.

Increase the size of the chip. this incurs in more power consumption and heat dissipation and information has to travel longer distances, making computation potentially slower. Furthermore, large chips might be infeasible for small devices like smartphones.

Reduce the size of the transistor. For the current transistor design the size has a hard lower bound of the Silicon atom, which is about 0.2nm [6], but before we get there, we have to figure out how to manufacture them with such precision in a cost effective way. As of today, 14nm seems to be the smallest size that can be viably produced.

One question we need to ask ourselves is how having a large number of transistor in a chip translates into computing power? It’s a convenient proxy for CPU performance because it’s easy to measure. However, what can we achieve with more transistors? It allows higher parallelism via multiple cores and transistors can be used to build CPU caches, which improves the speed of common operations.

Other ways to potentially increase processing power is to keep the number of transistors constant but reduce the chip size. This decreases the distance needed for electrons to travel and by dissipating less heat, it’s possible to increase the clock frequency.

Besides reducing the size of the transistor, other strategies are being explored to reduce the chip’s area: instead of the classic 2D square layout, chip manufacturers are exploring stacking approaches to reduce the overall size of the chip.

### Memory Bandwidth

The speed improvements of RAM memories haven’t followed the ones from the CPU. The widening gap can become so large that memory speed will become a bottleneck for the CPU speed (known as the Memory Wall [8]).

To work around this limitation CPU cache is currently being used. Alternative solutions include adding an in-chip memory to reduce latency in the transportation of data.

### Conclusion

I didn’t have a good idea of what to write, but I was interested in understanding how close we are to practical limits of current computer architectures, so the idea was to go back to the early inception of computer architectures and learn some about it.

We’ve made rapid progress in the early days and had steady progress for a long time, so it’s reasonable to be optimistic, but progress has been slowing down, at least in the general purpose single-node computation. We’ve seen specialized hardware take place with GPUs and TPUs, and also the increase of parallel, concurrent and distributed computing.

Quantum computer still seems a dream far away. I wonder if there’s any value in rethinking the classical architecture model from scratch to see if we can escape from local minimum?

### References

[1] Introduction to “The First Draft Report on the EDVAC”
[2] Wikipedia – Von Neumann architecture
[3] Wikipedia – Harvard architecture
[4] Wikipedia – Modified Harvard architecture
[5] Wikipedia – Transistor count
[6] Is 14nm the end of the road for silicon chips?
[7] Intel demos first Lakefield chip design using its 3D stacking architecture
[8] Wikipedia -Random-access memory: memory wall

# Constructing Trees from a Distance Matrix

Richard Dawkins is an evolutionary biologist and author of many science books. In The Blind Watchmaker he explains how complex systems can exist without the need of an intelligent design.

Chapter 10 of that book delves into the tree of life. He argues that the tree of life is not arbitrary taxonomy like the classification of animals into kingdoms or families, but it is more like a family tree, where the branching of the tree uniquely describes the true ancestry relationship between the nodes.

Even though we made great strides in genetics and mapped the DNA from several different species, determining the structure of the tree is very difficult. First, we need to define a suitable metric that would encode the ancestry proximity of two species. In other words, if species A evolved into B and C, we need a metric that would lead us to link A-B and A-C but not B-C. Another problem is that internal nodes can be missing (e.g. ancestor species went extinct without fossils).

David Hill’s tree of life based on sequenced genomes. Source: Wikipedia

In this post we’ll deal with a much simpler version of this problem, in which we have the metric well defined, we know the distance between every pair of nodes (perfect information), and all our nodes are leaves, so we have the freedom to decide the internal nodes of the tree.

This simplified problem can be formalized as follows:

Constructing a tree for its distance matrix problem. Suppose we are given a n x n distance matrix D. Construct a tree with n leaves such that the distance between every pair of leaves can be represented by D.

To reduce the amount of possible solutions, we will assume a canonical representation of a tree. A canonical tree doesn’t have any nodes with degree 2. We can always reduce a tree with nodes with degree 2 into a canonical one. For example:

Nodes with degree 2 can be removed and the edges combined.

### Terminology

Let’s introduce the terminology necessary to define the algorithm for solving our problem. A distance matrix D is a square matrix where d_ij represents the distance between elements i and j. This matrix is symmetric (d_ij = d_ji), all off-diagonal entries are positive, the diagonal entries are 0, and a triplet (i, j, k) satisfy the triangle inequality, that is,

d_ik <= d_ij + d_jk

A distance matrix is additive if there is a solution to the problem above.

We say two leaves are neighbors if they share a common parent. An edge connecting a leaf to its parent is called limb (edges connecting internal nodes are not limbs).

### Deciding whether a matrix is additive

We can decide whether a matrix is additive via the 4-point theorem:

Four-point Theorem. Let D be a distance matrix. If, for every possible set of 4 indexes (i, j, k, l), the following inequality holds (for some permutation):

(1) d_ij + d_kl <= d_ik + d_jl = d_il + d_jk

then D is additive.

Sketch of proof. We can derive the general idea from the example tree below:

We can see by inspection that (1) is true by inspecting the edges on the path between each pair of leaves. This will be our base case for induction.

Now, we’ll show that if we’re given a distance matrix satisfying (1), we are able to reconstruct a valid tree from it. We have that d_ik = a + e + c, d_jl = b + e + d, d_ij = a + b and d_kl = c + d. If we add the first two terms and subtract the last two, we have d_ik + d_jl - d_ij + d_kl = 2e, so we have

e = (d_ik + d_jl - d_ij + d_kl) / 2

We know from (1) that d_ik + d_jl >= d_kl + d_ij > d_kl, so e is positive.

If we add d_ik and d_ij and subtract d_jl, we get d_ik + d_ij - d_jk = 2a, so

a = (d_ik + d_ij - d_jk) / 2

To show that a is positive, we need to remember that a distance matrix satisfy the triangle inequality, that is, for any three nodes, x, y, z, d_xy + d_yz >= d_xz. In our case, this means d_ij + d_jk >= d_ik, hence d_ij >= d_ik - d_jk and a is positive. We can use analogous ideas to derive the values for b, c and d.

To show this for the more general case, if we can show that for every possible set of 4 leaves (i, j, k, l) this property is held, then we can show there’s a permutation of these four leaves such that the tree from the induced paths between each pair of leaves looks like the specific example we showed above.

For at least one one of these quadruplets, i and j will be neighbors in the reconstructed tree. With the computed values of a, b, c, d, e, we are able to merge i and j into its parent and generate the distance matrix for n-1 leaves, which we can keep doing until n = 4. We still need to prove that this modified n-1 x n-1 satisfies the 4-point theorem if and only if the n x n does.

### Limb cutting approach

We’ll see next an algorithm for constructing a tree from an additive matrix.

The general idea is that even though we don’t know the structure of the tree that will “yield” our additive matrix, we are able to determine the length of the limb of any leaf.  Knowing that, we can remove the corresponding leaf (and limb) from our unknown tree by removing the corresponding row and column in the distance matrix. We can then solve for the n-1 x n-1 case. Once we have the solution (a tree) for the smaller problem we can “attach” the leaf into the tree.

To compute the limb length and where to attach it, we can rely on the following theorem.

Limb Length Theorem: Given an additive matrix D and a leaf j, limbLength(j) is equal to the minimum of

(2) (d_ij + d_jk - d_ik)/2

over all pairs of leaves i and k.

The idea behind the theorem is that if we remove parent(j) from the unknown tree, it will divide it into at least 3 subtrees (one being leaf j on its own). This means that there exists leaves i and k that are in different subtrees. This tells us that the path from i to k has to go through parent(j) and also that the path from i to j and from j to k are disjoint except for j‘s limb, so we can conclude that:

d_ik = d_ij + d_jk - 2*limbLength(j)

which yields (2) for limbLength(j). We can show now that for i and k on the same subtree d_ik <= d_ij + d_jk - 2*limbLength(j), and hence

limbLength(j) <= (d_ij + d_jk - d_ik)/2

This means that finding the minimum of (2) will satisfy these constraints.

Attaching leaf j back in. From the argument above, there are at least one pair of leaves (i, k) that yields the minimum limbLength(j) that belongs to different subtrees when parent(j) is removed. This means that parent(j) lies in the path between i and k. We need to plug in j at some point on this path such that when computing the distance from j to i and from j to k, it will yield d_ij and d_jk respectively. This might fall in the middle of an edge, in which case we need to create a new node. Note that as long as the edges all have positive values, there’s only one location within the path from i to k that we can attach j.

Note: There’s a missing detail in the induction argument here. How can we guarantee that no matter what tree is returned from the inductive step, it is such that attaching j will yield consistent distances from j to all other leaves besides i and k?

This constructive proof gives us an algorithm to find a tree for an additive matrix.

Runtime complexity. Finding limbLength(j) takes O(n^2) time since we need to inspect every pair of entries in D. We can generate an n-1 x n-1 matrix in O(n^2) and find the attachment point in O(n). Since each recursive step is proportional to the size of the matrix and we have n such steps, the total runtime complexity is O(n^3).

Detecting non-additive matrices. If we find a non-positive limbLength(j), this condition is sufficient for a matrix to be considered non-additive, since if we have a corresponding tree we know it has to represent the length of j’s limb. However, is this necessary? It could be that we find a positive value for limbLength(j) but when trying to attach j back in the distances won’t match.

The answer to this question goes back to the missing detail on the induction step and I don’t know how to answer.

### The Neighbor-Joining Algorithm

Naruya Saitou and Masatoshi Nei developed an algorithm, called Neighbor Joining, that also constructs a tree from an additive matrix, but has the additional property that for non-additive ones it serves as heuristic.

The idea behind is simple: It transforms the distance matrix D into another n x n matrix, D*, such that the minimum non-diagonal entry, say d*_ij, in that matrix corresponds to neighboring vertices (i ,j) in the tree, which is generally not true for a distance matrix.

The proof that D* has this property is non-trivial and will not provide here. Chapter 7 of [1] has more details and the proof.

Given this property, we can find i and j such that d*_ij is minimal and compute the limbs distances limbLength(i) and limbLength(j), replace them with a single leaf m, and solve the problem recursively. With the tree returned by the recursive step we can then attach i and j into m, which will become their parents.

### Conclusion

In this post we saw how to construct a tree from the distance between its leaves. The algorithms are relatively simple, but proving that they work is not. I got the general idea of the proofs but didn’t get with 100% of detail.

The idea of reconstructing the genealogical tree of all the species is fascinating and is a very interesting application of graph theory.

### References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P. – Chapter 10

[2] The Blink Watchmaker – Richard Dawkins

# Consistent Hashing

Daniel Lewin was an Israeli-American mathematician and entrepreneur. He was aboard the American Airlines Flight 11, which was hijacked by al-Qaeda during the September 11 attacks.

Tom Leighton is a professor (on leave) of Applied Mathematics at CSAIL @ MIT and an expert on algorithms for network applications.

Together, Lewin and Leighton founded the company Akamai, which was a pioneer in the business of content delivery networks (CDNs) and is currently one of the top players in the segment. One of the key technologies employed by the company was the use of consistent hashing, which we’ll present in this post.

### Motivation

One of the main purposes of the CDN is to be a cache for static data. Due to large amounts of data, we cannot possibly store the cache in a single machine. Instead we’ll have many servers each of which will be responsible for storing a portion of the data.

We can see this as a distributed key-value store, and we have two main operations: read and write. For the write part, we provide the data to be written and an associated key (address). For the read part, we provide the key and the system either returns the stored data or decides it doesn’t exist.

In scenarios where we cannot make any assumptions over the pattern of data and keys, we can try to distribute the entries uniformly over the set of servers. One simple way to do this is to hash the keys and get the remainder of the division by N (mod N), where N corresponds to the number of servers. Then we assign the entry (key, value) to the corresponding server.

The problem arises when the set of servers changes very frequently. This can happen in practice, for example, if servers fail and need to be put offline, or we might need to reintroduce servers after reboots or even add new servers for scaling.

Changing the value of N would cause almost complete redistribution of the keys to different servers which is very inefficient. We need to devise a way to hash the keys in a way that adding or removing servers will only require few keys from changing servers.

### Consistent Hashing

The key idea of the consistent hashing algorithm is to include the key for the server in the hash table. A possible key for the server could be its IP address.

Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with hash larger than h(k), we wrap around and start looking from the beginning of the array.

Big blue circles are servers, orange circles are keys. Right: If we remove server S3, only entries corresponding to keys K5 and K4 need to be moved / re-assigned.

If we assume that the hash distributes the keys uniformly, including the server keys, we’ll still get a uniform distribution of keys to each server.

The advantage comes to when adding and removing servers to the list. When adding a new server sx to the system, its hash will be in between 2 server hashes, say h(s1) and h(s2) in the circle. Only the keys from h(s1) to h(sx), which belonged to s2, will change servers, to sx. Conversely, when removing a server sx, only the keys assigned to it will need to go to a different server, in this case the server that immediately follows sx.

How can we find the server associated to a given key? The naive way is to scan linearly the hashes until we find a server hash. A more efficient way is to keep the server hashes in a binary balanced search tree, so we can find the leaf with the smallest value larger that h(x) in O(log n), while adding and removing servers to the tree is also a O(log n) operation.

### Implementation in Rust

We will provide an implementation of the ideas above in Rust as an exercise. We define the interface of our structure as

Note that we’ll store the list of servers (containers) and keys (entries) in separate structures. We can store the entries in a simple hash table since we just need efficient insertion, deletion and look up. For the containers we need insertion, deletion but also finding the smallest element that is larger than a given value, which we’ll call successor. As we discussed above, we can use a binary balanced search tree which allow all these operations in O(log n), for example a Red-Black tree. I found this Rust implementation of the Red-Black tree [1].

Finally, we also include the hash function as part of the structure in case we want customize the implementation (handy for testing), but we can provide a default implementation.

To “construct” a new structure, we define a method new() in the implementation section, and use farmhash as the default implementation for the hash function [2].

The insertion and removal are already provided by the data structures, and are trivial to extend to ours. The interesting method is determining the server corresponding to a given key, namely get_container_id_for_entry().

In there we need to traverse the Red-Black tree to find the successor of our value v. The API of the Red-Black tree doesn’t have such method, only one to search for the exact key. However due to the nature of binary search trees, we can guarantee that the smallest element greater than the searched value v will be visited while searching for v.

Thus, we can modify the search algorithm to include a visitor, that is, a callback that is called whenever a node is visited during the search. In the code below we start with a reference to the root, temp, and in a loop we keep traversing the tree depending on comparison between the key and the value at the current node.

Let’s take a detour to study the Rust code a bit. First, we see the unsafe block [3]. It can be used to de-reference a raw pointer. A raw pointer is similar to a C pointer, i.e. it points to a specific memory address. When we de-reference the pointer, we have access to the value stored in that memory address. For example:

The reason we need the unsafe block in our implementation is that self.root is a raw pointer to RBTreeNode, as we can see in line 1 and 4 below:

The other part worth mentioning is the type of the visitor function. It’s defined as

It relies on several concepts from Rust, including Traits, Closures, and Trait Bounds [4, 5]. The syntax indicates that the type of visitor must be FnMut(&K), which in turns mean a closure that has a single parameter of type &K (K is the type of the key of the RB tree). There are three traits a closure can implement: Fn, FnMut and FnOnce. FnMut allows closures that can capture and mutate variables in their environment (see Capturing the Environment with Closures). We need this because our visitor will update a variable defined outside of the closure as we’ll see next.

We are now done with our detour into the Rust features realm, so we can analyze the closure we pass as visitor. It’s a simple idea: whenever we visit a node, we check if it’s greater than our searched value and if it’s smaller than the one we found so far. It’s worth noticing we define closest_key outside of the closure but mutate it inside it:

We also need to handle a corner case which is that if the hash of the value is larger than all of those of the containers, in which case we wrap around our virtual circular table and return the container with smallest hash:

The full implementation is on Github and it also contains a set of basic unit tests.

### Conclusion

The idea of a consistent hash is very clever. It relies on the fact that binary search trees can be used to search not only exact values (those stored in the nodes) but also the closest value to a given query.

In a sense, this use of binary trees is analogous to a common use of quad-trees, which is to subdivide the 2d space into regions. In our case we’re subdividing the 1d line into segments, or more precisely, we’re subdividing  a 1d circumference into segments, since our line wraps around.

I struggled quite a bit with the Rust strict typing, especially around passing lambda functions as arguments and also setting up the testing. I found the mocking capability from the Rust toolchain lacking, and decided to work with dependency injection to mock the hash function and easier to test. I did learn a ton, though!

### References

[1] GitHub: /tickbh/rbtree-rs
[2] GitHub: seiflotfy/rust-farmhash
[3] The Rust Programming Language book – Ch19: Unsafe Rust
[4] The Rust Programming Language book – Ch13: Closures: Anonymous Functions that Can Capture Their Environment
[5] The Rust Programming Language book – Ch13: Traits: Defining Shared Behavior

# De Bruijn Graphs and Sequences

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

## Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

## Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

## De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

## Applications

#### Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

#### Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

#### Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

## Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

## Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

## References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[4] Twitter: John Carlos Baez
[5] Wikipedia: Superpermutation

# Eulerian Circuits

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

## Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

## Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

## Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices $(v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v)$ and w is a vertex with a sub-circuit. Let $(w, q_1, ..., q_m-1, w)$ be the sub-circuit starting from w. We can construct a new circuit $(v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v)$.

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

## Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

## Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

## Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

$ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)!$ (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

$\prod_{v \in V}(deg(v) - 1)!$ (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

## Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

## Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

## References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

# Blockchain

Blockchain has become a very popular technology recently due to the spread of Bitcoin. In this post, we’ll focus on the details of blockchain with a focus on Computer Science, studying it as a distributed data structure.

### Motivation

Blockchain is a distributed data structure that can be used to implement a distributed ledger system. A distributed ledger orchestrates important operations (for example financial transactions) without the need of a centralized arbiter (such as a financial institution).

The reason to prefer decentralized systems could be from costs of operations: having a few financial institutions mediate all transactions require a lot of work; To avoid a single point of failure (more reliable), and finally to not have to trust a few companies with our assets.

Additionally, by being decentralized, the expectation is that it becomes less likely to regulate and thus far it has enable global currencies like bitcoin.

### Challenges

The main challenge with a distributed ledger is how to distribute the work and data across nodes in a network and, more importantly, how to make sure we can trust that information.

An initial naive idea could be to store a copy of all accounts balances in every node of the network. The problem of storing only the balances of accounts is that it’s very hard to verify whether a given payment went through or make sure the balances are consistent. To address that, the blockchain also stores the whole history of transactions that led to the current balances (think of version control).

Safety. Even if we have the whole history of the transactions, it’s possible for a bad actor to tamper with this history on their own benefit, so we need a safety mechanism to prevent that.

Consistency. Finally, because we store copies of data that is changing all the time in multiple machines, we’ll invariably hit problems with consistency and sources of truth.

Blockchain is a data structure designed to address these challenges as we shall see next. We’ll use bitcoin as the application when describing examples.

### Data Structure

A blockchain is a chain of entities called blocks that is replicated to several nodes in a network. Each block has a hash which is computed as a function of its contents and the hash of the previous node in the chain.

For the bitcoin case, the content of the block is a set of transactions adding up to 1MB. The hash of the contents is the hash of the combination of the hashes of individual transactions. More specifically, these hashes can be organized in a binary tree (also known as Merkle Tree) where the transactions hashes are the leaves and the hashes of individual inner nodes are the hash of the concatenation of the hashes of the children.

Merkle Tree

Source of truth. There might be several different versions of the chain around the network, either due to inconsistency or bad actors. The assumption is that the chain with most nodes that is agreed upon by the majority of the nodes (over 50%) is the accurate and most up-to-date version of the chain and the one that should be trusted.

User C receiving the chains from A and B. Since B’s chain is longer, C will take it as the source of truth.

### Inserting Blocks

The insertion consists of adding a new transaction to the blockchain. In terms of our bitcoin example, this is essentially user A sending some amount of money to user B.

To start the process, node A broadcasts a message with the transaction, signing it with its private key. The other nodes on the network have node A’s public key, so they can check the authenticity of the transaction.

The transaction stays in a pool of unresolved transactions. At any time, there are many nodes in the network constructing a new block, which contains several transactions. These nodes are also called miners. Not all nodes in the network need to be miners. Note that each block can pick any transaction from the pool so the set of transactions in a block under construction can be different between miners.

Adding a transaction to its block consists of verifying things like its authenticity, the validity (e.g. to make sure there are enough funds to be spend), then inserting it to the Merkle Tree, which allows recomputing the root hash in O(log n) time.

Each block should be around 1MB in size, so when a miner is done collecting enough transactions it can start wrapping up its work. It needs to compute the block hash which is a string such that when concatenated with the Merkle tree root hash and the previous block in the largest chain will generate a hash with k leading 0s. Since the hash function used is cryptographic, it cannot be easily reverse-engineered. The only way to find such string is via brute force. The number k is a parameter that determines the difficulty in finding the hash (the value is controlled by some central authority but rarely changes). The idea of this hash computation is that it’s purposely a CPU-expensive operation, so that it’s too costly for attackers to forge hashes as we’ll see later.

Once it finds the hash, it can broadcast the new block to the network where other miners can start to validate this block. They can check that:

• The transactions in the block are valid
• The block was added to the largest chain known
• The hash generated for the block is correct

The verification step should be much cheaper than the process of generating the hash. Once the block passes the validation, the miner adds it to the largest chain. Now, when there’s a call to get the largest chain, this last block will be included. On the other hand, if any of the validation steps failed, the block is rejected.

Because of this, miners building their blocks can also periodically check for updates in the network for larger chains. If, while they’re building their blocks or computing the hash a new larger chain arrives, it’s better to start over, since it will be rejected later anyway.

### Checking for valid transactions

In the bitcoin example, suppose I’m a vendor and I want to verify that your payment to me went through before I ship you my product.

To check whether a transaction was validated, a node just needs to ask for the current longest blockchain, which is the source of truth, and see if the transaction appears in any of the blocks.

### Double-Spending

Say user A has exactly 1 coin, and that it sends to B as payment. Immediately after, A sends that same coin to another user C. How does C make sure that the coin it’s receiving is not spent before?

In a decentralized ledger, this could lead to problems since different nodes could disagree to where the coin went. In the blockchain, there are two scenarios:

• A to B  got included in a block first which later became part of the longest chain (and hence the source of truth). The transaction from A to C would be rejected from being added to future blocks due to insufficient funds.
• A to B and A to C got picked up to be included in the same block. The miner constructing the block would consider the second transaction it picked up invalid. Even if it was malicious and it included both transactions, it would fail validation when broadcasting it to the network.

The case in which A to C gets picked up first is analogous.

### Changing History

Suppose that user A, after performing a payment to and receiving its product from user B, wants to reuse the transaction and send the money to another user C. In theory it could edit the destination of the transaction from the chain (or even replace the block entirely), but to make this modified chain become the source of truth, it would also need to make it the longest.

As we’ve seen above, adding a block to a chain with a valid hash is extremely expensive. To make it the new source of truth, it would need to add at least one block in addition to the forged one. This means it would need to outpace all the other miners in the network that are also continuously computing new blocks in parallel. The paper [1] claims that unless the attacker controls over 50% of the CPU power in the network, this is extremely unlikely to succeed.

Note that a user cannot modify the transactions it is not the sender of, since a transaction from user A to user B is signed with A’s private key. If user C wanted to redirect A’s money to itself, it would need to know A’s private key, so it could sign the transaction pretending it’s A.

### Incentives

As we mentioned earlier, not all nodes in the network need to be miners. So why would any node volunteer to spend energy in form of CPU cycles to compute hashes? In bitcoin, the incentive is that the miner whose block gets added to the chain receives bitcoins as reward. These bitcoins are collected as fees from the normal transactions.

Bitcoin Mining Farms to compute block hashes and collect rewards

There must incentives to make sure that a miner won’t keep building blocks with one or zero transactions, to skip doing any validations for transactions. The reward should take into account the transaction, and possibly its value and their age (to make sure “unimportant” transactions won’t be ignored forever).

### Conclusion

In researching the material for this post, I ran into a lot of articles and videos covering blockchain without going into much detail. As always, I only realized my understanding had a lot of gaps once I started writing this post.

Overall I know a good deal more about what Blockchain is and what it’s not. I’m curious to see what other applications will make use of this technology and if we can come up with a trustworthy system that doesn’t require wasting a lot of CPU power.

### References

[1] Bitcoin: A Peer-to-Peer Electronic Cash System
[2] Hashcash: A Denial of Service Counter-Measure – A proof-of-work algorithm
[3] Blockchain: how mining works and transactions are processed in seven steps
[4] Bitcoin P2P e-cash paper – The proof-of-work chain is a solution to the Byzantine Generals’ Problem

# Two-factor authentication

In this post we’ll talk about some popular security measures to protect user accounts on the web via two-factor authentication. The term refers to the requirement of two methods of authentication for logging in into a given account. The first method is mostly always a password, and the second is one of the methods we’ll describe in this post.

### Why do we need an additional form of authentication?

In an ideal world, people would have strong (long, not complex) passwords, which would never get stolen and people would never forget them. In the real world, applications have to deal with two scenarios: 1) someone else knows your password or 2) you forgot your password.

#### Scenario 1: They are not who they claim to be

If someone else knows your password, the system needs to somehow know that this person is not you.

They can then employ a secondary method of authentication to verify that you are yourself. In theory they could ask for a secondary password or ask a security question. The problem with these is that they’re exposed to the same set of vulnerability that might have compromised the original password in the first place, for example, the password is too easy to crack or there was a breach of database storing plain text passwords. In addition, since these secondary methods are to be used in very rare occasions, it’s extremely likely you’ll incur in the second problem, i.e. forget your password.

Physical devices. Nowadays, security systems can almost always rely on the fact that even if someone has your password, they do not have your physical belongings (e.g. cellphone). Some websites allow users to setup the requirement to use both a password and a secondary authentication to access the account.

#### Scenario 2: I’m who I claim to be

To address the problem of a user losing a password, some websites offers a recovery mechanism, usually by sending a secure email with a link to re-set the password or, in case of email applications like GMail, allowing the secondary authentication method as an alternative to inputing your password.

Websites such as GMail and Github also have a set of auto-generated “master passwords” that you can print and store in a safe place. After used, these passwords become invalid. This is one of the safest options, but it also requires more effort from the user (printing and making sure they can find the printed document when needed).

Scenario 1 deals with security, and Scenario 2 deals with usability (recovering passwords), and these are usually at odds with which other. Security systems have to find the balance between the two.

We’ll now cover three popular secondary authentication mechanisms: SMS (text messages), third-party app authentication and hardware authentication.

### SMS

In the SMS (Short Message Service) method, the server generates a short code and sends it to the user via a text (SMS) message which is valid for a few minutes. The user can then copy the code from the phone to the computer and send the code to the server which can then authenticate the initial request.

During this period of time, the user account is technically vulnerable to a very weak code (a 6-digit number) which is very easy to crack. However, this period is very narrow, which great limits the ability of a bad agent to take any action.

#### Vulnerabilities

The real danger of the SMS method is a bad agent being able to intercept the SMS message that is supposed to go to the user. According to this Wired article, the telecoms use a network called SS7 (Signaling System No. 7) to transport text messages. This network relies on trust to implement features such as roaming, which enables a person from New York to receive/send text messages when they’re traveling to Berlin. In this case a carrier in Berlin could request the user’s carrier back in New York to receive the text messages so it can deliver to the user.

This system has a vulnerability because a hacked carrier could be used to intercept the text messages by pretending it’s doing so on behalf of a user. The carriers might not do any checks to verify the authenticity of the request. Hence, if an attacker knows your email, phone number and has access to a hacked carrier, they could technically hack into your account.

### App Authentication

Another authentication method is to install a third-party app that can be used to generated the authentication codes. One popular option is the Google Authenticator App, which you can install on your phone (Android or iOS).

It uses the Time-based One-time Password algorithm or TOTP [2, 3]. The general idea is to perform a one-time registration between your phone and the server which consists of having both store a secret.

Whenever the client needs to authenticate itself, it uses the current timestamp and the secret to generate a hash, and from this hash it extracts a simpler code (6 characters) that the user copies and sends to the server. The server performs the same operation and if the generated code matches, it accepts the authentication.

The precision of the timestamp defines on how much time the user has to copy and send the code to the server. For example, the server can define the timestamp granularity to be 30 seconds. This also defines how long the server is vulnerable, since the code is usually short and hence easier to crack via brute force, so it cannot be too long.

### Hardware Authentication

A more recent approach to authentication is using a dedicated piece of hardware. YubiKey is an example of such device, which can be connected to the USB port. One way it can be used is part of the open authentication protocol called Universal 2nd Factor (U2F), developed by Google and Yubico (the company that manufactures YubiKey). We’ll describe this protocol next. In the discussion that follows we’ll refer to the Yubikey device generically as U2F.

The general flow consists of a enrollment phase, where the use registers the U2F in the target webpage. The webpage asks for a confirmation, which the user can do by tapping the U2F, which sends some information to the webpage, which stores it.

The other part is the signing phase. When this webpage needs to verify the user, say during login, it can ask the user to tap the U2F, which will send information that can be validated by the webpage to make sure it’s the same device that was registered in the first step.

#### Implementation details

One of the designs of this system is to be cross compatible and require no extra configuration from the user, like installing drivers. To achieve that, the communication between the U2F and the server is mediated via the browser. That means that the website calls a browser API (via JavaScript) which in turn communicates with the U2F. Henceforth when we refer to the communication between the U2F and the server, we’re implicitly assuming it’s done via the browser.

During the enrollment process, the device generates a pair of public and private keys (public-key cryptography). It sends the public key to the server which stores it together with other information. During the signing phase the server can generate a challenge (string), encrypt with the public key and send it to the U2F, which can decode it. At this point, the user is asked to tap the U2F. Once that it’s done, it sends the challenge back to the server encrypted with its private key. If the server can then decode the message, it can trust the U2F and authenticate the user.

The reason a new public and private key is generated at every enrollment is for privacy reasons (not security). This is to prevent the case of different websites that enable U2F, to share data between them and be able to track the user. For example, if the same public key was used for all enrollments, a site A and B would be able to identify the user via their public key and share this information among themselves. If site A is a online shopping, it could use this information to show targeted ads in site B.

Stateless U2F. The problem of having to generate a pair of public/private keys every time is that now the U2F has to store them somehow. Since another important part of design is for the U2F to be very accessible, the implication is that they also have to be cheap. Hence, the protocol cannot assume the device has embedded storage. The solution is to send the pair for the server to store!

This seems to defeat the whole purpose of using cryptography but this information is sent to the server encrypted, which only the U2F itself can decode. Now, in addition to the server storing the public key, it has to store this extra information which the protocol calls Key Handle [5]. During the signing phase it sends not only the encrypted challenge, but also the Key Handle.

Man-in-the-middle. One potential security hole could be a scam website that looks like the real one and acts as a man-in-the-middle. First, the user will provide the scam site with the username and password. The scam site can then forward these to the real site to trigger the secondary factor request, which will send down the Key Handle and encrypted challenge. The scam site will forward it back to the U2F. Then the U2F would encrypt the challenge, which would be sent to the scam site, which in turn would relay it to the real site, finally allowing the bad actor to login as the user.

To prevent that, the site origin can be stored in the Key Handle as well. Before deciding to send data back, the U2F can check if the origin of the server and match it against the data in the Key Handle. The site origin is hard to tamper with when using an HTTPS connection unless the real site’s certificates are compromised.

Vendor reliability. Another component of the security is the trust in the manufacturer of the device. It could have malicious intent or flawed implementation. To address that concern, the U2F should also contain an extra pair of attestation public-private pair of keys. The attestation is to prove the identity of the manufacturer/vendor. During the enrollment, the public key that is generated is encrypted with the private attestation key. The public attestation key is made available by some trusted organization for the server to consult. If it’s able to decode the generated public key, then it can trust the U2F vendor.

### Conclusion

In this post we covered 3 methods of extra protection to the online identity. We saw that SMS has serious vulnerability while third-party and hardware authentication are much safer, which is no surprise since SMS were not initially designed to serve as a secure channel for authentication. No method is 100% secure but recent authentication mechanisms go to great lengths to reduce the vulnerable surface area to a minimum.

Note how all these methods assume the possession of a physical device separate from the computer where we’re trying to log into. Physical devices are much harder to steal compared to piece of information like passwords.

### References

[1] Information Security – How does Google Authenticator work?
[2] Wikipedia – HMAC-based One-time Password algorithm
[3] Wikipedia – Time-based One-time Password algorithm
[4] Wired – Fixing the cell network flaw that lets hackers drain bank accounts
[5] Google U2F (Gnubby) Documents – Snapshot prior to joining FIDO