# Consistent Hashing

Daniel Lewin was an Israeli-American mathematician and entrepreneur. He was aboard the American Airlines Flight 11, which was hijacked by al-Qaeda during the September 11 attacks.

Tom Leighton is a professor (on leave) of Applied Mathematics at CSAIL @ MIT and an expert on algorithms for network applications.

Together, Lewin and Leighton founded the company Akamai, which was a pioneer in the business of content delivery networks (CDNs) and is currently one of the top players in the segment. One of the key technologies employed by the company was the use of consistent hashing, which we’ll present in this post.

### Motivation

One of the main purposes of the CDN is to be a cache for static data. Due to large amounts of data, we cannot possibly store the cache in a single machine. Instead we’ll have many servers each of which will be responsible for storing a portion of the data.

We can see this as a distributed key-value store, and we have two main operations: read and write. For the write part, we provide the data to be written and an associated key (address). For the read part, we provide the key and the system either returns the stored data or decides it doesn’t exist.

In scenarios where we cannot make any assumptions over the pattern of data and keys, we can try to distribute the entries uniformly over the set of servers. One simple way to do this is to hash the keys and get the remainder of the division by N (mod N), where N corresponds to the number of servers. Then we assign the entry (key, value) to the corresponding server.

The problem arises when the set of servers changes very frequently. This can happen in practice, for example, if servers fail and need to be put offline, or we might need to reintroduce servers after reboots or even add new servers for scaling.

Changing the value of N would cause almost complete redistribution of the keys to different servers which is very inefficient. We need to devise a way to hash the keys in a way that adding or removing servers will only require few keys from changing servers.

### Consistent Hashing

The key idea of the consistent hashing algorithm is to include the key for the server in the hash table. A possible key for the server could be its IP address.

Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with hash larger than h(k), we wrap around and start looking from the beginning of the array.

Big blue circles are servers, orange circles are keys. Right: If we remove server S3, only entries corresponding to keys K5 and K4 need to be moved / re-assigned.

If we assume that the hash distributes the keys uniformly, including the server keys, we’ll still get a uniform distribution of keys to each server.

The advantage comes to when adding and removing servers to the list. When adding a new server sx to the system, its hash will be in between 2 server hashes, say h(s1) and h(s2) in the circle. Only the keys from h(s1) to h(sx), which belonged to s2, will change servers, to sx. Conversely, when removing a server sx, only the keys assigned to it will need to go to a different server, in this case the server that immediately follows sx.

How can we find the server associated to a given key? The naive way is to scan linearly the hashes until we find a server hash. A more efficient way is to keep the server hashes in a binary balanced search tree, so we can find the leaf with the smallest value larger that h(x) in O(log n), while adding and removing servers to the tree is also a O(log n) operation.

### Implementation in Rust

We will provide an implementation of the ideas above in Rust as an exercise. We define the interface of our structure as

Note that we’ll store the list of servers (containers) and keys (entries) in separate structures. We can store the entries in a simple hash table since we just need efficient insertion, deletion and look up. For the containers we need insertion, deletion but also finding the smallest element that is larger than a given value, which we’ll call successor. As we discussed above, we can use a binary balanced search tree which allow all these operations in O(log n), for example a Red-Black tree. I found this Rust implementation of the Red-Black tree [1].

Finally, we also include the hash function as part of the structure in case we want customize the implementation (handy for testing), but we can provide a default implementation.

To “construct” a new structure, we define a method new() in the implementation section, and use farmhash as the default implementation for the hash function [2].

The insertion and removal are already provided by the data structures, and are trivial to extend to ours. The interesting method is determining the server corresponding to a given key, namely get_container_id_for_entry().

In there we need to traverse the Red-Black tree to find the successor of our value v. The API of the Red-Black tree doesn’t have such method, only one to search for the exact key. However due to the nature of binary search trees, we can guarantee that the smallest element greater than the searched value v will be visited while searching for v.

Thus, we can modify the search algorithm to include a visitor, that is, a callback that is called whenever a node is visited during the search. In the code below we start with a reference to the root, temp, and in a loop we keep traversing the tree depending on comparison between the key and the value at the current node.

Let’s take a detour to study the Rust code a bit. First, we see the unsafe block [3]. It can be used to de-reference a raw pointer. A raw pointer is similar to a C pointer, i.e. it points to a specific memory address. When we de-reference the pointer, we have access to the value stored in that memory address. For example:

The reason we need the unsafe block in our implementation is that self.root is a raw pointer to RBTreeNode, as we can see in line 1 and 4 below:

The other part worth mentioning is the type of the visitor function. It’s defined as

It relies on several concepts from Rust, including Traits, Closures, and Trait Bounds [4, 5]. The syntax indicates that the type of visitor must be FnMut(&K), which in turns mean a closure that has a single parameter of type &K (K is the type of the key of the RB tree). There are three traits a closure can implement: Fn, FnMut and FnOnce. FnMut allows closures that can capture and mutate variables in their environment (see Capturing the Environment with Closures). We need this because our visitor will update a variable defined outside of the closure as we’ll see next.

We are now done with our detour into the Rust features realm, so we can analyze the closure we pass as visitor. It’s a simple idea: whenever we visit a node, we check if it’s greater than our searched value and if it’s smaller than the one we found so far. It’s worth noticing we define closest_key outside of the closure but mutate it inside it:

We also need to handle a corner case which is that if the hash of the value is larger than all of those of the containers, in which case we wrap around our virtual circular table and return the container with smallest hash:

The full implementation is on Github and it also contains a set of basic unit tests.

### Conclusion

The idea of a consistent hash is very clever. It relies on the fact that binary search trees can be used to search not only exact values (those stored in the nodes) but also the closest value to a given query.

In a sense, this use of binary trees is analogous to a common use of quad-trees, which is to subdivide the 2d space into regions. In our case we’re subdividing the 1d line into segments, or more precisely, we’re subdividing  a 1d circumference into segments, since our line wraps around.

I struggled quite a bit with the Rust strict typing, especially around passing lambda functions as arguments and also setting up the testing. I found the mocking capability from the Rust toolchain lacking, and decided to work with dependency injection to mock the hash function and easier to test. I did learn a ton, though!

### References

[1] GitHub: /tickbh/rbtree-rs
[2] GitHub: seiflotfy/rust-farmhash
[3] The Rust Programming Language book – Ch19: Unsafe Rust
[4] The Rust Programming Language book – Ch13: Closures: Anonymous Functions that Can Capture Their Environment
[5] The Rust Programming Language book – Ch13: Traits: Defining Shared Behavior

# Rust Memory Management

Graydon Hoare is a Software Developer who created the Rust programming language while working at Mozilla Research [1]. He has an interesting presence on the internet, for example responding to this question and on Twitter.

In this post we’ll talk about one of the key features of Rust, which I find the hardest to wrap my head around, which is its memory management.

### Motivation

In most programming languages we either have to manage memory allocation ourselves or rely on a complex garbage collector to which we have limited control and can lead to unpredictable performance bottlenecks.

Rust has a set of constraints around memory allocation that results in a deterministic automatic memory management.

We’ll now delve into more details on what these constraints are, how they enforce the necessary guarantees to enable an efficient memory management by the runtime.

### Ownership Rules

Rust has the following basic rules around ownership:

• Each value in Rust has a variable that’s called its owner
• There can only be one owner at a time
• When the owner goes out of scope, the value will be dropped

They’re very basic but have deep implications in how we think about programs. Let’s analyze some of them.

### Assignment transfers ownership

To conform the second rule, whenever we assign a variable to another, we are transferring the ownership from one variable to another. For example:

In the example above, vec1 transfer ownership of its data to vec2. This means that we can neither read nor write to vec1 anymore. It’s as if it was out of scope (unless it is assigned ownership to some other data later).

By having a single owner we don’t have to worry about keeping track of references to a given object as garbage collectors do to know when we are allowed to free the memory. For example, if we had:

Because vec2 owns the vector allocated initially and it goes out of scope after line 4, the runtime can free the memory safely, since we know vec1 cannot access that data after line 3.

Similarly, when we use a variable as argument to a function, its data is transferred to the parameter. In the example below, vec1‘s data is transferred to vec in mutate_vec(), so we cannot access it in line 9.

One way to return the ownership back to vec1 is for the function to return the argument.

### References & Borrowing

To avoid transferring data to a variable on assignment, we can use references. In the example below, vec2 “borrows” the data from vec1 but there’s no ownership transfer. We have read access to the data via vec2, which we can do by dereferencing vec2 with &vec2.

If we want write access, we need to make vec1 mutable and also obtain a mutable reference to vec1 via the &mut operator, like in the example below:

However, if we try to access the data via vec1 while vec2 has a mutable reference to it, we’ll get an error.

We can take as many (non-mutable) references as we want:

But once we try to obtain a mutable reference, we get an error:

Read and write locks. Rust enforces these constraints to prevent race condition bugs in multi-thread applications. In fact the borrowing mechanism via references is very similar to read locks and write locks.

To recall, if some data has a read lock, we can acquire as many other reads locks as we want but we cannot acquire a write lock. This way we prevent data inconsistency between multiple reads since we prevent data mutations by not allowing writes. Conversely, if some data has a write lock, we cannot acquire neither read or write locks.

We can see that the regular borrowing implements a read lock while a mutable borrowing implements a write lock.

Dangling references. One potential issue with references is that if we return a reference to some variable that went out of scope.

Luckily, the compiler will prevent this case by making a compile error. It’s now always the case that we cannot return a reference. If the reference we’re returning was sent as argument, it’s still a valid one, for example:

However, if we try to pass two parameters, we’ll run into a compile error:

To see why it’s not possible to guarantee this memory-safe, we can consider the following code calling get_largest:

Here we’re sending references for both vec1 and vec2, and returning back either of them. If vec2 happened to be larger, we’d return it to result and try to access the data after vec2 went out of scope.

However, if result is used while both vec1 and vec2 are in scope, it should be theoretically safe to allow calling get_largest:

In fact, it’s possible and we’ll see how next, but we’ll need to introduce a new terminology.

The lifetime of a variable represents the duration in which the variable is valid. In the example below, the lifetime of variable a is from line 2 to line 7. The lifetimes for b is from 4 to 5, for c is line 5 and for d is line 7. Note that a’s lifetime contains all other lifetimes and b’s lifetime contains c’s lifetime.

We can annotate a function using a syntax similar to generics to parametrize the function arguments by their lifetimes. For example, if we want to include the lifetimes of the arguments in get_largest(), we can do:

This is essentially binding the each variable to the lifetime specified by 'a. Note it’s forcing that both variables have the same lifetime and that the return type has the same lifetime as the input parameters.

Now, if we replace get_largest() with get_largest_with_lifetime(), we won’t get compiler errors. In the example below, result has the same lifetime was the common lifetimes of vec1 and vec2, which is vec2‘s lifetime. This means we’re fine using result with the inner block.

### Conclusion

Rust documentation is very detailed and well written. All the concepts presented in this post are explained in length there. In here I’m presenting them in my own words and different examples (vec instead of string).

I also tried to group topics around memory management, which is different from the documentation. In there “lifetimes” are not covered under ownership and I skipped the generics discussion.

### References

[1] Wikipedia – Rust (programming language)
[2] Rust Documentation: What is Ownership?
[3] Rust Documentation: References & Borrowing

# DNA Assembly

In this post we’ll discuss the problem of reconstructing a DNA segment from its fragments, also known as DNA assembly.

Context. When we first talked about DNA sequencing, we learned that there’s no good way to “scan” the whole series of nucleotides from a DNA with current technology. What can be done is to break the target segment into many small (overlapping) segments and then rely on computers to help with the task of reconstructing the original DNA sequence from these segments.

We can start by making some assumptions over the nature of the fragments to start. First, we’ll assume every fragment has the same length and second that we have every possible fragment of such length.

For instance, if our sequence is TATGGGGTGC, all possible fragments of length 3 are: ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG.

Note that the fragments overlap with each other. For example, TAT and ATG have an overlap of AT. This is crucial for us solve the problem, since if there was no overlap it would be impossible to order the fragments to obtain the original sequence, since there would be no “link” between any two fragments.

Let’s state the problem more formally given these constraints.

### The String Reconstruction Problem

Definitions. A k-mer of a string S is any substring of S with length k.  The String Composition Problem consists in, given the set of all k-mers from S, reconstruct the string S.

Reusing the example from above, if we are given ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG, we could reconstruct the string TATGGGGTGC. We’ll now see a way to solve this problem.

Solution. Assuming a solution exists, it will consist of a (ordered) sequence of the k-mers such that adjacent k-mers overlap in k-1 positions. From the example above, the permutation is

TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC

And two adjacent k-mers such as TGG and GGG, overlap in k-1 positions (GG).

We can model this problem as a graph problem. If we define a directed graph where each vertex corresponds to a k-mer, and an edge (u, v) exists if and only if the suffix of k-mer u is the prefix of the k-mer v, in other words, u overlaps with v.

Now, if we can find a path visiting each vertex exactly once, that will be a valid reconstructed string. This is known as the Hamiltonian path problem for general graphs it’s a NP-Hard problem.

Instead, we can model this problem using a graph as follows: for each k-mer, we have a vertex corresponding to its prefix of length k-1 and another to its suffix with length k-1. For example, for a k-mer TGG, there would exist vertices TG and GG. There’s then an edge from u to v, if the overlap of u and v in k-2 positions is a k-mer in the input. In the example, above, there’s an edge from TG to GG because TGG is a k-mer. Note we can have repeated (multiple) edges.

In this new graph, if we can find a path visiting each edge exactly once, we’ll find a reconstructed string to the set of k-mers. To see why, we can observe that each edge in this new graph is a k-mer and two consecutive edges must overlap in k-1 positions (the overlap being the vertex that “links” these two edges). The graph for the example we discussed above can be seen below:

Graph representing the k-mer set:ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

Note that this second graph is the line graph of the one we first used, in a similar fashion that a de Bruijn graph of dimension n is a line graph of the one with dimension n-1. In fact, these graphs are a subgraph of the de Bruijn graphs.

As we saw in our discussions about Eulerian Circuits that this is a much easier problem to solve.

#### Dealing with Ambiguity

Even if are able to solve the String Reconstruction problem, we might not end up with the right DNA segment. In [1], the authors provide the example TATGCCATGGGATGTT which has the same 3-mer composition of TAATGGGATGCCATGTT. Let’s see strategies employed to work around this problem.

### The String Reconstruction from Read-Pairs Problem

While it’s currently infeasible to generate longer fragments that would reduce ambiguity, it’s possible to obtain what is called read-pairs. These are a pair of k-mers that are separated by a distance of exactly d in the target segment.

For example, TGC and ATG are a pair of 3-mers separated by distance 1 in TATGCCATGGGATGTT. We refer to a pair of k-mers separated by distance d as (k, d)-mers, or (pattern1 | pattern2) if we can omit the distance.

Solution. We can construct a de Bruijn-like graph for this problem too, which we’ll call Paired de Bruijn graph. First, let’s define the prefix and suffix of a (k, d)-mer.  Given a (k, d)-mer in the form of (a1, ..., ak | b1, ..., bk), its prefix is given by (a1, ..., ak-1 | b1, ..., bk-1) and its suffix by (a2, ..., ak | b2, ..., bk).

For every (k, d)-mer, we’ll have one vertex corresponding to the prefix and one to the suffix of this (k, d)-mer. There’s an edge from vertex u to vertex v if there’s a (k, d)-mer whose prefix is u and suffix is v.

Similar to the solution to the String Reconstruction problem, we can find an Eulerian path, but in this case that might not yield a valid solution. In [1] the authors provide an example:

Consider the set of (2, 1)-mers is given by (AG|AG), (AG | TG), (CA | CT), (CT | CA), (CT | CT), (GC | GC), (GC | GC), (GC | GC), (TG | TG).

After constructing the graph, one of the possible Eulerian paths is (AG|AG) → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) → (CT | CT) → (TG | TG) → (GC | GC) →  (CT | CA) which spells AGCAAGCTGCTGCA, which is a valid solution.

However, another valid Eulerian path, (AG|AG) → (GC | GC) →  (CT | CT) →  (TG | TG)  → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) →  (CT | CA) does not yield a valid string.

In [1] the authors don’t provide an explicit way to overcome this issue but they go on to describe how to enumerate all Eulerian paths, which seems to suggest a brute-force approach.

### Practical Challenges

Missing fragments. One of the assumptions we made, that all fragments of the sequence are present doesn’t hold true for the state-of-the-art sequencers.

A technique to address this issue is to break the fragments into smaller ones until we get full coverage.

Left: 10-mers not providing full coverage. Right: 5-mers obtained from 10-mers and having full coverage.

This trades off coverage with ambiguity, since smaller k-mers are more likely to contain repeats and that might not lead to a single solution.

Typos. Another limitation of sequencers is that they can misread nucleotides. If we perform multiple reads – some containing the correct nucleotides, some not – we’ll end up with a graph where some paths are valid and some are not. It’s possible to remove them via heuristics but they’re not perfect and can lead to the removal of valid paths.

### Conclusion

While following the textbook [1] I felt a bit lost due to so many detours and kept losing track of the main problem being solved. Because the textbook is meant to also be accessible to people without prior knowledge of Computer Science, so it does need to provide the base for concepts such as Graph Theory.

One think I missed from the content were a section for experiment results. Bioinformatics is a highly applied branch of Computer Science and all of these methods are heuristics or based on approximate models. I’d be interested in knowing how well they perform in practice.

What I liked the most about the presentation style is that it provides a simpler solution first, describe the issues with it and then provide a better solution. This helps understanding on why a given algorithm is this or that way.

Reconstructing a string using (k,d)-mers in an efficient way seems like an open problem, given the solution presented requires brute force in the worst case. I wonder if there has been any progress since.

### References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Wikipedia – Sequence Assembly

# 2018 in Review

This is a meta-post to review what happened in 2018.

## Posts Summary

This year I set out to learn about Bioinformatics. I completed the Bioinformatics class on Coursera. Under this umbrella I wrote about Cell Biology and DNA Sequencing. I’m on my way to write about DNA Fragment Assembly, but wanted to work out the theory behind it first, which led me to Eulerian Circuits and De Bruijn Graphs.

I was curious about current technologies such as Blockchain and Two-factor Authentication and wrote a bit about them.

One of my resolutions for last year was to learn the Rust programming language. I implemented the code from few of my posts using it, including the HyperLogLog data structure and a solver for a game called Bulls and Cows. I still ventured a bit with OCaml by learning BuckleScript (JavaScript with OCaml syntax).

I continued my slow progress in studying Distributed Systems. This year I read Google’s F1 Database paper and wrote about LSM Trees.

Besides BuckleScript, I haven’t dedicated too much time to Web Development topics, the other one being Layout properties of CSS.

The most popular post is still the 2014 Introduction to the Parsec Library, with 1.3k visits. From this year, the recreational math problem, Bulls and Cows was most viewed. Overall the blog had a total of 9.6k visitors.

I kept the resolution to post once a month on average. The blog completed 6 years with 79 posts.

## Resolutions for 2019

I’ll repeat my resolutions from 2018 for 2019. I don’t think I learned nearly the minimum of Rust, especially around memory management, and I’ve only scratched the surface on Bioinformatics. Besides DNA analysis I learned more about other problems like protein folding that seem exciting.

I haven’t done any mobile projects and only read one paper, so I’ll put these on the bucket list as well.

## Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

### Trips

I enjoy traveling and 2018 had plenty of trips. I haven’t had been to Europe before and this year I happened to go twice! Once for work, to England and another time for pleasure, to Greece.

In England I explored mostly around London including the cities of Bath and Dover.

Top: Tower Bridge; Iconic double-decker bus; Rosetta Stone at the British Museum. Bottom: Roman Baths; Dover Cliffs; Windsor Castle.

The trip to Greece included Athens, Santorini and a train ride to Kalambaka, to see the Meteora monasteries.

Top: Athens seen from the Acropolis, the Parthenon, and Santorini. Bottom: Temple of Zeus in Athens, a monastery on top of a mountain and the Akrotiri museum.

There were also trips around the US, including Albuquerque in New Mexico, New Orleans in Louisiana and Los Angeles in California.

Top: Taos Pueblo near Santa Fe NM, Petroglyphs in Albuquerque NM, Venice Canals in Los Angeles. Bottom: Getty Museum in Los Angeles; Jackson Square in Louisiana; French Quarter in Louisiana.

There was also a trip to Montana, to the Glacier National Park. I really like National Parks and I’m glad to have visited this one, which is very beautiful.

Glacier National Park: Iceberg Glacier, Mountain Goat and Bearhat Mountain

### Books

This year I read a lot of non-fiction, especially science-related. My favorites science books were:

•  Blind Watchmaker from Richard Dawkins. He delves into Darwin’s theory of evolution to  conclude it’s the most probable explanation for the existence of life on Earth.
• Genome: The Autobiography of a Species in 23 Chapters by Matt Ridley is a highly engaging and elucidating tour of our genome. Each chapter is dedicated to one chromosome and he provides an example of trait or disease related to it.
• The Ghost Map: The Story of London’s Most Terrifying Epidemic – and How It Changed Science, Cities, and the Modern World by Steven Johnson. It describes some fascinating detective work from a doctor during a time we knew a lot less about diseases.

In the realms of humanities,

• Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker is a thorough presentation of facts and data to back the claim that despite localized set backs and short-term regressions, the world has been becoming more progressive. The idea that stuck the most with me is that each new generation tends to be more progressive than their parents, which shines a optimist light to a more humane future.
• Why Nations Fail: The Origins of Power, Prosperity, and Poverty makes the claim that the success of a nation has nothing to do with geography, race or culture. There is a lot of world history in this book and I learned a bunch about different countries.

I also enjoyed some biographies,

• I Am Malala – inspiring story from a Pakistani girl who went on to win the Nobel Peace prize in 2014. Great overview of the history of Pakistan, and the life of a civilian under the regime of the Taliban.
• Born a Crime – The comedian Trevor Noah had a pretty happening life. The book covers the recent history of South Africa and especially the Apartheid. He provides an interesting perspective on growing up on the later part of that regime and for being the son of a black mother and a white father.

and reading stuff related to trips,

• For Greece, I chose The King Must Die by Mary Renault. It is a fiction written by Mary Renault with most of it set in the mythical kingdom of Minos in Crete. I really like the fact it alludes to Greek myths but the story itself does not rely on supernatural elements.
• For Montana, I picked A River Runs Through It by Norman Maclean. It’s a short story set in rural Montana and a constant theme is family relations, fishing and the big questions of life.
• A Modern History of Japan by Andrew Gordon. I was in Japan in 2017, not in 2018, but I only managed to finish the book this past year. I learned a bunch about the recent Japanese history, but not in enough detail to change how I thought about the experiences from my trip.

### Movies

I haven’t watched many movies but really enjoyed Coco and Crazy Rich Asians.

# De Bruijn Graphs and Sequences

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

## Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

## Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

## De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

## Applications

#### Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

#### Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

#### Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

## Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

## Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

## References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[5] Wikipedia: Superpermutation

# Eulerian Circuits

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

## Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

## Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

## Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices $(v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v)$ and w is a vertex with a sub-circuit. Let $(w, q_1, ..., q_m-1, w)$ be the sub-circuit starting from w. We can construct a new circuit $(v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v)$.

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

## Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

## Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

## Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

$ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)!$ (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

$\prod_{v \in V}(deg(v) - 1)!$ (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

## Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

## Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

## References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

# Blockchain

Blockchain has become a very popular technology recently due to the spread of Bitcoin. In this post, we’ll focus on the details of blockchain with a focus on Computer Science, studying it as a distributed data structure.

### Motivation

Blockchain is a distributed data structure that can be used to implement a distributed ledger system. A distributed ledger orchestrates important operations (for example financial transactions) without the need of a centralized arbiter (such as a financial institution).

The reason to prefer decentralized systems could be from costs of operations: having a few financial institutions mediate all transactions require a lot of work; To avoid a single point of failure (more reliable), and finally to not have to trust a few companies with our assets.

Additionally, by being decentralized, the expectation is that it becomes less likely to regulate and thus far it has enable global currencies like bitcoin.

### Challenges

The main challenge with a distributed ledger is how to distribute the work and data across nodes in a network and, more importantly, how to make sure we can trust that information.

An initial naive idea could be to store a copy of all accounts balances in every node of the network. The problem of storing only the balances of accounts is that it’s very hard to verify whether a given payment went through or make sure the balances are consistent. To address that, the blockchain also stores the whole history of transactions that led to the current balances (think of version control).

Safety. Even if we have the whole history of the transactions, it’s possible for a bad actor to tamper with this history on their own benefit, so we need a safety mechanism to prevent that.

Consistency. Finally, because we store copies of data that is changing all the time in multiple machines, we’ll invariably hit problems with consistency and sources of truth.

Blockchain is a data structure designed to address these challenges as we shall see next. We’ll use bitcoin as the application when describing examples.

### Data Structure

A blockchain is a chain of entities called blocks that is replicated to several nodes in a network. Each block has a hash which is computed as a function of its contents and the hash of the previous node in the chain.

For the bitcoin case, the content of the block is a set of transactions adding up to 1MB. The hash of the contents is the hash of the combination of the hashes of individual transactions. More specifically, these hashes can be organized in a binary tree (also known as Merkle Tree) where the transactions hashes are the leaves and the hashes of individual inner nodes are the hash of the concatenation of the hashes of the children.

Merkle Tree

Source of truth. There might be several different versions of the chain around the network, either due to inconsistency or bad actors. The assumption is that the chain with most nodes that is agreed upon by the majority of the nodes (over 50%) is the accurate and most up-to-date version of the chain and the one that should be trusted.

User C receiving the chains from A and B. Since B’s chain is longer, C will take it as the source of truth.

### Inserting Blocks

The insertion consists of adding a new transaction to the blockchain. In terms of our bitcoin example, this is essentially user A sending some amount of money to user B.

To start the process, node A broadcasts a message with the transaction, signing it with its private key. The other nodes on the network have node A’s public key, so they can check the authenticity of the transaction.

The transaction stays in a pool of unresolved transactions. At any time, there are many nodes in the network constructing a new block, which contains several transactions. These nodes are also called miners. Not all nodes in the network need to be miners. Note that each block can pick any transaction from the pool so the set of transactions in a block under construction can be different between miners.

Adding a transaction to its block consists of verifying things like its authenticity, the validity (e.g. to make sure there are enough funds to be spend), then inserting it to the Merkle Tree, which allows recomputing the root hash in O(log n) time.

Each block should be around 1MB in size, so when a miner is done collecting enough transactions it can start wrapping up its work. It needs to compute the block hash which is a string such that when concatenated with the Merkle tree root hash and the previous block in the largest chain will generate a hash with k leading 0s. Since the hash function used is cryptographic, it cannot be easily reverse-engineered. The only way to find such string is via brute force. The number k is a parameter that determines the difficulty in finding the hash (the value is controlled by some central authority but rarely changes). The idea of this hash computation is that it’s purposely a CPU-expensive operation, so that it’s too costly for attackers to forge hashes as we’ll see later.

Once it finds the hash, it can broadcast the new block to the network where other miners can start to validate this block. They can check that:

• The transactions in the block are valid
• The block was added to the largest chain known
• The hash generated for the block is correct

The verification step should be much cheaper than the process of generating the hash. Once the block passes the validation, the miner adds it to the largest chain. Now, when there’s a call to get the largest chain, this last block will be included. On the other hand, if any of the validation steps failed, the block is rejected.

Because of this, miners building their blocks can also periodically check for updates in the network for larger chains. If, while they’re building their blocks or computing the hash a new larger chain arrives, it’s better to start over, since it will be rejected later anyway.

### Checking for valid transactions

In the bitcoin example, suppose I’m a vendor and I want to verify that your payment to me went through before I ship you my product.

To check whether a transaction was validated, a node just needs to ask for the current longest blockchain, which is the source of truth, and see if the transaction appears in any of the blocks.

### Double-Spending

Say user A has exactly 1 coin, and that it sends to B as payment. Immediately after, A sends that same coin to another user C. How does C make sure that the coin it’s receiving is not spent before?

In a decentralized ledger, this could lead to problems since different nodes could disagree to where the coin went. In the blockchain, there are two scenarios:

• A to B  got included in a block first which later became part of the longest chain (and hence the source of truth). The transaction from A to C would be rejected from being added to future blocks due to insufficient funds.
• A to B and A to C got picked up to be included in the same block. The miner constructing the block would consider the second transaction it picked up invalid. Even if it was malicious and it included both transactions, it would fail validation when broadcasting it to the network.

The case in which A to C gets picked up first is analogous.

### Changing History

Suppose that user A, after performing a payment to and receiving its product from user B, wants to reuse the transaction and send the money to another user C. In theory it could edit the destination of the transaction from the chain (or even replace the block entirely), but to make this modified chain become the source of truth, it would also need to make it the longest.

As we’ve seen above, adding a block to a chain with a valid hash is extremely expensive. To make it the new source of truth, it would need to add at least one block in addition to the forged one. This means it would need to outpace all the other miners in the network that are also continuously computing new blocks in parallel. The paper [1] claims that unless the attacker controls over 50% of the CPU power in the network, this is extremely unlikely to succeed.

Note that a user cannot modify the transactions it is not the sender of, since a transaction from user A to user B is signed with A’s private key. If user C wanted to redirect A’s money to itself, it would need to know A’s private key, so it could sign the transaction pretending it’s A.

### Incentives

As we mentioned earlier, not all nodes in the network need to be miners. So why would any node volunteer to spend energy in form of CPU cycles to compute hashes? In bitcoin, the incentive is that the miner whose block gets added to the chain receives bitcoins as reward. These bitcoins are collected as fees from the normal transactions.

Bitcoin Mining Farms to compute block hashes and collect rewards

There must incentives to make sure that a miner won’t keep building blocks with one or zero transactions, to skip doing any validations for transactions. The reward should take into account the transaction, and possibly its value and their age (to make sure “unimportant” transactions won’t be ignored forever).

### Conclusion

In researching the material for this post, I ran into a lot of articles and videos covering blockchain without going into much detail. As always, I only realized my understanding had a lot of gaps once I started writing this post.

Overall I know a good deal more about what Blockchain is and what it’s not. I’m curious to see what other applications will make use of this technology and if we can come up with a trustworthy system that doesn’t require wasting a lot of CPU power.

### References

[1] Bitcoin: A Peer-to-Peer Electronic Cash System
[2] Hashcash: A Denial of Service Counter-Measure – A proof-of-work algorithm
[3] Blockchain: how mining works and transactions are processed in seven steps
[4] Bitcoin P2P e-cash paper – The proof-of-work chain is a solution to the Byzantine Generals’ Problem