# DNA Assembly

In this post we’ll discuss the problem of reconstructing a DNA segment from its fragments, also known as DNA assembly.

Context. When we first talked about DNA sequencing, we learned that there’s no good way to “scan” the whole series of nucleotides from a DNA with current technology. What can be done is to break the target segment into many small (overlapping) segments and then rely on computers to help with the task of reconstructing the original DNA sequence from these segments.

We can start by making some assumptions over the nature of the fragments to start. First, we’ll assume every fragment has the same length and second that we have every possible fragment of such length.

For instance, if our sequence is TATGGGGTGC, all possible fragments of length 3 are: ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG.

Note that the fragments overlap with each other. For example, TAT and ATG have an overlap of AT. This is crucial for us solve the problem, since if there was no overlap it would be impossible to order the fragments to obtain the original sequence, since there would be no “link” between any two fragments.

Let’s state the problem more formally given these constraints.

### The String Reconstruction Problem

Definitions. A k-mer of a string S is any substring of S with length k.  The String Composition Problem consists in, given the set of all k-mers from S, reconstruct the string S.

Reusing the example from above, if we are given ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG, we could reconstruct the string TATGGGGTGC. We’ll now see a way to solve this problem.

Solution. Assuming a solution exists, it will consist of a (ordered) sequence of the k-mers such that adjacent k-mers overlap in k-1 positions. From the example above, the permutation is

TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC

And two adjacent k-mers such as TGG and GGG, overlap in k-1 positions (GG).

We can model this problem as a graph problem. If we define a directed graph where each vertex corresponds to a k-mer, and an edge (u, v) exists if and only if the suffix of k-mer u is the prefix of the k-mer v, in other words, u overlaps with v.

Now, if we can find a path visiting each vertex exactly once, that will be a valid reconstructed string. This is known as the Hamiltonian path problem for general graphs it’s a NP-Hard problem.

Instead, we can model this problem using a graph as follows: for each k-mer, we have a vertex corresponding to its prefix of length k-1 and another to its suffix with length k-1. For example, for a k-mer TGG, there would exist vertices TG and GG. There’s then an edge from u to v, if the overlap of u and v in k-2 positions is a k-mer in the input. In the example, above, there’s an edge from TG to GG because TGG is a k-mer. Note we can have repeated (multiple) edges.

In this new graph, if we can find a path visiting each edge exactly once, we’ll find a reconstructed string to the set of k-mers. To see why, we can observe that each edge in this new graph is a k-mer and two consecutive edges must overlap in k-1 positions (the overlap being the vertex that “links” these two edges). The graph for the example we discussed above can be seen below:

Graph representing the k-mer set:ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

Note that this second graph is the line graph of the one we first used, in a similar fashion that a de Bruijn graph of dimension n is a line graph of the one with dimension n-1. In fact, these graphs are a subgraph of the de Bruijn graphs.

As we saw in our discussions about Eulerian Circuits that this is a much easier problem to solve.

#### Dealing with Ambiguity

Even if are able to solve the String Reconstruction problem, we might not end up with the right DNA segment. In [1], the authors provide the example TATGCCATGGGATGTT which has the same 3-mer composition of TAATGGGATGCCATGTT. Let’s see strategies employed to work around this problem.

### The String Reconstruction from Read-Pairs Problem

While it’s currently infeasible to generate longer fragments that would reduce ambiguity, it’s possible to obtain what is called read-pairs. These are a pair of k-mers that are separated by a distance of exactly d in the target segment.

For example, TGC and ATG are a pair of 3-mers separated by distance 1 in TATGCCATGGGATGTT. We refer to a pair of k-mers separated by distance d as (k, d)-mers, or (pattern1 | pattern2) if we can omit the distance.

Solution. We can construct a de Bruijn-like graph for this problem too, which we’ll call Paired de Bruijn graph. First, let’s define the prefix and suffix of a (k, d)-mer.  Given a (k, d)-mer in the form of (a1, ..., ak | b1, ..., bk), its prefix is given by (a1, ..., ak-1 | b1, ..., bk-1) and its suffix by (a2, ..., ak | b2, ..., bk).

For every (k, d)-mer, we’ll have one vertex corresponding to the prefix and one to the suffix of this (k, d)-mer. There’s an edge from vertex u to vertex v if there’s a (k, d)-mer whose prefix is u and suffix is v.

Similar to the solution to the String Reconstruction problem, we can find an Eulerian path, but in this case that might not yield a valid solution. In [1] the authors provide an example:

Consider the set of (2, 1)-mers is given by (AG|AG), (AG | TG), (CA | CT), (CT | CA), (CT | CT), (GC | GC), (GC | GC), (GC | GC), (TG | TG).

After constructing the graph, one of the possible Eulerian paths is (AG|AG) → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) → (CT | CT) → (TG | TG) → (GC | GC) →  (CT | CA) which spells AGCAAGCTGCTGCA, which is a valid solution.

However, another valid Eulerian path, (AG|AG) → (GC | GC) →  (CT | CT) →  (TG | TG)  → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) →  (CT | CA) does not yield a valid string.

In [1] the authors don’t provide an explicit way to overcome this issue but they go on to describe how to enumerate all Eulerian paths, which seems to suggest a brute-force approach.

### Practical Challenges

Missing fragments. One of the assumptions we made, that all fragments of the sequence are present doesn’t hold true for the state-of-the-art sequencers.

A technique to address this issue is to break the fragments into smaller ones until we get full coverage.

Left: 10-mers not providing full coverage. Right: 5-mers obtained from 10-mers and having full coverage.

This trades off coverage with ambiguity, since smaller k-mers are more likely to contain repeats and that might not lead to a single solution.

Typos. Another limitation of sequencers is that they can misread nucleotides. If we perform multiple reads – some containing the correct nucleotides, some not – we’ll end up with a graph where some paths are valid and some are not. It’s possible to remove them via heuristics but they’re not perfect and can lead to the removal of valid paths.

### Conclusion

While following the textbook [1] I felt a bit lost due to so many detours and kept losing track of the main problem being solved. Because the textbook is meant to also be accessible to people without prior knowledge of Computer Science, so it does need to provide the base for concepts such as Graph Theory.

One think I missed from the content were a section for experiment results. Bioinformatics is a highly applied branch of Computer Science and all of these methods are heuristics or based on approximate models. I’d be interested in knowing how well they perform in practice.

What I liked the most about the presentation style is that it provides a simpler solution first, describe the issues with it and then provide a better solution. This helps understanding on why a given algorithm is this or that way.

Reconstructing a string using (k,d)-mers in an efficient way seems like an open problem, given the solution presented requires brute force in the worst case. I wonder if there has been any progress since.

### References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Wikipedia – Sequence Assembly

# 2018 in Review

This is a meta-post to review what happened in 2018.

## Posts Summary

This year I set out to learn about Bioinformatics. I completed the Bioinformatics class on Coursera. Under this umbrella I wrote about Cell Biology and DNA Sequencing. I’m on my way to write about DNA Fragment Assembly, but wanted to work out the theory behind it first, which led me to Eulerian Circuits and De Bruijn Graphs.

I was curious about current technologies such as Blockchain and Two-factor Authentication and wrote a bit about them.

One of my resolutions for last year was to learn the Rust programming language. I implemented the code from few of my posts using it, including the HyperLogLog data structure and a solver for a game called Bulls and Cows. I still ventured a bit with OCaml by learning BuckleScript (JavaScript with OCaml syntax).

I continued my slow progress in studying Distributed Systems. This year I read Google’s F1 Database paper and wrote about LSM Trees.

Besides BuckleScript, I haven’t dedicated too much time to Web Development topics, the other one being Layout properties of CSS.

The most popular post is still the 2014 Introduction to the Parsec Library, with 1.3k visits. From this year, the recreational math problem, Bulls and Cows was most viewed. Overall the blog had a total of 9.6k visitors.

I kept the resolution to post once a month on average. The blog completed 6 years with 79 posts.

## Resolutions for 2019

I’ll repeat my resolutions from 2018 for 2019. I don’t think I learned nearly the minimum of Rust, especially around memory management, and I’ve only scratched the surface on Bioinformatics. Besides DNA analysis I learned more about other problems like protein folding that seem exciting.

I haven’t done any mobile projects and only read one paper, so I’ll put these on the bucket list as well.

## Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

### Trips

I enjoy traveling and 2018 had plenty of trips. I haven’t had been to Europe before and this year I happened to go twice! Once for work, to England and another time for pleasure, to Greece.

In England I explored mostly around London including the cities of Bath and Dover.

Top: Tower Bridge; Iconic double-decker bus; Rosetta Stone at the British Museum. Bottom: Roman Baths; Dover Cliffs; Windsor Castle.

The trip to Greece included Athens, Santorini and a train ride to Kalambaka, to see the Meteora monasteries.

Top: Athens seen from the Acropolis, the Parthenon, and Santorini. Bottom: Temple of Zeus in Athens, a monastery on top of a mountain and the Akrotiri museum.

There were also trips around the US, including Albuquerque in New Mexico, New Orleans in Louisiana and Los Angeles in California.

Top: Taos Pueblo near Santa Fe NM, Petroglyphs in Albuquerque NM, Venice Canals in Los Angeles. Bottom: Getty Museum in Los Angeles; Jackson Square in Louisiana; French Quarter in Louisiana.

There was also a trip to Montana, to the Glacier National Park. I really like National Parks and I’m glad to have visited this one, which is very beautiful.

Glacier National Park: Iceberg Glacier, Mountain Goat and Bearhat Mountain

### Books

This year I read a lot of non-fiction, especially science-related. My favorites science books were:

•  Blind Watchmaker from Richard Dawkins. He delves into Darwin’s theory of evolution to  conclude it’s the most probable explanation for the existence of life on Earth.
• Genome: The Autobiography of a Species in 23 Chapters by Matt Ridley is a highly engaging and elucidating tour of our genome. Each chapter is dedicated to one chromosome and he provides an example of trait or disease related to it.
• The Ghost Map: The Story of London’s Most Terrifying Epidemic – and How It Changed Science, Cities, and the Modern World by Steven Johnson. It describes some fascinating detective work from a doctor during a time we knew a lot less about diseases.

In the realms of humanities,

• Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker is a thorough presentation of facts and data to back the claim that despite localized set backs and short-term regressions, the world has been becoming more progressive. The idea that stuck the most with me is that each new generation tends to be more progressive than their parents, which shines a optimist light to a more humane future.
• Why Nations Fail: The Origins of Power, Prosperity, and Poverty makes the claim that the success of a nation has nothing to do with geography, race or culture. There is a lot of world history in this book and I learned a bunch about different countries.

I also enjoyed some biographies,

• I Am Malala – inspiring story from a Pakistani girl who went on to win the Nobel Peace prize in 2014. Great overview of the history of Pakistan, and the life of a civilian under the regime of the Taliban.
• Born a Crime – The comedian Trevor Noah had a pretty happening life. The book covers the recent history of South Africa and especially the Apartheid. He provides an interesting perspective on growing up on the later part of that regime and for being the son of a black mother and a white father.

and reading stuff related to trips,

• For Greece, I chose The King Must Die by Mary Renault. It is a fiction written by Mary Renault with most of it set in the mythical kingdom of Minos in Crete. I really like the fact it alludes to Greek myths but the story itself does not rely on supernatural elements.
• For Montana, I picked A River Runs Through It by Norman Maclean. It’s a short story set in rural Montana and a constant theme is family relations, fishing and the big questions of life.
• A Modern History of Japan by Andrew Gordon. I was in Japan in 2017, not in 2018, but I only managed to finish the book this past year. I learned a bunch about the recent Japanese history, but not in enough detail to change how I thought about the experiences from my trip.

### Movies

I haven’t watched many movies but really enjoyed Coco and Crazy Rich Asians.

# De Bruijn Graphs and Sequences

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

## Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

## Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

## De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

## Applications

#### Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

#### Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

#### Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

## Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

## Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

## References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[5] Wikipedia: Superpermutation

# Eulerian Circuits

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

## Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

## Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

## Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices $(v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v)$ and w is a vertex with a sub-circuit. Let $(w, q_1, ..., q_m-1, w)$ be the sub-circuit starting from w. We can construct a new circuit $(v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v)$.

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

## Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

## Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

## Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

$ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)!$ (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

$\prod_{v \in V}(deg(v) - 1)!$ (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

## Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

## Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

## References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

# Blockchain

Blockchain has become a very popular technology recently due to the spread of Bitcoin. In this post, we’ll focus on the details of blockchain with a focus on Computer Science, studying it as a distributed data structure.

### Motivation

Blockchain is a distributed data structure that can be used to implement a distributed ledger system. A distributed ledger orchestrates important operations (for example financial transactions) without the need of a centralized arbiter (such as a financial institution).

The reason to prefer decentralized systems could be from costs of operations: having a few financial institutions mediate all transactions require a lot of work; To avoid a single point of failure (more reliable), and finally to not have to trust a few companies with our assets.

Additionally, by being decentralized, the expectation is that it becomes less likely to regulate and thus far it has enable global currencies like bitcoin.

### Challenges

The main challenge with a distributed ledger is how to distribute the work and data across nodes in a network and, more importantly, how to make sure we can trust that information.

An initial naive idea could be to store a copy of all accounts balances in every node of the network. The problem of storing only the balances of accounts is that it’s very hard to verify whether a given payment went through or make sure the balances are consistent. To address that, the blockchain also stores the whole history of transactions that led to the current balances (think of version control).

Safety. Even if we have the whole history of the transactions, it’s possible for a bad actor to tamper with this history on their own benefit, so we need a safety mechanism to prevent that.

Consistency. Finally, because we store copies of data that is changing all the time in multiple machines, we’ll invariably hit problems with consistency and sources of truth.

Blockchain is a data structure designed to address these challenges as we shall see next. We’ll use bitcoin as the application when describing examples.

### Data Structure

A blockchain is a chain of entities called blocks that is replicated to several nodes in a network. Each block has a hash which is computed as a function of its contents and the hash of the previous node in the chain.

For the bitcoin case, the content of the block is a set of transactions adding up to 1MB. The hash of the contents is the hash of the combination of the hashes of individual transactions. More specifically, these hashes can be organized in a binary tree (also known as Merkle Tree) where the transactions hashes are the leaves and the hashes of individual inner nodes are the hash of the concatenation of the hashes of the children.

Merkle Tree

Source of truth. There might be several different versions of the chain around the network, either due to inconsistency or bad actors. The assumption is that the chain with most nodes that is agreed upon by the majority of the nodes (over 50%) is the accurate and most up-to-date version of the chain and the one that should be trusted.

User C receiving the chains from A and B. Since B’s chain is longer, C will take it as the source of truth.

### Inserting Blocks

The insertion consists of adding a new transaction to the blockchain. In terms of our bitcoin example, this is essentially user A sending some amount of money to user B.

To start the process, node A broadcasts a message with the transaction, signing it with its private key. The other nodes on the network have node A’s public key, so they can check the authenticity of the transaction.

The transaction stays in a pool of unresolved transactions. At any time, there are many nodes in the network constructing a new block, which contains several transactions. These nodes are also called miners. Not all nodes in the network need to be miners. Note that each block can pick any transaction from the pool so the set of transactions in a block under construction can be different between miners.

Adding a transaction to its block consists of verifying things like its authenticity, the validity (e.g. to make sure there are enough funds to be spend), then inserting it to the Merkle Tree, which allows recomputing the root hash in O(log n) time.

Each block should be around 1MB in size, so when a miner is done collecting enough transactions it can start wrapping up its work. It needs to compute the block hash which is a string such that when concatenated with the Merkle tree root hash and the previous block in the largest chain will generate a hash with k leading 0s. Since the hash function used is cryptographic, it cannot be easily reverse-engineered. The only way to find such string is via brute force. The number k is a parameter that determines the difficulty in finding the hash (the value is controlled by some central authority but rarely changes). The idea of this hash computation is that it’s purposely a CPU-expensive operation, so that it’s too costly for attackers to forge hashes as we’ll see later.

Once it finds the hash, it can broadcast the new block to the network where other miners can start to validate this block. They can check that:

• The transactions in the block are valid
• The block was added to the largest chain known
• The hash generated for the block is correct

The verification step should be much cheaper than the process of generating the hash. Once the block passes the validation, the miner adds it to the largest chain. Now, when there’s a call to get the largest chain, this last block will be included. On the other hand, if any of the validation steps failed, the block is rejected.

Because of this, miners building their blocks can also periodically check for updates in the network for larger chains. If, while they’re building their blocks or computing the hash a new larger chain arrives, it’s better to start over, since it will be rejected later anyway.

### Checking for valid transactions

In the bitcoin example, suppose I’m a vendor and I want to verify that your payment to me went through before I ship you my product.

To check whether a transaction was validated, a node just needs to ask for the current longest blockchain, which is the source of truth, and see if the transaction appears in any of the blocks.

### Double-Spending

Say user A has exactly 1 coin, and that it sends to B as payment. Immediately after, A sends that same coin to another user C. How does C make sure that the coin it’s receiving is not spent before?

In a decentralized ledger, this could lead to problems since different nodes could disagree to where the coin went. In the blockchain, there are two scenarios:

• A to B  got included in a block first which later became part of the longest chain (and hence the source of truth). The transaction from A to C would be rejected from being added to future blocks due to insufficient funds.
• A to B and A to C got picked up to be included in the same block. The miner constructing the block would consider the second transaction it picked up invalid. Even if it was malicious and it included both transactions, it would fail validation when broadcasting it to the network.

The case in which A to C gets picked up first is analogous.

### Changing History

Suppose that user A, after performing a payment to and receiving its product from user B, wants to reuse the transaction and send the money to another user C. In theory it could edit the destination of the transaction from the chain (or even replace the block entirely), but to make this modified chain become the source of truth, it would also need to make it the longest.

As we’ve seen above, adding a block to a chain with a valid hash is extremely expensive. To make it the new source of truth, it would need to add at least one block in addition to the forged one. This means it would need to outpace all the other miners in the network that are also continuously computing new blocks in parallel. The paper [1] claims that unless the attacker controls over 50% of the CPU power in the network, this is extremely unlikely to succeed.

Note that a user cannot modify the transactions it is not the sender of, since a transaction from user A to user B is signed with A’s private key. If user C wanted to redirect A’s money to itself, it would need to know A’s private key, so it could sign the transaction pretending it’s A.

### Incentives

As we mentioned earlier, not all nodes in the network need to be miners. So why would any node volunteer to spend energy in form of CPU cycles to compute hashes? In bitcoin, the incentive is that the miner whose block gets added to the chain receives bitcoins as reward. These bitcoins are collected as fees from the normal transactions.

Bitcoin Mining Farms to compute block hashes and collect rewards

There must incentives to make sure that a miner won’t keep building blocks with one or zero transactions, to skip doing any validations for transactions. The reward should take into account the transaction, and possibly its value and their age (to make sure “unimportant” transactions won’t be ignored forever).

### Conclusion

In researching the material for this post, I ran into a lot of articles and videos covering blockchain without going into much detail. As always, I only realized my understanding had a lot of gaps once I started writing this post.

Overall I know a good deal more about what Blockchain is and what it’s not. I’m curious to see what other applications will make use of this technology and if we can come up with a trustworthy system that doesn’t require wasting a lot of CPU power.

### References

[1] Bitcoin: A Peer-to-Peer Electronic Cash System
[2] Hashcash: A Denial of Service Counter-Measure – A proof-of-work algorithm
[3] Blockchain: how mining works and transactions are processed in seven steps
[4] Bitcoin P2P e-cash paper – The proof-of-work chain is a solution to the Byzantine Generals’ Problem

# Two-factor authentication

In this post we’ll talk about some popular security measures to protect user accounts on the web via two-factor authentication. The term refers to the requirement of two methods of authentication for logging in into a given account. The first method is mostly always a password, and the second is one of the methods we’ll describe in this post.

### Why do we need an additional form of authentication?

In an ideal world, people would have strong (long, not complex) passwords, which would never get stolen and people would never forget them. In the real world, applications have to deal with two scenarios: 1) someone else knows your password or 2) you forgot your password.

#### Scenario 1: They are not who they claim to be

If someone else knows your password, the system needs to somehow know that this person is not you.

They can then employ a secondary method of authentication to verify that you are yourself. In theory they could ask for a secondary password or ask a security question. The problem with these is that they’re exposed to the same set of vulnerability that might have compromised the original password in the first place, for example, the password is too easy to crack or there was a breach of database storing plain text passwords. In addition, since these secondary methods are to be used in very rare occasions, it’s extremely likely you’ll incur in the second problem, i.e. forget your password.

Physical devices. Nowadays, security systems can almost always rely on the fact that even if someone has your password, they do not have your physical belongings (e.g. cellphone). Some websites allow users to setup the requirement to use both a password and a secondary authentication to access the account.

#### Scenario 2: I’m who I claim to be

To address the problem of a user losing a password, some websites offers a recovery mechanism, usually by sending a secure email with a link to re-set the password or, in case of email applications like GMail, allowing the secondary authentication method as an alternative to inputing your password.

Websites such as GMail and Github also have a set of auto-generated “master passwords” that you can print and store in a safe place. After used, these passwords become invalid. This is one of the safest options, but it also requires more effort from the user (printing and making sure they can find the printed document when needed).

Scenario 1 deals with security, and Scenario 2 deals with usability (recovering passwords), and these are usually at odds with which other. Security systems have to find the balance between the two.

We’ll now cover three popular secondary authentication mechanisms: SMS (text messages), third-party app authentication and hardware authentication.

### SMS

In the SMS (Short Message Service) method, the server generates a short code and sends it to the user via a text (SMS) message which is valid for a few minutes. The user can then copy the code from the phone to the computer and send the code to the server which can then authenticate the initial request.

During this period of time, the user account is technically vulnerable to a very weak code (a 6-digit number) which is very easy to crack. However, this period is very narrow, which great limits the ability of a bad agent to take any action.

#### Vulnerabilities

The real danger of the SMS method is a bad agent being able to intercept the SMS message that is supposed to go to the user. According to this Wired article, the telecoms use a network called SS7 (Signaling System No. 7) to transport text messages. This network relies on trust to implement features such as roaming, which enables a person from New York to receive/send text messages when they’re traveling to Berlin. In this case a carrier in Berlin could request the user’s carrier back in New York to receive the text messages so it can deliver to the user.

This system has a vulnerability because a hacked carrier could be used to intercept the text messages by pretending it’s doing so on behalf of a user. The carriers might not do any checks to verify the authenticity of the request. Hence, if an attacker knows your email, phone number and has access to a hacked carrier, they could technically hack into your account.

### App Authentication

Another authentication method is to install a third-party app that can be used to generated the authentication codes. One popular option is the Google Authenticator App, which you can install on your phone (Android or iOS).

It uses the Time-based One-time Password algorithm or TOTP [2, 3]. The general idea is to perform a one-time registration between your phone and the server which consists of having both store a secret.

Whenever the client needs to authenticate itself, it uses the current timestamp and the secret to generate a hash, and from this hash it extracts a simpler code (6 characters) that the user copies and sends to the server. The server performs the same operation and if the generated code matches, it accepts the authentication.

The precision of the timestamp defines on how much time the user has to copy and send the code to the server. For example, the server can define the timestamp granularity to be 30 seconds. This also defines how long the server is vulnerable, since the code is usually short and hence easier to crack via brute force, so it cannot be too long.

### Hardware Authentication

A more recent approach to authentication is using a dedicated piece of hardware. YubiKey is an example of such device, which can be connected to the USB port. One way it can be used is part of the open authentication protocol called Universal 2nd Factor (U2F), developed by Google and Yubico (the company that manufactures YubiKey). We’ll describe this protocol next. In the discussion that follows we’ll refer to the Yubikey device generically as U2F.

The general flow consists of a enrollment phase, where the use registers the U2F in the target webpage. The webpage asks for a confirmation, which the user can do by tapping the U2F, which sends some information to the webpage, which stores it.

The other part is the signing phase. When this webpage needs to verify the user, say during login, it can ask the user to tap the U2F, which will send information that can be validated by the webpage to make sure it’s the same device that was registered in the first step.

#### Implementation details

One of the designs of this system is to be cross compatible and require no extra configuration from the user, like installing drivers. To achieve that, the communication between the U2F and the server is mediated via the browser. That means that the website calls a browser API (via JavaScript) which in turn communicates with the U2F. Henceforth when we refer to the communication between the U2F and the server, we’re implicitly assuming it’s done via the browser.

During the enrollment process, the device generates a pair of public and private keys (public-key cryptography). It sends the public key to the server which stores it together with other information. During the signing phase the server can generate a challenge (string), encrypt with the public key and send it to the U2F, which can decode it. At this point, the user is asked to tap the U2F. Once that it’s done, it sends the challenge back to the server encrypted with its private key. If the server can then decode the message, it can trust the U2F and authenticate the user.

The reason a new public and private key is generated at every enrollment is for privacy reasons (not security). This is to prevent the case of different websites that enable U2F, to share data between them and be able to track the user. For example, if the same public key was used for all enrollments, a site A and B would be able to identify the user via their public key and share this information among themselves. If site A is a online shopping, it could use this information to show targeted ads in site B.

Stateless U2F. The problem of having to generate a pair of public/private keys every time is that now the U2F has to store them somehow. Since another important part of design is for the U2F to be very accessible, the implication is that they also have to be cheap. Hence, the protocol cannot assume the device has embedded storage. The solution is to send the pair for the server to store!

This seems to defeat the whole purpose of using cryptography but this information is sent to the server encrypted, which only the U2F itself can decode. Now, in addition to the server storing the public key, it has to store this extra information which the protocol calls Key Handle [5]. During the signing phase it sends not only the encrypted challenge, but also the Key Handle.

Man-in-the-middle. One potential security hole could be a scam website that looks like the real one and acts as a man-in-the-middle. First, the user will provide the scam site with the username and password. The scam site can then forward these to the real site to trigger the secondary factor request, which will send down the Key Handle and encrypted challenge. The scam site will forward it back to the U2F. Then the U2F would encrypt the challenge, which would be sent to the scam site, which in turn would relay it to the real site, finally allowing the bad actor to login as the user.

To prevent that, the site origin can be stored in the Key Handle as well. Before deciding to send data back, the U2F can check if the origin of the server and match it against the data in the Key Handle. The site origin is hard to tamper with when using an HTTPS connection unless the real site’s certificates are compromised.

Vendor reliability. Another component of the security is the trust in the manufacturer of the device. It could have malicious intent or flawed implementation. To address that concern, the U2F should also contain an extra pair of attestation public-private pair of keys. The attestation is to prove the identity of the manufacturer/vendor. During the enrollment, the public key that is generated is encrypted with the private attestation key. The public attestation key is made available by some trusted organization for the server to consult. If it’s able to decode the generated public key, then it can trust the U2F vendor.

### Conclusion

In this post we covered 3 methods of extra protection to the online identity. We saw that SMS has serious vulnerability while third-party and hardware authentication are much safer, which is no surprise since SMS were not initially designed to serve as a secure channel for authentication. No method is 100% secure but recent authentication mechanisms go to great lengths to reduce the vulnerable surface area to a minimum.

Note how all these methods assume the possession of a physical device separate from the computer where we’re trying to log into. Physical devices are much harder to steal compared to piece of information like passwords.

### References

[1] Information Security – How does Google Authenticator work?
[2] Wikipedia – HMAC-based One-time Password algorithm
[3] Wikipedia – Time-based One-time Password algorithm
[4] Wired – Fixing the cell network flaw that lets hackers drain bank accounts
[5] Google U2F (Gnubby) Documents – Snapshot prior to joining FIDO

# DNA Sequencing

Frederick Sanger. Wikipedia

Frederick Sanger was a British biochemist. He is known for the first sequencing of a protein (1955) and a method for sequencing DNA that bears his name, the Sanger Method. Sanger won two Nobel Prizes in Chemistry [1].

In this post we’ll talk about one of the first steps of DNA analysis, DNA sequencing, which is obtaining the data from the DNA, how it’s performed (we’ll focus on the Sanger method) and some interesting computational problems associated with it.

This is the second post in the series of Biochemistry studies from a Computer Scientist perspective. Our first post is a brief discussion of basic concepts in Cell Biology.

## DNA Sequencing

DNA sequencing is the determination of the physical order of the nucleotide bases in a molecule of DNA.

The first living organism to have its genome sequenced was a bacteria, Haemophilus influenzae, whose genome is about 1.8 million base pairs.

Genome represents the whole set of  genetic information of an organism. For a bacteria, it’s singular circular chromosome, but for humans it’s the set of all 23 pairs of chromosomes. A base-pair (shortened as bp) refers to a pair of nucleotides (bases) that are bound together in the double-strands of DNA.

In the human genome, there are 3 billion of base-pairs, and it took 13 years for it to be completed.

There are two main methods of sequencing, Sanger and Next-generation [5]. We’ll talk about the Sanger in details and discuss briefly the Next-generation from a real-world use case.

## Sanger Sequencing

The Sanger method is able to determine the nucleotide sequence of small fragments (up to abound 900bps) [5] of DNA.

### Overview

The first step is cloning the fragment into multiple copies (like billions) by a process called amplification. This is essentially mimicking the DNA cloning process in an artificial setup. In very high level we have:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Extend the primer with nucleotides until it reaches the other extremity

Once this is complete, we end up with two double-stranded fragments. We can repeat the process to generate many copies of the original fragment (theoretically doubling at each step). This process is known as Polymerase Chain Reaction (PCR).

Once we have enough copies of the fragment, we do a similar process, but this time, we also allow extending the primer with a special nucleotide named (dideoxy nucleotide). The key difference is that once it’s added, a dideoxy nucleotide cannot be further extended, and it also contains a special marker that causes each different base to have a different color. The process is now the following:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Add to the primer either
• Regular nucleotide (non-terminal – can be further extended)
• Dideoxy nucleotide (terminal – cannot be further extended)

Now, instead of ending up with more clones, we’ll have fragments with a variety of lengths.

We then run theses fragments through a process called Capillary Gel Electrophoresis which roughly consists in subjecting the fragments to a electric field, which then causes fragments to move with speed proportional to their length (the smaller the fragment the faster it moves). Once a group of fragments (which have the same length) reach a sensor at end of the field, we make use of the color marker in the special nucleotide at the tip of the fragment to determine the base of that dideoxy nucleotide. This enables us to determine the sequence of the nucleotides in the original fragment!

To given an example, say that the original sequence is GATTCAGC. Because there are many copies of the fragment after amplification, it’s very likely that we’ll end up with fragments with all possible lengths, from 1 to 8. Since a fragment of length 1 moves faster than any other, it will reach the sensor first. We know that a fragment with length 1 is a base pair G-C, and C is a dideoxy nucleotide (if it was not, it would have continued extended further). Say the color marker for C is orange. Then the first color to be captured by the sensor will beorange. The second set of fragments to reach the sensor is of length 2, which is a fragment (G-C, A-T), where T is a dideoxy nucleotide. If we assume it has color red, that’s the color which will be captured next by the sensor. You can see that based on the colors captured by the sensor we can infer the nucleotide sequence in the original segment.

Fragments with different lengths and color markers. Image copied from Whole-Genome Sequencing

### Details

Let’s go in more details for these processes. For both cases we work with solutions. Finer grained manipulation of molecules is infeasible.

To separate the double strands we heat the solution up to 96ºC, in a process we call denaturation. The high temperature causes the hydrogen bonds between pairs of nucleotides to break.

In the same solution we have the primer molecules (aka oligonucleotides) which are carefully chosen to match the beginning of the DNA strand. They also bind to DNA at a higher temperature than the strands (e.g. 55ºC). This is important because we can now lower the temperature of the solution slowly, enough so that primers can bind, but not so low to the point where the original strands will join back together. This slow cooling is called annealing. These primers can be synthesized artificially.

Gap in understanding: how to choose the right primer, since we need to know at least some of sequence from the nucleotide in order to know which primer will end up binding there? One possible explanation is that we know the exact sequence where a DNA was cut if we restriction enzymes, since their binding site is known [9] and hence we have an idea of the result endpoints after the cut.

# □

We also add an enzyme, DNA polymerase, and nucleotides to the solution. The enzyme is able to extend the original primer segment by binding free nucleotides in the solution. In particular we use the enzyme of a bacteria that lives at 70ºC (Thermus Acquaticus), also know as Taq polymerase because it is functional at higher temperatures. Performing the replication at a higher temperature prevents the separated strands from gluing together again.

The dideoxy nucleotide are modified versions of the regular nucleotide by supressing the OH group. They can still be incorporated to the primer via the DNA polymerase, but they prevent other nucleotides to binding to them via the sugar-phosphate binding.

In addition these dideoxy nucleotides contain a fluorescent molecule whose color is unique for each different type of base. How are these molecules “inserted” into the nucleotides? The abstract of [6] states:

Avian myeloblastosis virus reverse transcriptase is used in a modified dideoxy DNA sequencing protocol to produce a complete set of fluorescence-tagged fragments in one reaction mixture.

which suggests it’s possible to synthesize them by using a specific virus.

Dideoxynucleotide vs deoxynucleotide. The lack of the OH group prevents a ddNTP from binding to the “next” nucleotide, effectively terminating the chain. Image copied from Whole-Genome Sequencing

## Next-Generation Sequencing (Illumina)

The Sanger method is a very slow process, making it infeasible for analyzing large amounts of DNAs such as the human genome.

Modern sequencers make use of the “Next-generation” methods, which consist in massive parallelism to speed up the process. The most advanced sequencer in the market is produced by a company called Illumina. As of the time of this writing, their top equipment, Hiseq X Ten, costs about $10 million dollars, can sequence about 18k full genomes a year and it costs about$1000 per genome [2, 3].

Illumina’s educational video [4], describes the process:

• Cut a long sequence into small fragments
• Have the segments attach to beads in a glass plate via the adapters. The beads are basically the primers.
• Amplify the segments (massive cloning)
• Extend the primer with fluorescent nucleotides
• Every time a nucleotide is added, it emits light which is captured by a sensor

After the process is over, we have the sequence of many fragments based on the colors that were emitted.

We can see that the principles resemble the Sanger method, but it uses different technologies to allow a very automated and parallel procedure.

This whole process is very vague and it’s hard to have a good understanding of it. It’s understandable given that a lot of the information is likely industry secret.

## Sequencing vs Genotyping in personal genomics

Some of the most popular personal genetic analysis companies, such as 23andMe, provide a service in which they analyze the user DNA for a small fee. It’s way cheaper than the full genome analysis provided by Illumina, but that’s because these companies don’t do DNA sequencing, but rather genotyping.

Genotyping is the process of determining which genetic variants an individual possesses. This is easier than sequencing because a lot of known diseases and traits can be traced back to specific regions and specific chromosomes.

This is the information you most likely want to know about yourself. Remember that the majority of DNA in complex organisms is not useful (introns). In humans genome, exome (the part of DNA consisting of exons) account for less than 2% of total DNA.

Sequencing technology has not yet progressed to the point where it is feasible to sequence an entire person’s genome quickly and cheaply enough to keep costs down for consumers. It took the Human Genome Project, a consortium of multiple research labs, over 10 years to sequence the whole genomes of just a few individuals.

## Sequence Assembly Problem

Current technology is unable to sequence large segments of DNA, so it has to break it down into small fragments. Once that is done, we need to reconstruct the original sequence from the data of individual fragments.

There are two scenarios:

• Mapping: matching fragments of DNA to existing sequences by some similarity criteria
• De-novo: reconstruct the DNA when there’s no prior data about that sequence.

There is a family of computational methods for the mapping case known as Sequence Assembly, which we’ll study in more details in a future post.

Gap in understanding: It’s unclear what exact information is expected for the De-novo sequencing. It is impossible to determine the original sequence by only having data about individual segments, in the same way it’s impossible to reconstruct a book by only knowing it’s paragraphs contents (but not their order), borrowing the analogy from Wikipedia [8].

One thing I can think of is that if we repeat the experiments multiple times, each time cutting the molecules at different places, it might be possible to infer the relative order of the segments if they’re unique enough.

For example, if the original segment is: GATTCAGC and we run two experiments, one yielding (GAT, TCAGC) and another (GATTC, AGC), then in this case there’s only one way to assemble (order) these sequences in a consistent way.

# □

## Conclusion

In this post we studied some aspects of DNA sequencing. I found the Sanger method fascinating. In Computer Science ideas usually translate to algorithms and code very directly, but in other science branches the mapping from ideas to implementation is a problem in itself.

In Mechanics for example, we have to work with the physical world, so when converting from a circular movement to a linear one requires some clever tricks.

This needs to be taken to another level in Molecular Biology, because we don’t have direct access to the medium like we do in a mechanical device, for example, we can’t directly manipulate a double strand of DNA fragment to separate it, but have to resort to indirect ways.

The field of Biotechnology seems to be progressing at such a pace that it was challenging to find good sources of information. I’m yet to find a resource that explains end to end the steps from the process that takes a sample of cells and outputs the DNA nucleotides to a computer, including the technical challenges in doing so. This is what this post aimed to do.

## References

[1] Wikipedia – Frederick Sanger
[2] Genohub – Illumina’s Latest Release: HiSeq 3000, 4000, NextSeq 550 and HiSeq X5
[3] Illumina Hiseq-X
[4] Youtube – Illumina Sequencing by Synthesis
[5] Khan Academy – DNA sequencing
[6] Science Magazine: A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides
[7] Towards Data Science – DNA Sequence Data Analysis – Starting off in Bioinformatics
[8] Wikipedia – Sequence assembly
[9] Dummies – How Scientists Cut DNA with Restriction Enzymes