# Protein Design

We previously learned about the problem of predicting the folding of a protein, that is, given a chain of amino acids, find its final 3D structure. This time we’re interested in the reverse problem, that is, given a 3D structure, find some chain of amino-acid that would lead to that structure once fully folded. This problem is called Protein Design.

In this post we’ll focus on mathematical models for this problem, studying its computational complexity and discuss possible solutions.

### Mathematical Modeling

The 3D structure of a protein can be divided into two parts: the backbone and the side chains. In the model proposed by Piece and Winfree [3], we assume the backbone is rigid and that we’ll try to find the amino-acids for the side chains such that it minimizes some energy function.

This means we’re not really trying to predict the whole chain of amino acids, but a subset of those amino that will end up on the side chains.

The amino acids on the side-chain can have a specific orientation [2, 3], known as rotamer, which in turn can be represented by a single value, its dihedral angle. We can define some arbitrary order for these amino acids and label them with an index, which we call position.

At each position there are multiple rotamers possible and we need to select them to minimize the overall energy function. More formally, for each position i, let $R_i$ be the set of rotamers available and $r_i \in R_i$ the chosen rotamer.

The model assumes the cost function of the structure is pairwise decomposable, meaning that we can account for the interaction of each pair independently when calculating the total energy:

Where $E(r_i, r_j)$ is the energy cost of the interaction between positions i and j, assuming rotamers $r_i$ and $r_j$ respectivelly. The definition of E can be based on molecular dynamics such as AMBER.

In [3], the authors call this optimization problem PRODES (PROtein DESign).

### PRODES is NP-Hard

Pierce and Winfree [3] prove that PRODES is NP-Hard. We’ll provide an informal idea of the proof here.

First we need to prove that the decision version of PRODES is NP-Hard. The decision version of PRODES is, given a value K, determine whether there is a set of rotamers such that the energy cost is less or equal K. We’ll call it PRODESd. We can then prove that PRODESd is NP-complete by showing that it belongs to the NP complexity class and by reducing, in polynomial time, some known NP-complete problem to it.

We claim that this problem is in NP because given an instance for the problem we can verify in polynomial time whether it is a solution (i.e. returns true), since we just need to evaluate the cost function and verify whether it’s less than K.

We can reduce the 3-SAT problem, known to be NP-complete, to PRODESd. The idea is that we can map every instance of the 3-SAT to an instance of PRODESd and show that the result is “true” for that instance of 3-SAT if and only if the result is “true” for the mapped instance of PRODESd.

Let’s start by formalizing PRODESd as a graph problem, with an example shown in the picture below:

a) has 3 positions with their sets of rotamers. b) each rotamer becomes a vertex grouped into sets. We pick exactly one vertex per set and try to minimize the total cost of the edges associated with the selected vertices. Image source: [3]

Now, given a 3-SAT instance, we create a vertex set for each clause $C_i$ (containing a vertex for each literal), and a vertex set for each variable $x_i$ (containing vertices T and F). For each literal $x_i$ we add an edge of weight 1 to the vertex F of the set corresponding to variable $x_i$. Conversely, for each negated literal, $\bar x_i$, we add an edge of weight 1 to the vertex T. All other edges have weight 0.

For example, the instance $(x_1 + \bar x_2 + x_3) \cdot ( \bar x_1 + x_2 + x_3) \cdot (\bar x_3)$ yields the following graph where only edges of non-zero weight are shown:

Source: [3]

We claim that this 3-SAT instance is satisfiable if and only if the PRODESd is true for K=0. The idea is the following: for each vertex set corresponding to a variable we pick either T or F, which corresponds to assigning true or false to the variable. For each vertex set corresponding to a clause we pick a literal that will evaluate to true (and hence make the clause true). It’s possible to show that if 3-SAT is satisfiable, there’s a solution for PRODESd that avoids any edge with non-zero weight.

### Integer Linear Programming

Now that we know that PRODES is unlikely to have an efficient algorithm to solve it, we can attempt to obtain exact solutions using integer linear programming model. Let’s start with some definitions:

We can define our variables as:

The object function of our model becomes:

Finally, the constraints are:

Equation (1) says we should pick exactly one rotamer for each position. Constraints (2) and (3) enforce that $x_{i, j} = 1$ if and only if $r_i = r_j = 1$.

Note: the LaTeX used to generate the images above are available here.

### Conclusion

The study of protein prediction led me to protein design, which is a much simpler problem, even though from the computational complexity perspective it’s still an intractable problem.

The model we studied is very simple and makes a lot of assumptions, but it’s interesting as a theoretical computer science problem. Still I’m not sure how useful it is in practice.

### References

[1] Wikipedia – Protein design
[2] Wikipedia – Rotamer
[3] Protein Design is NP-hard – Niles A. Pierce, Erik Winfree

# Protein Folding Prediction

In this post we’ll study an important problem in bioinformatics, which consists in predicting protein folding using computational methods.

We’ll cover basic concepts like proteins, protein folding, why it is important and then discuss a few methods used to predict them.

## Amino Acids and Proteins

Amino acids are any molecules containing an Amino (-NH2) + Carboxyl (COOH) + and a side chain (denoted as R), which is the part that varies between different amino acids [1].

Etymology. A molecule containing a carboxyl (COOH) is known as carboxylid acid, so together with the Amino part, they give name to amino acids [2].

An amino acid where R attaches to the first carbon (see Figure 1) is known as α-amino acids (read as alpha amino acids).

Figure 1: alpha and beta carbons in a Carbonyl group (link)

There are 500 different amino acids known, but only 22 of them are known to be present in proteins and all of them are α-amino acids. Since we’re talking about proteins, we’ll implicitly assume amino acids are the alpha amino ones. See Figure 2.

Figure 2: Amino acid

Amino acids can be chained, forming structures known as pepitides, which in turn can be chained forming structures like polipeptides. Some polipeptides are called proteins, usually when they’re large and have some specific function, but the distinction is not precise.

Etymology.Peptide comes from the Greek to digest. According to [3], scientists Starling and Bayliss discovered a substance called secretin, which is produced in the stomach and signals the pancreas to secrete pancreatic juice, important in the digestion process. Secretin was the first known peptide and inspired its name.

## Protein Folding

To be functional, a protein needs to fold into a 3-D structure. The sequence of amino acids in that protein is believed [6] to be all that is needed to determine the final shape of a protein and its function. This final shape is also known as native fold.

There are 4 levels of folding, conveniently named primary, secondary, tertiary and quaternary.

• Primary structure is the non-folded form, the “linearized” chain of amino acids.
• Secondary structure is formed when parts of the chain bond together via hydrogen bonds, and can be categorized into alpha-helix and beta-strands.
• Tertiary structure is when the alpha-helix and beta-sheet fold onto globular structures.
• Quaternary structure is when 2 or more peptide chains that have already been folded combine with each other.

## Prediction

The prediction of protein folds usually involves determining the tertiary (most common form) structure from its primary form. We’ve seen before in DNA Sequencing  that reading amino acid sequences is easier when its denatured into a linear chain. On the other hand, analyzing the intricate folded forms can be very hard [7], even with techniques such as X-ray crystalography.

## Why is this useful?

If the protein structure is known, we can infer its function on the basis of structural similarity with other proteins we know more about. One can also predict which molecules or drugs can efficiently bind and how they will bind to protein [7].

From Wikipedia’s Nutritious Rice for the World [8], protein prediction can also help understand the function of genes:

… prediction tools will determine the function of each protein and the role of the gene that encodes it.

## Computational Methods

In this section we’ll focus on the computation methods used to aid the prediction of protein folds. There are 2 main types, template based and de novo.

Template based methods rely on proteins with known folds which can be used as template for determining the fold of a target protein with unknown fold. This only works well if the target protein is similar enough to the “template” one, or has some shared ancestry with it, in which case they are said to be homologous.

De novo methods do not rely on existing protein for templates. It uses known physico-chemical properties to determine the native structure of a protein. We can model the folding as an optimization problem where the cost function is some measure of structure stability (e.g. energy level) and steps are taken to minimize such function given the physico-chemical constraints. Molecules that have the same bonds but different structures are known as conformations, which form the search space of our optimization problem.

Let’s now cover some implementations of de novo methods.

Rosetta

One implementation of de novo method is the Rosetta method [9]. It starts off with the primary form of the protein. For each segment of length 9 it finds a set of known “folds” for that segment. A fold in this case is represented by torsion angles on the peptide bonds. The search consists in randomly choosing one of the segments and applying the torsion of its corresponding known folds to the current solution and repeat.

Note that while this method does not rely on protein templates, it still assumes an existing database of folds for amino acid fragments. However, by working with smaller chains, the chances of finding a matching template are much higher.

AlphaFold

One major problem with de novo methods is that they’re computationally expensive, since it has to analyze many solutions to optimize them, which is a similar problem chess engines have to.

Google’s DeepMind has made the news with breakthrough performance with its Go engine, AlphaGo. They used a similar approach to create a solver for protein folding prediction, AlphaFold. This approach placed first in a competition called CASP.

CASP is a competition held annually to measure the progress of predictors. The idea is to ask competitors to predict the shape of a protein whose native fold is known but not public.

Crowdsourcing

Another approach to the computational costs involved in the prediction is to distribute the work across several machines. One such example is Folding@home, which relies on volunteers lending their idle CPU time to perform chunks of work in a distributed fashion which is then sent back to the server.

An alternative to offering idle CPU power from volunteers’ machines is to lend CPU power from their own brains. Humans are good at solving puzzles and softwares like Foldit rely on this premise to find suitable folds. According to Wikipedia:

A 2010 paper in the science journal Nature credited Foldit’s 57,000 players with providing useful results that matched or outperformed algorithmically computed solutions

## Conclusion

Here are some questions that I had before writing this post:

• What is protein folding? Why predicting it important?
• What are some of the existing computational methods to help solving this problem?

Questions that are still unanswered are details on the algorithms like the Rosetta method. I also learned about Protein Design in the process, which I’m looking into studying next. The approach used by DeepMind seem very interesting, but as of this writing, they haven’t published a paper with more details.

As a final observation, while researching for this post, I ran into this Stack Exchange discussion, in which one of the comments summarizes my impressions:

This field is pretty obscure and difficult to get around in. There aren’t any easy introductions that I know of.

I found that the information available is often vague and sparse. I’d love to find a good textbook reference to learn more details. One possible explanation is that this field is relatively new, and there’s a lot of proprietary information due to this being a lucrative field.

## References

[1] Wikipedia – Amino acid
[2] Wikipedia – Carboxylic acid
[3] The Research History of Peptide
[4] Wikipedia – Homology modeling
[5] Wikipedia – Threading (protein sequence)
[6] Wikipedia – De novo protein structure prediction
[7] Quora: Why is protein structure prediction important?
[8] Wikipedia – Nutritious Rice for the World
[9] The Rosetta method for Protein Structure Prediction
[10] AlphaFold: Using AI for scientific discovery

# Constructing Trees from a Distance Matrix

Richard Dawkins is an evolutionary biologist and author of many science books. In The Blind Watchmaker he explains how complex systems can exist without the need of an intelligent design.

Chapter 10 of that book delves into the tree of life. He argues that the tree of life is not arbitrary taxonomy like the classification of animals into kingdoms or families, but it is more like a family tree, where the branching of the tree uniquely describes the true ancestry relationship between the nodes.

Even though we made great strides in genetics and mapped the DNA from several different species, determining the structure of the tree is very difficult. First, we need to define a suitable metric that would encode the ancestry proximity of two species. In other words, if species A evolved into B and C, we need a metric that would lead us to link A-B and A-C but not B-C. Another problem is that internal nodes can be missing (e.g. ancestor species went extinct without fossils).

David Hill’s tree of life based on sequenced genomes. Source: Wikipedia

In this post we’ll deal with a much simpler version of this problem, in which we have the metric well defined, we know the distance between every pair of nodes (perfect information), and all our nodes are leaves, so we have the freedom to decide the internal nodes of the tree.

This simplified problem can be formalized as follows:

Constructing a tree for its distance matrix problem. Suppose we are given a n x n distance matrix D. Construct a tree with n leaves such that the distance between every pair of leaves can be represented by D.

To reduce the amount of possible solutions, we will assume a canonical representation of a tree. A canonical tree doesn’t have any nodes with degree 2. We can always reduce a tree with nodes with degree 2 into a canonical one. For example:

Nodes with degree 2 can be removed and the edges combined.

### Terminology

Let’s introduce the terminology necessary to define the algorithm for solving our problem. A distance matrix D is a square matrix where d_ij represents the distance between elements i and j. This matrix is symmetric (d_ij = d_ji), all off-diagonal entries are positive, the diagonal entries are 0, and a triplet (i, j, k) satisfy the triangle inequality, that is,

d_ik <= d_ij + d_jk

A distance matrix is additive if there is a solution to the problem above.

We say two leaves are neighbors if they share a common parent. An edge connecting a leaf to its parent is called limb (edges connecting internal nodes are not limbs).

### Deciding whether a matrix is additive

We can decide whether a matrix is additive via the 4-point theorem:

Four-point Theorem. Let D be a distance matrix. If, for every possible set of 4 indexes (i, j, k, l), the following inequality holds (for some permutation):

(1) d_ij + d_kl <= d_ik + d_jl = d_il + d_jk

Sketch of proof. We can derive the general idea from the example tree below:

We can see by inspection that (1) is true by inspecting the edges on the path between each pair of leaves. This will be our base case for induction.

Now, we’ll show that if we’re given a distance matrix satisfying (1), we are able to reconstruct a valid tree from it. We have that d_ik = a + e + c, d_jl = b + e + d, d_ij = a + b and d_kl = c + d. If we add the first two terms and subtract the last two, we have d_ik + d_jl - d_ij + d_kl = 2e, so we have

e = (d_ik + d_jl - d_ij + d_kl) / 2

We know from (1) that d_ik + d_jl >= d_kl + d_ij > d_kl, so e is positive.

If we add d_ik and d_ij and subtract d_jl, we get d_ik + d_ij - d_jk = 2a, so

a = (d_ik + d_ij - d_jk) / 2

To show that a is positive, we need to remember that a distance matrix satisfy the triangle inequality, that is, for any three nodes, x, y, z, d_xy + d_yz >= d_xz. In our case, this means d_ij + d_jk >= d_ik, hence d_ij >= d_ik - d_jk and a is positive. We can use analogous ideas to derive the values for b, c and d.

To show this for the more general case, if we can show that for every possible set of 4 leaves (i, j, k, l) this property is held, then we can show there’s a permutation of these four leaves such that the tree from the induced paths between each pair of leaves looks like the specific example we showed above.

For at least one one of these quadruplets, i and j will be neighbors in the reconstructed tree. With the computed values of a, b, c, d, e, we are able to merge i and j into its parent and generate the distance matrix for n-1 leaves, which we can keep doing until n = 4. We still need to prove that this modified n-1 x n-1 satisfies the 4-point theorem if and only if the n x n does.

### Limb cutting approach

We’ll see next an algorithm for constructing a tree from an additive matrix.

The general idea is that even though we don’t know the structure of the tree that will “yield” our additive matrix, we are able to determine the length of the limb of any leaf.  Knowing that, we can remove the corresponding leaf (and limb) from our unknown tree by removing the corresponding row and column in the distance matrix. We can then solve for the n-1 x n-1 case. Once we have the solution (a tree) for the smaller problem we can “attach” the leaf into the tree.

To compute the limb length and where to attach it, we can rely on the following theorem.

Limb Length Theorem: Given an additive matrix D and a leaf j, limbLength(j) is equal to the minimum of

(2) (d_ij + d_jk - d_ik)/2

over all pairs of leaves i and k.

The idea behind the theorem is that if we remove parent(j) from the unknown tree, it will divide it into at least 3 subtrees (one being leaf j on its own). This means that there exists leaves i and k that are in different subtrees. This tells us that the path from i to k has to go through parent(j) and also that the path from i to j and from j to k are disjoint except for j‘s limb, so we can conclude that:

d_ik = d_ij + d_jk - 2*limbLength(j)

which yields (2) for limbLength(j). We can show now that for i and k on the same subtree d_ik <= d_ij + d_jk - 2*limbLength(j), and hence

limbLength(j) <= (d_ij + d_jk - d_ik)/2

This means that finding the minimum of (2) will satisfy these constraints.

Attaching leaf j back in. From the argument above, there are at least one pair of leaves (i, k) that yields the minimum limbLength(j) that belongs to different subtrees when parent(j) is removed. This means that parent(j) lies in the path between i and k. We need to plug in j at some point on this path such that when computing the distance from j to i and from j to k, it will yield d_ij and d_jk respectively. This might fall in the middle of an edge, in which case we need to create a new node. Note that as long as the edges all have positive values, there’s only one location within the path from i to k that we can attach j.

Note: There’s a missing detail in the induction argument here. How can we guarantee that no matter what tree is returned from the inductive step, it is such that attaching j will yield consistent distances from j to all other leaves besides i and k?

This constructive proof gives us an algorithm to find a tree for an additive matrix.

Runtime complexity. Finding limbLength(j) takes O(n^2) time since we need to inspect every pair of entries in D. We can generate an n-1 x n-1 matrix in O(n^2) and find the attachment point in O(n). Since each recursive step is proportional to the size of the matrix and we have n such steps, the total runtime complexity is O(n^3).

Detecting non-additive matrices. If we find a non-positive limbLength(j), this condition is sufficient for a matrix to be considered non-additive, since if we have a corresponding tree we know it has to represent the length of j’s limb. However, is this necessary? It could be that we find a positive value for limbLength(j) but when trying to attach j back in the distances won’t match.

The answer to this question goes back to the missing detail on the induction step and I don’t know how to answer.

### The Neighbor-Joining Algorithm

Naruya Saitou and Masatoshi Nei developed an algorithm, called Neighbor Joining, that also constructs a tree from an additive matrix, but has the additional property that for non-additive ones it serves as heuristic.

The idea behind is simple: It transforms the distance matrix D into another n x n matrix, D*, such that the minimum non-diagonal entry, say d*_ij, in that matrix corresponds to neighboring vertices (i ,j) in the tree, which is generally not true for a distance matrix.

The proof that D* has this property is non-trivial and will not provide here. Chapter 7 of [1] has more details and the proof.

Given this property, we can find i and j such that d*_ij is minimal and compute the limbs distances limbLength(i) and limbLength(j), replace them with a single leaf m, and solve the problem recursively. With the tree returned by the recursive step we can then attach i and j into m, which will become their parents.

### Conclusion

In this post we saw how to construct a tree from the distance between its leaves. The algorithms are relatively simple, but proving that they work is not. I got the general idea of the proofs but didn’t get with 100% of detail.

The idea of reconstructing the genealogical tree of all the species is fascinating and is a very interesting application of graph theory.

### References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P. – Chapter 10

[2] The Blink Watchmaker – Richard Dawkins

# DNA Assembly

In this post we’ll discuss the problem of reconstructing a DNA segment from its fragments, also known as DNA assembly.

Context. When we first talked about DNA sequencing, we learned that there’s no good way to “scan” the whole series of nucleotides from a DNA with current technology. What can be done is to break the target segment into many small (overlapping) segments and then rely on computers to help with the task of reconstructing the original DNA sequence from these segments.

We can start by making some assumptions over the nature of the fragments to start. First, we’ll assume every fragment has the same length and second that we have every possible fragment of such length.

For instance, if our sequence is TATGGGGTGC, all possible fragments of length 3 are: ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG.

Note that the fragments overlap with each other. For example, TAT and ATG have an overlap of AT. This is crucial for us solve the problem, since if there was no overlap it would be impossible to order the fragments to obtain the original sequence, since there would be no “link” between any two fragments.

Let’s state the problem more formally given these constraints.

### The String Reconstruction Problem

Definitions. A k-mer of a string S is any substring of S with length k.  The String Composition Problem consists in, given the set of all k-mers from S, reconstruct the string S.

Reusing the example from above, if we are given ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG, we could reconstruct the string TATGGGGTGC. We’ll now see a way to solve this problem.

Solution. Assuming a solution exists, it will consist of a (ordered) sequence of the k-mers such that adjacent k-mers overlap in k-1 positions. From the example above, the permutation is

TAT, ATG, TGG, GGG, GGG, GGT, GTG, TGC

And two adjacent k-mers such as TGG and GGG, overlap in k-1 positions (GG).

We can model this problem as a graph problem. If we define a directed graph where each vertex corresponds to a k-mer, and an edge (u, v) exists if and only if the suffix of k-mer u is the prefix of the k-mer v, in other words, u overlaps with v.

Now, if we can find a path visiting each vertex exactly once, that will be a valid reconstructed string. This is known as the Hamiltonian path problem for general graphs it’s a NP-Hard problem.

Instead, we can model this problem using a graph as follows: for each k-mer, we have a vertex corresponding to its prefix of length k-1 and another to its suffix with length k-1. For example, for a k-mer TGG, there would exist vertices TG and GG. There’s then an edge from u to v, if the overlap of u and v in k-2 positions is a k-mer in the input. In the example, above, there’s an edge from TG to GG because TGG is a k-mer. Note we can have repeated (multiple) edges.

In this new graph, if we can find a path visiting each edge exactly once, we’ll find a reconstructed string to the set of k-mers. To see why, we can observe that each edge in this new graph is a k-mer and two consecutive edges must overlap in k-1 positions (the overlap being the vertex that “links” these two edges). The graph for the example we discussed above can be seen below:

Graph representing the k-mer set:ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

Note that this second graph is the line graph of the one we first used, in a similar fashion that a de Bruijn graph of dimension n is a line graph of the one with dimension n-1. In fact, these graphs are a subgraph of the de Bruijn graphs.

As we saw in our discussions about Eulerian Circuits that this is a much easier problem to solve.

#### Dealing with Ambiguity

Even if are able to solve the String Reconstruction problem, we might not end up with the right DNA segment. In [1], the authors provide the example TATGCCATGGGATGTT which has the same 3-mer composition of TAATGGGATGCCATGTT. Let’s see strategies employed to work around this problem.

### The String Reconstruction from Read-Pairs Problem

While it’s currently infeasible to generate longer fragments that would reduce ambiguity, it’s possible to obtain what is called read-pairs. These are a pair of k-mers that are separated by a distance of exactly d in the target segment.

For example, TGC and ATG are a pair of 3-mers separated by distance 1 in TATGCCATGGGATGTT. We refer to a pair of k-mers separated by distance d as (k, d)-mers, or (pattern1 | pattern2) if we can omit the distance.

Solution. We can construct a de Bruijn-like graph for this problem too, which we’ll call Paired de Bruijn graph. First, let’s define the prefix and suffix of a (k, d)-mer.  Given a (k, d)-mer in the form of (a1, ..., ak | b1, ..., bk), its prefix is given by (a1, ..., ak-1 | b1, ..., bk-1) and its suffix by (a2, ..., ak | b2, ..., bk).

For every (k, d)-mer, we’ll have one vertex corresponding to the prefix and one to the suffix of this (k, d)-mer. There’s an edge from vertex u to vertex v if there’s a (k, d)-mer whose prefix is u and suffix is v.

Similar to the solution to the String Reconstruction problem, we can find an Eulerian path, but in this case that might not yield a valid solution. In [1] the authors provide an example:

Consider the set of (2, 1)-mers is given by (AG|AG), (AG | TG), (CA | CT), (CT | CA), (CT | CT), (GC | GC), (GC | GC), (GC | GC), (TG | TG).

After constructing the graph, one of the possible Eulerian paths is (AG|AG) → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) → (CT | CT) → (TG | TG) → (GC | GC) →  (CT | CA) which spells AGCAAGCTGCTGCA, which is a valid solution.

However, another valid Eulerian path, (AG|AG) → (GC | GC) →  (CT | CT) →  (TG | TG)  → (GC | GC) → (CA | CT) → (AG | TG) → (GC | GC) →  (CT | CA) does not yield a valid string.

In [1] the authors don’t provide an explicit way to overcome this issue but they go on to describe how to enumerate all Eulerian paths, which seems to suggest a brute-force approach.

### Practical Challenges

Missing fragments. One of the assumptions we made, that all fragments of the sequence are present doesn’t hold true for the state-of-the-art sequencers.

A technique to address this issue is to break the fragments into smaller ones until we get full coverage.

Left: 10-mers not providing full coverage. Right: 5-mers obtained from 10-mers and having full coverage.

This trades off coverage with ambiguity, since smaller k-mers are more likely to contain repeats and that might not lead to a single solution.

Typos. Another limitation of sequencers is that they can misread nucleotides. If we perform multiple reads – some containing the correct nucleotides, some not – we’ll end up with a graph where some paths are valid and some are not. It’s possible to remove them via heuristics but they’re not perfect and can lead to the removal of valid paths.

### Conclusion

While following the textbook [1] I felt a bit lost due to so many detours and kept losing track of the main problem being solved. Because the textbook is meant to also be accessible to people without prior knowledge of Computer Science, so it does need to provide the base for concepts such as Graph Theory.

One think I missed from the content were a section for experiment results. Bioinformatics is a highly applied branch of Computer Science and all of these methods are heuristics or based on approximate models. I’d be interested in knowing how well they perform in practice.

What I liked the most about the presentation style is that it provides a simpler solution first, describe the issues with it and then provide a better solution. This helps understanding on why a given algorithm is this or that way.

Reconstructing a string using (k,d)-mers in an efficient way seems like an open problem, given the solution presented requires brute force in the worst case. I wonder if there has been any progress since.

### References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Wikipedia – Sequence Assembly

# De Bruijn Graphs and Sequences

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

## Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

## Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

## De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

## Applications

#### Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

#### Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

#### Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

## Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

## Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

## References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[5] Wikipedia: Superpermutation

# DNA Sequencing

Frederick Sanger. Wikipedia

Frederick Sanger was a British biochemist. He is known for the first sequencing of a protein (1955) and a method for sequencing DNA that bears his name, the Sanger Method. Sanger won two Nobel Prizes in Chemistry [1].

In this post we’ll talk about one of the first steps of DNA analysis, DNA sequencing, which is obtaining the data from the DNA, how it’s performed (we’ll focus on the Sanger method) and some interesting computational problems associated with it.

This is the second post in the series of Biochemistry studies from a Computer Scientist perspective. Our first post is a brief discussion of basic concepts in Cell Biology.

## DNA Sequencing

DNA sequencing is the determination of the physical order of the nucleotide bases in a molecule of DNA.

The first living organism to have its genome sequenced was a bacteria, Haemophilus influenzae, whose genome is about 1.8 million base pairs.

Genome represents the whole set of  genetic information of an organism. For a bacteria, it’s singular circular chromosome, but for humans it’s the set of all 23 pairs of chromosomes. A base-pair (shortened as bp) refers to a pair of nucleotides (bases) that are bound together in the double-strands of DNA.

In the human genome, there are 3 billion of base-pairs, and it took 13 years for it to be completed.

There are two main methods of sequencing, Sanger and Next-generation [5]. We’ll talk about the Sanger in details and discuss briefly the Next-generation from a real-world use case.

## Sanger Sequencing

The Sanger method is able to determine the nucleotide sequence of small fragments (up to abound 900bps) [5] of DNA.

### Overview

The first step is cloning the fragment into multiple copies (like billions) by a process called amplification. This is essentially mimicking the DNA cloning process in an artificial setup. In very high level we have:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Extend the primer with nucleotides until it reaches the other extremity

Once this is complete, we end up with two double-stranded fragments. We can repeat the process to generate many copies of the original fragment (theoretically doubling at each step). This process is known as Polymerase Chain Reaction (PCR).

Once we have enough copies of the fragment, we do a similar process, but this time, we also allow extending the primer with a special nucleotide named (dideoxy nucleotide). The key difference is that once it’s added, a dideoxy nucleotide cannot be further extended, and it also contains a special marker that causes each different base to have a different color. The process is now the following:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Add to the primer either
• Regular nucleotide (non-terminal – can be further extended)
• Dideoxy nucleotide (terminal – cannot be further extended)

Now, instead of ending up with more clones, we’ll have fragments with a variety of lengths.

We then run theses fragments through a process called Capillary Gel Electrophoresis which roughly consists in subjecting the fragments to a electric field, which then causes fragments to move with speed proportional to their length (the smaller the fragment the faster it moves). Once a group of fragments (which have the same length) reach a sensor at end of the field, we make use of the color marker in the special nucleotide at the tip of the fragment to determine the base of that dideoxy nucleotide. This enables us to determine the sequence of the nucleotides in the original fragment!

To given an example, say that the original sequence is GATTCAGC. Because there are many copies of the fragment after amplification, it’s very likely that we’ll end up with fragments with all possible lengths, from 1 to 8. Since a fragment of length 1 moves faster than any other, it will reach the sensor first. We know that a fragment with length 1 is a base pair G-C, and C is a dideoxy nucleotide (if it was not, it would have continued extended further). Say the color marker for C is orange. Then the first color to be captured by the sensor will beorange. The second set of fragments to reach the sensor is of length 2, which is a fragment (G-C, A-T), where T is a dideoxy nucleotide. If we assume it has color red, that’s the color which will be captured next by the sensor. You can see that based on the colors captured by the sensor we can infer the nucleotide sequence in the original segment.

Fragments with different lengths and color markers. Image copied from Whole-Genome Sequencing

### Details

Let’s go in more details for these processes. For both cases we work with solutions. Finer grained manipulation of molecules is infeasible.

To separate the double strands we heat the solution up to 96ºC, in a process we call denaturation. The high temperature causes the hydrogen bonds between pairs of nucleotides to break.

In the same solution we have the primer molecules (aka oligonucleotides) which are carefully chosen to match the beginning of the DNA strand. They also bind to DNA at a higher temperature than the strands (e.g. 55ºC). This is important because we can now lower the temperature of the solution slowly, enough so that primers can bind, but not so low to the point where the original strands will join back together. This slow cooling is called annealing. These primers can be synthesized artificially.

Gap in understanding: how to choose the right primer, since we need to know at least some of sequence from the nucleotide in order to know which primer will end up binding there? One possible explanation is that we know the exact sequence where a DNA was cut if we restriction enzymes, since their binding site is known [9] and hence we have an idea of the result endpoints after the cut.

# □

We also add an enzyme, DNA polymerase, and nucleotides to the solution. The enzyme is able to extend the original primer segment by binding free nucleotides in the solution. In particular we use the enzyme of a bacteria that lives at 70ºC (Thermus Acquaticus), also know as Taq polymerase because it is functional at higher temperatures. Performing the replication at a higher temperature prevents the separated strands from gluing together again.

The dideoxy nucleotide are modified versions of the regular nucleotide by supressing the OH group. They can still be incorporated to the primer via the DNA polymerase, but they prevent other nucleotides to binding to them via the sugar-phosphate binding.

In addition these dideoxy nucleotides contain a fluorescent molecule whose color is unique for each different type of base. How are these molecules “inserted” into the nucleotides? The abstract of [6] states:

Avian myeloblastosis virus reverse transcriptase is used in a modified dideoxy DNA sequencing protocol to produce a complete set of fluorescence-tagged fragments in one reaction mixture.

which suggests it’s possible to synthesize them by using a specific virus.

Dideoxynucleotide vs deoxynucleotide. The lack of the OH group prevents a ddNTP from binding to the “next” nucleotide, effectively terminating the chain. Image copied from Whole-Genome Sequencing

## Next-Generation Sequencing (Illumina)

The Sanger method is a very slow process, making it infeasible for analyzing large amounts of DNAs such as the human genome.

Modern sequencers make use of the “Next-generation” methods, which consist in massive parallelism to speed up the process. The most advanced sequencer in the market is produced by a company called Illumina. As of the time of this writing, their top equipment, Hiseq X Ten, costs about $10 million dollars, can sequence about 18k full genomes a year and it costs about$1000 per genome [2, 3].

Illumina’s educational video [4], describes the process:

• Cut a long sequence into small fragments
• Have the segments attach to beads in a glass plate via the adapters. The beads are basically the primers.
• Amplify the segments (massive cloning)
• Extend the primer with fluorescent nucleotides
• Every time a nucleotide is added, it emits light which is captured by a sensor

After the process is over, we have the sequence of many fragments based on the colors that were emitted.

We can see that the principles resemble the Sanger method, but it uses different technologies to allow a very automated and parallel procedure.

This whole process is very vague and it’s hard to have a good understanding of it. It’s understandable given that a lot of the information is likely industry secret.

## Sequencing vs Genotyping in personal genomics

Some of the most popular personal genetic analysis companies, such as 23andMe, provide a service in which they analyze the user DNA for a small fee. It’s way cheaper than the full genome analysis provided by Illumina, but that’s because these companies don’t do DNA sequencing, but rather genotyping.

Genotyping is the process of determining which genetic variants an individual possesses. This is easier than sequencing because a lot of known diseases and traits can be traced back to specific regions and specific chromosomes.

This is the information you most likely want to know about yourself. Remember that the majority of DNA in complex organisms is not useful (introns). In humans genome, exome (the part of DNA consisting of exons) account for less than 2% of total DNA.

Sequencing technology has not yet progressed to the point where it is feasible to sequence an entire person’s genome quickly and cheaply enough to keep costs down for consumers. It took the Human Genome Project, a consortium of multiple research labs, over 10 years to sequence the whole genomes of just a few individuals.

## Sequence Assembly Problem

Current technology is unable to sequence large segments of DNA, so it has to break it down into small fragments. Once that is done, we need to reconstruct the original sequence from the data of individual fragments.

There are two scenarios:

• Mapping: matching fragments of DNA to existing sequences by some similarity criteria
• De-novo: reconstruct the DNA when there’s no prior data about that sequence.

There is a family of computational methods for the mapping case known as Sequence Assembly, which we’ll study in more details in a future post.

Gap in understanding: It’s unclear what exact information is expected for the De-novo sequencing. It is impossible to determine the original sequence by only having data about individual segments, in the same way it’s impossible to reconstruct a book by only knowing it’s paragraphs contents (but not their order), borrowing the analogy from Wikipedia [8].

One thing I can think of is that if we repeat the experiments multiple times, each time cutting the molecules at different places, it might be possible to infer the relative order of the segments if they’re unique enough.

For example, if the original segment is: GATTCAGC and we run two experiments, one yielding (GAT, TCAGC) and another (GATTC, AGC), then in this case there’s only one way to assemble (order) these sequences in a consistent way.

# □

## Conclusion

In this post we studied some aspects of DNA sequencing. I found the Sanger method fascinating. In Computer Science ideas usually translate to algorithms and code very directly, but in other science branches the mapping from ideas to implementation is a problem in itself.

In Mechanics for example, we have to work with the physical world, so when converting from a circular movement to a linear one requires some clever tricks.

This needs to be taken to another level in Molecular Biology, because we don’t have direct access to the medium like we do in a mechanical device, for example, we can’t directly manipulate a double strand of DNA fragment to separate it, but have to resort to indirect ways.

The field of Biotechnology seems to be progressing at such a pace that it was challenging to find good sources of information. I’m yet to find a resource that explains end to end the steps from the process that takes a sample of cells and outputs the DNA nucleotides to a computer, including the technical challenges in doing so. This is what this post aimed to do.

## References

[1] Wikipedia – Frederick Sanger
[2] Genohub – Illumina’s Latest Release: HiSeq 3000, 4000, NextSeq 550 and HiSeq X5
[3] Illumina Hiseq-X
[4] Youtube – Illumina Sequencing by Synthesis
[5] Khan Academy – DNA sequencing
[6] Science Magazine: A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides
[7] Towards Data Science – DNA Sequence Data Analysis – Starting off in Bioinformatics
[8] Wikipedia – Sequence assembly
[9] Dummies – How Scientists Cut DNA with Restriction Enzymes

# Cell biology and programming

Rosalind Franklin was an English chemist and X-ray crystallographer. She is best known for her work on the X-ray diffraction images of DNA, particularly Photo 51, while at King’s College, London, which led to the discovery of the DNA structure by Watson and Crick.

James Watson, Francis Crick and Maurice Wilkins shared the Nobel Prize in Physiology or Medicine in 1962. According to Wikipedia [1], Watson suggested that Franklin would have ideally been awarded a Nobel Prize in Chemistry four years after Franklin passed away due to ovary cancer.

Photo 51

In this post we’ll study some basic concepts of cell biology, mostly around the DNA. We’ll start by introducing the structure of the DNA and then two of its main functions: replication and synthesis of protein. Since this is a programming blog, we’ll provide analogies (some forceful) to computer systems to help us relate to prior knowledge.

The end goal of this post to provide a good basis for later learning bio-informatics algorithms.

### Genome

Genome is the set of information necessary for creating an organism. In a computer programming analogy we can think of the genome as the entire source code of an application.

#### Chromosome

Let’s recall that cells can be classified into two categories, eukaryotic (from the Greek, good + nucleus) and prokaryotic (before + nucleus). As the name suggests, eukaryotic cells have a well define nucleus surrounded by a membrane.

In eukaryotic cells the genome is divided across multiple chromosomes which are basically compacted DNA. They need to be compacted to fit within the nucleus. According to Science Focus [3], if stretched, the DNA chain would be 2 meters long.

Humans cells usually have 23 pairs of chromosomes, 22 of which are called autosomes and they are numbered based on size (autosome 1 is the largest). The remaining is a sex chromosome and can be of type either X or Y.

Men and women share the same types of autosomes, but females have two copies of chromosome X, and men one chromosome X and one Y.

Chromosomes are usually depicted as X-like structures and neatly arranged [4], but recent research was able to visualize the actual structure and it looks more like:

Prokaryotes (e.g. bacteria) on the other hand typically store their entire genome within a single circular DNA chain.

In our computer programming analogy, the chromosome serve as units of organization of the source code, for example the files containing the code. If your application is simple, like a bacteria, the entire source code could be stored in a single file!

We can push the analogy even further and think of the tangled DNA within the chromosome as the “minification” step that JavaScript applications apply to the original source code to reduce network payload.

#### DNA

The deoxyribonucleic acid, more commonly known as DNA is a structure usually composed of two strands (chains) connected through steps to form a double helix.

Conceptual representation of the DNA: the double helix

In our analogy the DNA could represent the text of the source code.

#### Nucleotide

Nucleotides are the discrete units that form the base of the DNA. In the DNA it can be one of: Adenine, Cytosine, Guanine and Thymine.

Chemically speaking, we can divide the nucleotide in 3 parts: a sugar group, a phosphate group and a nitrogen base. The first two are common among the nucleotides, while the base differentiates them.

Guanine, one of the 4 possible nucleotides in the DNA

Any two nucleotides can be connected through their sugar and phosphate groups (the sugar group of one nucleotide links to the phosphate group of the next). Nucleotides linked this way form the backbone of a single strand of the DNA.

In addition, two nucleotides can be linked together via their nitrogen base, but in this case, there’s a restriction on which two bases can be linked together. Adenine can only be paired with Thymine, and Cytosine can only be paired with Guanine. These pairings form the “steps” in between two DNA strands.

4 nucleotides linked together through the sugar-phosphate groups or through the nitrogen bases.

Because the phosphate group of a nucleotide is linked to the sugar group of the next, we can define a direction for a chain of nucleotides. The endpoint that ends with the phosphate group has 5 carbon molecules and is called 5′ (read as five prime), while the sugar group has 3 and is called 3′ (three prime). The two strands in a DNA molecule are oriented in opposite directions, as depicted in the figure above.

We can now complete the analogy of computer programming by stating that nucleotides are the characters that compose the text of the source code (DNA). In our case, the alphabet contains 4 letters: A (Adenine), C (Cytosine), G (Guanine), T (Thymine).

### The replication

The DNA is capable of replicating itself so that it can be passed down to new formed cells.

In high-level, the double strands of the DNA start separating and other structures start binding new nucleotides to each strand (templates) until both the strands are completely duplicated (and form a double strand again).

The separation of the strands is triggered by the protein helicase, and can happen at any point of the DNA and it might happen in many places at the same time. One way to visualize this is opening a zipper jacket from the middle and keep pushing it open in one direction.

While the strands are being separated, proteins called DNA polymerase starts scanning each strand and making a copy strand by adding nucleotides to it.

The DNA polymerase can only extend an existing chain of nucleotide, so it requires an initial fragment to start with. For that reason, in the beginning of the duplication process, a small fragment of DNA or RNA, called primer, needs to be connected to the strand.

One important limitation the polymerase has is that it can only process a strand in one specific direction: from 3′ to 5′. But since we saw that strands are oriented in opposite direction of each other, it means that the replication process doesn’t happen symmetrically on both strands.

For the strand oriented 3′ to 5′ the replication is smooth, generating a continuous strand. For the reverse strand though, it will be done in pieces, because the polymerase is adding nucleotides in the opposite side of where the opening is happening. These pieces are known as Okazaki fragments and later joined together by another protein called ligase.

We can see these two cases taking place in the picture below:

One interesting fact about this process is that errors do happen and there are error corrections in place to minimize them. The result of the replication is also not 100% accurate, especially at the endpoints of the strands, where each new replica formed has its endpoints shorter than the original template. To compensate for this, the DNA has repeated redundant segments of nucleotides at the endpoints, know as telomeres, but eventually this extra segments get wore off to a point they cannot be used for replication. The shortening of telomeres is associated with aging.

In our analogy to computer programming, we could imagine the replication being the distribution of copies of the source code to other people. This analogy is weak, though. If we want to be precise, we’d need to come up with a program that is capable of printing its own source as output. This was named Quine by Douglas Hofstadter in Gödel, Escher, Bach [6] and it’s an example of a self-replicating automata, also studied by the famous computer scientist John von Neumann.

### Protein Production

A second function of the DNA is the production of proteins. The first step, called transcription, is very similar to replication: One of the strands serve as template, and a new strand with complementary base is generated. The difference is that instead of Thymine, the nucleotide Uracil is used. The resulting structure is called mRNA, short for messenger RNA.

Production of mRNA and its exit to the cytoplasm

The mRNA detaches itself from the DNA strand and exits the nucleus to the cytoplasm where the second step, translation, begins. In there, it is “interpreted” by a structure called ribosome. Every 3 nucleotides, denominated codon, translates to an amino acid, which in turn form a protein. The mapping of every possible 64 combinations of codons are displayed below:

Mapping of codons to amino-acids. For example, GGA maps to Glycine.

The tRNA, short for transfer RNA, is a small chain of RNA that connects amino-acids to their corresponding codon in the mRNA. There are some special codons that indicate the end of the process. The output of this translation will be a peptide chain.

Synthesis of a protein

The production of protein is the actual functioning of the DNA. If we link to the computer programming model, we could think of the source code (DNA) being interpreted or compiled into an actual program that can be executed. It’s incredible how biological systems evolved to define an explicit code for the end of the translation, much like how a semi-color or new line often indicates the end of an expression.

### Applications of Bio-informatics

How can computer science help with the study of biological systems? Computers are  good at repetitive tasks and handling large amounts of data. Here are some applications according to Wikipedia [7].

• DNA sequencing – consists of determining the order of nucleotides in the DNA from raw data. Computers are useful here because the raw data often come as fragments that need to be merged.
• Protein structure prediction – given a sequence of nucleotides, determine the chain of amino-acids is well-understood, but predicting the structure of the final protein is an open problem.
• Reduce noise in massive datasets output by experiments.

### Conclusion

From reading a lot of introductory material, I felt that there were a lot of  imprecise or ambiguous descriptions. That seems to come from the sheer complexity of biological systems and most results coming from empirical evidence, which currently only provide an incomplete picture of the whole, but new information still comes at a very frequent pace, sometimes making existing models obsolete or incorrect.

I last studied Biology back in high school. I don’t intend to study it in depth, but just enough to understand how computer science can be used to solve particular problems.

The idea of modeling living organisms as organic computers is fascinating. As we discover more about the inner workings of cells, I hope we can come up with better models and be able to predict their outcome with greater accuracy.

### References

[1] Rosalind Franklin – Wikipedia
[2] Photo 51 – Wikipedia
[3] How long is your DNA?
[4] How many chromosomes do people have?
[5] What a chromosome really looks like
[6] Gödel, Escher, Bach: An Eternal Golden Braid – Douglas R. Hofstadter
[7] Bioinformatics – Wikipedia