# Protein Design

We previously learned about the problem of predicting the folding of a protein, that is, given a chain of amino acids, find its final 3D structure. This time we’re interested in the reverse problem, that is, given a 3D structure, find some chain of amino-acid that would lead to that structure once fully folded. This problem is called Protein Design.

In this post we’ll focus on mathematical models for this problem, studying its computational complexity and discuss possible solutions.

### Mathematical Modeling

The 3D structure of a protein can be divided into two parts: the backbone and the side chains. In the model proposed by Piece and Winfree [3], we assume the backbone is rigid and that we’ll try to find the amino-acids for the side chains such that it minimizes some energy function.

This means we’re not really trying to predict the whole chain of amino acids, but a subset of those amino that will end up on the side chains.

The amino acids on the side-chain can have a specific orientation [2, 3], known as rotamer, which in turn can be represented by a single value, its dihedral angle. We can define some arbitrary order for these amino acids and label them with an index, which we call position.

At each position there are multiple rotamers possible and we need to select them to minimize the overall energy function. More formally, for each position i, let $R_i$ be the set of rotamers available and $r_i \in R_i$ the chosen rotamer.

The model assumes the cost function of the structure is pairwise decomposable, meaning that we can account for the interaction of each pair independently when calculating the total energy:

Where $E(r_i, r_j)$ is the energy cost of the interaction between positions i and j, assuming rotamers $r_i$ and $r_j$ respectivelly. The definition of E can be based on molecular dynamics such as AMBER.

In [3], the authors call this optimization problem PRODES (PROtein DESign).

### PRODES is NP-Hard

Pierce and Winfree [3] prove that PRODES is NP-Hard. We’ll provide an informal idea of the proof here.

First we need to prove that the decision version of PRODES is NP-Hard. The decision version of PRODES is, given a value K, determine whether there is a set of rotamers such that the energy cost is less or equal K. We’ll call it PRODESd. We can then prove that PRODESd is NP-complete by showing that it belongs to the NP complexity class and by reducing, in polynomial time, some known NP-complete problem to it.

We claim that this problem is in NP because given an instance for the problem we can verify in polynomial time whether it is a solution (i.e. returns true), since we just need to evaluate the cost function and verify whether it’s less than K.

We can reduce the 3-SAT problem, known to be NP-complete, to PRODESd. The idea is that we can map every instance of the 3-SAT to an instance of PRODESd and show that the result is “true” for that instance of 3-SAT if and only if the result is “true” for the mapped instance of PRODESd.

Let’s start by formalizing PRODESd as a graph problem, with an example shown in the picture below:

a) has 3 positions with their sets of rotamers. b) each rotamer becomes a vertex grouped into sets. We pick exactly one vertex per set and try to minimize the total cost of the edges associated with the selected vertices. Image source: [3]

Now, given a 3-SAT instance, we create a vertex set for each clause $C_i$ (containing a vertex for each literal), and a vertex set for each variable $x_i$ (containing vertices T and F). For each literal $x_i$ we add an edge of weight 1 to the vertex F of the set corresponding to variable $x_i$. Conversely, for each negated literal, $\bar x_i$, we add an edge of weight 1 to the vertex T. All other edges have weight 0.

For example, the instance $(x_1 + \bar x_2 + x_3) \cdot ( \bar x_1 + x_2 + x_3) \cdot (\bar x_3)$ yields the following graph where only edges of non-zero weight are shown:

Source: [3]

We claim that this 3-SAT instance is satisfiable if and only if the PRODESd is true for K=0. The idea is the following: for each vertex set corresponding to a variable we pick either T or F, which corresponds to assigning true or false to the variable. For each vertex set corresponding to a clause we pick a literal that will evaluate to true (and hence make the clause true). It’s possible to show that if 3-SAT is satisfiable, there’s a solution for PRODESd that avoids any edge with non-zero weight.

### Integer Linear Programming

Now that we know that PRODES is unlikely to have an efficient algorithm to solve it, we can attempt to obtain exact solutions using integer linear programming model. Let’s start with some definitions:

We can define our variables as:

The object function of our model becomes:

Finally, the constraints are:

Equation (1) says we should pick exactly one rotamer for each position. Constraints (2) and (3) enforce that $x_{i, j} = 1$ if and only if $r_i = r_j = 1$.

Note: the LaTeX used to generate the images above are available here.

### Conclusion

The study of protein prediction led me to protein design, which is a much simpler problem, even though from the computational complexity perspective it’s still an intractable problem.

The model we studied is very simple and makes a lot of assumptions, but it’s interesting as a theoretical computer science problem. Still I’m not sure how useful it is in practice.

### References

[1] Wikipedia – Protein design
[2] Wikipedia – Rotamer
[3] Protein Design is NP-hard – Niles A. Pierce, Erik Winfree

# Eulerian Circuits

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

## Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

## Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

## Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices $(v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v)$ and w is a vertex with a sub-circuit. Let $(w, q_1, ..., q_m-1, w)$ be the sub-circuit starting from w. We can construct a new circuit $(v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v)$.

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

## Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

## Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

## Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

$ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)!$ (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

$\prod_{v \in V}(deg(v) - 1)!$ (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

## Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

## Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

## References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

# Turing Machines and Undecidability

Jeffrey Ullman is a professor of Computer Science at Stanford. He is famous for his book Introduction to Automata Theory, Languages, and Computation. He teaches the Automata course in Coursera, and I’ve just finished the most recent edition.

Ullman’s research interests include database theory, data integration, data mining, and education using the information infrastructure. He was Sergey Brin’s PhD advisor and won the Knuth Prize in 2000 [1].

We divide this post in two parts. In the first part, we’ll provide a brief overview of automata theory, including basic concepts and terminology. In Part 2, we’ll focus on the limitations of automatons by discussing undecidability.

## Part 1: Automata Theory

In the following sections, we’ll go from the more limited automata, to a push-down automata and then to a turing machine.

### Finite automata

We can think of a finite automaton as a state machine. It’s represented by a directed graph where the nodes represent states and each edge $(u, v)$ connecting two states has a label, which is a symbol from a given alphabet $\Sigma$. Consider an initial state $q_0$, a set of final states $F$ and a string $w$ of the alphabet $\Sigma$.

To be more generic, we can describe the edges of the graph as a transition function. It takes as input a state $p$ and a symbol $c$ and outputs a new state $q$. We denote such function as $\delta$, so we have $\delta(p, c) \rightarrow q$.

Figure 1: Sample DFA

We can then decide if input $w$ is accepted by this automaton by simulating it on the automaton. That consists in consuming each character $c$ from $w$ at a time and moving from one state to another by following the edge labeled $c$. If there’s no such edge, we can assume it goes to a dead-end state where it gets stuck and is rejected. If after consuming all the input it is in one of the final states, we say it’s accepted, otherwise it’s rejected.

If there’s always at most one edge to take at any point of the simulation, we have a deterministic way to decide wheter $w$ is accepted, so we call such class of automaton, determinist finite automata or DFA for short. In case it has multiple choices, we have a non-deterministic finite automata, or NFA.

Surprisingly, a DFA is as powerful as a NFA, because we can simulate a NFA with a DFA, though such DFA might have an exponential number of states. DFA’s are also as powerful as regular expressions, which is out of the scope of this post.

We define a set of strings as a language. We say that an automaton defines a given language if it accepts all and only the strings contained in this language.

In the case of finite automata, the class of languages that they define is called regular languages.

### Push-down automata

A push-down automata is essentially a finite automata equipped with a stack. The symbols that can go into the stack are those from the input and also some additional ones. In this automaton, the transition function takes as input, the state $p$, the input symbol $c$ and the element at the top of the stack $X$. It outputs the new state $q$ and the new top of the stack that will replace the top $X$, which can be more than one element or event the empty symbol, meaning that we didn’t put any element on top.

This type of automaton can be shown to be equivalent to a context-free grammar (CFG), which is out of the scope of this post as well, but it’s very common for describing programming languages.

The set of languages for which there is a push-down automata that define them, is called context-free languages, which contains the regular languages.

### Turing machines

What could we accomplish with a finite automaton with 2 stacks? Quite a bit. Such automata are called two-push down stack automata. It’s possible to show that a finite automaton with 2 stacks is equivalent to a Turing machine.

So, what is exactly a Turing machine? It’s a device with an state, an infinite tape and a needle that points to a given position on this tape. The input $w$ is initially written on the tape and the needle points to the first character of $w$. All the other positions in the tape contains the blank symbol.

Figure 2: A concept of the Turing Machine

The transition function that takes as input the state $p$ and the input symbol pointed by the needle $c$ and it outputs the new state $q$, the new symbol that will be written replacing $c$ and the direction that the needle moves after that (either right or left). In case there’s no defined transition function for a state and input symbol, we say that the machine halts.

There are two main flavors of TM regarding when an input considered accepted: On options is that the TM has a set of final states and whenever we get to one of that states, we accept. The other option is by accepting when the TM halts.

It’s possible to show that in the general case both versions are equivalent, in the sense that both define the same class of languages, which was historically denoted by recursive enumerable languages.

An algorithm for a decision problem is a special type of TM that accepts by final state and it’s always guaranteed to halt (whether or not it accepts). The class of languages that are defined by algorithms are called recursive languages.

### Extending Turing machines

We can simulate more general TMs using our definition of TM. For example, we can model a TM with multiple tracks. Instead of having a single element, each position on the tape can contain a tuple, where each element of the such tuple represents the element on each tape with the same position.

We can even have multiple needles, by using a dedicated tape for representing the current position of each additional needle.

## Part 2: Undecidability

In this second part of the post we’ll discuss about undecidability. A problem is said undecidable if there’s no algorithm to solve it, that is, no TM that is guaranteed to halt on all its inputs.

Question: Are Turing machines powerful enough so that for any language there exists a corresponding TM that defines it?

Not really. We will next define a language that creates a paradox if a TM exists for it. Before that, we first discuss how we can encode TM’s as strings so it can be used as input to other TM’s.

Encoding Turing machines. Given a description of a Turing machine (the input, the transition functions), it’s possible to encode it as a binary string. Each possible Turing machine will have a distinct binary representation. We can then convert these strings to integer. Since two strings can be converted to the same integer because of leading zeroes (e.g. both 010 and 10 map to 2), we append a 1 before all the strings (e.g. now 1010 maps to 10 and 110 to 6).

We can now order all Turing machines by their corresponding integer representation. Note that some integers might not be encodings of an actual TM, but for simplicity, we can assume it’s a TM that accepts the empty language.

Since a TM can be seen as a string, we can say that a TM accepts itself if it accepts the corresponding encoding of itself as a binary string. We can even define a language as the set of TM encodings that don’t accept themselves:

$L_d = \{w | w \mbox{ is the encoding of a TM that doesn't accept itself} \}$

So we want to know whether there is a TM $T_d$ for $L_d$. Suppose there is. Then we ask: does $T_d$ accepts itself?

The self-referencing property is the origin Russell’s Paradox. It essentially defines a set $S$ that contains all elements (which can also be sets) that do not contain themselves. The paradox is that if $S$ contains itself, then there’s a set in $S$ that do contain itself, which is a contradiction. On the other hand, if it doesn’t contain itself, $S$ left out a set, which also contradicts the definition of $S$.

We can use the same analogy for TMs. If $T_d$ accepts itself, then it is accepting a TM that doesn’t comply with the constraints. If it does not, it left out the encoding of a TM that doesn’t accept itself.

The only assumption we made so far is that there is such TM for language $L_d$, so we can conclude this was a wrong assumption and thus $L_d$ is not a recursive enumerable language.

### The Universal Turing Machine

Question: Is there an algorithm to tell whether a Turing Machine accepts a given input $w$?

To answer this question, we’ll first define a special type of turing machine. The Universal Turing Machine (UTM) is a TM that takes as input a TM $M$ encoded as string and an input $w$ as a pair (M, $w$) and accepts it if and only if the corresponding TM accepts $w$. We denote by $L_u$ the language of the UTM.

We can design the UTM as a TM with 3 tapes (see multiple tapes TM on the Extended Turing Machines section),

* Tape 1 holds the M description
* Tape 2 represents the tape of M
* Tape 3 represents the state of M

The high-level steps we need to carry over (and for which we have to define the proper transition function, but we’re not doing here) are:

1. Verify that $M$ is a valid TM
2. Decide how many “bits” of the tape it needs to represent one symbol from $M$
3. Initialize Tape 2 with input w and Tape 3 with the initial state of $M$
4. Simulate $M$, by reading from Tape 1 the possible transitions given the state in Tape 3 and the input at Tape 2.
5. If the simulation of $M$ halts accepting the string w, then the UTM accepts $(M,w)$.

We can now answer the question in the beginning of this section by the following proposition:

Proposition: $L_u$ is recursive enumerable, but not recursive.

Proof: It’s recursive enumerable by definition, since it’s defined by a TM (in this case the UTM). So we need to show that the UTM is not guaranteed to always halt, that is,.

Suppose there’s such an algorithm. Then we have an algorithm to decide whether a string is in $L_d$. First we check if $w$ is a valid encoding of a TM. If it’s not, then we assume it represents a TM that defines an empty language, so obviously it doesn’t accept itself and thus should go into $L_d$.

Otherwise, we ask our hypothetical algorithm whether it accepts $(w, w)$. If it does, then the TM corresponding to $w$ accepts itself and should not go into $L_d$. On the other hand, if our algorithm doesn’t accept, it should go into $L_d$. We just described an algorithm that defines $L_d$, but in the previous section we proved that $L_d$ has no TM that defines it. This contradiction means that our assumption that an algorithm exists for $L_u$ is incorrect.

$\square$

This is equivalent to the halting problem, in which we want to know whether a program (a Turing machine) will halt for a given input or run forever.

### Post Correspondence Problem

Question: What are other problems undecidable problems?

We can show a problem is undecidable by reducing an already known undecidable problem to it. In particular, if we can show a particular problem can be used to simulate an universal Turing machine, we automatically prove it undecidable. We’ll now do this with a fairly simple problem to describe and that has no apparent connection to turing machines.

The post correspondence problem (PCP) can be stated as follows. Given $n$ pairs of strings of the form $(a_i, b_i)$, $1 \le i \le n$, the problem consists in finding a list of indexes $R = \{r_1, r_2, \cdots, r_k\}$ such that concatenating the pairs in this order leads to the same string, that is, $a_{r_1}a_{r_2} \cdots a_{r_k} = b_{r_1}b_{r_2} \cdots b_{r_k}$ (note that an index can be used more than once in the solution).

There’s a stricter version of this problem in which the first pair of the input must be the first pair in the solution, which we’ll refer to MPCP. This doesn’t make the problem harder because it’s possible to reduce it to the original PCP.

It’s possible to simulate a TM by solving the MPCP so that there’s a solution to the MPCP if, and only if, the TM accepts an input $w$. We provide the detailed reduction in the Appendix for the interested reader. This means that the if there exist an algorithm for PCP (and thus for MPCP), we have an algorithm for any TM, including the UTM, which is a contradiction, and thus we conclude that PCP is undecidable.

Many other problems can be proved to be undecidable by reducing the PCP to them. The list includes some decision problems regarding Context Free Grammars. For example, given a CFG, tell whether a string can be generated in more than one way (whether a CFG is ambiguous); or whether a CFG generates all the strings over the input alphabet; or whether a CFL is regular;

### Conclusion

So far this is the second course I finish on coursera (the other one I did was the Probabilistic Graphical Models). I’m very glad that classes from top tier universities are freely available on the web.

Automata theory was one of the gaps in my computer science theory base, especially Turing machines. I’ve done an introductory course on theoretical computer science, but it was mostly focused on intractable problems and cryptography.

### References

[1] Wikipedia – Jeffrey Ullman
[2] Slides from Automata Course – Coursera

### Appendix

Proposition. It’s possible to decide whether an universal turing machine accepts a string by reducing it to MPCP.

Sketch of proof. We won’t give a formal proof of the reduction, but the idea is pretty neat and after understanding it, the reduction becomes more intuitive.

First thing, let’s define a snapshot from a simulation of a TM. A snapshot is a compact description of the TM in a particular simulation step. We can represent it by placing the needle to the left of the tape position it points to. We can omit both the right and left infinite endpoints that are composed eintirely by blanks.

Thus, the first snapshot is represented by $q_0 w_1\cdots w_n$. If we move the needle to the right writing $X$ to the current position and changing the state to $p$, the snapshot becomes $x p w_2 \cdots w_n$. If instead from the first snapshot we moved to the left, writing $X$ and changing the state to $p$, we would have $p \square x w_2 \cdots w_n$, where $\square$ is the blank symbol and we need to represent it because it’s not part of the contiguous infinite blanks block anymore.

And each incomplete solution of the MPCP, we have the first part one step behind the second part. With that, we can “tie” to snapshots together by a pair, which can be use to encode the transition function.

The instance we’ll build for the MPCP will consist of the following classes of pairs:

First pair, represent the initial snapshot:

(1) $(\#, \#q_0w\#)$

Delimiter pair, to separate adjacent snapshots:

(2) $(\#, \#)$

Copy pair, to copy tape symbols that are not involved in the transition function. For each tape symbol $X$ we create the pair

(3) $(X, X)$

Transition function pairs

For a move to the right, that is, for each $\delta(q, X) \rightarrow (p, Y, R)$ we create the pair

(4) $(qX, Yp)$

For a move to the left, that is, for each $\delta(q, X) \rightarrow (p, Y, R)$ we create the pair

(5) $(ZqX, pZY)$

In case $X$ is the blank character, the right movement, that is $\delta(q, B) \rightarrow (p, Y, R)$ will become:

(6) we create $(q\#, Yp\#)$

The left movement, that is $\delta(q, B) \rightarrow (p, Y, L)$ will become:

(7) $(Zq\#, pZY\#)$

Final state pairs

For a final state $f$ and for all tape symbols $X, Y$ we add the following auxiliary pairs, from (8) to (10):

(8) $(XfY, f)$

(9) $(fY, f)$

(10) $(Xf, f)$

(11) $(f\#\#, \#)$

Simulation – Example

The solution will have to start with pair (1), because it’s the first pair and we’re solving the MPCP.

$\#$
$\#q_0 w_1 \cdots w_n$

We can now simulate a movement. For example, say we have a right move defined by $\delta(q_0, w_1) \rightarrow (p, x, R)$ by applying the corresponding pair (4) $(q_0 w_1, x p)$, which will lead us to the partial solution

$\#q_0 w_1$
$\#q_0 w_1 \cdots w_n \# x p$

for now, we can only use the pairs (3), because we need to match $w_2, w_3, \cdots w_n$, so we apply those pairs until we get

$\#q_0 w_1 \cdots w_n$
$\#q_0 w_1 \cdots w_n \# x p w_2 \cdots w_n$

Now, the only pair that can follow is (30, that is, $(\#, \#)$, which will finish the snapshot:

$\#q_0 w_1 \cdots w_n \#$
$\#q_0 w_1 \cdots w_n \# x p w_2 \cdots w_n \#$

Note that the pairs of form (3) are only used to copy the remaining part of the tape. That’s because the transition function itself is local, in the sense that it doesn’t know about the tape values that are not involved its definition.

This movement is analogous for the other transition functions.

The remaining case is when the reach a final state, in which we want to accept my making the first part of the partial solution to catch up with the second part.

Say we are in this stage:

$\cdots \#$
$\cdots \#ABfCDE\#$

where $f$ is a final stage. We first apply the copy pair (3) on A:

$\cdots \#A$
$\cdots \#ABfCDE\#A$

and then use pair (8) as $(BfC, f)$ to get

$\cdots \#ABfC$
$\cdots \#ABfCDE\#Af$

we can only copy and finish the snapshot using (3) and (4):

$\cdots \#ABfCDE\#$
$\cdots \#ABfCDE\#AfDE\#$

we need the pair (8) again, now as $(AfD, f)$ to get

$\cdots \#ABfCDE\#AfD$
$\cdots \#ABfCDE\#AfDE\#f$

again, by copying and finishing the snapshot we have,

$\cdots \#ABfCDE\#AfDE\#$
$\cdots \#ABfCDE\#AfDE\#fE\#$

now we use the pair (9) as $(fE, f)$ to obtain

$\cdots \#ABfCDE\#AfDE\#fE$
$\cdots \#ABfCDE\#AfDE\#fE\#f$

ending another snapshot

$\cdots \#ABfCDE\#AfDE\#fE\#$
$\cdots \#ABfCDE\#AfDE\#fE\#f\#$

finally we use the closing pair (11)

$\cdots \#ABfCDE\#AfDE\#fE\#f\#\#$
$\cdots \#ABfCDE\#AfDE\#fE\#f\#\#$

Note that the final closing is carried over using several auxiliary snapshots that don’t actually correspond to the actual simulation, but it’s easy to obtain the right snapshots from a solution to the MPCP.

$\square$

# Flood it! An exact approach

Flood-it is a game created by LabPixies, which was recently aquired by Google.

The game consists of a $n \times n$ board with random colors cells. Let’s call the top-left cell a seed cell and the region connected to the seed cell and with the same color as it, the flooded region. At each round, the player chooses a color for the flooded region which may flood adjacent regions, expanding the flooded region. The final objective is to flood all the board.

Figure 1. Flooding proccess [1]

In the original game, there are three different sizes of boards: 14, 21 or 28. The number of colors is always 6.

A paper from Clifford, Jalsenius, Montanaro and Sach [1], presents theoretical results regarding the general version of this problem, where the size of the board and the number of colors can be unbounded.

In this post we’ll highlight the main results of their paper and present an integer linear programming approach to solve it exactly.

### NP-Hardness

For $C \ge 3$, the game is shown to be NP-hard even if one is allowed to start flooding from an arbitrary cell. The proof given in [1] is based on a reduction from the shortest common supersequence problem (SCS for short).

### Greedy approaches

There are two greedy approaches one might try:

1. Pick the color that maximizes the number of cells covered.

2. The most frequent color in the perimeter of the current region.

In Figure 2, we have an instance where this strategies can be arbitrarily bad [1]. They use $n$ colors while the optimal solution is 3.

Figure 2: A 10×10 board for which the greedy algorithms perform badly [1]

### Approximation algorithms

Surprisingly, the following naïve strategy:

Cycle through all colors until the board is colored.

gives a solution with a value within a factor of the optimal value.

More specifically, if $L$ is the optimal number of movements and $C$ the number of available colors, then this algorithm solves the problem with no more than $C L$ movements. To see why, let $\{c_1, c_2, \cdots, c_L \}$ be the color sequence of the optimal solution. In the worst case, each cycle covers at least one color in the sequence. Thus, this algorithm is $C$ approximated.

This can be refined to $C-1$ observing that the cycle does not need to have the current color of the flooded region.

The authors improve this bound with a randomized algorithm that achieves a $\frac{2c}{3}$ expected factor.

They also give a lower bound, proving that if the number of colors is arbitrary, no polynomial time approximated algorithm with a constant factor can exist unless P=NP.

### Bounds

An upper bound for the number of movements required for solving any $n \times n$ is given by the following theorem:

Theorem 1: There exists a polynomial time algorithm for Flood-It which can flood any n x n board with C colors in at most $2n + (\sqrt{2C})n + C$ moves.

Conversely, we can set an lower bound for a $n \times n$ board:

Theorem 2: For $2 \le C \le n^2$, there exists an $n \times n$ board with (up to) $c$ colors which requires at least $(\sqrt{C - 1})n/2 - C/2$ moves to flood.

That is, we can’t expect an algorithm to perform much better than the one from Theorem 1 for arbitrary boards.

### Integer Linear Programming

Let $C$ be the number of colors and $k$ an upper bound for the optimal solution. A component is a connected region of pixels with same color, considering 4-adjacency. Two components $i$ and $j$ are adjacenct if there exists at least a pixel in $i$ adjacent to a pixel in $j$.

We denote by $N(i)$ the set of components adjacent to component $i$. Let $S$ be the set of components and $m = |S|$. Furthermore, let $c_{i}$ be the color of component $i$.

We define the binary variable $x_{ct}$ that is 1 if color $c = 1, \cdots, C$ is chosen at iteration $t = 1, \cdots, k$ or 0 otherwise. We also define the binary variable $y_{it}$ that is 1 if component $i = 1, \cdots, m$ is filled in some iteration $t$ or 0 otherwise.

For simplicity, we’ll assume that the component of the seed cell has index $i = 1$.

(1) Objective function:

$\displaystyle \min \sum_{c=1}^{C} \sum_{t=1}^{k} x_{ct}$

(2) For each component $i$,

$\displaystyle \sum_{t=1}^{k} y_{it} = 1$

(3) For each iteration $t$,

$\displaystyle \sum_{c=1}^{C} x_{ct} \le 1$

(4) For each component $i$ and iteration $t$,

$\displaystyle y_{it} \le \sum_{j \in N(i)} \sum_{t' = 1}^{t - 1} y_{jt'}$

(5) For each component $i$ and iteration $t$

$y_{it} \le x_{c_it}$

(6) For each component $i = 2, \cdots, k$,

$y_{i,0} = 0$

(7) For the component of seed cell,

$y_{1,0} = 1$

(8) For all component $i$ and iteration $t$,

$y_{it} \in \{0, 1\}$

(9) For all color $c$ and iteration $t$,

$x_{ct} \in \{0, 1\}$

Constraint (2) states that we fill a component exactly in one iteration. Constraint (3) says that at each iteration we pick at most one color. Constraint (4) allow us to fill a component only if any of its adjacent components has been already filled (which means it currently has the same color as the seed pixel) and if its color is the same as the chosen color (5) .

Finally, constraints (6)-(9) state the variables are binary and each component starts unfilled except the component of the seed cell.

### Computational Experiments

We’ve implemented this model in C++ using the COIN-OR Cbc library. The code is available, as always, on github.

The first task was obtaining “real” instances. I don’t know whether the colors of Flood it! boards are uniformly chosen. Thus, I preferred to take some print screens from the original game and do some image processing to convert it matrices with integer representing colors.

Figure 3: A 14×14 Flood it! puzzle

Unfortunately the model is too big if we use $k$ as the number of components, $m$. The number of constraints (4) is then $O(m^3)$ and for the instance in Figure 3, $m = 127$.

In order to reduce its size, we tried two tricks.

First, we changed our model a little bit to decrease the number of contraints (4). Now, $y_{it}$ may also be 1 if component $i$ was covered some iteration before $t$. Thus, we can rewrite (4) as,

(4′) For each component $i$,

$\displaystyle y_{it} \le \sum_{j \in N(i)} y_{j(t-1)}$

But now we need to allow a component to be “covered” if it was covered in a previous iteration, even if the current color does not match. Thus, (5) becomes:

(5′) For each component $i$ and iteration $t$

$y_{it} \le x_{c_it} + y_{i(t-1)}$

We also need to change (2) equality to $\ge$ inequalities. Now, note that there are $O(m^2)$ constraints (4′).

The second trick is based on the assumption that in general, the optimal number of movements is much less than the number of components. Thus, solve the model for increasing values of $k$ starting with the number of colors until we find a feasible solution.

Even with these changes, we were not able to solve the instance in Figure 3. The 5×5 instance obtained from the first rows and cols of the matrix is solved in 2 minutes using 9 colors.

0 0 1 2 2 3 4 0 0 1 0 0 0 2 4 1 3 3 3 4 3 0 4 0 2 

Solution:

1 2 0 1 3 0 3 4 2

For the 6×6 instance obtained the same way, the solver does not find the optimal solution in an hour.

### Conclusion

In this post we gave a brief summary of Clifford et al. paper and then presented a integer linear programming approach to solve instances exactly.

As our experimental results showed, even for the smallest boards (14 x 14) we’re not able to obtain optimal solutions in feasible time.

As future work, we may try devising alternative models or find additional inequalities. Another possibility is to solve this same model using a commercial solver like CPLEX.

### References

[1] Clifford R., Jalsenius M., Montanaro A. and Sach B. – The Complexity of Flood Filling Games (arxiv.org)