# Protein Design

We previously learned about the problem of predicting the folding of a protein, that is, given a chain of amino acids, find its final 3D structure. This time we’re interested in the reverse problem, that is, given a 3D structure, find some chain of amino-acid that would lead to that structure once fully folded. This problem is called Protein Design.

In this post we’ll focus on mathematical models for this problem, studying its computational complexity and discuss possible solutions.

### Mathematical Modeling

The 3D structure of a protein can be divided into two parts: the backbone and the side chains. In the model proposed by Piece and Winfree [3], we assume the backbone is rigid and that we’ll try to find the amino-acids for the side chains such that it minimizes some energy function.

This means we’re not really trying to predict the whole chain of amino acids, but a subset of those amino that will end up on the side chains.

The amino acids on the side-chain can have a specific orientation [2, 3], known as rotamer, which in turn can be represented by a single value, its dihedral angle. We can define some arbitrary order for these amino acids and label them with an index, which we call position.

At each position there are multiple rotamers possible and we need to select them to minimize the overall energy function. More formally, for each position i, let $R_i$ be the set of rotamers available and $r_i \in R_i$ the chosen rotamer.

The model assumes the cost function of the structure is pairwise decomposable, meaning that we can account for the interaction of each pair independently when calculating the total energy:

Where $E(r_i, r_j)$ is the energy cost of the interaction between positions i and j, assuming rotamers $r_i$ and $r_j$ respectivelly. The definition of E can be based on molecular dynamics such as AMBER.

In [3], the authors call this optimization problem PRODES (PROtein DESign).

### PRODES is NP-Hard

Pierce and Winfree [3] prove that PRODES is NP-Hard. We’ll provide an informal idea of the proof here.

First we need to prove that the decision version of PRODES is NP-Hard. The decision version of PRODES is, given a value K, determine whether there is a set of rotamers such that the energy cost is less or equal K. We’ll call it PRODESd. We can then prove that PRODESd is NP-complete by showing that it belongs to the NP complexity class and by reducing, in polynomial time, some known NP-complete problem to it.

We claim that this problem is in NP because given an instance for the problem we can verify in polynomial time whether it is a solution (i.e. returns true), since we just need to evaluate the cost function and verify whether it’s less than K.

We can reduce the 3-SAT problem, known to be NP-complete, to PRODESd. The idea is that we can map every instance of the 3-SAT to an instance of PRODESd and show that the result is “true” for that instance of 3-SAT if and only if the result is “true” for the mapped instance of PRODESd.

Let’s start by formalizing PRODESd as a graph problem, with an example shown in the picture below:

a) has 3 positions with their sets of rotamers. b) each rotamer becomes a vertex grouped into sets. We pick exactly one vertex per set and try to minimize the total cost of the edges associated with the selected vertices. Image source: [3]

Now, given a 3-SAT instance, we create a vertex set for each clause $C_i$ (containing a vertex for each literal), and a vertex set for each variable $x_i$ (containing vertices T and F). For each literal $x_i$ we add an edge of weight 1 to the vertex F of the set corresponding to variable $x_i$. Conversely, for each negated literal, $\bar x_i$, we add an edge of weight 1 to the vertex T. All other edges have weight 0.

For example, the instance $(x_1 + \bar x_2 + x_3) \cdot ( \bar x_1 + x_2 + x_3) \cdot (\bar x_3)$ yields the following graph where only edges of non-zero weight are shown:

Source: [3]

We claim that this 3-SAT instance is satisfiable if and only if the PRODESd is true for K=0. The idea is the following: for each vertex set corresponding to a variable we pick either T or F, which corresponds to assigning true or false to the variable. For each vertex set corresponding to a clause we pick a literal that will evaluate to true (and hence make the clause true). It’s possible to show that if 3-SAT is satisfiable, there’s a solution for PRODESd that avoids any edge with non-zero weight.

### Integer Linear Programming

Now that we know that PRODES is unlikely to have an efficient algorithm to solve it, we can attempt to obtain exact solutions using integer linear programming model. Let’s start with some definitions:

We can define our variables as:

The object function of our model becomes:

Finally, the constraints are:

Equation (1) says we should pick exactly one rotamer for each position. Constraints (2) and (3) enforce that $x_{i, j} = 1$ if and only if $r_i = r_j = 1$.

Note: the LaTeX used to generate the images above are available here.

### Conclusion

The study of protein prediction led me to protein design, which is a much simpler problem, even though from the computational complexity perspective it’s still an intractable problem.

The model we studied is very simple and makes a lot of assumptions, but it’s interesting as a theoretical computer science problem. Still I’m not sure how useful it is in practice.

### References

[1] Wikipedia – Protein design
[2] Wikipedia – Rotamer
[3] Protein Design is NP-hard – Niles A. Pierce, Erik Winfree

# Eulerian Circuits

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

## Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

## Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

## Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices $(v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v)$ and w is a vertex with a sub-circuit. Let $(w, q_1, ..., q_m-1, w)$ be the sub-circuit starting from w. We can construct a new circuit $(v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v)$.

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

## Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

## Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

## Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

$ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)!$ (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

$\prod_{v \in V}(deg(v) - 1)!$ (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

## Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

## Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

## References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

# The PageRank algorithm

In this post we’ll discuss the PageRank algorithm. The problem it tries to solve is ranking web pages based on their relevance. The general idea is that links to a page represent a good indicator to its relevance. It models the entire web as a giant directed graph, where links represent arcs, and then perform random walks following trough the links and assigns a score to each page based of the probability of landing at it.

This algorithm was developed by Sergey Brian and Larry Page during their PhD in Stanford. Besides the obvious reference to ranking of pages, PageRank also plays with the last name of Larry [1].

### Theory

The basic modeling of the web as a graph $G = (V, A)$ is simple. The set of vertices $V$ is composed by each web page (document), and an edge $(u, v) \in A$ if the page corresponding to $u$ has a link the page corresponding to $v$. Let $N = |V|$ be the number of nodes.

The rank of each node represents the probability of being at a node at a given time if we are to perform a random walk in the graph. We can start at any node with equal probability, so the rank of each node starts as $rank(i) = 1/N$.

At each node, we can change to any one the adjacent nodes with equal probability, so this graph is essentially a Markov chain. We can represent the rank of a given node $v$ at time t through the following equation:

(1) $Pr(v,t) = \sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)}$

Where $d(u)$ is the out-degree of vertex $u$.

One limitation of this simple approach is that nodes that do not have outbound links will “drain” all the probability, because once we arrive at such nodes, there’s no way out, so as time passes by, the probability of getting trapped in such nodes goes to 1.

A simple solution is to add a “teleport” probability, in which at each node we can teleport to any other node in the graph with a small probability. We control the chance of a teleport happening by the parameter $\beta$. More formally, the page rank equation is now the following:

(2) $Pr(v,t) = \beta \left(\sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)} \right) + \dfrac{(1 - \beta)}{N}$

#### Matricial form

Let $M$ be the adjacency matrix corresponding to the set of arcs $A$, that is, $M$ is a $N \times N$ matrix, and $M_{(i, j)} = 1$ if, and only if, $(i, j) \in A$. Let $M'$ be $M$ normalized its values normalized by the sum of each row. That is,

$M'_{(i, j)} = \dfrac{M_{(i, j)}}{d(i)}$

We can then rewrite (1) as:

(3) $Pr(t) = {M'}^T Pr(t - 1)$

and (2) as:

(4) $Pr(t) = \beta {M'}^T Pr(t - 1) + \dfrac{(1 - \beta)}{N} \vec{1}$

where $\vec{1}$ corresponds to a column vector with 1’s. Let’s simplify even further. Since for each iteration t, Pr is a probability distribution (that is, $\sum_{v \in V} Pr(v, t) = 1$), then we can rewrite (2) as:

(5) $Pr(v,t) = \beta \left(\sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)} \right) +$
$\dfrac{(1 - \beta)}{N} \left(\sum_{v \in V} Pr(v, t-1) \right)$

We can now define $\hat{M} = \beta {M'}^T + \dfrac{(1 - \beta) E}{N}$

where $E$ is a $N \times N$ matrix with all ones. With that in mind, we can re-write (5) simply as:

(6) $Pr(t) = \hat{M} Pr(t-1)$

In the next section we’ll show $\hat{M}$ has some interesting properties that guarantees the page rank converges to a final value. Using that information, we’ll be able to iterate until $|Pr(t) - Pr(t - 1)| < \epsilon$.

#### Properties of the transition matrix

$\hat{M}$ is stochastic. A matrix is said (row or column) stochastic if the sum of each column or row is 1. For a given column of $\hat{M}$ there are $d(i)$ entries summing to 1 and which are multiplied by $\beta$ and also $N$ entries that sum to $N$ and multiplied by $\frac{(1 - \beta)}{N}$.

$\hat{M}$ is irreducible. A matrix is reducible if it can be transformed into a upper-diagonal matrix by rows/columns permutation. In the context of Markov chains, the transition matrix is irreducible if there is a non-zero probability of transitioning (even if in more than one step) from any state to any other state [3].

Alternatively, we can show a matrix $M$ is not reducible by the following procedure [2]: Construct a directed graph $G(M)$ where $(i,j) \in A$ iff $M_{(i,j)} > 0$. Then, $M$ is irreducible iff $G(M)$ is strongly connected. Because we’re adding the teleport links, we can show $G(\hat{M})$ is a complete graph and clearly strongly connected.

$\hat{M}$ is aperiodic. If a matrix is such that $M^{p} = M$, we say it has period $p$. A matrix with maximum period 1 is said aperiodic. We can compute the period of a matrix $M$ by constructing the same directed graph as above. If $M$ is irreducible, then the period is the greatest common divisor of the length of the closest path for a given node. Since by adding the teleport links we also add self-arcs (i.e. there’s a chance of staying in the same node), there’s always a path of length 1, so the GCD is always 1, and thus is the period.

$\hat{M}$ is positive recurrent. Let $f_{ij}^{(m)}$ be the probability a node $j$ will be first visited at time $m$ if we start from the state $i$. Let $\mu_{ij}$ be the expected value of $m$ for a given $i$ and $j$.

We say a matrix is recurrent if $\sum_{m=1}^{\infty} f_{ij}^{m} = 1$ for all $i, j$, that is, we’re sure every node will be eventually visited no matter where we start. A matrix is positive-recurrent if $\mu_{ii} < \infty$, that is, the expected number of steps for a random walk to come back to the state if started is not infinite.

If a Markov chain has a finite set of states (i.e. the vertices of the graph) and its transition matrix is irreducible, then we can show this transition matrix is also positive recurrent [4].

With these properties, we're ready to throw in the Fundamental theorem of Markov Chains [5]:

For any irreducible, aperiodic, positive-recurrent Markov chain, there’s a unique stationary distribution $\pi$ such that

$\pi^{(t + 1)} = \pi^{(t)}$

This also means the final rank is an eigenvector with a corresponding eigenvalue $\lambda = 1$, since

$\lambda Pr = \hat{M} Pr$

A similar but more general result is Perron Frobenius theorem.

### Conclusion

I’ve been watching the Mining Massive Data Sets course in Coursera. It hasn’t finished yet, but I found the idea of the PageRank pretty interesting from the mathematical point of view. I didn’t do any actual coding here, but I’m planning to revisit this subject when I learn more about the Map reduce paradigm.

### References

[1] Wikipedia: Page Rank Theorem
[2] Wikipedia: Perron Frobenius Theorem
[3] Wikipedia: Irreducibility (mathematics)
[4] Mathematics Stack Exchange: Irreducible, finite Markov chains are positive recurrent
[5] The Fundamental Theorem of Markov Chains: Aaron Plavnick

# Supervised Machine Learning

Andrew Ng is a Computer Science professor at Stanford. He is also one of the co-founders of the Coursera platform. Andrew recently joined the Chinese company Baidu as a chief scientist.

His main research focus is on Machine Learning, Artificial Intelligence and Deep Learning. He teaches the Machine Learning class in Coursera, which lasts 10 weeks and is comprised of videos, multiple choice assignments and programming assingments (with Octave).

Professor Andrew is a very good lecturer and makes the content more accessible by relying more on intuition rather than rigor, but always mentioning that a given result is backed by mathematical proofs. The material is very practical and he often discusses about real world techniques and applications.

Throughout the classes I got curious about the math behind many of the results he presented, so this post is an attempt to help me with that, as well as providing a overview of the content of the course.

### Supervised Learning

Suppose we have a set of data points and each of these points encodes information about a set of features, and for each of these vectors we are also given a value or label, which represents the desired output. We call this set a training set.

We want to find a black box that encodes some structure from this data and is able to predict the output of a new vector with certain accuracy.

It’s called supervised because we require human assistance in collecting and labeling the training set. We’ll now discuss some methods for performing supervised learning, specifically: linear regression, logistic regression, neural networks and support vector machines.

### Linear regression

We are given a set $X$ of $m$ data points, where $x_i \in R^n$ and a set of corresponding values $Y$, where $y_i \in R$.

Linear regression consists in finding a hyperplane in $R^n$, which we can represent by $y = \Theta^{T} x$ (here the constant is represented by $\Theta_0$ and we assume $x_0 = 1$), such that the sum of the square distances of $y_i$ and the value of $x_i$ in that hyperplane, that is, $\Theta^{T} x_i$, for each $i=1, \cdots, m$, is minimized.

More formally, we can define a function:

$f_{X,Y}(\Theta) = \sum_{i=1}^{m} (y_i - \Theta^{T}x_i)^2$

We want to fit a hyperplane that minimizes $f$. We can do so by finding the optimal parameters $\Theta^{*}$:

$\Theta^{*} = \min_{\Theta} f_{X, Y}(\Theta)$.

This linear regression with the least square cost function is also known as linear least squares.

We can solve this problem numerically, because it’s possible to show that $f(\Theta)$ is convex. One common method is gradient descent (we used this method before for solving Lagrangian relaxation).

We can also find the optimal $\Theta^{*}$ analytically, using what is known as the normal equation:

$(X^T X) \Theta^{*} = X^T y$

This equation comes from the fact that a convex function has its minimum/maximum value when its derivative in terms of its input is 0. In our case, $\dfrac{\partial f_{X, Y}}{\partial \Theta}(\Theta^{*}) = 0$.

The assumption is that our data lies approximately in a hyperplane, but in reality it might be closer to a quadratic curve. We can then create a new feature that is a power of one of the original features, so we can represent this curve as a plane in a higher dimension.

This method also returns a real number corresponding to the estimated value of $y$, but many real world problems are classification problems, most commonly binary (yes/no). For this case linear regression doesn’t work very well. We’ll now discuss another method called logistic regression.

### Logistic regression

In logistic regression we have $y_i \in \{0, 1\}$ and the idea is that after trained, our model returns a probability that a new input belongs to the class $y=1$.

Here we still train our model to obtain $\Theta$ and project new input in the corresponding hyperplane to obtain a numerical value. But this time, we convert this to the range [0-1], by applying the sigmoid function, so it becomes

$h_{\Theta}(x^{(i)}) = \dfrac{1}{1 - e^{(\Theta^T x^{(i)})}}$

This represents the probability that given $x^{(i)}$ parametrized by $\Theta$, the output will be 1, that is, $p(y^{(i)}=1|x^{(i)}; \Theta)$. The probability of y being 0 is the complementary, that is

$p(y^{(i)}=0|x^{(i)}; \Theta) = 1 - p(y^{(i)}=1|x^{(i)}; \Theta)$

The likelihood of the parameter $\Theta$ to fit the data can be given by

(1) $\ell(\Theta) = \prod_{i=1}^{m} (h_{\Theta}(x^{(i)}))^{y^{(i)}} (1 - h_{\Theta}(x^{(i)}))^{({1 - y^{(i)}})}$

We want to find the maximum likelihood estimation (MLE), that is, the parameter $\Theta$ that minimizes (1). We can apply the log function to the $\ell(\Theta)$, because the optimal $\Theta$ is the same in both. This get us

(2) $- \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]$

This function doesn’t allow a closed form, so we need a numerical algorithm, such as gradient descent. This function is not necessarily convex either, so we might get stuck in local optima.

After estimating the parameter $\Theta^{*}$, we can classify a input point $x$ by:

$y = \left\{ \begin{array}{ll} 1 & \mbox{if } h_{\Theta^{*}}(x)) \ge 0.5 \\ 0 & \mbox{otherwise} \end{array} \right.$

### Neural Networks

Neural networks can be seen as a generalization of the simple form of logistic regression seen above. In there, we have one output ($y$) that is a combination of several inputs ($x_j$). We can use a diagram to visualize it. More specifically, we could have a directed graph like the following:

Single Layer Network

Here we have nodes representing the input data and the output and each edge as a corresponding parameter $\Theta$. We can partition this graph in layers. In this case, the input and the output layer. In order to generalize that, we could add intermediate layers, which can be called hidden-layers:

Multi-layer network

In the general case, we have $L$ layers. Between layers $\ell$ an $\ell+1$, we have a complete bipartite graph. Each edge $(i,j)$ connecting node $i$ at level $\ell$ to node $j$ at level $\ell+1$ is associated to a $\Theta_{ij}^{(\ell)}$. The idea at each node it the same: perform the linear combination of the nodes incident to it and then apply the sigmoid function. Note that at each layer, we have one node that is not connected to any of the previous layers. This one acts as the constant factor for the nodes in the following layer. The last node will represent the final value of the function based on the input $X$.

There’s a neat algorithm to solve this, called back-propagation algorithm.

Forward-propagation is the method of calculating the cost function for each of the nodes. The idea is the following. For each level $\ell$ we’ll use variables $a$ and $z$.

Initialize $a$:

$a^{(0)} = x$

For the following layers $\ell=1, \cdots, L$ we do:

$\begin{array}{ll} z^{(\ell)} & = (\Theta^{(\ell-1)})^{T} a^{(\ell-1)}\\ a^{(\ell)} & = g(z^{(\ell)}) \end{array}$

for the simple case of binary classification, the last layer will contain a single node and after running the above procedure, it will contain the cost function for an input $x$:

$h_{\Theta}(x) = a^{(L)}$

Then, we can define the cost function $J$ to be the average of (2) for the training set.

$J(\Theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + \right.$
$\left. (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]$

Back-propagation is the algorithm to find the derivatives of the cost function with respect to $\Theta$. In this procedure, we use the helper variables $\delta$, which we initialize as:

$\delta^{(L)} = a^{(L)} - y^{(i)}$

We iterate over the layers backwards $\ell = L-1, \cdots, 1$:

$\delta^{(\ell)} = (\Theta^{(\ell)})^{T} \delta^{(\ell+1)} .* g'(z^{(i)})$

$\delta$ is defined in such a way that

$\dfrac{\partial J(\Theta)}{\partial \Theta_{j,k}^{(\ell)}} = \dfrac{1}{m} \sum_{i=1}^{m} a_{k}^{(\ell)} \delta_{j}^{(\ell+1)}$

For an sketch of the derivation of these equations, see the Appendix.

### Support Vector Machine

Support Vector Machine is a method for binary classification and it does so by defining a boundary that separates points from different classes. For this model, our labels are either $y \in \{-1, 1\}$ (as opposed to $y \in \{0, 1\})$.

In the simple case, where we want to perform a linear separation, it consists in finding a hyperplane that partitions the space in regions, one containing points with y=-1 and the other y=1; also, it tries to maximize the minimum distance from any point to this hyperplane.

Finding a separating hyperplane (Wikipedia)

Instead of working with the parameters $\Theta$, we’ll have $w$ and $b$, so we can represent our hyperplane as $w^Tx^{(i)} + b$. The classification function is give by:

$h_{w,b}(x) = g(w^Tx + b)$

Where $g$ is the sign function, that is, $g(z) = 1$ if $z \ge 0$ and $g(z) = -1$ otherwise.

#### Margin

For each training example $x^{(i)}$, we can define a margin as:

$\gamma^{(i)} = y^{(i)} (w^Tx^{(i)} + b)$

Intuitively, a function margin represents the confidence that point $x^{(i)}$ belongs to the class $y^{(i)}$. Given $m$ input points, we define the margin of the training set as

(3) $\gamma = \min_{ i = 1, \cdots, m} \gamma^{(i)}$

We then want to maximize the margin, so we can write the following optimization problem:

$\max \gamma$

s.t. $y^{(i)} (w^Tx^{(i)} + b) \ge \gamma$

The constraint is to enforce (3). The problem with this model is that we can arbitrarily scale $w$ and $b$ without changing $h_{w,b}(x)$, so the optimal solution is unbounded.

The alternative is to normalize the by dividing by $||w||$, so it becomes indifferent to scaling.

#### Primal form

It’s possible to show that in order to maximize the margin, we can minimize the following linear programming with quadratic objective function:

$\min \frac{1}{2} ||w||^2$

s.t. $y^{(i)}(w^Tx^{(i)} + b) \ge 1$, for $i = 1, \cdots, m$

#### Dual form

For non-linear programming, it’s possible to make use of Lagrangian multipliers to obtain the dual model. In particular, the following model obtained using this idea, together with the primal form, satisfy the Karush–Kuhn–Tucker (KKT) conditions, so the value of the optimal solution for both cases is the same.

$\max_{\alpha} W(\alpha) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i,j=1}^{m} y^{(i)}y^{(j)} \alpha_i \alpha_j \langle x^{(i)}, x^{(j)} \rangle$

s.t.
$\begin{array}{llr} \alpha_i & \ge 0 & i=1,\cdots,m \\ \sum_{i=1}^{m} \alpha_i y^{(i)} & = 0 \end{array}$

It’s possible to show that the non-zeros entries of $\alpha_i$ correspond to the input points that will define the margin. We call these points support vectors.

Moreover, $w$ can be obtained by the following equation:

$w = \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)}$

#### Non-linear SVM: Kernels

Kernel is a way to generate features from the training set. For example, we can generate $m$ features from the $m$ points in the training set by having feature $f_i$ to be the similarity of a point to it. One common similarity function is the Gaussian distance. Given points $x^{(i)}$ and $x^{(j)}$, we can compute it by:

$K_{\mbox{gaussian}}(x^{(i)}, x^{(j)}) = \exp \left(-\dfrac{||x^{(i)} - x^{(j)}||^2}{2\sigma^2}\right)$

### Conclusion

I’ve completed an Artificial Intelligence class in college and previously did the PGM (Probabilistic Graphical Models). One of the main reasons I enrolled for this course was the topic of Support Vector Machines.

Unfortunately, SVM was discussed in a high level and apart from kernels, it was pretty confusing and mixed up with logistic regression. SVM happened to be a very complicated topic and I ended up learning a lot about it with this post, but I still don’t fully understand all the details.

### References

[1] Wikipedia: Back-propagation
[2] Prof. Cosma Shalizi’s notes – Logistic Regression
[4] Prof. Andrew Ng’s notes – Support Vector Machine
[5] Wikipedia: Support Vector Machine
[6] An Idiot’s guide to Support Vector Machines

### Appendix: Derivation of the backpropagation relations

In this appendix we’ll calculate the partial derivative of $J(\Theta)$ in relation to a $\Theta_{ij}^{(\ell)}$ component to get some insight into the back-propagation algorithm. For simplicity, we’ll assume a single input $x^{(i)}$, so we can get rid of the summation.

By the chain rule we have:

$\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}}$

Let’s analyze each of the terms:

For the term $\dfrac{\partial z^{(l)}}{\partial \Theta_{j,k}^{(\ell)}}$, we have that $z^{(\ell)} = \Theta^{(\ell)} a^{(\ell-1)}$ or breaking down by elements:

$z^{(\ell)}_{k'} = \sum_{j=1}^{n^{(\ell-1)}} \Theta^{(\ell)}_{j,k'} a^{(\ell-1)}_j$

For $k=k'$, it’s easy to see that $\dfrac{\partial z^{(\ell)}_k}{\partial \Theta_{j,k}^{(\ell)}} = a^{(\ell-1)}_j$. For $k \ne k'$,

$\dfrac{\partial z^{(\ell)}_{k'}}{\partial \Theta_{j,k}^{(\ell)}} = 0$

For the term $\dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}$, we have that $a^{(\ell)} = g(z^{(\ell)})$, thus, $\dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} = g'(z^{(\ell)})$

Finally, for the term $\dfrac{\partial J(\Theta)}{\partial a^{(\ell)}}$, we will assume that $J(\Theta) \approx \dfrac{1}{2} (a^{(\ell)} - y)^2$

If $\ell = L$, then $a^{(\ell)} = h_\Theta(x)$, so it’s the derivative is straightforward:

$\dfrac{\partial J(\Theta)}{\partial a^{(L)}} = (h_\Theta(x) - y)$

For the intermediate layers, we can find the derivative by writing it in terms of subsequent layers and then applying the chain rule. Inductively, suppose we know $\dfrac{\partial J(\Theta)}{\partial a^{(\ell)}}$ (base is $\ell = L$) and we want to find $\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}}$. We can write:

(4) $\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot \dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}}$

Given that $a^{(\ell)} = g(z^{(\ell)}) = g((\Theta^{(\ell-1)})^T a^{(\ell - 1)})$, we have:

$\dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}} = \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot \dfrac{\partial z^{(\ell)}}{\partial a^{(\ell - 1)}} = (\Theta^{(\ell - 1)})^T \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}$

Replacing back in (4):

$\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot (\Theta^{(\ell -1)})^T \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} = (\Theta^{(\ell -1)})^T \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}$

If we name $\delta^{(\ell)} = \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}$, we now have

(5) $\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} = (\Theta^{(\ell -1)})^T \delta^{(\ell)}$

“Undoing” one of the chain steps, we have

$\dfrac{\partial J(\Theta)}{\partial z^{(\ell-1)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell-1)}} \cdot \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}$

Replacing (5) here,

$= (\Theta^{(\ell -1)})^T \delta^{(\ell)} \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}$

With that, we can get the following recurrence:

$\delta^{(\ell-1)} = (\Theta^{(\ell-1)})^T \delta^{(\ell)} g'(z^{(\ell - 1)})$

Finally, replacing the terms in the original equation give us

$\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}} = a_j^{(\ell)} \delta_k^{(\ell+1)}$

# Lawler and an Introduction to Matroids

Eugene Lawler was an American computer scientist, professor of UC Berkeley and was one of the founders of the field of Combinatorial Optimization.

Lawler made important contributions on branch and bound algorithms, dynamic programming, and was also the first one to observe that matroid intersection could be done in polynomial time.

He also proved that two of Karp’s 21 NP-Complete problems, The Directed Hamiltonian Cycle and 3-Dimensional Matching were NP-Complete.

This is the first post in our series of Matroids in Combinatorial Optimization context. They will be mainly based on the Matroids and the Greedy Algorithm chapter from Combinatorial Optimization – Networks and Matroids, by Eugene Lawler.

### Introduction

Hassler Whitney developed Matroids in 1935 in the context of algebraic theory and it was further applied by Jack Edmonds in the context of combinatorial optimization.

Matroids are a structure that allows us to solve problems by always taking local optimal steps and by doing that there’s the guarantee we’ll reach the global optimum. These types of algorithms are known as greedy algorithms.

First we’ll define Matroids and then give some examples of problems to modeled as matroids. Next, we’ll introduce weighted matroids and describe a generic algorithm to solve them and how such algorithm applied to matroids in graphs is actually the famous Kruskal algorithm.

### Definition

Let $E$ be a set of elements and $\cal I$ a family of subsets of $E$ (family here has the meaning of a set of elements that share some properties and it’s also clearer than using ‘set of subsets’). Let $M = (E, \cal I)$ and consider the following properties:

1) $\emptyset \in \cal I$ and if a subset $I \in \cal I$, then all subsets of $I$ belong to $\cal I$ as well.

2) If $I_p \in {\cal I}$ with $|I_p| = p$ and $I_{p+1} \in {\cal I}$ with $|I_{p+1}| = p+1$, then there is an element $e \in I_{p+1} \setminus I_{p}$ such that $I + e \in {\cal I}$. (Henceforth, by abuse of notation, when we say $I + e$ we mean $I \cup \{e\}$).

If $M$ satisfies both (1) and (2), we say that $M$ is a matroid.

Terminology. An independent set is any set $I$ belonging to the family $\cal I$. If there’s no other set containing $I$ in $\cal I$ we say it’s a maximal independent set. Note that by property (2), all maximal independent sets have the same size.

The rank of a set $E$, denoted by $r(E)$ is the largest independent subset of $E$. A minimal dependent set is called circuit.

### Special types of matroids

We can model some structures as matroids and take advantage of matroids properties to find solutions to problems involving these structures. In this section we’ll present a few examples of such modelings.

The modelling process consists in defining the set $E$, the family $\cal I$ (usually by defining the property that the subsets of $E$ must have to be in there) and then proving that $\cal I$ satisfies (1) and (2).

Matric Matroid.

A matric matroid is a matroid in the context of matrices. Given a matrix $A$ with the set of columns as $E$, if $\cal I$ is the family of sets containing only linear independent columns, then $M = (E, {\cal I})$ is a matric matroid.

Proof. It’s easy to see that any subset of a linear independent (LI) set of columns is also LI, so (1) is straightforward. For (2), we need to prove that given subsets $I_{p+1}$ and $I_{p}$ of $p+1$ and $p$ LI columns, respectively, then there must be a column $c$ from $I_{p+1} \setminus I_{p}$ such that $I_{p} + c$ is still LI. Well, if it’s not the case, it means that each column in $I_{p+1}$ can be written as linear combination of the $p$ LI columns from $I_{p}$, which contradicts the fact that $I_{p+1}$ is LI.

Graphic Matroid.

A graphic matroid is a matroid in the context of graphs. Given a graph $G = (V, E)$, if $\cal I$ is the family of arcs that do not form a cycle, then $M = (E, {\cal I})$ is a graphic matroid.

Proof. It’s possible to show that subsets of edges that are cycle free, correspond to LI columns in the incidence matrix of $G$, so in this case we can also view $M$ as a matric matroid of the incidence matrix of $G$.

Matching Matroid.

A matching matroid is a matroid in the context of matchings in graphs.

Let $G = (V, E)$ be an undirected graph and let $\cal I$ be the family of subsets of nodes $I \subseteq V$ such that there’s a matching in $G$ that covers all nodes in $I$ (we say that an node in $I$ is covered if there’s an edge in the matching incident to it, even if the other end of the edge is not in $I$). Then $M = (V, {\cal I})$ is a matching matroid.

Sketch of the Proof. It’s easy to verify that $\cal I$ satisfies (1). The proof of why it satisfies (2) consists in getting the symmetric difference between matchings covering $p+1$ and $p$ vertices respectively and showing that in that difference there will be at least one alternating path such that if we change the alternation will increase the number of matched vertices.

### Weighted Matroids

Let $M = (E, {\cal I})$ be a matroid and $w$ a weight function for elements in $E$. The problem we want to solve is to find the independent set $I \in {\cal I}$ that has maximum weight, where the weight of a set is the sum of the weight of its elements.

Let’s assume the elements on each independent set are listed in the non-increasing order of weight, so given sets $I_1$ and $I_2$, we list them as

$I_1 = \{a_1, a_2, \cdots, a_m\}$ and $I_2 = \{b_1, b_2, \cdots, b_n\}$

such that $w(a_1) \ge w(a_2) \ge \cdots w(a_m)$ and $w(b_1) \ge w(b_2) \ge \cdots w(b_n)$

We say $I_1$ is lexicographically greater than $I_2$ if for the first position their components differ, say at index $k$, $a_k > b_k$ or in case $I_2$ listing is a prefix of $I_1$ listing (that is, the same way we sort strings).

A set that is not lexicographically less than any other set is said to be lexicographically maximum. We can show such set is also a maximum independent set, because otherwise, by property (2), we can always add more elements to it, making it lexicographically greater.

The following Theorem from Rado and Edmonds states an important property of weighted matroids:

Theorem 1. Let $M = (E, {\cal I})$ be a matroid. Then

3) For any negative weight function on $E$, a lexicographically maximum set in $\cal I$ is also the set with the maximum weight.

Conversely, given $E$ and $\cal I$, if (1) and (3) are satisfied, $M = (E, {\cal I})$ is a matroid.

We say that a set $B \in {\cal I}$ is Gale optimal if all its components are not less than the corresponding components of any other set $I \in {\cal I}$ when these components are listed in non-increasing order. More formally, there exists a one-to-one mapping $h:I \rightarrow B$ such that $w(e) \le w(h(e))$ for all $e \in I$.

Note that a Gale optimal set is clearly a lexicographically maximum and thus has optimal weight.

Given that, Gale’s theorem provides another a stronger result than Theorem 1 regarding weighted matroids:

Theorem 2. Let $M = (E, {\cal I})$ be a matroid. Then

4) For any weight function of elements on $E$, there exists a Gale optimal set $B \in {\cal I}$.

Conversely, given $E$ and $\cal I$, if (1) and (4) are satisfied, $M = (E, {\cal I})$ is a matroid.

Weighted Matroid Greedy Algorithm. Property (4) allows to use a simple greedy algorithm to find the set of the largest weight in $\cal I$.

In the first step, we look for the single-element set $I = \{e_0\}$ in $\cal I$ with the largest weight. By property (2) and (4), we can show that the Gale optimal set contains $e_0$. Next, we look for the largest element $e_1$ such that $\{e_0, e_1\} \in {\cal I}$. Again, we can show that such elements are contained in the Gale optimal set. We repeat this until we get a maximum independent set, which is also the Gale optimal set.

More generically, we have the following algorithm:

Let $S$ be our current solution, which starts as the empty set. At each step, we look for the element $e$ with maximum weight not in $S$ such that $S + e$ belongs to $\cal I$.

The Maximum Spanning Tree problem and the Kruskal Algorithm.

In the maximum spanning tree problem we are given a connected, undirected graph $G = (V,E)$ with non-negative weights $w$ on $V$ and we want to find a spanning tree (subset of $E$ that forms a tree) with the lowest cost.

Recall that a tree is a connected graph without cycles. The graphic matroid represents cycle-free subsets of arcs (or a forest) and thus the maximum independent set of a connected graph is a tree. If we assign non-negative weights to arcs, we can find the optimal Gale independent set which corresponds to the maximum spanning tree.

The Kruskal algorithm is then basically an application of the weighted matroid greedy algorithm for graphic matroids: It first sorts the edges by non-increasing order of weight and then adds an edge to the current solution if it doesn’t form a cycle.

### Conclusion

In this post we learned the basics of matroids and weighted matroids. In future posts we’ll study Matroid Intersection, Matroid Partition and other types of matroids like Oriented Matroids.

### References

[1] Wikipedia – Eugene Lawler
[2] Combinatorial Optimization – Networks and Matroids, Eugene Lawler – Chapter: Matroids and the Greedy Algorithm