Eulerian Circuits

111px-Leonhard_Euler

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Konigsberg_bridges.png

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices (v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v) and w is a vertex with a sub-circuit. Let (w, q_1, ..., q_m-1, w) be the sub-circuit starting from w. We can construct a new circuit (v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v).

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)! (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

graph.png

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

arborescence1.png

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

arborescence2.png

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

\prod_{v \in V}(deg(v) - 1)! (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

  1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
  2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

Advertisements

The PageRank algorithm

In this post we’ll discuss the PageRank algorithm. The problem it tries to solve is ranking web pages based on their relevance. The general idea is that links to a page represent a good indicator to its relevance. It models the entire web as a giant directed graph, where links represent arcs, and then perform random walks following trough the links and assigns a score to each page based of the probability of landing at it.

This algorithm was developed by Sergey Brian and Larry Page during their PhD in Stanford. Besides the obvious reference to ranking of pages, PageRank also plays with the last name of Larry [1].

pagerank-cartoon

Theory

The basic modeling of the web as a graph G = (V, A) is simple. The set of vertices V is composed by each web page (document), and an edge (u, v) \in A if the page corresponding to u has a link the page corresponding to v. Let N = |V| be the number of nodes.

The rank of each node represents the probability of being at a node at a given time if we are to perform a random walk in the graph. We can start at any node with equal probability, so the rank of each node starts as rank(i) = 1/N.

At each node, we can change to any one the adjacent nodes with equal probability, so this graph is essentially a Markov chain. We can represent the rank of a given node v at time t through the following equation:

(1) Pr(v,t) = \sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)}

Where d(u) is the out-degree of vertex u.

One limitation of this simple approach is that nodes that do not have outbound links will “drain” all the probability, because once we arrive at such nodes, there’s no way out, so as time passes by, the probability of getting trapped in such nodes goes to 1.

A simple solution is to add a “teleport” probability, in which at each node we can teleport to any other node in the graph with a small probability. We control the chance of a teleport happening by the parameter \beta. More formally, the page rank equation is now the following:

(2) Pr(v,t) = \beta \left(\sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)} \right) + \dfrac{(1 - \beta)}{N}

Matricial form

Let M be the adjacency matrix corresponding to the set of arcs A, that is, M is a N \times N matrix, and M_{(i, j)} = 1 if, and only if, (i, j) \in A. Let M' be M normalized its values normalized by the sum of each row. That is,

M'_{(i, j)} = \dfrac{M_{(i, j)}}{d(i)}

We can then rewrite (1) as:

(3) Pr(t) = {M'}^T Pr(t - 1)

and (2) as:

(4) Pr(t) = \beta {M'}^T Pr(t - 1) + \dfrac{(1 - \beta)}{N} \vec{1}

where \vec{1} corresponds to a column vector with 1’s. Let’s simplify even further. Since for each iteration t, Pr is a probability distribution (that is, \sum_{v \in V} Pr(v, t) = 1), then we can rewrite (2) as:

(5) Pr(v,t) = \beta \left(\sum_{(u, v) \in V} \dfrac{Pr(u, t-1)}{d(u)} \right) +
\dfrac{(1 - \beta)}{N} \left(\sum_{v \in V} Pr(v, t-1) \right)

We can now define \hat{M} = \beta {M'}^T + \dfrac{(1 - \beta) E}{N}

where E is a N \times N matrix with all ones. With that in mind, we can re-write (5) simply as:

(6) Pr(t) = \hat{M} Pr(t-1)

In the next section we’ll show \hat{M} has some interesting properties that guarantees the page rank converges to a final value. Using that information, we’ll be able to iterate until |Pr(t) - Pr(t - 1)| < \epsilon.

Properties of the transition matrix

\hat{M} is stochastic. A matrix is said (row or column) stochastic if the sum of each column or row is 1. For a given column of \hat{M} there are d(i) entries summing to 1 and which are multiplied by \beta and also N entries that sum to N and multiplied by \frac{(1 - \beta)}{N}.

\hat{M} is irreducible. A matrix is reducible if it can be transformed into a upper-diagonal matrix by rows/columns permutation. In the context of Markov chains, the transition matrix is irreducible if there is a non-zero probability of transitioning (even if in more than one step) from any state to any other state [3].

Alternatively, we can show a matrix M is not reducible by the following procedure [2]: Construct a directed graph G(M) where (i,j) \in A iff M_{(i,j)} > 0. Then, M is irreducible iff G(M) is strongly connected. Because we’re adding the teleport links, we can show G(\hat{M}) is a complete graph and clearly strongly connected.

\hat{M} is aperiodic. If a matrix is such that M^{p} = M, we say it has period p. A matrix with maximum period 1 is said aperiodic. We can compute the period of a matrix M by constructing the same directed graph as above. If M is irreducible, then the period is the greatest common divisor of the length of the closest path for a given node. Since by adding the teleport links we also add self-arcs (i.e. there’s a chance of staying in the same node), there’s always a path of length 1, so the GCD is always 1, and thus is the period.

\hat{M} is positive recurrent. Let f_{ij}^{(m)} be the probability a node j will be first visited at time m if we start from the state i. Let \mu_{ij} be the expected value of m for a given i and j.

We say a matrix is recurrent if \sum_{m=1}^{\infty} f_{ij}^{m} = 1 for all i, j, that is, we’re sure every node will be eventually visited no matter where we start. A matrix is positive-recurrent if \mu_{ii} < \infty, that is, the expected number of steps for a random walk to come back to the state if started is not infinite.

If a Markov chain has a finite set of states (i.e. the vertices of the graph) and its transition matrix is irreducible, then we can show this transition matrix is also positive recurrent [4].

With these properties, we're ready to throw in the Fundamental theorem of Markov Chains [5]:

For any irreducible, aperiodic, positive-recurrent Markov chain, there’s a unique stationary distribution \pi such that

\pi^{(t + 1)} = \pi^{(t)}

This also means the final rank is an eigenvector with a corresponding eigenvalue \lambda = 1, since

\lambda Pr =  \hat{M} Pr

A similar but more general result is Perron Frobenius theorem.

Conclusion

I’ve been watching the Mining Massive Data Sets course in Coursera. It hasn’t finished yet, but I found the idea of the PageRank pretty interesting from the mathematical point of view. I didn’t do any actual coding here, but I’m planning to revisit this subject when I learn more about the Map reduce paradigm.

References

[1] Wikipedia: Page Rank Theorem
[2] Wikipedia: Perron Frobenius Theorem
[3] Wikipedia: Irreducibility (mathematics)
[4] Mathematics Stack Exchange: Irreducible, finite Markov chains are positive recurrent
[5] The Fundamental Theorem of Markov Chains: Aaron Plavnick

Supervised Machine Learning

TIME Summit On Higher Education
Andrew Ng is a Computer Science professor at Stanford. He is also one of the co-founders of the Coursera platform. Andrew recently joined the Chinese company Baidu as a chief scientist.

His main research focus is on Machine Learning, Artificial Intelligence and Deep Learning. He teaches the Machine Learning class in Coursera, which lasts 10 weeks and is comprised of videos, multiple choice assignments and programming assingments (with Octave).

Professor Andrew is a very good lecturer and makes the content more accessible by relying more on intuition rather than rigor, but always mentioning that a given result is backed by mathematical proofs. The material is very practical and he often discusses about real world techniques and applications.

Throughout the classes I got curious about the math behind many of the results he presented, so this post is an attempt to help me with that, as well as providing a overview of the content of the course.

Supervised Learning

ml

Suppose we have a set of data points and each of these points encodes information about a set of features, and for each of these vectors we are also given a value or label, which represents the desired output. We call this set a training set.

We want to find a black box that encodes some structure from this data and is able to predict the output of a new vector with certain accuracy.

It’s called supervised because we require human assistance in collecting and labeling the training set. We’ll now discuss some methods for performing supervised learning, specifically: linear regression, logistic regression, neural networks and support vector machines.

Linear regression

We are given a set X of m data points, where x_i \in R^n and a set of corresponding values Y, where y_i \in R.

Linear regression consists in finding a hyperplane in R^n, which we can represent by y = \Theta^{T} x (here the constant is represented by \Theta_0 and we assume x_0 = 1), such that the sum of the square distances of y_i and the value of x_i in that hyperplane, that is, \Theta^{T} x_i, for each i=1, \cdots, m, is minimized.

More formally, we can define a function:

f_{X,Y}(\Theta) = \sum_{i=1}^{m} (y_i - \Theta^{T}x_i)^2

We want to fit a hyperplane that minimizes f. We can do so by finding the optimal parameters \Theta^{*}:

\Theta^{*} = \min_{\Theta} f_{X, Y}(\Theta).

This linear regression with the least square cost function is also known as linear least squares.

We can solve this problem numerically, because it’s possible to show that f(\Theta) is convex. One common method is gradient descent (we used this method before for solving Lagrangian relaxation).

We can also find the optimal \Theta^{*} analytically, using what is known as the normal equation:

(X^T X) \Theta^{*} = X^T y

This equation comes from the fact that a convex function has its minimum/maximum value when its derivative in terms of its input is 0. In our case, \dfrac{\partial f_{X, Y}}{\partial \Theta}(\Theta^{*}) = 0.

The assumption is that our data lies approximately in a hyperplane, but in reality it might be closer to a quadratic curve. We can then create a new feature that is a power of one of the original features, so we can represent this curve as a plane in a higher dimension.

This method also returns a real number corresponding to the estimated value of y, but many real world problems are classification problems, most commonly binary (yes/no). For this case linear regression doesn’t work very well. We’ll now discuss another method called logistic regression.

Logistic regression

In logistic regression we have y_i \in \{0, 1\} and the idea is that after trained, our model returns a probability that a new input belongs to the class y=1.

Here we still train our model to obtain \Theta and project new input in the corresponding hyperplane to obtain a numerical value. But this time, we convert this to the range [0-1], by applying the sigmoid function, so it becomes

h_{\Theta}(x^{(i)}) = \dfrac{1}{1 - e^{(\Theta^T x^{(i)})}}

This represents the probability that given x^{(i)} parametrized by \Theta, the output will be 1, that is, p(y^{(i)}=1|x^{(i)}; \Theta). The probability of y being 0 is the complementary, that is

p(y^{(i)}=0|x^{(i)}; \Theta) = 1 - p(y^{(i)}=1|x^{(i)}; \Theta)

The likelihood of the parameter \Theta to fit the data can be given by

(1) \ell(\Theta) = \prod_{i=1}^{m} (h_{\Theta}(x^{(i)}))^{y^{(i)}} (1 - h_{\Theta}(x^{(i)}))^{({1 - y^{(i)}})}

We want to find the maximum likelihood estimation (MLE), that is, the parameter \Theta that minimizes (1). We can apply the log function to the \ell(\Theta), because the optimal \Theta is the same in both. This get us

(2) - \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]

This function doesn’t allow a closed form, so we need a numerical algorithm, such as gradient descent. This function is not necessarily convex either, so we might get stuck in local optima.

After estimating the parameter \Theta^{*}, we can classify a input point x by:

y = \left\{ \begin{array}{ll}   1 & \mbox{if } h_{\Theta^{*}}(x)) \ge 0.5 \\  0 & \mbox{otherwise}  \end{array}  \right.

Neural Networks

Neural networks can be seen as a generalization of the simple form of logistic regression seen above. In there, we have one output (y) that is a combination of several inputs (x_j). We can use a diagram to visualize it. More specifically, we could have a directed graph like the following:

Single Layer Network

Single Layer Network

Here we have nodes representing the input data and the output and each edge as a corresponding parameter \Theta. We can partition this graph in layers. In this case, the input and the output layer. In order to generalize that, we could add intermediate layers, which can be called hidden-layers:

x

Multi-layer network

In the general case, we have L layers. Between layers \ell an \ell+1, we have a complete bipartite graph. Each edge (i,j) connecting node i at level \ell to node j at level \ell+1 is associated to a \Theta_{ij}^{(\ell)}. The idea at each node it the same: perform the linear combination of the nodes incident to it and then apply the sigmoid function. Note that at each layer, we have one node that is not connected to any of the previous layers. This one acts as the constant factor for the nodes in the following layer. The last node will represent the final value of the function based on the input X.

There’s a neat algorithm to solve this, called back-propagation algorithm.

Forward-propagation is the method of calculating the cost function for each of the nodes. The idea is the following. For each level \ell we’ll use variables a and z.

Initialize a:

a^{(0)} = x

For the following layers \ell=1, \cdots, L we do:

\begin{array}{ll}   z^{(\ell)} & = (\Theta^{(\ell-1)})^{T} a^{(\ell-1)}\\  a^{(\ell)} & = g(z^{(\ell)})  \end{array}

for the simple case of binary classification, the last layer will contain a single node and after running the above procedure, it will contain the cost function for an input x:

h_{\Theta}(x) = a^{(L)}

Then, we can define the cost function J to be the average of (2) for the training set.

J(\Theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + \right.
\left. (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]

Back-propagation is the algorithm to find the derivatives of the cost function with respect to \Theta. In this procedure, we use the helper variables \delta, which we initialize as:

\delta^{(L)} = a^{(L)} - y^{(i)}

We iterate over the layers backwards \ell = L-1, \cdots, 1:

\delta^{(\ell)} = (\Theta^{(\ell)})^{T} \delta^{(\ell+1)} .* g'(z^{(i)})

\delta is defined in such a way that

\dfrac{\partial J(\Theta)}{\partial \Theta_{j,k}^{(\ell)}} = \dfrac{1}{m} \sum_{i=1}^{m} a_{k}^{(\ell)} \delta_{j}^{(\ell+1)}

For an sketch of the derivation of these equations, see the Appendix.

Support Vector Machine

Support Vector Machine is a method for binary classification and it does so by defining a boundary that separates points from different classes. For this model, our labels are either y \in \{-1, 1\} (as opposed to y \in \{0, 1\}).

In the simple case, where we want to perform a linear separation, it consists in finding a hyperplane that partitions the space in regions, one containing points with y=-1 and the other y=1; also, it tries to maximize the minimum distance from any point to this hyperplane.

Finding a separating hyperplane (Wikipedia)

Finding a separating hyperplane (Wikipedia)

Instead of working with the parameters \Theta, we’ll have w and b, so we can represent our hyperplane as w^Tx^{(i)} + b. The classification function is give by:

h_{w,b}(x) = g(w^Tx + b)

Where g is the sign function, that is, g(z) = 1 if z \ge 0 and g(z) = -1 otherwise.

Margin

For each training example x^{(i)}, we can define a margin as:

\gamma^{(i)} = y^{(i)} (w^Tx^{(i)} + b)

Intuitively, a function margin represents the confidence that point x^{(i)} belongs to the class y^{(i)}. Given m input points, we define the margin of the training set as

(3) \gamma = \min_{ i = 1, \cdots, m} \gamma^{(i)}

We then want to maximize the margin, so we can write the following optimization problem:

\max \gamma

s.t. y^{(i)} (w^Tx^{(i)} + b) \ge \gamma

The constraint is to enforce (3). The problem with this model is that we can arbitrarily scale w and b without changing h_{w,b}(x), so the optimal solution is unbounded.

The alternative is to normalize the by dividing by ||w||, so it becomes indifferent to scaling.

Primal form

It’s possible to show that in order to maximize the margin, we can minimize the following linear programming with quadratic objective function:

\min \frac{1}{2} ||w||^2

s.t. y^{(i)}(w^Tx^{(i)} + b) \ge 1, for i = 1, \cdots, m

Dual form

For non-linear programming, it’s possible to make use of Lagrangian multipliers to obtain the dual model. In particular, the following model obtained using this idea, together with the primal form, satisfy the Karush–Kuhn–Tucker (KKT) conditions, so the value of the optimal solution for both cases is the same.

\max_{\alpha} W(\alpha) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i,j=1}^{m} y^{(i)}y^{(j)} \alpha_i \alpha_j \langle x^{(i)}, x^{(j)} \rangle

s.t.
\begin{array}{llr}   \alpha_i & \ge 0 & i=1,\cdots,m \\  \sum_{i=1}^{m} \alpha_i y^{(i)} & = 0  \end{array}

It’s possible to show that the non-zeros entries of \alpha_i correspond to the input points that will define the margin. We call these points support vectors.

Moreover, w can be obtained by the following equation:

w = \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)}

Non-linear SVM: Kernels

Kernel is a way to generate features from the training set. For example, we can generate m features from the m points in the training set by having feature f_i to be the similarity of a point to it. One common similarity function is the Gaussian distance. Given points x^{(i)} and x^{(j)}, we can compute it by:

K_{\mbox{gaussian}}(x^{(i)}, x^{(j)}) = \exp \left(-\dfrac{||x^{(i)} - x^{(j)}||^2}{2\sigma^2}\right)

Conclusion

I’ve completed an Artificial Intelligence class in college and previously did the PGM (Probabilistic Graphical Models). One of the main reasons I enrolled for this course was the topic of Support Vector Machines.

Unfortunately, SVM was discussed in a high level and apart from kernels, it was pretty confusing and mixed up with logistic regression. SVM happened to be a very complicated topic and I ended up learning a lot about it with this post, but I still don’t fully understand all the details.

References

[1] Wikipedia: Back-propagation
[2] Prof. Cosma Shalizi’s notes – Logistic Regression
[4] Prof. Andrew Ng’s notes – Support Vector Machine
[5] Wikipedia: Support Vector Machine
[6] An Idiot’s guide to Support Vector Machines

Appendix: Derivation of the backpropagation relations

In this appendix we’ll calculate the partial derivative of J(\Theta) in relation to a \Theta_{ij}^{(\ell)} component to get some insight into the back-propagation algorithm. For simplicity, we’ll assume a single input x^{(i)}, so we can get rid of the summation.

By the chain rule we have:

\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} =   \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot  \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}}

Let’s analyze each of the terms:

For the term \dfrac{\partial z^{(l)}}{\partial \Theta_{j,k}^{(\ell)}}, we have that z^{(\ell)} = \Theta^{(\ell)} a^{(\ell-1)} or breaking down by elements:

z^{(\ell)}_{k'} = \sum_{j=1}^{n^{(\ell-1)}} \Theta^{(\ell)}_{j,k'} a^{(\ell-1)}_j

For k=k', it’s easy to see that \dfrac{\partial z^{(\ell)}_k}{\partial \Theta_{j,k}^{(\ell)}} = a^{(\ell-1)}_j. For k \ne k',

\dfrac{\partial z^{(\ell)}_{k'}}{\partial \Theta_{j,k}^{(\ell)}} = 0

For the term \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}, we have that a^{(\ell)} = g(z^{(\ell)}), thus, \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} = g'(z^{(\ell)})

Finally, for the term \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}}, we will assume that J(\Theta) \approx \dfrac{1}{2} (a^{(\ell)} - y)^2

If \ell = L, then a^{(\ell)} = h_\Theta(x), so it’s the derivative is straightforward:

\dfrac{\partial J(\Theta)}{\partial a^{(L)}} = (h_\Theta(x) - y)

For the intermediate layers, we can find the derivative by writing it in terms of subsequent layers and then applying the chain rule. Inductively, suppose we know \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} (base is \ell = L) and we want to find \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}}. We can write:

(4) \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot \dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}}

Given that a^{(\ell)} = g(z^{(\ell)}) = g((\Theta^{(\ell-1)})^T a^{(\ell - 1)}), we have:

\dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}} =   \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot   \dfrac{\partial z^{(\ell)}}{\partial a^{(\ell - 1)}} =     (\Theta^{(\ell - 1)})^T \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}

Replacing back in (4):

\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} =     \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot     (\Theta^{(\ell -1)})^T  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}    =    (\Theta^{(\ell -1)})^T  \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}

If we name \delta^{(\ell)} = \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}, we now have

(5) \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} =     (\Theta^{(\ell -1)})^T \delta^{(\ell)}

“Undoing” one of the chain steps, we have

\dfrac{\partial J(\Theta)}{\partial z^{(\ell-1)}} =     \dfrac{\partial J(\Theta)}{\partial a^{(\ell-1)}} \cdot  \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}

Replacing (5) here,

=    (\Theta^{(\ell -1)})^T \delta^{(\ell)}  \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}

With that, we can get the following recurrence:

\delta^{(\ell-1)} = (\Theta^{(\ell-1)})^T \delta^{(\ell)} g'(z^{(\ell - 1)})

Finally, replacing the terms in the original equation give us

\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} =   \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot  \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}}     =    a_j^{(\ell)} \delta_k^{(\ell+1)}

Lawler and an Introduction to Matroids

lawler

Eugene Lawler was an American computer scientist, professor of UC Berkeley and was one of the founders of the field of Combinatorial Optimization.

Lawler made important contributions on branch and bound algorithms, dynamic programming, and was also the first one to observe that matroid intersection could be done in polynomial time.

He also proved that two of Karp’s 21 NP-Complete problems, The Directed Hamiltonian Cycle and 3-Dimensional Matching were NP-Complete.

This is the first post in our series of Matroids in Combinatorial Optimization context. They will be mainly based on the Matroids and the Greedy Algorithm chapter from Combinatorial Optimization – Networks and Matroids, by Eugene Lawler.


Introduction

Hassler Whitney developed Matroids in 1935 in the context of algebraic theory and it was further applied by Jack Edmonds in the context of combinatorial optimization.

Matroids are a structure that allows us to solve problems by always taking local optimal steps and by doing that there’s the guarantee we’ll reach the global optimum. These types of algorithms are known as greedy algorithms.

First we’ll define Matroids and then give some examples of problems to modeled as matroids. Next, we’ll introduce weighted matroids and describe a generic algorithm to solve them and how such algorithm applied to matroids in graphs is actually the famous Kruskal algorithm.

Definition

Let E be a set of elements and \cal I a family of subsets of E (family here has the meaning of a set of elements that share some properties and it’s also clearer than using ‘set of subsets’). Let M = (E, \cal I) and consider the following properties:

1) \emptyset \in \cal I and if a subset I \in \cal I, then all subsets of I belong to \cal I as well.

2) If I_p \in {\cal I} with |I_p| = p and I_{p+1} \in {\cal I} with |I_{p+1}| = p+1, then there is an element e \in I_{p+1} \setminus I_{p} such that I + e \in {\cal I}. (Henceforth, by abuse of notation, when we say I + e we mean I \cup \{e\}).

If M satisfies both (1) and (2), we say that M is a matroid.

Terminology. An independent set is any set I belonging to the family \cal I. If there’s no other set containing I in \cal I we say it’s a maximal independent set. Note that by property (2), all maximal independent sets have the same size.

The rank of a set E, denoted by r(E) is the largest independent subset of E. A minimal dependent set is called circuit.

Special types of matroids

We can model some structures as matroids and take advantage of matroids properties to find solutions to problems involving these structures. In this section we’ll present a few examples of such modelings.

The modelling process consists in defining the set E, the family \cal I (usually by defining the property that the subsets of E must have to be in there) and then proving that \cal I satisfies (1) and (2).

Matric Matroid.

A matric matroid is a matroid in the context of matrices. Given a matrix A with the set of columns as E, if \cal I is the family of sets containing only linear independent columns, then M = (E, {\cal I}) is a matric matroid.

Proof. It’s easy to see that any subset of a linear independent (LI) set of columns is also LI, so (1) is straightforward. For (2), we need to prove that given subsets I_{p+1} and I_{p} of p+1 and p LI columns, respectively, then there must be a column c from I_{p+1} \setminus I_{p} such that I_{p} + c is still LI. Well, if it’s not the case, it means that each column in I_{p+1} can be written as linear combination of the p LI columns from I_{p}, which contradicts the fact that I_{p+1} is LI.

Graphic Matroid.

A graphic matroid is a matroid in the context of graphs. Given a graph G = (V, E), if \cal I is the family of arcs that do not form a cycle, then M = (E, {\cal I}) is a graphic matroid.

Proof. It’s possible to show that subsets of edges that are cycle free, correspond to LI columns in the incidence matrix of G, so in this case we can also view M as a matric matroid of the incidence matrix of G.

Matching Matroid.

A matching matroid is a matroid in the context of matchings in graphs.

Let G = (V, E) be an undirected graph and let \cal I be the family of subsets of nodes I \subseteq V such that there’s a matching in G that covers all nodes in I (we say that an node in I is covered if there’s an edge in the matching incident to it, even if the other end of the edge is not in I). Then M = (V, {\cal I}) is a matching matroid.

Sketch of the Proof. It’s easy to verify that \cal I satisfies (1). The proof of why it satisfies (2) consists in getting the symmetric difference between matchings covering p+1 and p vertices respectively and showing that in that difference there will be at least one alternating path such that if we change the alternation will increase the number of matched vertices.

Weighted Matroids

Let M = (E, {\cal I}) be a matroid and w a weight function for elements in E. The problem we want to solve is to find the independent set I \in {\cal I} that has maximum weight, where the weight of a set is the sum of the weight of its elements.

Let’s assume the elements on each independent set are listed in the non-increasing order of weight, so given sets I_1 and I_2, we list them as

I_1 = \{a_1, a_2, \cdots, a_m\} and I_2 = \{b_1, b_2, \cdots, b_n\}

such that w(a_1) \ge w(a_2) \ge \cdots w(a_m) and w(b_1) \ge w(b_2) \ge \cdots w(b_n)

We say I_1 is lexicographically greater than I_2 if for the first position their components differ, say at index k, a_k > b_k or in case I_2 listing is a prefix of I_1 listing (that is, the same way we sort strings).

A set that is not lexicographically less than any other set is said to be lexicographically maximum. We can show such set is also a maximum independent set, because otherwise, by property (2), we can always add more elements to it, making it lexicographically greater.

The following Theorem from Rado and Edmonds states an important property of weighted matroids:

Theorem 1. Let M = (E, {\cal I}) be a matroid. Then

3) For any negative weight function on E, a lexicographically maximum set in \cal I is also the set with the maximum weight.

Conversely, given E and \cal I, if (1) and (3) are satisfied, M = (E, {\cal I}) is a matroid.

We say that a set B \in {\cal I} is Gale optimal if all its components are not less than the corresponding components of any other set I \in {\cal I} when these components are listed in non-increasing order. More formally, there exists a one-to-one mapping h:I \rightarrow B such that w(e) \le w(h(e)) for all e \in I.

Note that a Gale optimal set is clearly a lexicographically maximum and thus has optimal weight.

Given that, Gale’s theorem provides another a stronger result than Theorem 1 regarding weighted matroids:

Theorem 2. Let M = (E, {\cal I}) be a matroid. Then

4) For any weight function of elements on E, there exists a Gale optimal set B \in {\cal I}.

Conversely, given E and \cal I, if (1) and (4) are satisfied, M = (E, {\cal I}) is a matroid.

Weighted Matroid Greedy Algorithm. Property (4) allows to use a simple greedy algorithm to find the set of the largest weight in \cal I.

In the first step, we look for the single-element set I = \{e_0\} in \cal I with the largest weight. By property (2) and (4), we can show that the Gale optimal set contains e_0. Next, we look for the largest element e_1 such that \{e_0, e_1\} \in {\cal I}. Again, we can show that such elements are contained in the Gale optimal set. We repeat this until we get a maximum independent set, which is also the Gale optimal set.

More generically, we have the following algorithm:

Let S be our current solution, which starts as the empty set. At each step, we look for the element e with maximum weight not in S such that S + e belongs to \cal I.

The Maximum Spanning Tree problem and the Kruskal Algorithm.

In the maximum spanning tree problem we are given a connected, undirected graph G = (V,E) with non-negative weights w on V and we want to find a spanning tree (subset of E that forms a tree) with the lowest cost.

Recall that a tree is a connected graph without cycles. The graphic matroid represents cycle-free subsets of arcs (or a forest) and thus the maximum independent set of a connected graph is a tree. If we assign non-negative weights to arcs, we can find the optimal Gale independent set which corresponds to the maximum spanning tree.

The Kruskal algorithm is then basically an application of the weighted matroid greedy algorithm for graphic matroids: It first sorts the edges by non-increasing order of weight and then adds an edge to the current solution if it doesn’t form a cycle.

Conclusion

In this post we learned the basics of matroids and weighted matroids. In future posts we’ll study Matroid Intersection, Matroid Partition and other types of matroids like Oriented Matroids.

References

[1] Wikipedia – Eugene Lawler
[2] Combinatorial Optimization – Networks and Matroids, Eugene Lawler – Chapter: Matroids and the Greedy Algorithm