De Bruijn Graphs and Sequences

De_Bruijn

I._J._Good

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

de-bruijn

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

Applications

Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[4] Twitter: John Carlos Baez
[5] Wikipedia: Superpermutation

Advertisements

US as an hexagonal map

In this post we’ll study a way to visualize maps in a hexagonal grid, in which each entity have uniform area. We’ll then model that as a mathematical problem.

One challenge in displaying data in maps is that larger area countries or states tend to get more attention than smaller ones, even when economically or population-wise, the smaller state is more relevant (e.g. New Jersey vs. Alaska). One idea is to normalize the areas of all the states by using symbols such as squares. Recently I ran into a NPR map that used hexagons and it looked very neat, so I decided to try building it in D3 and perform some analysis.

Below is the result of plotting the state populations (log scale):

US Hexmap: Population (log scale)

US Hexmap: Population (log scale)

One important property of visualizing data in maps is familiarity of the location (you can easily find specific states because you remember where they are) and also adjacency patterns can provide insights. For example, if we plot a measure as a choropleth map and see that the West coast is colored differently from the Midwest, then we gain an insight we wouldn’t have by looking at a column chart for example.

Because of this, ideally the homogeneous area maps should preserve adjacencies as much as possible. With that in mind, we can come up with a similarity score. Let X be the set of pairs of states that share a border in the actual US map. Now, let Y be the set of pairs of states that share a border in the hexagonal map (that is, two hexagons sharing a side). The similarity score is the size of their symmetric difference and we can normalize by the size of the original:

(|X - Y| + |Y - X|) / |X|

The lower the score the better. In an ideal case, the borders sets would match perfectly for a score of 0.

The size of the symmetric difference between the two sets seems like a good measure for similarity, but I’m not sure about the normalization factor. I initially picked the size of the union of X and Y, but this wouldn’t let us model this problem as a linear programming model as we’ll see next. The problem with using the size of X is that the score could theoretically be larger than 1, but it’s trivial to place the hexagons in the grid in such a way that none of them are touching and thus Y is empty, so we can assume the score is between 0 and 1.

Hexgrid coordinates convention

Hexgrid coordinates convention

The score from the NPR maps is 0.67.

An Optimization Problem

Let’s generalize and formalize the problem above as follows: given a graph G = (V,E), and another graph H = (V_H, E_H) representing our grid, find the induced subgraph of H, I = (V_I, E_I), such that there’s bijection f: V \rightarrow V_I and the size of the symmetric difference of f(E) and E_I is minimized (f(E) is an abuse of notation, but it means applying the bijection to each vertex in the set of edges E).

To make it clearer, let’s apply the definition above to the original problem. G represents the adjacency of states in the original map. V is the set of states and E is the set of pairs of states that share a border. H is the hexagonal grid. V_H is the set of all hexagons and E_H is the set of pairs of hexagons that are adjacent. We want to find a subset of the hexagons where we can place each of the states (hence the bijection from states to hexagons) and if two hexagons are in the grid, and we place two states there, we consider the states to be adjacent, hence the need for an induced graph, so the adjacency in the grid is preserved.

Is this general case an NP-hard problem? We can reduce the Graph Isomorphism problem to this one. It consists in deciding whether two graphs A and B are isomorphic. If we set G = A and H = B, then A and B are isomorphic if and only if I = H and the symmetric difference of f(E) and E_I is 0. The problem is that it’s not known whether Graph Isomorphism belongs to NP-Complete.

What if G is planar (which is the case for maps)? I haven’t given much thought about it, but I decided to come up with an integer programming model nevertheless.

An Integer Linear Programming Model

Note: the model uses the original grid analogy instead of the more general problem so that the constraints are easier to understand.

Boolean algebra as linear constraints

Before we start, we need to recall how to model logical constraints (AND, OR and EQUAL) using linear constraints. Let a and b be binary variables. We want x to be the result of logical operators applied to a and b.

For AND, we can do (x = 1 if and only if a = 1 and b = 1)

x \le a
x \le b
x \ge a + b - 1

For OR, we can do (x = 0 if and only if a = 0 and b = 0)

x \ge a
x \ge b
x \le a + b

For EQUAL, we can do (x = 1 if and only if a = b)

x \le 1 - (a - b)
x \le 1 - (b - a)
x \ge a + b - 1
x \ge -(a + b - 1)

We can introduce a notation and assume these constraints can be generated by a function. For example, if we say
x = \mbox{EQ}(a, b), we’re talking about the four constraints we defined above for modeling EQUAL. This is discussed in [2].

Constants

Let G be the set of pairs (x,y) representing the grid positions. Let P be the set of pieces p that have to be placed in the grid. Let N(x,y) be the set of pairs (x',y') that are adjacent to (x, y) in the grid.

Let A_{v1, v2} represent whether v1 and v2 are adjacent to each other in the dataset.

“Physical” constraints

Let b_{x,y,s} be a binary variable, and equals 1 if and only if state s is placed position (x, y).

1) A piece has to be placed in exactly one spot in the grid:

\sum_{(x,y) \in G} b_{x,y,p} = 1 for all p \in P

2) A spot can only be occupied by at most one state:

\sum_s b_{x,y,s} \le 1 for all (x,y) \in G

Adjacency constraints

Let a_{p1, p2, x, y} be a binary variable and equals 1 if and only if piece p1 is placed in (x, y) and adjacent to p2 in the grid.

3) a_{p1, p2, x, y} has to be 0 if p1 is not in (x,y) or p2 is not adjacent to any of (x,y) neighbors:

a_{p1, p2, x, y} = \mbox{AND}(\sum_{(x', y') \in N(x, y)} b_{x', y', p2}, b_{x,y,p})

We have that a_{p1, p2, x, y} is 1 if and only if p1 is in (x,y) and p2 is adjacent to any of (x,y) neighbors.

Finally, we can model the adjacency between two pieces p1 and p2. Let a_{p1, p2} be a binary variable and equals 1 if and only if p1 and p2 are adjacent in the grid:

a_{p1, p2} = \sum_{(x,y) in G} a_{p1, p2, x, y}

Symmetric difference constraints

Let y_{p1, p2} be a binary variable and equals to 1 if and only if a_{p1, p2} \ne A_{p1, p2}.

4) y_{p1, p2} \ge a_{p1, p2} - A_{p1, p2}
5) y_{p1, p2} \ge A_{p1, p2} - a_{p1, p2}

Objective function

The sum of all y‘s is the size of the symmetric difference:

\min \sum_{p1, p2 \in P} y_{p1, p2}.

Practical concerns

This model can be quite big. For our US map example, we have |P| = 50 and we need to estimate the size of the grid. A 50×50 grid is enough for any type of arrangement. The problem is that the number of variables a_{p1, p2, x, y} is |P|^2|G| = 50^4 which is not practical.

We can also solve the problem for individual connected components in the original graph and it’s trivial construct the optimal solution from each optimal sub-solution. This doesn’t help much in our example, since only Hawaii and Alaska are disconnected, so we have |P| = 48. The grid could also be reduced. It’s very unlikely that an optimal solution would be a straight line. In the NPR map, the grid is 8×12. Sounds like doubling these dimensions would give the optimal solution enough room, so |G| = 8*12*4 = 384.

We can also assume states are orderer and we only have variables a_{p1, p2, x, y} for p1 < p2, so the number of a_{p1, p2, x, y} is about 450K. Still too large, unfortunately.

Another important optimization we can do in our case because we're working with a grid is to define the adjacency for x and y independently and combine them afterwards.

Refined adjacency constraints

Instead of working with b_{x,y,s} we use X_{x, s}, and equals 1 if and only if state s is placed position (x, y) for any y and Y_{y, s}, which equals 1 iff state s is placed position (x, y) for any x. The physical constraints are analogous to the previous model:

6) A piece has to be placed in exactly one spot in the grid:

\sum_{x \in G} X_{x,p} = 1 for all p \in P
\sum_{y \in G} Y_{y,p} = 1 for all p \in P

7) A spot can only be occupied by at most one state:

\sum_s X_{xs} \le 1 for all x \in G
\sum_s Y_{y,s} \le 1 for all y \in G

In a hexagonal grid, if we have the piece p1 in position (x,y), it will be adjacent to another piece p2 if and only if p2 is in one of these six positions: 1: (x-1, y), 2: (x+1, y), 3: (x-1, y-1), 4: (x, y-1), 5: (x-1, y+1) or 6: (x, y+1). We can define two adjacency categories: Type I, which happens when p1.y - p2.y = 0 and |p1.x - p2.x| = 1 (cases 1 and 2); and Type II, which is when |p1.y - p2.y| = 1 and p1.x - p2.x \le 0 (cases 3, 4, 5 and 6).

Let’s define Y_{d=0, p1, p2, y} = 1 iff p1.y - p2.y = 0 for a given y. Similarly we define X_{|d|=1, p1, p2, x} = 1 iff |p1.x - p2.x| = 1, Y_{|d|=1, p1, p2, y} = 1 iff |p1.y - p2.y| = 1 and finally X_{d \ge 0, p1, p2, x} = 1 iff p1.x - p2.x \ge 0.

8) We can have the following constraints do model the variables we just defined:

Y_{d=0, p1, p2, y} = \mbox{EQ}(Y_{y, p_1}, Y_{y, p2})
X_{|d|=1, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x-1, p2} + X_{x+1, p2})
Y_{|d|=1, p1, p2, y} = \mbox{EQ}(Y_{y, p1}, Y_{y-1, p2} + Y_{y+1, p2})
X_{d \ge 0, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x, p2} + X_{x+1, p2})

9) Let Y_{d=0, p1, p2} = 1 iff p1.x - p2.y = 0 for any y. We can define analogous variables for the other cases:

Y_{d=0, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{|d|=1, p1, p2} = \sum_{x} X_{d=0, p1, p2, x}
Y_{|d|=1, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{d \ge 0, p1, p2} = \sum_{x} Y_{d \ge 0, p1, p2, x}

10) Let T'_{p1, p2} = 1 iff p1 and p2 have the Type I adjacency and T''_{p1, p2} = 1 iff p1 and p2 have Type II adjacency:

T'_{p1, p2} = \mbox{AND}(Y_{d=0, p1, p2}, X_{|d|=1, p1, p2})
T''_{p1, p2} = \mbox{AND}(Y_{|d|=1, p1, p2}, X_{d \ge 0, p1, p2})

11) Finally, we say that p1 and p2 are adjacency iff either Type I or Type II occurs:

a_{p1, p2} = \mbox{OR}(T'_{p1, p2}, T''_{p1, p2})

The model for adjacency became much more complicated but we were able to reduce the number of adjacency variables are now roughly O(|P|^2 \sqrt{|G|}). The number of non-zero entries in the right hand size of (which represents the size of the sparse matrix) is roughly 11M, dominated by the type (8) constraints. I’m still not confident this model will run, so I’ll punt on implementing it for now.

Conclusion

In this post we explored a different way to visualize the US states map. I was mainly exploring the idea of how good of an approximation this layout is and a natural question was how to model this as an optimization problem. Turns out that if we model it using graphs, the problem definition is pretty simple and happens to be a more general version of the Graph Isomorphism problem.

I struggled with coming up with an integer programming model and couldn’t find one with a manageable size, but it was a fun exercise!

World Map?

One cannot help wondering if we can display the countries in a hexagonal map. I’m planning to explore this idea in a future post. The main challenge is that the US states areas are more uniform than the countries. For example, the largest state (Alaska) is 430 times larger than the smallest (Rhode Island). While the largest country (Russia) is almost 40,000,000 bigger than the smallest (Vatican City).

Also, the layout of the US map was devised by someone from NPR and they did a manual process. I’m wondering if we can come up with a simple heuristic to place the hexagons and then perform manual adjustments.

References

[1] NPR Visuals Team – Let’s Tesselate: Hexagons For Tile Grid Maps
[2] Computer Science: Express boolean logic operations in zero-one integer linear programming (ILP)
[3] SOM – Creating hexagonal heatmaps with D3.js
[4] Github – d3/d3-hexbin

Data sources

[5] US State Borders
[6] Wikipedia – Population of US states and territories
[7] Tool to download Wikipedia tables as CSV
[8] List of US states and territories by area

Lawler and an Introduction to Matroids

lawler

Eugene Lawler was an American computer scientist, professor of UC Berkeley and was one of the founders of the field of Combinatorial Optimization.

Lawler made important contributions on branch and bound algorithms, dynamic programming, and was also the first one to observe that matroid intersection could be done in polynomial time.

He also proved that two of Karp’s 21 NP-Complete problems, The Directed Hamiltonian Cycle and 3-Dimensional Matching were NP-Complete.

This is the first post in our series of Matroids in Combinatorial Optimization context. They will be mainly based on the Matroids and the Greedy Algorithm chapter from Combinatorial Optimization – Networks and Matroids, by Eugene Lawler.


Introduction

Hassler Whitney developed Matroids in 1935 in the context of algebraic theory and it was further applied by Jack Edmonds in the context of combinatorial optimization.

Matroids are a structure that allows us to solve problems by always taking local optimal steps and by doing that there’s the guarantee we’ll reach the global optimum. These types of algorithms are known as greedy algorithms.

First we’ll define Matroids and then give some examples of problems to modeled as matroids. Next, we’ll introduce weighted matroids and describe a generic algorithm to solve them and how such algorithm applied to matroids in graphs is actually the famous Kruskal algorithm.

Definition

Let E be a set of elements and \cal I a family of subsets of E (family here has the meaning of a set of elements that share some properties and it’s also clearer than using ‘set of subsets’). Let M = (E, \cal I) and consider the following properties:

1) \emptyset \in \cal I and if a subset I \in \cal I, then all subsets of I belong to \cal I as well.

2) If I_p \in {\cal I} with |I_p| = p and I_{p+1} \in {\cal I} with |I_{p+1}| = p+1, then there is an element e \in I_{p+1} \setminus I_{p} such that I + e \in {\cal I}. (Henceforth, by abuse of notation, when we say I + e we mean I \cup \{e\}).

If M satisfies both (1) and (2), we say that M is a matroid.

Terminology. An independent set is any set I belonging to the family \cal I. If there’s no other set containing I in \cal I we say it’s a maximal independent set. Note that by property (2), all maximal independent sets have the same size.

The rank of a set E, denoted by r(E) is the largest independent subset of E. A minimal dependent set is called circuit.

Special types of matroids

We can model some structures as matroids and take advantage of matroids properties to find solutions to problems involving these structures. In this section we’ll present a few examples of such modelings.

The modelling process consists in defining the set E, the family \cal I (usually by defining the property that the subsets of E must have to be in there) and then proving that \cal I satisfies (1) and (2).

Matric Matroid.

A matric matroid is a matroid in the context of matrices. Given a matrix A with the set of columns as E, if \cal I is the family of sets containing only linear independent columns, then M = (E, {\cal I}) is a matric matroid.

Proof. It’s easy to see that any subset of a linear independent (LI) set of columns is also LI, so (1) is straightforward. For (2), we need to prove that given subsets I_{p+1} and I_{p} of p+1 and p LI columns, respectively, then there must be a column c from I_{p+1} \setminus I_{p} such that I_{p} + c is still LI. Well, if it’s not the case, it means that each column in I_{p+1} can be written as linear combination of the p LI columns from I_{p}, which contradicts the fact that I_{p+1} is LI.

Graphic Matroid.

A graphic matroid is a matroid in the context of graphs. Given a graph G = (V, E), if \cal I is the family of arcs that do not form a cycle, then M = (E, {\cal I}) is a graphic matroid.

Proof. It’s possible to show that subsets of edges that are cycle free, correspond to LI columns in the incidence matrix of G, so in this case we can also view M as a matric matroid of the incidence matrix of G.

Matching Matroid.

A matching matroid is a matroid in the context of matchings in graphs.

Let G = (V, E) be an undirected graph and let \cal I be the family of subsets of nodes I \subseteq V such that there’s a matching in G that covers all nodes in I (we say that an node in I is covered if there’s an edge in the matching incident to it, even if the other end of the edge is not in I). Then M = (V, {\cal I}) is a matching matroid.

Sketch of the Proof. It’s easy to verify that \cal I satisfies (1). The proof of why it satisfies (2) consists in getting the symmetric difference between matchings covering p+1 and p vertices respectively and showing that in that difference there will be at least one alternating path such that if we change the alternation will increase the number of matched vertices.

Weighted Matroids

Let M = (E, {\cal I}) be a matroid and w a weight function for elements in E. The problem we want to solve is to find the independent set I \in {\cal I} that has maximum weight, where the weight of a set is the sum of the weight of its elements.

Let’s assume the elements on each independent set are listed in the non-increasing order of weight, so given sets I_1 and I_2, we list them as

I_1 = \{a_1, a_2, \cdots, a_m\} and I_2 = \{b_1, b_2, \cdots, b_n\}

such that w(a_1) \ge w(a_2) \ge \cdots w(a_m) and w(b_1) \ge w(b_2) \ge \cdots w(b_n)

We say I_1 is lexicographically greater than I_2 if for the first position their components differ, say at index k, a_k > b_k or in case I_2 listing is a prefix of I_1 listing (that is, the same way we sort strings).

A set that is not lexicographically less than any other set is said to be lexicographically maximum. We can show such set is also a maximum independent set, because otherwise, by property (2), we can always add more elements to it, making it lexicographically greater.

The following Theorem from Rado and Edmonds states an important property of weighted matroids:

Theorem 1. Let M = (E, {\cal I}) be a matroid. Then

3) For any negative weight function on E, a lexicographically maximum set in \cal I is also the set with the maximum weight.

Conversely, given E and \cal I, if (1) and (3) are satisfied, M = (E, {\cal I}) is a matroid.

We say that a set B \in {\cal I} is Gale optimal if all its components are not less than the corresponding components of any other set I \in {\cal I} when these components are listed in non-increasing order. More formally, there exists a one-to-one mapping h:I \rightarrow B such that w(e) \le w(h(e)) for all e \in I.

Note that a Gale optimal set is clearly a lexicographically maximum and thus has optimal weight.

Given that, Gale’s theorem provides another a stronger result than Theorem 1 regarding weighted matroids:

Theorem 2. Let M = (E, {\cal I}) be a matroid. Then

4) For any weight function of elements on E, there exists a Gale optimal set B \in {\cal I}.

Conversely, given E and \cal I, if (1) and (4) are satisfied, M = (E, {\cal I}) is a matroid.

Weighted Matroid Greedy Algorithm. Property (4) allows to use a simple greedy algorithm to find the set of the largest weight in \cal I.

In the first step, we look for the single-element set I = \{e_0\} in \cal I with the largest weight. By property (2) and (4), we can show that the Gale optimal set contains e_0. Next, we look for the largest element e_1 such that \{e_0, e_1\} \in {\cal I}. Again, we can show that such elements are contained in the Gale optimal set. We repeat this until we get a maximum independent set, which is also the Gale optimal set.

More generically, we have the following algorithm:

Let S be our current solution, which starts as the empty set. At each step, we look for the element e with maximum weight not in S such that S + e belongs to \cal I.

The Maximum Spanning Tree problem and the Kruskal Algorithm.

In the maximum spanning tree problem we are given a connected, undirected graph G = (V,E) with non-negative weights w on V and we want to find a spanning tree (subset of E that forms a tree) with the lowest cost.

Recall that a tree is a connected graph without cycles. The graphic matroid represents cycle-free subsets of arcs (or a forest) and thus the maximum independent set of a connected graph is a tree. If we assign non-negative weights to arcs, we can find the optimal Gale independent set which corresponds to the maximum spanning tree.

The Kruskal algorithm is then basically an application of the weighted matroid greedy algorithm for graphic matroids: It first sorts the edges by non-increasing order of weight and then adds an edge to the current solution if it doesn’t form a cycle.

Conclusion

In this post we learned the basics of matroids and weighted matroids. In future posts we’ll study Matroid Intersection, Matroid Partition and other types of matroids like Oriented Matroids.

References

[1] Wikipedia – Eugene Lawler
[2] Combinatorial Optimization – Networks and Matroids, Eugene Lawler – Chapter: Matroids and the Greedy Algorithm

Network Matrices

Introduction

In this post we’re going to write about a special case of the \{0, \pm 1\} matrix, which is called network matrix. They play an important role in our study of totally unimodular matrices.

We’ll first define a network matrix and provide an example. We’ll later show some properties and then describe a polynomial-time algorithm to recognize network matrices.

Definition

Network matrices were first defined by William Tutte (1965) [1]. The definition is the following:

Let D = (V, A) be a directed graph and T = (V, A_0) be a directed tree on the vertex set V. Let M be the matrix A_0 \times A-matrix defined by a = (v, w) \in A and a' \in A_0.

M_{a',a} = \left\{ \begin{array}{ll}   +1 & \mbox{if the path } v\mbox{-}w \in T \mbox{ passes through}\\     & a \mbox{ in the same direction.}\\  -1 & \mbox{if the path } v\mbox{-}w \in T \mbox{ passes through}\\     & a \mbox{ in the opposite direction}.\\  0  & \mbox{if does not pass through } a \mbox{ at all.}  \end{array} \right.

Matrices with this structure are called network matrices.

Example. In Figure 1 we see a sample digraph G(V,A) and a given directed tree T on the vertex set V. The corresponding network matrix M represented by G and T is given on Table 1.

F

Figure 1. Directed graph G(V,A) and a directed tree T on V

For each arc (u,v) \in A, we have a column in M representing the path from u to v in T. The signs indicate the direction of the edges in the path.

Table 1. Network Matrix of G and T

Table 1. Network Matrix of G and T

Properties

Let M be a network matrix, represented by the digraph G and the directed tree T.

1) Multiplying a row of M by -1 is equivalent to inverting the direction of the corresponding edge in T.

2) Multiplying a column of M by -1 is the same as inverting the direction of the corresponding edge in G.

3) Deleting a row of M corresponding to edge (i, j) of T, is the same as shrinking vertex i and j into a single vertex.

4) Deleting a column of M is equivalent to removing the corresponding edge from G.

Combining properties (3) and (4), we have that:

5) A submatrix of a network matrix is a network matrix as well.

The following theorem is the most important property about Network Matrices, that we’ll explore in the forthcoming posts about Integer Programming Theory:

Theorem 1. Networks matrices are Totally Unimodular.

Recognizing Network Matrices

There is a polynomial-time algorithm for deciding whether a matrix M is a network matrix. Without loss of generality we restrict ourselves to \{0, \pm 1\} matrices, since all network matrices are of this type and we can easily find out whether a matrix has such property.

We now divide it in two cases. Case 1: for all columns of M, it has at most two non-zero entries; Case 2: the remaining types, that is, M has at least one column with three or more non-zero entries.

Case 1. Let’s describe the algorithm for the first case. Let M be a m \times n, \{0,\pm 1\} matrix with at most 2 non-zero entries.

If M has, for each column, at most one entry 1 and at most on entry -1, we can show it’s a network matrix by constructing a digraph G and a tree T such that their corresponding network matrix is M. We first build a direct star T with m vertices and with all edges pointing towards the center vertex v^*. We now build the digraph G with the same vertex set. For each column in M that has an entry +1 and -1 for rows (u,v^*) and (w,v^*) we add an edge (u,w). For columns having a single entry for row (u, v^*), we add the (u,v^*) to G if it’s +1 or (v^*,u) if it’s -1.

The problem is that even in the restrictive Case 1, we may have both entries with the same signal. From property (1) though, we can multiply some rows by -1 and try to reach the case above. The question is then: can we split the rows into sets R_1 and R_2, in such a way that if we multiply rows in R_2 by -1, we have at most one entry +1 and at most one entry -1 for all columns?

This question can be answered by a neat reduction to problem of deciding wether a graph is bipartite. Let G_R be a graph with vertex set corresponding to each row of M plus some artificial vertices. We add an edge (i, j) if the corresponding rows have the same signal for some column. We add edges (i, v_{ij}^*) and (j, v_{ij}^*) if the have different signs. We then try to split this graph into two partitions R_1 and R_2.

The idea is that if such a partitioning exists, vertex with different signs will be in the same partition because they share a common vertex and vertex with the same signs must be on different partitions. If we now multiply all rows in R_2 by -1, we’ll have our property.

Conversely, if no partition exists, it’s possible to show (see Example 1, from this post) that such matrix is not TU. Since by Theorem 1 all network matrices are TU, we conclude that this matrix is also not a network matrix. Summarizing we have that

Observation 1. The matrix M from the Case 1 is a network matrix if and only if its graph G_R define as above is bipartite.

Case 2. Now let’s concentrate on the more general case, where the matrix has at least one column with 3 or more non-zero entries.

For each row i of M, we define a graph G_i with vertex set \{1, \cdots, m\} \setminus \{i\}. There’s an edge between j and k if exists any column in M such that it has non-zero entries for j and k, but 0 for i. We have the following observation:

Observation 2. If M is a network matrix, there exists a row i for which G_i is disconnected.

The basic idea in understanding this observation is that since there is at least one column in M with three non-entries, there must be a path in T that connects two vertices in G with length at least 3. There is some edge i from the middle of this path that is not the first nor the last edge, henceforth denoted as j and k. If we remove this edge, we will split it into two components. It’s then easy to see that any path in the tree that has j and k needs to go through i. In turn, this means that there is no edge between the corresponding vertices in G_i.

From Observation 2, we can conclude that if a given matrix has G_i for all possible columns i, then M is not a network matrix.

So we can now suppose our candidate matrix has a disconnected G_1 (we can assume i = 1 without loss of generality. Let C_1, \cdots, C_p be the connected components of G_1.

We define

* W := as the set of column indexes for which the first row of M has non-zero entries;

* W_i := W \cap the set of column indexes for which the i-th row of M has non-zero entries;

* U_k := \bigcup \{W_i | i\in C_k\}

Now, we build a graph H with vertex set \{C_1, \cdots, C_p\}, with an edge (C_k, C_\ell) if the following conditions are met:

* \exists i \in C_k: U_\ell \not \subseteq W_i and U_\ell \cap W_i \neq \emptyset

* \exists j \in C_\ell: U_k \not \subseteq W_j and U_k \cap W_j \neq \emptyset

and let M_k be the submatrix formed by the rows of M corresponding to vertices in C_k.

We now can state the following Theorem:

Theorem 2. M is a Network Matrix if and only if: H is bipartite and M_k is a network matrix for k=1, \cdots, p.

Complexity

Theorem 3. The algorithm above to detect whether a given matrix M is a network matrix has polynomial-time complexity.

1. Check if has only entries in \{0, \pm 1\}

2. Test if has at most two non-zero entries for each column (Case 1 or Case 2)

3. Case 1: Construct the graph G_R and check whether it is bipartite.

4. Case 2: For each column i, construct graph G_i and check if it is not connected. In this case, we build the graph H and build each of the submatrices M_k.

If we assume the matrix is m \times n, (1) and (2) can be performed in linear size of the matrix, O(mn). We can construct G_R in O(mn) as well and test for bipartiteness in linear time on the number of vertices plus the edges of G_R, which are both O(n + m), so (3) is bounded by O(mn) too.

For step (4), we need to generate the graph G_i. Its edges are bounded by mn and we can decide it’s connected in linear time of its size. Doing this for all n columns we a total complexity of O(mn^2).

For the recursive complexity, we can relax our complexity for Steps 1 to 4 to be O(n^tm^t) for a fixed constant t \ge 2. By induction, we assume our total complexity for a given level of the recursion is O(n^{t+1}m^t).

The matrix M_k has m_k rows and n columns, can be solved, by induction, in O(m_k^{t+1}n^t). Summing up all the complexities we have:

O(m^tn^t) + O(m_1^{t+1}n^t) + O(m_2^{t+1}n^t) + \cdots + O(m_p^{t+1}n^t) which is O(m^{t+1}n^t)

The base is when we do steps 1 to 4 without building any submatrices, which we saw it is O(m^tn^t).

Conclusion

The main points to be taken from this post is that network matrices are totally unimodular and that they can be recognized in polynomial-time.

We provided explanations about the two observations, but we left out the proofs of the Theorems 1 and 2, which are quite long and complicated, and thus out of the scope of the post. Nevertheless, they can be found on [1].

References

[1] Theory of Linear and Integer Programming – A. Schrijver (Chapter 19)

Totally Unimodular Matrices

An unimodular matrix is a square matrix with integer entries such that its determinant is either -1, 0 or 1. A matrix is said totally unimodular (TU for short) if all its square submatrices are unimodular.

Sometime ago, we said that problems such as the minimum path, maximum flow and minimum cost max flow can be modeled using linear programming with the interesting property that the optimal solutions are always integer.

In that post, we also said that it was because the coefficient matrix of the constraints of the formulations are totally unimodular. More formally, we have the following theorem:

Theorem 1. Let A be a totally unimodular matrix and b be an integer vector. Then, the polyhedra P = \{x | Ax \le b\} has integer vertices.

In this post we’ll present some properties of TU matrices and discuss about two simple examples.

Properties

Let A be a totally unimodular matrix. Then we have the following properties:

  1. Its transpose, A^{T}, is TU.
  2. The matrix obtained by appending the identity matrix to A, that is [A, I], is TU
  3. Any submatrix of A is TU
  4. The matrix obtained by multiplying any row of A by -1 is TU
  5. The matrix obtained by duplicating any row of A is TU

Using this properties we can get some Corollaries from Theorem 1.

Since [A, -I] is TU, we have the following

Corollary 1. The polytope P = \{x | Ax \le b; x \ge 0 \} has integer vertices.

Also, since [A^T, -A^T, I, -I] is TU,

Corollary 2. The dual of P = \{c^Tx | Ax \le b; x \ge 0 \}, namely Q = \{b^Ty | A^Ty \ge c; y \ge 0\} has also integer vertices.

Examples

1. Bipartite Graphs

Let G = (V, E) be an undirected graph and M the incidence matrix of G. That is, a binary matrix where each line corresponds to a vertex v and each column to an edge e. We have M_{v,e} = 1 if v is an endpoint of e or M_{v,e} = 0 otherwise. Then, we have the following result:

Theorem 2. The incidence matrix of a graph G is totally unimodular if and only if, G is bipartite.

This result can be used to derive the König-Egerváry theorem, stating that the maximum cardinality matching and the minimum vertex cover have the same value bipartite graphs.

The maximum cardinality can be modeled as integer linear programming:

\max \sum_{e \in E} y_e

\begin{array}{llclr}    & \sum_{e = (u, v)} y_e & \le & 1 & \forall v \in V\\   & y_e & \in & \{0, 1\} & \forall e \in E  \end{array}

And its dual is the minimum vertex cover:

\min \sum_{v \in V} x_v

\begin{array}{llclr}    & x_u + x_v  & \ge & 1 & \forall (u,v) \in E\\   & x_v  & \le & \{0, 1\} & \forall v \in V  \end{array}

It’s not hard to see that if M is the incidence matrix of the graph, then the problems can be stated as

(1) \max \{1y | My \le 1; y \mbox{ binary} \} and

(2) \min \{x1 | xM \ge 1; x \mbox{ binary} \}

If the graph is bipartite, we can use Theorem 2 and the strong duality for linear programs to conclude that (1) = (2).

2. Directed Graphs

Let D = (V, A) a directed graph, and M be the incidence matrix of D. That is, a matrix where each line corresponds to a vertex and each column to an arc. For each arc e = (u, v), we have M_{u, e} = -1 and M_{v, e} = 1 and 0 otherwise. For directed graphs, we have a even stronger result:

Theorem 3. The incidence matrix of a directed graph D is totally modular.

Consider a network represented by D and with capacities represented by c : A \rightarrow \mathbb{R}_{+}. For each directed edge (ij) \in A, let x_{ij} be the flow in this edge.

If M is the incidence matrix of D, then Mx = 0 corresponds to

\begin{array}{llclr}   (3) & \sum_{(iv) \in A} x_{iv} & = & \sum_{(vi) \in A} x_{vi} & \forall v \in V\\  \end{array}

which are the flow conservation constraints. Since M is totally unimodular, then if c is integer, it’s possible to show that the polytope \{x| 0 \le x \le c; Mx = 0\} has integral vertices. This polytope also represents the constraints of the max circulation problem.

Now, we can use these observations to prove that the following LP formulation for the max-flow problem has an optimal integer solution:

\max \sum_{(si) \in A} x_{si}

Subject to:

\begin{array}{llclr}   & \sum_{(iv) \in A} x_{iv} & = & \sum_{(vi) \in A} x_{vi} & \forall v \in V \setminus \{s, t\}\\  0 \le & x_{ij} & \le & c_{ij} & \forall (ij) \in A\\  \end{array}

We can see that the constraints matrix of the above formulation is a submatrix of the max circulation problem and by Property 3, it’s also TU, which in turn means the corresponding polytope has integral vertices.

Conclusion

In this post, we introduced the concept of total unimodular matrices and presented two simple examples: the incidence matrix of a bipartite graph and the incidence matrix of a directed graph.

Here’s a cool chart of common polynomial time solvable problems organized by their generality [2].

In future posts, we’ll keep exploring this subject by studying other examples and properties of TU matrices.

References

[1] Theory of Linear and Integer Programming – A. Schrijver
[2] Marco Chiarandini – Scheduling, Timetabling and Routing DM204 (Lecture Notes)
[3] Ted Ralphs – Integer Programming IE418 (Lecture Notes)