# Sockets

In this post we’ll discuss sockets, more precisely Network Sockets. It’s based on two excellent articles from Brian “Beej Jorgensen” Hall. We’ll first cover some basic OS and Network concepts and then go over some important functions needed to create a simple socket server that can handle requests from multiple clients.

This post assumes basic familiarity with the Linux operating system and the C programming language.

## OS Basics

File Descriptors

A file descriptor is basically a numerical identifier (id) to a lookup table a given process contains. It is used to model not only files, but also things like stdin (special id = 0), stdout (id = 1) and stderr (id = 2). Sockets are also represented as file descriptors as we’ll see later.

Fork

fork() is a system call the current process can use to generate copies processes (known as children), that run the same program. One observation is that the child process gets a copy of the parent’s data.

This will be important for our example where the main process uses fork() to generate children to handle connections. The child needs to inherit some of the file descriptors from the parent.

fork() returns 0 if it’s the current process executing the code or a non-zero value corresponding to the process id (pid) of the child. A common way to use fork() is the following:

The return value can be used to distinguish between a parent and child and hence we can have them execute different code.

Signal

Signals are one of the simplest ways to communicate with a process. We just send a code that the process knows how to handle. There are OS-level signals like SIGKILL and SIGTERM and user-defined such as SIGUSR1. It’s possible to override how a process handles specific signals via sigaction() which take a structure that points to a handler:

## Network Basics

Beej describes the Layered Network Model but follows with this interesting observation:

Now, this model is so general you could probably use it as an automobile repair guide if you really wanted to. A layered model more consistent with Unix might be:

• Application Layer (telnet, ftp, etc.)
• Host-to-Host Transport Layer (TCP, UDP)
• Internet Layer (IP and routing)
• Network Access Layer (Ethernet, wi-fi, or whatever)

In this model, sockets are in the application layer since it relies on TCP or UDP on top of IP. We’ll go over these 3 things next.

IP the Internet Protocol

The guide discusses some details of the IP, including the IPv4 and IPv6 distinction and different address types. The flags AF_INET and AF_INET6 are part of the socket API and associated to these 2 types.

One interesting detail is the byte order (Little-Endian and Big-Endian): while each computer systems can represent data in different ways, the order is standardized for the internet, and is Big-Endian. This is also known as Network Byte Order.

UDP and TCP

The User Datagram Protocol is connectionless (stateless) and provides no guarantees on the order of the datagrams, their delivery or that duplicates are avoided. The messages sent via UDP are known as datagrams.

The Transmission Control Protocol relies on a connection (via a 3-way handshake), guarantees order and perform retries. The messages sent via TCP are known as data stream.

The socket types SOCK_STREAM and SOCK_DGRAM are associated to the TCP and UDP protocols respectively.

## Socket API

Before we proceed with our example, we’ll cover the C functions in sys/socket.h that correspond to the Linux socket APIs:

getaddrinfo() is a relatively high-level function that is capable of resolving an address (e.g. google.com) to an actual IP and port. An example of use is (fullcode):

The variable servinfo is of type addrinfo, which is a node of a linked list:

getaddrinfo() returns a list of such values, any that match the criteria from the input parameters. In the Beej’s client code, we’ll see iterates over that list until it finds a set of parameters that it can connect to.

socket() returns a file descriptor. It’s basically creating a register with a given ID in a table and it returns that identifier we’ll use to establish a connection later. The interface is as follows:

int socket(int domain, int type, int protocol);

• domain could be one of AF_INET and AF_INET6 (IPv4 / PIv6)
• type is one of SOCK_STREAM or SOCK_DGRAM (TCP / UDP)
• protocol could be one of PF_INET and PF_INET6

In practice AF_INET is the same as PF_INET. Beej says:

This PF_INET thing is a close relative of the AF_INET (…) they’re so closely related that they actually have the same value (…), it was thought that maybe an address family (what the “AF” in “AF_INET” stands for) might support several protocols that were referred to by their protocol family (what the “PF” in “PF_INET” stands for). That didn’t happen.

More conveniently we can also use the results of getaddrinfo() to fill these for us:

bind() binds a socket to a specific hostname and port. It is not always needed (for example in the client case). The client doesn’t usually care which port is used on its own side, so it can let the OS choose. For the server case it’s important because it defines the IP address and port the socket will listen to.

This information can be more easily provided via the struct returned from getaddrinfo(). By providing null to getaddrinfo() and AI_PASSIVE to ai_flags, we’ll have this function fill the IP in res for us:

connect() is the function a client can use to indicate the desire to establish a connection with a given server. Similarly to bind() we can use the results from getaddrinfo():

Note how we didn’t need to bind(). connect() filled the local host and a random port for us.

listen() is the function a server calls to define how many connections it can listen to at a specific socket. Beej’s article mentions a system limit of 20, but using 5 to 10 seems to work in practice.

accept() is the function that is actually connects with a specific client. So that the original socket can keep listening to other incoming connections, accept() returns a new socket descriptor which will be used to send and receive messages.

send() / recv() are used to send and receive messages via the established connection. One important aspect is that while you specify the size of the data being sent, the API does not guarantee it will send the whole data, so you need to write a loop to make sure all data is  sent/received.

## Example

The high-level sequence of calls for the API above is:

• getaddrinfo()
• socket()
• bind()
• accept()
• recv()/send()

For the client we have:

• getaddrinfo()
• socket()
• connect()
• send()/recv()

Beej provides examples for server and client codes in C: server.c and client.c.

## Conclusion

Beej has an entire guide dedicated to inter-process communication. The guide covers the basic concepts such as creating new processes via fork() and handling signals; synchronization mechanisms suck as locks and semaphores; and communication mechanisms such as pipes, message queues and sockets. I like the conversational style of his writings.

I probably wrote code using sockets a long time ago. I didn’t have time to dig deep on this subject so I didn’t feel like I learned a ton. It was a good refresher nevertheless.

## References

[1] Beej’s Guide to Unix Interprocess Communication
[2] Beej’s Guide to Network Programming

# De Bruijn Graphs and Sequences

Nicolaas Govert de Bruijn was a Dutch mathematician, born in the Hague and taught University of Amsterdam and Technical University Eindhoven.

Irving John Good was a British mathematician who worked with Alan Turing, born to a Polish Jewish family in London. De Bruijn and Good independently developed a class of graphs known as de Bruijn graphs, which we’ll explore in this post.

## Definition

A de Bruijn graph [1] is a directed graph defined over a dimension n and a set S of m symbols. The set of vertices in this graph corresponds to the m^n possible sequences of symbols with length n (symbols can be repeated).

There’s a directed edge from vertex u to v if the sequence from v can be obtained from u by removing u’s first element and then appending a symbol at the end. For example, if S = {A, B, C, D}, n = 3 and u = ABC, then there’s an edge from ABC to BC*, that is, BCA, BCB, BCC and BCD.

## Properties

We can derive some basic properties for de Bruijn graphs.

1) Every vertex has exactly m incoming and m outgoing edges.

We saw from the example above that ABC had edges to any vertex BC*, where * is any of the m symbols in S. Conversely, any sequence in the form *AB can be transformed into ABC, by dropping the first symbol and appending ‘C’.

2) Every de Bruijn graph is Eulerian.

In our last post we discussed about Eulerian graphs and learned that a necessary and sufficient condition for a directed graph to have an Eulerian cycle is that all the vertices in the graph have the same in-degree and out-degree and that it’s strongly connected. The first condition is clearly satisfied given the Property 1) above.

To see that a de Bruijn graph is strongly connected, we just need to note that it’s possible to convert any sequence into another by removing the first character and replacing the last with the appropriate one in at most n steps. For example, given the string ABC, we can convert it to BDD by doing ABC -> BCB -> CBD -> BDD. Since each such step corresponds to traversing an edge in the de Bruijn graph, we can see it’s possible to go from any vertex to another, making the graph strongly connected.

3) A de Bruijn graph over the set of symbols S and dimension n is the line graph of the de Bruijn graph over set S and dimension n – 1

A line graph of a given graph G has vertices corresponding to edges in G, and there are edges between two vertices if the corresponding edges in G share a vertex. More formally, let G = (V, E) be an undirected graph. The line graph of G, denoted by L(G) has a set of vertex V’ corresponding to E.  Let u’, v’ be two vertices from V’, corresponding to edges e1 and e2 in E. There’s an edge between u’ and v’ if e1 and e2 have one vertex in common.

It’s possible to generalize this to directed graphs by changing the definition of edges slightly: let u’, v’ be two vertices from V’, corresponding to the directed edges e1 = (a, b) and e2 = (c, d) in E. Then there’s a directed edge from u’ to v’ if and only if b = c.

We can gain an intuition on Property 3 by looking at an example with set S = {0, 1} and constructing a de Bruijn graph with n = 2 from one with n = 1. In Figure 1, the vertices from n = 2 are the labeled edges of n = 1. The edges in n = 2 correspond to the directed paths of length 2 in n = 1. We highlighted in red one of such paths. In n = 1, the path is given by (0, 1) and (1, 1), which became (01, 11) in n = 2.

Figure 1: Constructing a De Bruijn graph over  symbols{0, 1} and dimension n = 2 from one with dimension n = 1

4) Every de Bruijn graph is Hamiltonian

This follows from Properties 2 and 3. We claim that an Eulerian cycle in a De Bruijn graph in dimension n is a Hamiltonian path in dimension n + 1. That’s because we visit every edge exactly once and each edge corresponds to a vertex in the graph in dimension n + 1. Given two consecutive edges in the Eulerian cycle in dimension n, (u, v) and (v, w), from Property 3 we know that there’s an edge from the corresponding vertex (u, v)’ to vertex (v, w)’ in dimension n + 1.

## De Bruijn Sequence

The de Bruijn sequence of dimension n on a set of symbols S, denoted B(S, n), is a cyclic sequence in which every possible sequences of length n appears as substring. The length of such sequence is |S|^n.

Since |S|^n is also the number of distinct sequences of length n, we can conclude that this sequence is the shortest possible. To see why, let B be a de Bruijn sequence. We can assign an index p to each sequence s of length n based on where it appears in B such that the substring B[p, p + n – 1] represents s. Since each of the |S|^n sequences are distinct, they cannot have the same index p. Hence, there must be at least |S|^n indexes, and thus B must be at least that long.

It’s possible to construct a de Bruijn sequence B(S, n) from the Hamiltonian path of a de Bruijn graph over S and dimension n. Two adjacent nodes in the Hamiltonian path share n-1 symbols, so if we start with a vertex v, each new vertex in the path only adds one symbol. It would have a total of n + (|S|^n – 1), but since the last n-1 symbols of the sequence overlap with the beginning when we wrap in a cycle, the cyclic sequence has length |S|^n.

Note that we can construct an Hamiltonian cycle for a de Bruijn graph in polynomial time because it’s equivalent to the Eulerian path in one dimension below. Hence we have a polynomial time algorithm to construct the de Bruijn sequence.

## Applications

#### Cracking Locks

A de Bruijn sequence can be used to brute-force a lock without an enter key, that is, one that opens whenever the last n digits tried are correct. A naive brute force would need to try all |S|^n typing n digits every time, for a total of |S|^n. Using a de Bruijn sequence we would make use of the overlap between trials, and only need to type |S|^n digits in total.

#### Finding the Least Significant Bit

The other interesting application mentioned in [2] is to determine the index of the least significant bit in an unsigned int (32-bits). The code provided is given by:

Let’s understand what the code above is doing. For now, let’s assume v > 0 and we’ll handle v = 0 as a special case later.

In the code above, (v & -v) has the effect of “isolating” the least significant bit. Since v is unsigned, -v is its two’s complement, that is, we complement the digits of v (~v) and add one. Let p be the position of the least significant digit in v. The bits in positions lower than p will be 1 in ~v, and in position p it’s a 0. When incremented by 1, they’ll turn into 1 in position p and 0 in the lower positions. In the positions higher than p, v and -v will be have complementary bits. When doing a bitwise AND, the only position where both operands have 1 is p, hence it will be the number (1 << p) (or 2^p).

Then we multiply the result above by 0x077CB531U which is the de Bruijn sequence B({0, 1}, 5) in hexadecimal. In binary this is 00000111011111001011010100110001, which is a 32-bit number.  Because v & -v is a power of 2 (2^p), multiplying a number by it is the same as bit-shifting it to the left p positions. Then we shift it to the right by 27 positions, which has the effect of capturing the 5 most significant bits from the resulting multiplication. If we treat the number as a string of characters (note that most significant bits are the first characters), the left shift followed by the right shift is equivalent to selecting a “substring” from position p to p+5.

For example, if p = 13, a left shift on 00000111011111001011010100110001 would result in 10010110101001100010000000000000. Then a right shift of 27, would pick the 5 leftmost bits, 10010. If we treat 00000111011111001011010100110001 as a string, 10010 shows up as a substring 00000111011111001011010100110001 in positions [13, 17].

Since this is a de Bruijn sequence for n = 5, every substring of length 5 corresponds to a unique 5-bit number and conversely every 5-bit number is present in this sequence. Now we just need to keep a map from the 5-bit number we obtained via the bit manipulation to the actual number we wanted, which we store in MultiplyDeBruijnBitPosition. Since 10010 is 18, we’ll have an entry MultiplyDeBruijnBitPosition[18] = 13.

Finally, for the special case where v = 0, we have that v & -v is 0 and the algorithm will return 0.

#### Assembling DNA Fragments

In [3] Compeau and Pevzner proposed a method to assemble fragments of DNA into its original form. The problem can be modeled as the k-universal circular string problem.

Definition: Consider a list of sequences s_1, s_2, …, s_n, each of which having the same size k, having the property that s_i’ s suffix and s_i+1 ‘ s prefix overlap in k-1 positions. That is, the last k-1 characters in s_i are the same as the first k-1 characters in s_i+1. We are given the sequences in no particular order. The objective is to find a composed string S which is the result of the overlap of s_1, s_2, …, s_n in order.

This problem can be modeled as a de Bruijn graph where each sequence is associated with a vertex. If sequence s_i’s suffix and s_j’s prefix overlap in k-1 positions, we add a directed edge from vertex s_i to s_j. We then find an Hamiltonian path in the de Bruijn graph and the order in which the vertices are visited will give us the desired string.

## Variants: The Super-permutation

One variant to the de Bruijn sequence problem is to, instead of finding a universal sequence containing all possible sequences of length n, find one containing all the permutations of the symbols in S. Instead of the |S|^ n sequences as input, we’ll have |S|! sequences. This is know as the Super-permutation problem.

For example, for S = {1, 2}, it want to find a sequence containing: 12 and 21. The sequence 121 is a possible solution. For S = {1, 2, 3}, we have now 123, 132, 213, 231, 312 and 321. The shortest 123121321. John Carlos Baez tweets about this problem in [4]. Finding the shortest sequence that includes all permutations is an open problem!

We know optimal solution for n up to 5. The best known lower bound for this problem is n! + (n−1)! + (n−2)! + n − 3 while the upper bound is n! + (n−1)! + (n−2)! + (n−3)! + n − 3 [5].

## Conclusion

In this post I was mainly interested in learning more about de Bruijn graphs after reading about them in Bioinformatics Algorithms by Compeau and Pevzner [3]. I ended up learning about de Bruijn sequences and realized that the problem was similar to one I read about recently on John’s Twitter. It was a nice coincidence.

## References

[1] Wikipedia: De Bruijn graph
[2] Wikipedia: De Bruijn sequence
[3] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[5] Wikipedia: Superpermutation

# Flood it! An exact approach

Flood-it is a game created by LabPixies, which was recently aquired by Google.

The game consists of a $n \times n$ board with random colors cells. Let’s call the top-left cell a seed cell and the region connected to the seed cell and with the same color as it, the flooded region. At each round, the player chooses a color for the flooded region which may flood adjacent regions, expanding the flooded region. The final objective is to flood all the board.

Figure 1. Flooding proccess [1]

In the original game, there are three different sizes of boards: 14, 21 or 28. The number of colors is always 6.

A paper from Clifford, Jalsenius, Montanaro and Sach [1], presents theoretical results regarding the general version of this problem, where the size of the board and the number of colors can be unbounded.

In this post we’ll highlight the main results of their paper and present an integer linear programming approach to solve it exactly.

### NP-Hardness

For $C \ge 3$, the game is shown to be NP-hard even if one is allowed to start flooding from an arbitrary cell. The proof given in [1] is based on a reduction from the shortest common supersequence problem (SCS for short).

### Greedy approaches

There are two greedy approaches one might try:

1. Pick the color that maximizes the number of cells covered.

2. The most frequent color in the perimeter of the current region.

In Figure 2, we have an instance where this strategies can be arbitrarily bad [1]. They use $n$ colors while the optimal solution is 3.

Figure 2: A 10×10 board for which the greedy algorithms perform badly [1]

### Approximation algorithms

Surprisingly, the following naïve strategy:

Cycle through all colors until the board is colored.

gives a solution with a value within a factor of the optimal value.

More specifically, if $L$ is the optimal number of movements and $C$ the number of available colors, then this algorithm solves the problem with no more than $C L$ movements. To see why, let $\{c_1, c_2, \cdots, c_L \}$ be the color sequence of the optimal solution. In the worst case, each cycle covers at least one color in the sequence. Thus, this algorithm is $C$ approximated.

This can be refined to $C-1$ observing that the cycle does not need to have the current color of the flooded region.

The authors improve this bound with a randomized algorithm that achieves a $\frac{2c}{3}$ expected factor.

They also give a lower bound, proving that if the number of colors is arbitrary, no polynomial time approximated algorithm with a constant factor can exist unless P=NP.

### Bounds

An upper bound for the number of movements required for solving any $n \times n$ is given by the following theorem:

Theorem 1: There exists a polynomial time algorithm for Flood-It which can flood any n x n board with C colors in at most $2n + (\sqrt{2C})n + C$ moves.

Conversely, we can set an lower bound for a $n \times n$ board:

Theorem 2: For $2 \le C \le n^2$, there exists an $n \times n$ board with (up to) $c$ colors which requires at least $(\sqrt{C - 1})n/2 - C/2$ moves to flood.

That is, we can’t expect an algorithm to perform much better than the one from Theorem 1 for arbitrary boards.

### Integer Linear Programming

Let $C$ be the number of colors and $k$ an upper bound for the optimal solution. A component is a connected region of pixels with same color, considering 4-adjacency. Two components $i$ and $j$ are adjacenct if there exists at least a pixel in $i$ adjacent to a pixel in $j$.

We denote by $N(i)$ the set of components adjacent to component $i$. Let $S$ be the set of components and $m = |S|$. Furthermore, let $c_{i}$ be the color of component $i$.

We define the binary variable $x_{ct}$ that is 1 if color $c = 1, \cdots, C$ is chosen at iteration $t = 1, \cdots, k$ or 0 otherwise. We also define the binary variable $y_{it}$ that is 1 if component $i = 1, \cdots, m$ is filled in some iteration $t$ or 0 otherwise.

For simplicity, we’ll assume that the component of the seed cell has index $i = 1$.

(1) Objective function:

$\displaystyle \min \sum_{c=1}^{C} \sum_{t=1}^{k} x_{ct}$

(2) For each component $i$,

$\displaystyle \sum_{t=1}^{k} y_{it} = 1$

(3) For each iteration $t$,

$\displaystyle \sum_{c=1}^{C} x_{ct} \le 1$

(4) For each component $i$ and iteration $t$,

$\displaystyle y_{it} \le \sum_{j \in N(i)} \sum_{t' = 1}^{t - 1} y_{jt'}$

(5) For each component $i$ and iteration $t$

$y_{it} \le x_{c_it}$

(6) For each component $i = 2, \cdots, k$,

$y_{i,0} = 0$

(7) For the component of seed cell,

$y_{1,0} = 1$

(8) For all component $i$ and iteration $t$,

$y_{it} \in \{0, 1\}$

(9) For all color $c$ and iteration $t$,

$x_{ct} \in \{0, 1\}$

Constraint (2) states that we fill a component exactly in one iteration. Constraint (3) says that at each iteration we pick at most one color. Constraint (4) allow us to fill a component only if any of its adjacent components has been already filled (which means it currently has the same color as the seed pixel) and if its color is the same as the chosen color (5) .

Finally, constraints (6)-(9) state the variables are binary and each component starts unfilled except the component of the seed cell.

### Computational Experiments

We’ve implemented this model in C++ using the COIN-OR Cbc library. The code is available, as always, on github.

The first task was obtaining “real” instances. I don’t know whether the colors of Flood it! boards are uniformly chosen. Thus, I preferred to take some print screens from the original game and do some image processing to convert it matrices with integer representing colors.

Figure 3: A 14×14 Flood it! puzzle

Unfortunately the model is too big if we use $k$ as the number of components, $m$. The number of constraints (4) is then $O(m^3)$ and for the instance in Figure 3, $m = 127$.

In order to reduce its size, we tried two tricks.

First, we changed our model a little bit to decrease the number of contraints (4). Now, $y_{it}$ may also be 1 if component $i$ was covered some iteration before $t$. Thus, we can rewrite (4) as,

(4′) For each component $i$,

$\displaystyle y_{it} \le \sum_{j \in N(i)} y_{j(t-1)}$

But now we need to allow a component to be “covered” if it was covered in a previous iteration, even if the current color does not match. Thus, (5) becomes:

(5′) For each component $i$ and iteration $t$

$y_{it} \le x_{c_it} + y_{i(t-1)}$

We also need to change (2) equality to $\ge$ inequalities. Now, note that there are $O(m^2)$ constraints (4′).

The second trick is based on the assumption that in general, the optimal number of movements is much less than the number of components. Thus, solve the model for increasing values of $k$ starting with the number of colors until we find a feasible solution.

Even with these changes, we were not able to solve the instance in Figure 3. The 5×5 instance obtained from the first rows and cols of the matrix is solved in 2 minutes using 9 colors.

0 0 1 2 2 3 4 0 0 1 0 0 0 2 4 1 3 3 3 4 3 0 4 0 2 

Solution:

1 2 0 1 3 0 3 4 2

For the 6×6 instance obtained the same way, the solver does not find the optimal solution in an hour.

### Conclusion

In this post we gave a brief summary of Clifford et al. paper and then presented a integer linear programming approach to solve instances exactly.

As our experimental results showed, even for the smallest boards (14 x 14) we’re not able to obtain optimal solutions in feasible time.

As future work, we may try devising alternative models or find additional inequalities. Another possibility is to solve this same model using a commercial solver like CPLEX.

### References

[1] Clifford R., Jalsenius M., Montanaro A. and Sach B. – The Complexity of Flood Filling Games (arxiv.org)

# Functors

In Chapter 10 of Real World Haskell, the authors explain how to write a parser for a PGM file.

They start with a very direct and verbose solution. Along the chapter they perform abstractions that allows code reuse. Then, they end up with a much concise and legible solution.

Part of the abstraction makes use of functors, and it’s about them that we’ll talk briefly in this post. Since functors also exists in C++, we’ll talk about them too, even though they represent different concepts in each of these languages.

You’re probably acquainted with maps. Nevertheless, let’s see an example in Haskell:

map (+1) [1,2,3]

A map receives a function that transform an element of type a in a element in a type b and a list of elements of type a. We can see this by checking the its type on ghci:

>:t map
map :: (a -> b) -> [a] -> [b]


What if we want to apply a function to the elements of other structures beside lists?

Let’s start with the Maybe datatype already defined on Prelude:

data Maybe a = Nothing | Just a


In this case, the example map (+1) (Just 1) will not work. So, let’s write a function to apply the function only to the “internal” type of a Maybe:

maybeMap:: (a -> b) -> Maybe a -> Maybe b
maybeMap f (Just x) = Just (f x)
maybeMap _ Nothing  = Nothing


A classical example is performing a map over the elements a tree. Let’s define a datatype Tree representing a binary tree:

data Tree a =   Node (Tree a) (Tree a)
| Leaf a
deriving (Show)


To apply a function to its internal elements we can do:

treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f (Leaf a)   = Leaf (f a)
treeMap f (Node l r) = Node (treeMap f l) (treeMap f r)


We can test it by doing:

> let t = Node (Leaf 1) (Node (Leaf 2) (Leaf 3))
> treeMap (+1) t
Node (Leaf 2) (Node (Leaf 3) (Leaf 4))


The Prelude allow us to implement a common interface using the typeclass Functor:

class Functor f where
fmap :: (a -> b) -> f a -> f b


Thus, we can specify an implementation for Maybe:

instance Functor Maybe where
fmap = maybeMap


… for our Tree datatype:

instance Functor Maybe where
fmap = treeMap


… and for lists:

instance Functor [] where
fmap = map


We can then use now fmap to perform mapping on lists, Maybe or Tree:

fmap (+1) [1, 2, 3] // [2, 3, 4]
fmap (+1) (Just 10) // Just 11
fmap (+1) (Leaf 0) // (Leaf 1)


Note that we can also apply this function recursively to perform mappings into inner types, for example:

fmap (fmap (+1)) [[1,2], [3]] // [[2,3], [4]]
fmap (fmap (*2)) [Just 4] // [Just 8]


The idea of functors is to perform mappings over the internal elements of datatype, but without affecting its structure.

For instance, a function that receives a (Just 1) and returns Nothing, is not preserving the Maybe structure. The same can be said about any function that changes the topology of a Tree.

So, functors are expected to satisfy two properties: identity and composability.

The identity property says that (fmap id) is equivalent to id. Since we’re not changing the internal elements neither the structure of a datatype, we are not changing the datatype at all.

The composability means that (fmap f) . (fmap g) is equivalent to fmap (f . g)

For further details, this subject is pretty well explained in [2] and [3].

### Functors in C++

In C++ we also have the concept of functors, but it has a different meaning. Here a functor is an object that acts as a function.

Basically, C++ functors are classes that define the operator(). Here’s an example:

struct Functor {
void operator() (){
cout << "Functor\n";
}
};
...
Functor f;
f();


What’s the difference between overloading the operator() and using any other method instead? For instance, one could define a method apply() and use it like this:

struct Functor {
void apply(){
cout << "Functor\n";
}
};
...
Functor f;
f.apply();


The answer to the question is that, besides the syntax sugar (i.e. calling f() instead of f.apply()), overloading the operator() allows functors and function pointers to be used alike with templates. Consider the following example,

struct Functor {
void operator() (){
cout << "Functor\n";
}
};
void f(){
cout << "Function\n";
}

template<typename F>
void callFunction(F func){
func();
}

int main(){
callFunction(f); // Function
callFunction(Functor()); // Functor
return 0;
}


Here, both f() and an instance of Functor can be passed as parameter to callFunction(). This is pretty much like the STL sort function.

The advantage of functors over regular functions is that they can have internal state, like the example below.

struct Functor {
int x;

Functor(int _x){
x = _x;
}
int operator() (int y){
return x + y;
}
};

int main (){
return 0;
}


Sure, one could achieve the same behavior with functions using global variables, but with functors we get more encapsulation.

### References

[1] Stackoverflow – C++ Functors and their uses
[2] Real World Haskell – Chapter 10. Code case study: parsing a binary data format
[3] Learn You a Haskell – Chapter 8. The Functor typeclass

# The Visitor pattern and Vtables in C++

The other day I was developing a Java code and I needed to use instanceof, but since I have read that this is a sign of bad design, I asked for help and was recommended the Visitor pattern.

In this post we’ll talk about this design pattern in Java, that is where my need began, but also in C++ because the implementation is related to vtables, which I’ve been wanting to read about.

### Visitors in Java

Let us consider the following example, where we have the classes Dog and Cat, which are specializations of Animal and have methods to emit sound, respectively bark() and meow(). We want to implement a method emitSound() which receives an instance of Animal and emits the sound according to the actual class of the instance. A first alternative using instanceof is the following:

// Listing 1
interface Animal {
}
public class Dog implements Animal {
public void bark(){
System.out.println("Woof");
}
}
public class Cat implements Animal {
public void meow(){
System.out.println("Meow");
}
}
...
public static void emitSound(Animal animal){
if(animal instanceof Dog){
Dog dog = (Dog) animal;
dog.bark();
}
else if(animal instanceof Cat){
Cat cat = (Cat) animal;
cat.meow();
}
}


Here we can quote Scott Meyers, from Effective C++ book:

Anytime you find yourself writing code of the form “if the object is of type T1, then do something, but if it’s of type T2, then do something else,” slap yourself.

To get rid of instanceof, we can add the method emitSound() to Animal interface and replace bark()/meow() for it. In our method emitSound(), we let polymorphism choose the correct implementation.

// Listing 2
interface Animal {
void emitSound();
}
public class Dog implements Animal {
@Override
public void emitSound(){
System.out.println("Woof");
}
}
public class Cat implements Animal {
@Override
public void emitSound(){
System.out.println("Meow");
}
}
...
public static void emitSound(Animal animal){
animal.emitSound();
}


Another possibility is to delegate the implementation of emitSound() to another class through the use of the Visitor pattern. This pattern considers two types of classes: an element and a visitor. The element implements a method usually named accept() while the visitor implements the method visit(). The accept() method receives a visitor as a parameter and calls the method visit() from this visitor with itself as parameter. This way, the visitor can implement the visit() method for each possible element.

We can implement like this:

// Listing 3
interface Animal {
void accept(AnimalVisitor visitor);
}
public class Dog implements Animal {
@Override
public void accept(AnimalVisitor visitor){
visitor.visit(this);
}
}
public class Cat implements Animal {
@Override
public void accept(AnimalVisitor visitor){
visitor.visit(this);
}
}
interface AnimalVisitor {
void visit(Dog dog);
void visit(Cat cat);
}
public class SoundEmissor implements AnimalVisitor {
@Override
public void visit(Cat cat){
System.out.println("Meow");
}
@Override
public void visit(Dog dog){
System.out.println("Woof");
}
}
...
public static void emitSound(Animal animal){
SoundEmissor soundEmissor = new SoundEmission();
animal.accept(soundEmissor);
}


The advantage of this approach is that SoundEmissor may contain members and methods related to the way animals emit sounds. Otherwise, these methods would ended up being implemented in Animal.

Another advantage is that the visitor implementation can be chosen at runtime and, being a dependency injection, simplifies testing.

### Visitors in C++

In C++ we don’t have instanceof, but we can use the typeid() operator instead. According to [1], if the argument to this operator is a reference or a dereferenced pointer to a polymorfic class, typeid() will return a type_info corresponding to the runtime object type.

// Listing 4
struct Animal {
virtual void foo(){};
};
struct Dog : public Animal {
void bark(){
cout << "Woof\n";
}
};
struct Cat : public Animal {
void meow(){
cout << "Meow\n";
}
};
void emitSound(Animal *animal){
if(typeid(*animal) == typeid(Dog)){
Dog *dog = dynamic_cast<Dog *>(animal);
dog->bark();
}
else if(typeid(*animal) == typeid(Cat)){
Cat *cat = dynamic_cast<Cat *>(animal);
cat->meow();
}
}


Again, we can create emitSound() in a interface and make use of polymorphism. In C++ we can implement an interface through a class containing only purely virtual function and no visible constructor like Animal in the code below:

// Listing  5
struct Animal {
virtual void emitSound() = 0;
};
struct Dog : public Animal {
void emitSound(){
cout << "Woof\n";
}
};
struct Cat : public Animal {
void emitSound(){
cout << "Meow\n";
}
};
void emitSound(Animal *animal){
animal->emitSound();
}


The Visitor pattern can be similarly implemented in the following way:

// Listing 6
struct Cat;
struct Dog;

struct AnimalVisitor {
virtual void visit(Cat *cat) = 0;
virtual void visit(Dog *dog) = 0;
};
struct Animal {
virtual void accept(AnimalVisitor *visitor) = 0;
};
struct SoundEmissor : public AnimalVisitor {
void visit(Cat *cat){
cout << "Meow\n";
}
void visit(Dog *dog){
cout << "Woof\n";
}
};
struct Dog : public Animal {
void accept(AnimalVisitor *visitor){
visitor->visit(this);
}
};
struct Cat : public Animal {
void accept(AnimalVisitor *visitor){
visitor->visit(this);
}
};
void emitSound(Animal *animal){
AnimalVisitor *visitor = new SoundEmissor();
animal->accept(visitor);
}


### Single and multiple dispatch

We just saw that the Visitor pattern relies on the use of virtual functions. So now, let’s analyse how C++ implements these kind of functions. Before that, we need to understand the concept of dynamic dispatch.

Dynamic dispatch is a technique used in cases where different classes (with a common ancestor) implement the same methods, but we can’t tell on compile time what is the type of a given instance, as in the cases above.

In C++ and Java, this type of dispatch is also known as single dispatch since it only considers the type of the object calling the method. In Lisp and other few languages, there is the multiple dispatch that also takes into account the type of the parameters.

As an example, take a look at the following code:

// Listing 7
void visitor(Dog *dog){
cout << "Woof\n";
}
void visitor(Cat *cat){
cout << "Meow!\n";
}
void emitSound(Animal *animal){
visitor(animal);
}


We’ll get a compile error since method/function matching for parameters is made at compile time. In this case we need to have an implementation for the Animal type.

Now, we’ll see how C++ implements dynamic dispatch.

### C++ Vtables

Although the C++ standard does not specify how dynamic dispatch should be implemented, compilers in general use the a structure called the virtual method table, known for other names such as vtable, which is the one we’ll use henceforth. Actually, this table is an array of pointers to virtual methods.

Each class that defines or inherits at least one virtual method has its own vtable. This table points to the virtual methods defined by the nearest ancestral (including the class itself) that are visible for that class. Besides, each instance of these classes will have an additional member, a pointer that points to the table corresponding to the class used to instantiate that object.

As an example, if we were to instantiate a Dog from Listing 5:

Animal *animal = new Cachorro();


Here, the animal's pointer points to Dog‘s and not Animal‘s. Thus, when we do

animal->emitirSom();


the corresponding table is looked up to find exactly which emitSound() to call. Note that virtual methods requires an extra indirection that may affect performance. Because of this dynamic dispatch is not performed by default unless some class defines a virtual method. On the other hand, in Java this is done by default.

Let’s see a final example,

struct A {
virtual void foo(){}
virtual void bar(){}
void baz(){}
};
struct B : public A {
void bar(){}
virtual void baz(){}
};


We can analyse the vtables from class A and B compiling the code above with gcc using the -fdump-class-hierarch flag. A text file with extension .class will be generated like this:

Vtable for A
A::_ZTV1A: 4u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1A)
8 (int (*)(...))A::foo
12 (int (*)(...))A::bar

Class A
size=4 align=4
base size=4 base align=4
A (0xb71f6038) 0 nearly-empty
vptr=((&amp; A::_ZTV1A) + 8u)

Vtable for B
B::_ZTV1B: 5u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1B)
8 (int (*)(...))A::foo
12 (int (*)(...))B::bar
16 (int (*)(...))B::baz

Class B
size=4 align=4
base size=4 base align=4
B (0xb7141618) 0 nearly-empty
vptr=((&amp; B::_ZTV1B) + 8u)
A (0xb71f61f8) 0 nearly-empty
primary-for B (0xb7141618)


Here we can see that function A::foo and A::bar appear in A‘s vtable, but A::baz don’t, since it was not declared virtual. On B‘s vtable we have A::foo, because it was not overridden in B. We also have B::bar, although it has not been declared virtual in B, this property was inherited from A::bar. Finally, B::baz appears on the vtable because it was declared virtual in B.

We can also see the value that the instances of such classes will have: vptr=((& A::_ZTV1A) + 8u) and vptr=((& B::_ZTV1B) + 8u), which is the offset to the functions’ pointers on the respective tables. A nice reference for a understanding of vtables can be seen in [2].

### Referências

[1] The typeid operator (C++ only)
[2] LearnCpp.com – The Virtual Table