Review: Working Effectively With Legacy Code

In this post we will review the book Working Effectively with Legacy Code by Michael C. Feathers.

The basic premise of the book is that we want to make a change in an existing (legacy) code base and we want to make it safely. The flowchart below describes in broad strokes the content of the book:

 

The core of the book focuses on breaking dependencies of legacy code so that we can add tests more easily. How can we break the vicious cycle? We need to make changes to add tests, but we need to have tests to make changes safely. The idea is to break the dependencies in safe ways even if it temporarily leads to a worse design until we can add tests and refactor the code.

Organization of the book

The book is divided into 3 parts of a total of 25 chapters. In the first part the author describes concepts and introduce tools that are used in the rest of the book. In part II, he describes several different situations and problems and goes on to provide a solution to them. Part 3 contains a single chapter and describes several dependency breaking techniques. As the author suggests, they are not required to be read in order.

Selected Notes

Here I’m going to list some of the techniques the book provided that I found interesting and novel.

How to break dependencies

  • Seam model – the idea is to make behavior changes while minimizing code changes. One example is to add a new subclass in order to remove/mock dependencies for tests.
  • Sprouting – move blocks of code to new methods and classes which can be easily tested.
  • Wrapper class – wrap code inside a new class which we have more control of, and can be mocked. This is a variant of the seam model, but instead of relying on inheritance it uses composition. As the author notes, this is the decorator design pattern.

A common pattern between them is keeping code changes to a minimum. The bigger the changes that more likely subtle behavior changes are introduced.

Where to test

  • Effect analysis – draw an informal DAG, where nodes represent functions or classes and a directed edge implies that changing one node will affect the other. This will help understanding which code is subject to be affected and must be tested.
  • Pinch points – The effect analysis might yield too many affected nodes – pinch points nodes that cover a large portion of affected code but has relatively few dependencies so it’s easier to test.

Interpreted Languages

The book has the strongest assumption that we are working with a compiled object-oriented language, namely C++, Java or C#. He dedicates about a page or two to dynamic languages, in this case Ruby.

Quotes

The author shares several design principles throughout the chapters. Here are some quotes describing some of them.

Temporal coupling:

When you group things together just because they have to happen at the same time, the relationship between them isn’t very strong.

Refactoring:

Rename class is a powerful refactoring. It changes the way people see code and lets them notice possibilities that they might not have considered before.

Command and Query Separation:

A method should be a command or a query but not both. A command is a method that can modify the state of the object but that doesn’t return a value. A query is a method that returns a value but that does not modify the object.

Psychology of working with legacy code:

If morale is low on your team and it’s slow because of code quality, here’s something that you can try: pick the ugliest most obnoxious set of classes in the project and get them under test. When you’ve tackled the worst problem as a team you will feel in control of your situation.

Conclusion

Overall there are many interesting techniques for breaking dependencies and making the code more testable. I’ve been using some of them in a less methodic way. I like the pragmatism of the techniques (done is better than perfect), which can be seen when good practices and designs are temporarily violated to write tests.

I’ve also enjoyed the code design tips interspersed throughout the chapters, but I disagree with some of the suggestions, for example converting static utils to methods in class instances. In my opinion static classes are a good forcing function to keep implementation modular and stateless. This is especially the case if there’s a good mocking framework available which can mock out static functions.

I found that the set of tips are less useful for dynamic language or just in time compilers which allows runtime modifications of the objects, because it allows much more powerful testing frameworks which for example can easily mock dependencies is for us. Nowadays I work mostly with Hack and JavaScript, which both provide a lot of runtime flexibility and support for mocking dependencies. The challenge I often face is having to mock too many dependencies which is tedious.

The major negative of the book is that it contains a lot of obvious typos and typographic error such as duplicated listings.

Related Posts

  • Review: Code Complete 2 – This is another review of a highly practical and pragmatic book that I believe made me a better programmer. The overlap in content is minimal, so they’re a nice complement.

Sockets

In this post we’ll discuss sockets, more precisely Network Sockets. It’s based on two excellent articles from Brian “Beej Jorgensen” Hall. We’ll first cover some basic OS and Network concepts and then go over some important functions needed to create a simple socket server that can handle requests from multiple clients.

This post assumes basic familiarity with the Linux operating system and the C programming language.

OS Basics

File Descriptors

A file descriptor is basically a numerical identifier (id) to a lookup table a given process contains. It is used to model not only files, but also things like stdin (special id = 0), stdout (id = 1) and stderr (id = 2). Sockets are also represented as file descriptors as we’ll see later.

Fork

fork() is a system call the current process can use to generate copies processes (known as children), that run the same program. One observation is that the child process gets a copy of the parent’s data.

This will be important for our example where the main process uses fork() to generate children to handle connections. The child needs to inherit some of the file descriptors from the parent.

fork() returns 0 if it’s the current process executing the code or a non-zero value corresponding to the process id (pid) of the child. A common way to use fork() is the following:

The return value can be used to distinguish between a parent and child and hence we can have them execute different code.

Signal

Signals are one of the simplest ways to communicate with a process. We just send a code that the process knows how to handle. There are OS-level signals like SIGKILL and SIGTERM and user-defined such as SIGUSR1. It’s possible to override how a process handles specific signals via sigaction() which take a structure that points to a handler:

Network Basics

Beej describes the Layered Network Model but follows with this interesting observation:

Now, this model is so general you could probably use it as an automobile repair guide if you really wanted to. A layered model more consistent with Unix might be:

  • Application Layer (telnet, ftp, etc.)
  • Host-to-Host Transport Layer (TCP, UDP)
  • Internet Layer (IP and routing)
  • Network Access Layer (Ethernet, wi-fi, or whatever)

In this model, sockets are in the application layer since it relies on TCP or UDP on top of IP. We’ll go over these 3 things next.

IP the Internet Protocol

The guide discusses some details of the IP, including the IPv4 and IPv6 distinction and different address types. The flags AF_INET and AF_INET6 are part of the socket API and associated to these 2 types.

One interesting detail is the byte order (Little-Endian and Big-Endian): while each computer systems can represent data in different ways, the order is standardized for the internet, and is Big-Endian. This is also known as Network Byte Order.

UDP and TCP

The User Datagram Protocol is connectionless (stateless) and provides no guarantees on the order of the datagrams, their delivery or that duplicates are avoided. The messages sent via UDP are known as datagrams.

The Transmission Control Protocol relies on a connection (via a 3-way handshake), guarantees order and perform retries. The messages sent via TCP are known as data stream.

The socket types SOCK_STREAM and SOCK_DGRAM are associated to the TCP and UDP protocols respectively.

Socket API

Before we proceed with our example, we’ll cover the C functions in sys/socket.h that correspond to the Linux socket APIs:

getaddrinfo() is a relatively high-level function that is capable of resolving an address (e.g. google.com) to an actual IP and port. An example of use is (fullcode):

The variable servinfo is of type addrinfo, which is a node of a linked list:

getaddrinfo() returns a list of such values, any that match the criteria from the input parameters. In the Beej’s client code, we’ll see iterates over that list until it finds a set of parameters that it can connect to.

socket() returns a file descriptor. It’s basically creating a register with a given ID in a table and it returns that identifier we’ll use to establish a connection later. The interface is as follows:

int socket(int domain, int type, int protocol);

  • domain could be one of AF_INET and AF_INET6 (IPv4 / PIv6)
  • type is one of SOCK_STREAM or SOCK_DGRAM (TCP / UDP)
  • protocol could be one of PF_INET and PF_INET6

In practice AF_INET is the same as PF_INET. Beej says:

This PF_INET thing is a close relative of the AF_INET (…) they’re so closely related that they actually have the same value (…), it was thought that maybe an address family (what the “AF” in “AF_INET” stands for) might support several protocols that were referred to by their protocol family (what the “PF” in “PF_INET” stands for). That didn’t happen.

More conveniently we can also use the results of getaddrinfo() to fill these for us:

bind() binds a socket to a specific hostname and port. It is not always needed (for example in the client case). The client doesn’t usually care which port is used on its own side, so it can let the OS choose. For the server case it’s important because it defines the IP address and port the socket will listen to.

This information can be more easily provided via the struct returned from getaddrinfo(). By providing null to getaddrinfo() and AI_PASSIVE to ai_flags, we’ll have this function fill the IP in res for us:

connect() is the function a client can use to indicate the desire to establish a connection with a given server. Similarly to bind() we can use the results from getaddrinfo():

Note how we didn’t need to bind(). connect() filled the local host and a random port for us.

listen() is the function a server calls to define how many connections it can listen to at a specific socket. Beej’s article mentions a system limit of 20, but using 5 to 10 seems to work in practice.

accept() is the function that is actually connects with a specific client. So that the original socket can keep listening to other incoming connections, accept() returns a new socket descriptor which will be used to send and receive messages.

send() / recv() are used to send and receive messages via the established connection. One important aspect is that while you specify the size of the data being sent, the API does not guarantee it will send the whole data, so you need to write a loop to make sure all data is  sent/received.

Example

The high-level sequence of calls for the API above is:

  • getaddrinfo()
  • socket()
  • bind()
  • accept()
  • recv()/send()

For the client we have:

  • getaddrinfo()
  • socket()
  • connect()
  • send()/recv()

Beej provides examples for server and client codes in C: server.c and client.c.

Conclusion

Beej has an entire guide dedicated to inter-process communication. The guide covers the basic concepts such as creating new processes via fork() and handling signals; synchronization mechanisms suck as locks and semaphores; and communication mechanisms such as pipes, message queues and sockets. I like the conversational style of his writings.

I probably wrote code using sockets a long time ago. I didn’t have time to dig deep on this subject so I didn’t feel like I learned a ton. It was a good refresher nevertheless.

References

[1] Beej’s Guide to Unix Interprocess Communication
[2] Beej’s Guide to Network Programming

Related Posts

Observable

logo

Observable is a web-first interactive notebook for data analysis, visualization, and exploration [1].

It was created by Mike Bostock, the creator of d3.js, which we discussed previously way back in 2014 [2]. At some point, Mike stepped down from d3.js to focus on a new library called d3.express, which he presented in 2017 during the OpenVis conference [3] and was still in the makes. d3.express eventually got renamed to Observable.

I’ve been experimenting with Observable and in this post I’ll cover some of my learnings while trying to build two visualizations.

One of the benefits of notebooks like Observable is to co-locate code with its output. This makes it much easier to explain code because in addition to markdown comments, we can modify the code and see the results in real-time. This means that a lot of the documentation of APIs available in Observable are notebooks themselves. For these reasons, this post consists of a series of links to notebooks with high-level comments.

Before we start, some assumptions I’m making are familiarity with imperative programming and the basics of JavaScript.

What is Observable?

Why Observable – this notebook explains the motivation behind Observable and how it works at a high level:

An Observable notebook consists of reactive cells, each defining a piece of state. Rather than track dependencies between mutable values in your head, the runtime automatically extracts dependencies via static analysis and efficiently re-evaluates cells whenever things change, like a spreadsheet

Five-Minute Introduction – provides a hands-on approach of the many topics, some of which we’ll cover in more detail in this post:

  • Real-time evaluation of JavaScript snippets
  • Cell references
  • Markdown
  • HTML/DOM display
  • Promises
  • Generators
  • Views
  • Importing from other notebooks

Let’s start by delving into the details of JavaScript snippets and cell references.

Observable vs. JavaScript

Observable’s not JavaScript – in this notebook Bostock explains the differences between JavaScript and Observable. Some notes I found interesting from that notebook:

  • A cell is composed by a cell name and an expression:
    • [cell name] = [expression]

It can be simple statements like 1 + 2, or multi-line statements like

Screenshot from 2020-01-04 09-56-25

  • The value of cell_name can be used in other cells, but by default is read-only.
  • Each cell can only have one cell_name  assignment. In other words, it can only “export” one variable. It’s possible to cheat by exporting an object. Note that because curly brackets are used to denote code blocks, we need to wrap an object literal with parenthesis:

Screenshot from 2020-01-04 10-14-58

  • Cells can refer to other cells in any part of the code – Observable builds a dependency DAG to figure out the right order. This also means dependencies cannot be circular. How Observable Runs explains this in more details.
  • Constructs like async functions (await) and generators (yield) have special behavior in a cell. We’ll expand in the Generators section below.
  • Cells can be mutated if declared with a qualifier (mutable). We’ll expand in the Mutability section below.

Mutability

Introduction to Mutable State – cells are read-only by default but Observable allows changing the value of a cell by declaring it mutable. When changing the value of the cell elsewhere, the mutable keyword also must be used.

It’s important to note that the cell is immutable, but if the cell is a reference to a value, say an Object, then the value can be mutated. This can lead to unexpected behavior, so  I created a notebook, Mutating references, to highlight this issue.

Markdown

Markdown summary – is a great cheat sheet of Markdown syntax in Observable. I find the Markdown syntax in Observable verbose which is not ideal given text is so prevalent in notebooks:

md`Hello World`

More cumbersome still is typing inline code. Because it uses backticks, It has to be escaped:

md`Hello \`code\``

To be fair, most code will be typed as cells, so this shouldn’t be too common. It supports LaTeX via KaTeX which is awesome. KaTeX on Observable contains a bunch of examples.

Generators

Generators are a JavaScript construct that allows a function to yield the execution back to the caller and resume from where it stopped when it’s called again. We talked about this in the Async Functions in JavaScript post.

Introduction to Generators explains how generators can be used in Observable. The combination of generators and delays (via promises) is a great mechanism to implement a ticker, which is the base of animation. Here’s a very simple way to define a ticker in Observable (remember that Promises are awaited without extra syntax):

Views

Introduction to Views explains what views are and how to construct one from scratch. I like this recommendation:

If there is some value that you would like the user to control in your notebook, then usually the right way to represent that value is as a view.

Views are thus convenient ways to expose parameters via buttons and knobs such as text inputs, dropdowns and checkboxes.

I ran into a problem when using checkbox with labels in it, like the cell below:

Screenshot from 2020-01-03 23-14-08

It does not yield true/false as I wanted. This notebook, Checkbox, discusses the problem and offers a solution. This is an example where great but imperfect abstractions make it hard to understand when something doesn’t work. It’s worth learning about how viewof is implemented behind the scenes.

In one of my experiments I needed to synchronize a view and a ticker. Bostock provides a neat abstraction in his Synchronized Views notebook.

Importing from other notebooks

Introduction to Imports shows how easy it is to import cells from other notebooks.

It also shows that it’s possible to override the value of some of the cells in that notebook:

Screenshot from 2020-01-03 23-28-09.png

This is neat, because the chart cell depends on the data, and by allowing overriding data, it allows parameterizing chart. In other words, we import chart as a function as opposed to a value (the result of chart(data)).

Versioning. When importing a notebook we always import the most recent version, which means it could potentially break if the notebook’s contract changes. The introduction above mentions the possibility of specifying the version when importing but didn’t provide an example on how to do so.

My strategy so far has been to clone a notebook if I’m really concerned about it changing, but that means one has to manually update / re-fork if they want to sync with the upstream changes.

React Hooks

I’m used to write UI components with React and luckily Jed Fox created a notebook which I forked into this React Notebook which allowed me to use React Hooks to write a view.

Conclusion

I really like Observable and am looking forward to add implement more data visualizations in it. I experimented with 2 animations so far: a simplistic Solar System model and a spinning Globe.

One natural question that crossed my mind is whether I should have written this whole post as an Observable notebook. In the end I decided to stick with an old-school static post for two reasons:

  • Consistency: Observable only supports JavaScript, so if I’m writing a post about Python I’d need to fallback to a post, and I wouldn’t want to have a mix of both.
  • Durability: Observable has a lot of potential but it’s still mostly experimental and I’m not sure I can rely on it sticking around for a long time.

References

Notebooks mentioned in the post

Other

[1] Observable – Why Observable?
[2] Introduction to d3.js
[3] OpenViz Conf – 2017
[4] Async Functions in JavaScript

Aho-Corasick

AhoPicAlfred Aho is a Canadian computer scientist. He is famous for the book Principles of Compiler Design co-authored with Jeffrey Ullman. Aho also developed a number of utilities for UNIX which has widespread usage, including the AWK programming language (acronym for Aho, Weinberger and Kernigham) and variants of grep.

Unfortunately, I couldn’t find much information about Margareth J. Corasick, except that she worked at Bell Labs and together with Aho, developed the Aho-Corasick algorithm, which we’ll talk about in this post.

Multiple-string search

We have studied before the problem of finding a string in a text, for which the Knuth-Morris-Prat algorithm is well-known.

Aho and Corasick studied the problem of finding a set of strings in a text. More precisely, given a body of text H and a set of search terms S, we want to find all occurrences of each term of S in H.

They came up with the Aho-Corasick algorithm which is able to solve this problem in linear time on the size of H and the size of the output (i.e. the total number of characters that need to be printed when a match is reported).

It consists in pre-computing a look-up structure from S and then scanning the text using it. Whenever it detects a match was found, it prints the corresponding matches. Let’s see first how to construct the structure.

The look-up structure

The look-up structure is constructed in two phases, each to construct two sets of edges: goto and fail edges.

The goto edges are essentially the edges of a trie containing the entries in S. The trie can be seen as a state machine where each node represents a prefix of one of the entries in S and the transition function tells us to which prefix to go next given an input character. These edges are thus called goto arrows. We denote by g(s, a) the node we navigate to if we’re at node s and receive input a.

Given a text H = [a1,a2,…,a_n], we can start at the root and follow the edges labeled a1, a2 and so on. If we land on a node representing an entry in S we print it.

Eventually though, we may reach a node s such that the next character a_j doesn’t have a corresponding edge g(s, a_j). We know that s is a suffix of [a1,a2,…,a_{j-1}], say s = [a_k,…,a_{j-1}]. We also know there’s no other unreported entry in S that starts at k but there might be one that starts at a_{k+1}. We could restart the search at k+1 and at the root of the tree, but can we do better?

Because S is fixed, no matter what text H we have, by the time we encounter a dead end at a node r, we know that the last characters seen are  [a_k,…,a_{j-1}] = r. Now suppose s = [a_{k+1},…,a_{j-1}, a_j] happens to be in the trie. Then we can continue our search from s, without having to backtrack. If k+1 doesn’t work, we can try k+2, k+3, and so forth. More generally, say that we eventually found the smallest index k* > k such that s* = [a_k*,…,a_{j-1}, a_j] is in the trie. In other words s* is the longest proper suffix of s in the trie. When we fail to find s = r + a_j in the trie, we can resume at s*. This suggests we could have a fail edge from s to s* in our trie to quickly resume the search without backtrack.

Technically, s doesn’t exist in the trie, so we can’t add an edge there. However, we can still add the failure edge in r, and denote it as f(r, a_j) = s* for all labels a_j not in a goto edge.

Given the goto and fail edges, we have a proper state machine which we can use to find matches in H. Let’s combine them into a single set of edges e(s, a) (that is e(s, a) = g(s, a) if it exists, otherwise f(s, a)).

We are still missing printing out the matches. Because of the shortcuts we take via the fail edges, we never explicitly visit a suffix of a state s. For this reason, when visiting a state s, we need to print all its suffixes that belong to S.

We’ll now see how to construct the fail edges and the output.

Building the fail edges

We’ll show how to construct the failure edges efficiently. We define auxiliary edges ps(s) which represents the longest proper suffix of s that is in the trie. Note that f(r, a) is ps(r + a) for when g(r, a) doesn’t exist.

The idea is that if we know ps(r) for all nodes s of depth up to l in the trie, we can compute ps(r + a), that is for the nodes in the next level.

Consider each node r of depth l and each symbol a. Let s = r + a. We know ps(r) is in the trie, but is ps(r) + a in the trie? Not necessarily, but maybe a suffix of ps(r) is? We can try ps(ps(r)) and further suffixes until we find one that is in the trie, or we reach the empty string.

How can we make sure to only process node s only after all the edges from lower levels have been processed? We can process the nodes in bread-first order, which can be achieved using a queue. The pseudo-code could be:

Note how instead of searching for the largest suffix by removing one character at a time and checking if it exists in the trie, we’re “jumping” to suffixes via repeated application of ps(), which is what makes the algorithm efficient as we shall see. You might be asking if we’re not missing any valid suffix when doing that, but no worries, we analyze the correctness of this in Lemma 1 (Appendix).

Building the output

If we compute output(s) in a bread-first order, we can assume we know output(r) for nodes r lower level than s. Assuming we already computed ps(s), we have that output(s) = output(ps(s)) + {s} if s is in S:

How do we know by jumping through suffixes via ps(s) we are not missing any matches? We demonstrate that in Lemma 2 (Appendix).

Implementation in Python

I’ve implemented the pseudo-code proposed here in Python, and made it available on Github. The main implementation aspect is how we compute the output list. To avoid storing the matches multiple times, we model it as a linked list. More precisely, when computing output(s) from ps(s), if ps(s) is itself a keyword, we point output(s) to ps(s). Otherwise we point it to output(ps(s)).

When printing output(s) we just need to traverse the list,:

Conclusion

It was a lot of work to study Aho Corasick’s paper [1], understand the proof and re-write with my own words, but this provided me much better intuition behind the algorithm.

Like the KMP, I’ve known this algorithm before but never took the time to fully understand it.

My initial implementation was in Rust, but I got stuck when modeling the nodes because of self-reference. This led me to this Stackoverflow answer which I haven’t time to study, but seems like a good idea for a post.

Appendix

Correctness

Lemma 1. Let s be a prefix in the look-up structure. Then ps(s) is the longest proper suffix of s that exists in the look-up structure T.

Let’s prove this by induction on the size of s. The base is for nodes of level 1, where ps(s) = root (empty string), which is true, since the proper suffix of a single letter is empty.

Assuming that ps(s’) is the longest proper suffix of s that exists in the look-up structure for |s’| < |s| , let’s prove this is true for ps(s) as well. Let’s show first that ps(s) is a proper suffix of s in T. We know that ps(r) is a suffix of r, and so is ps(ps(r)) since a suffix of a suffix of r is also a suffix of r, and so on. Let’s define ps_k(r) = ps(ps_{k-1}(r)) and ps_n(r) such that g(ps_n(r), a) exists. By construction we assigned ps(s) = g(ps_n(r), a). Since ps_n(r) is a suffix of r, and r + a = s, we know what ps_n(r) + a is suffix of s. We’ll show there’s no other proper suffix or s in loop-up structure larger than ps(s). Suppose there is a s* in T such that |s*| > |ps(s)|. Let r* be = s* + a. Note that r* is a suffix of r and it exists in T (since s* is in there and T it’s a prefix tree). It cannot be larger than ps(r) because of our induction hypothesis. r* has to be larger than ps_n(r) since |ps(s)| = |ps_n(r)| + 1, |s*| > |ps(s)| and |s*| = |r*| + 1.

The first observation is that r* != ps_k(r) for any k. Otherwise the algorithm would have set ps(s) = g(r*, a) = s*. So there must be some k for which |ps_k(r)| > |r*| > |ps_{k+1}(r)|, this means that for r’ = ps_k(r), there’s a suffix r* suc that |r*| > |ps(r’)|. But this is a contradiction of the induction hypothesis saying that ps(r’) is the largest suffix of r’ in T.

Lemma 2. output(s) contains y if and only if y is in S and a suffix of s.

Let’s prove one direction: if output(s) contains y then y is in S and a suffix of s. This is easy to see by construction, output(s) contains s if it’s in S, and also the outputs of ps(s), which is a suffix of s, and by induction are all in S and suffix of ps(s), and hence of s.

Now suppose y is a keyword and suffix of s. Let’s show that there’s a k for which ps_k(s) = y. If this was not the case, then by a similar argument we did in Lemma 1, there is ps_{k-1}(s) such that y is suffix of it, and |y| > |ps_{k}(s)|, which violates Lemma 1, since  ps(ps_{k-1}(s)) = ps_{k}(s) is not the longest possible. Since ps_k(s) = y, by construction, we know that output(y) will eventually be merged into output(s).

Lemma 3. At the end of the loop in the Algorithm 1, state s is in the look-up structure T and it is a suffix of [a1,a2,…,aj].

This is true for a1: it would be either in state [a1] in T, or the empty state, both being suffix of [a1]. Assuming this hypothesis holds for j-1, we were in a state r before consuming aj, and r is the longest suffix of  [a1,a2,…,aj-1].

By construction we know that the end state s is still in T, since we only followed valid edges of T (gotos and fails). We need to show s is the longest suffix of [a1,a2,…,aj]. For the sake of contradiction, assume there’s s* such that |s*| > |s| in T that is a suffix of [a1,a2,…,aj]. Let r* be such that s* = r* + aj. Note that r* is in T and is a suffix of [a1,a2,…,aj-1] and by the inductive hypothesis |r*| < |r|. If we used a goto edge, then |r*| > |r| which contradicts the inductive hypothesis. If we used a fail edge, then s = f(r, aj). Using the same argument from Lemma 1, we have that r* is a suffix of r, and that r* is not ps_k(r) for any k, so it must be that for some k, x = ps_k(r), |x| > |r*| > |ps(x)|and is a contradiction because r* is a proper suffix of x but longer than ps(x).

Theorem 1. The Aho Corasick algorithm is correct.

We need to show that every substring of text which is in S, will be reported. Any substring we were to report end at an index j of H, so it suffices to show we report every suffix of [a1,a2,…,aj] in S, for all j.

Let s be the node we are at the end of each loop in Algorithm 1. Suppose there’s a suffix s* of [a1,a2,…,aj] in S that is not in output(s). From Lemma 2, we know that every keyword in S that is a suffix of s is in output(s), it follows that s* is not a suffix of s, so |s*| > |s|, which cannot be from Lemma 3.

QED

Complexity

Theorem 2. The look-up stricture can be constructed in linear time of the number of characters in S.

Let’s define size(S) = the total number of characters in S and A the set of symbols of the alphabet.

It’s known to that we can construct a trie from a set of string in S, so the goto edges are covered.

For constructing the ps() function, the cost will be associated on how many times we execute the inner loop. We visit each node once and for each of the |A| symbols, we perform the inner loop. Since|f(s)| < |s|, the number of times we execute the inner-most loop for a given state is bounded by |s|, so for each state we execute the inner-most loop |s|*|A| times. If we sum over all states, we have size(S)*|A|. Assuming |A| is constant and small, we can ignore it in the overall complexity. It’s worth noting Aho and Corasick’s original paper don’t depend on |A|.

The cost of constructing the output() function is O(1) if we were to use a shared linked list. In that case output(s) can be constructing with a cell s and have it point to the list of output(f(s)), so again, this would be bounded by size(S).

For the search algorithm, let’s ignore the cost of printing the output for now. We can follow an edge in O(1) time and do so exactly once per character.

In the worst case there could be a lot matches at each position. For example, if S = {a, aa, aaa, …, a^(n/2)} and H = a^n.  For n/2 positions we’ll need to output all entries in S, leading to a complexity of size(S)*n.

Since these values need to be printed, there’s no way to be faster than that and this could be the dominating factor in the total complexity. In practice however the number of matches is much smaller.

References

[1] Efficient String Matching: An Aid to Bibliographic Search
[2] Alfred Aho – Wikipedia

Protein Design

We previously learned about the problem of predicting the folding of a protein, that is, given a chain of amino acids, find its final 3D structure. This time we’re interested in the reverse problem, that is, given a 3D structure, find some chain of amino-acid that would lead to that structure once fully folded. This problem is called Protein Design.

In this post we’ll focus on mathematical models for this problem, studying its computational complexity and discuss possible solutions.

Mathematical Modeling

The 3D structure of a protein can be divided into two parts: the backbone and the side chains. In the model proposed by Piece and Winfree [3], we assume the backbone is rigid and that we’ll try to find the amino-acids for the side chains such that it minimizes some energy function.

This means we’re not really trying to predict the whole chain of amino acids, but a subset of those amino that will end up on the side chains.

The amino acids on the side-chain can have a specific orientation [2, 3], known as rotamer, which in turn can be represented by a single value, its dihedral angle. We can define some arbitrary order for these amino acids and label them with an index, which we call position.

At each position there are multiple rotamers possible and we need to select them to minimize the overall energy function. More formally, for each position i, let R_i be the set of rotamers available and r_i \in R_i the chosen rotamer.

The model assumes the cost function of the structure is pairwise decomposable, meaning that we can account for the interaction of each pair independently when calculating the total energy:

Screenshot from 2019-09-20 15-46-22

Where E(r_i, r_j) is the energy cost of the interaction between positions i and j, assuming rotamers r_i and r_j respectivelly. The definition of E can be based on molecular dynamics such as AMBER.

In [3], the authors call this optimization problem PRODES (PROtein DESign).

PRODES is NP-Hard

Pierce and Winfree [3] prove that PRODES is NP-Hard. We’ll provide an informal idea of the proof here.

First we need to prove that the decision version of PRODES is NP-Hard. The decision version of PRODES is, given a value K, determine whether there is a set of rotamers such that the energy cost is less or equal K. We’ll call it PRODESd. We can then prove that PRODESd is NP-complete by showing that it belongs to the NP complexity class and by reducing, in polynomial time, some known NP-complete problem to it.

We claim that this problem is in NP because given an instance for the problem we can verify in polynomial time whether it is a solution (i.e. returns true), since we just need to evaluate the cost function and verify whether it’s less than K.

We can reduce the 3-SAT problem, known to be NP-complete, to PRODESd. The idea is that we can map every instance of the 3-SAT to an instance of PRODESd and show that the result is “true” for that instance of 3-SAT if and only if the result is “true” for the mapped instance of PRODESd.

Let’s start by formalizing PRODESd as a graph problem, with an example shown in the picture below:

Screenshot from 2019-09-29 16-19-31

a) has 3 positions with their sets of rotamers. b) each rotamer becomes a vertex grouped into sets. We pick exactly one vertex per set and try to minimize the total cost of the edges associated with the selected vertices. Image source: [3]

Now, given a 3-SAT instance, we create a vertex set for each clause C_i (containing a vertex for each literal), and a vertex set for each variable x_i (containing vertices T and F). For each literal x_i we add an edge of weight 1 to the vertex F of the set corresponding to variable x_i. Conversely, for each negated literal, \bar x_i, we add an edge of weight 1 to the vertex T. All other edges have weight 0.

For example, the instance (x_1 + \bar x_2 + x_3) \cdot ( \bar x_1 + x_2 + x_3) \cdot (\bar x_3) yields the following graph where only edges of non-zero weight are shown:

Screenshot from 2019-09-29 16-41-00

Source: [3]

We claim that this 3-SAT instance is satisfiable if and only if the PRODESd is true for K=0. The idea is the following: for each vertex set corresponding to a variable we pick either T or F, which corresponds to assigning true or false to the variable. For each vertex set corresponding to a clause we pick a literal that will evaluate to true (and hence make the clause true). It’s possible to show that if 3-SAT is satisfiable, there’s a solution for PRODESd that avoids any edge with non-zero weight.

Integer Linear Programming

Now that we know that PRODES is unlikely to have an efficient algorithm to solve it, we can attempt to obtain exact solutions using integer linear programming model. Let’s start with some definitions:

Screenshot from 2019-09-29 18-43-51

We can define our variables as:

Screenshot from 2019-09-29 18-44-27.png

The object function of our model becomes:

Screenshot from 2019-09-29 18-45-05

Finally, the constraints are:

Screenshot from 2019-09-29 18-39-34

Equation (1) says we should pick exactly one rotamer for each position. Constraints (2) and (3) enforce that x_{i, j} = 1 if and only if r_i = r_j = 1.

Note: the LaTeX used to generate the images above are available here.

Conclusion

The study of protein prediction led me to protein design, which is a much simpler problem, even though from the computational complexity perspective it’s still an intractable problem.

The model we studied is very simple and makes a lot of assumptions, but it’s interesting as a theoretical computer science problem. Still I’m not sure how useful it is in practice.

References

[1] Wikipedia – Protein design
[2] Wikipedia – Rotamer
[3] Protein Design is NP-hard – Niles A. Pierce, Erik Winfree

Protein Folding Prediction

In this post we’ll study an important problem in bioinformatics, which consists in predicting protein folding using computational methods.

We’ll cover basic concepts like proteins, protein folding, why it is important and then discuss a few methods used to predict them.

Amino Acids and Proteins

Amino acids are any molecules containing an Amino (-NH2) + Carboxyl (COOH) + and a side chain (denoted as R), which is the part that varies between different amino acids [1].

Etymology. A molecule containing a carboxyl (COOH) is known as carboxylid acid, so together with the Amino part, they give name to amino acids [2].

An amino acid where R attaches to the first carbon (see Figure 1) is known as α-amino acids (read as alpha amino acids).

alpha-beta

Figure 1: alpha and beta carbons in a Carbonyl group (link)

There are 500 different amino acids known, but only 22 of them are known to be present in proteins and all of them are α-amino acids. Since we’re talking about proteins, we’ll implicitly assume amino acids are the alpha amino ones. See Figure 2.

amino-acid

Figure 2: Amino acid

Amino acids can be chained, forming structures known as pepitides, which in turn can be chained forming structures like polipeptides. Some polipeptides are called proteins, usually when they’re large and have some specific function, but the distinction is not precise.

Etymology.Peptide comes from the Greek to digest. According to [3], scientists Starling and Bayliss discovered a substance called secretin, which is produced in the stomach and signals the pancreas to secrete pancreatic juice, important in the digestion process. Secretin was the first known peptide and inspired its name.

Protein Folding

To be functional, a protein needs to fold into a 3-D structure. The sequence of amino acids in that protein is believed [6] to be all that is needed to determine the final shape of a protein and its function. This final shape is also known as native fold.

There are 4 levels of folding, conveniently named primary, secondary, tertiary and quaternary.

  • Primary structure is the non-folded form, the “linearized” chain of amino acids.
  • Secondary structure is formed when parts of the chain bond together via hydrogen bonds, and can be categorized into alpha-helix and beta-strands.
  • Tertiary structure is when the alpha-helix and beta-sheet fold onto globular structures.
  • Quaternary structure is when 2 or more peptide chains that have already been folded combine with each other.

Prediction

The prediction of protein folds usually involves determining the tertiary (most common form) structure from its primary form. We’ve seen before in DNA Sequencing  that reading amino acid sequences is easier when its denatured into a linear chain. On the other hand, analyzing the intricate folded forms can be very hard [7], even with techniques such as X-ray crystalography.

Why is this useful?

If the protein structure is known, we can infer its function on the basis of structural similarity with other proteins we know more about. One can also predict which molecules or drugs can efficiently bind and how they will bind to protein [7].

From Wikipedia’s Nutritious Rice for the World [8], protein prediction can also help understand the function of genes:

… prediction tools will determine the function of each protein and the role of the gene that encodes it.

Computational Methods

In this section we’ll focus on the computation methods used to aid the prediction of protein folds. There are 2 main types, template based and de novo.

Template based methods rely on proteins with known folds which can be used as template for determining the fold of a target protein with unknown fold. This only works well if the target protein is similar enough to the “template” one, or has some shared ancestry with it, in which case they are said to be homologous.

De novo methods do not rely on existing protein for templates. It uses known physico-chemical properties to determine the native structure of a protein. We can model the folding as an optimization problem where the cost function is some measure of structure stability (e.g. energy level) and steps are taken to minimize such function given the physico-chemical constraints. Molecules that have the same bonds but different structures are known as conformations, which form the search space of our optimization problem.

Let’s now cover some implementations of de novo methods.

Rosetta

One implementation of de novo method is the Rosetta method [9]. It starts off with the primary form of the protein. For each segment of length 9 it finds a set of known “folds” for that segment. A fold in this case is represented by torsion angles on the peptide bonds. The search consists in randomly choosing one of the segments and applying the torsion of its corresponding known folds to the current solution and repeat.

Note that while this method does not rely on protein templates, it still assumes an existing database of folds for amino acid fragments. However, by working with smaller chains, the chances of finding a matching template are much higher.

AlphaFold

One major problem with de novo methods is that they’re computationally expensive, since it has to analyze many solutions to optimize them, which is a similar problem chess engines have to.

Google’s DeepMind has made the news with breakthrough performance with its Go engine, AlphaGo. They used a similar approach to create a solver for protein folding prediction, AlphaFold. This approach placed first in a competition called CASP.

CASP is a competition held annually to measure the progress of predictors. The idea is to ask competitors to predict the shape of a protein whose native fold is known but not public.

Crowdsourcing

Another approach to the computational costs involved in the prediction is to distribute the work across several machines. One such example is Folding@home, which relies on volunteers lending their idle CPU time to perform chunks of work in a distributed fashion which is then sent back to the server.

An alternative to offering idle CPU power from volunteers’ machines is to lend CPU power from their own brains. Humans are good at solving puzzles and softwares like Foldit rely on this premise to find suitable folds. According to Wikipedia:

A 2010 paper in the science journal Nature credited Foldit’s 57,000 players with providing useful results that matched or outperformed algorithmically computed solutions

Conclusion

Here are some questions that I had before writing this post:

  • What is protein folding? Why predicting it important?
  • What are some of the existing computational methods to help solving this problem?

Questions that are still unanswered are details on the algorithms like the Rosetta method. I also learned about Protein Design in the process, which I’m looking into studying next. The approach used by DeepMind seem very interesting, but as of this writing, they haven’t published a paper with more details.

As a final observation, while researching for this post, I ran into this Stack Exchange discussion, in which one of the comments summarizes my impressions:

This field is pretty obscure and difficult to get around in. There aren’t any easy introductions that I know of.

I found that the information available is often vague and sparse. I’d love to find a good textbook reference to learn more details. One possible explanation is that this field is relatively new, and there’s a lot of proprietary information due to this being a lucrative field.

References

[1] Wikipedia – Amino acid
[2] Wikipedia – Carboxylic acid
[3] The Research History of Peptide
[4] Wikipedia – Homology modeling
[5] Wikipedia – Threading (protein sequence)
[6] Wikipedia – De novo protein structure prediction
[7] Quora: Why is protein structure prediction important?
[8] Wikipedia – Nutritious Rice for the World
[9] The Rosetta method for Protein Structure Prediction
[10] AlphaFold: Using AI for scientific discovery

Content Delivery Network

In this post we’ll explore the basics of a Content Delivery Network (CDN) [1].

Outline

Here are the topics we’ll explore in this post.

  • Introduction
  • Advantages
  • Definitions
  • Caching
  • Routing

Introduction

Content Delivery Networks, in a simplistic definition, is like the cache of the internet. It serves static content to end users. This cache is a bunch of machines that are located closer to the users than data centers.

Differently from data centers that can operate in a relatively independent manner, the location of CDNs are subject to the infrastructure of internet providers, which means a lot more resource/space sharing between companies is necessary and there are more regulations.

Companies often use use CDNs via third-party companies like Akamai and Cloudflare, or, as it happens with large internet companies like Google, go with their own solution.

Let’s see next reasons why a company would spend money in CDN services or infrastructure.

Advantages

CDNs are closer to the user. Due to smaller scale and partnerships with internet providers, CDN machines are often physically located closer to the user than a data center can be.

Why does this matter? Data has to traverse the distance between the provider and the requester. Being closer to the user means lower latency. It also less hops to go through, meaning less required bandwidth [2].

Redundancy and scalability. Adding more places where data live increases redundancy and scalability. Requests will not go all to a few places (data centers) but will hit first the more numerous and distributed CDNs.

Due to shared infrastructure, CDN companies can shield small companies from DDoS attacks by distributing traffic among its own servers, absorbing the requests from the attack.

Improve performance of TLS/SSL connections. For websites that use secure HTTP connections, it’s necessary to perform 2 round-trips, one to establish the HTTP connection, another to validate the TLS certificate. Since CDNs must connect to the data center to retrieve the static content, it can leverage that to keep the HTTP connection alive, so that subsequent connections from client to data centers can skip the initial round trip [3].

Before we go into more details on how CDNs work, let’s review some terminology and concepts.

Definitions and Terminology

Network edge. is the part of network before a request traverses before reaching the data center. Think of the data center as a polygonal area  and the edge being the perimeter/boundary.

Internet exchange points (IXP). It’s a physical location containing network switches where (Internet Service Provider) ISPs connect with CDNs. The term seems to be used interchangeably with PoP (Point of Presence). According to this forum, the latter seems an outdated terminology [4]. The map below shows the location of IXP in the world:

Screenshot from 2019-07-04 23-07-19.png

Locations of IXP in the world (source: Data Center Map)

Border Gateway Protocol (BGP). The BGP is used for routing traffic between different ISPs. To connect a source IP with its destination, the connection might need rely on a networks from different a different provider than the original one.

CDN peering. CDN companies rely on each other to provide full world coverage. The cost of having a single company cover the entire world is prohibitive. Analogous to connections traversing networks from multiple ISPs.

Origin server. To disambiguate between the CDN servers and the servers where the original data is being served from, we qualify the latter as origin server.

Modes of communication. There are four main modes [5]:

  • Broadcast – one node sends a message to all nodes in the network
  • Multicast  – one node sends a message to a subset of the nodes in the network (defined via subscription)
  • Unicast – one node sends a message to a specific node
  • Anycast – one node sends a message to another node, but there’s flexibility to which node it will send it to.

ISP Tiers. Internet providers can be categorized into 3 tiers based on their network. According to Wikipedia [6] there’s no authoritative source defining to which tier networks belong, but the tiers have the following definition:

  • Tier 1: can exchange traffic with other Tier 1 networks without paying for it.
  • Tier 2: can exchange some traffic for free, but pays for at least some portion of it.
  • Tier 3: pays for all its traffic.

Most of ISPs are Tier 2 [7].

Caching

One of the primary purposes of a CDN is caching static content for websites [8]. The CDN machines act as a look through cache, that is, the CDN serves the data if cached, or makes a request to the origin server behind the scenes, caches it and then returns it, making the actual data fetching transparent to the user.  We can see that the CDN acts as a proxy server, or an intermediary which the client can talk to instead of the origin server.

Let’s consider the details of the case in which the data is not found in the cache. The CDN will issue an HTTP request to the origin server. In the response the origin server can indicate on their HTTP header whether the file is cacheable with the following property:

Cache-Control: public

For example, if we visit https://www.google.com/, we can inspect the network tab and look at the HTTP response header of a random JavaScript file:

Screenshot from 2019-07-06 23-07-19

we can see the line cache-control: public. In fact, if we look at the origin of this file, we see it’s coming from the domain http://www.gstatic.com which is Google’s own CDN.

Caching and Privacy

One of the important aspect of caching is to make sure privacy is honored. Consider the case of a private photo that is cached by a CDN. We need to make sure that it’s highly unlikely that someone without access will be able to see it.

Since CDNs store the static content but not the logic that performs access control, how can we make it privacy-safe? One option is to generate an image path that is very hard to reverse-engineer.

For example, the origin server could hash the image content using a cryptographic hash function and use that as the URL. Then, the corresponding entry in the CDN will have that name.

In theory if someone has access to the URL they can see my private photo but for this to happen I’d need to share the image URL, which is not much different from downloading the photo and sending to them. In practice there’s an additional risk if one uses a public computer and happen to navigate to the raw image URL. The URL will be in the browser history even if the user logs out. For reasons like these, it’s best to use incognito mode in these cases.

To make it extra safe the server could generate a new hash every so often so that even if someone got handle of an URL, it will soon be rendered invalid, minimizing unintended leaks. This is similar to the concept of some strategies for 2-fac authentication we discussed previously where a code is generated that makes the system vulnerable very temporarily.

Routing

Because CDNs is in between client and origin server, and due to its positioning, both physical and strategical, they can provide routing as well. According to this article from Imperva [9], this can mean improved performance by use of better infrastructure:

Much focus is given to CDN caching and FEO features, but it’s direct tier 1 network access that often provides the largest performance gains. It can revolutionize your website page load speeds and response times, especially if you’re catering to a global audience.

CDNs are often comprised of multiple distributed data centers, which they can leverage to distribute load and, as mentioned previously, protect against DDoS. In this context, we covered Consistent Hashing as a way to distribute load among multiple hosts which are constantly coming in and out of the availability pool.

CDNs can also rely on advanced routing techniques such as Anycast [10] that performs routing at the Network Layer, to avoid bottlenecks and single point of failures of a given hardware.

Conclusion

I wanted to understand CDNs better than being “the cache of the internet”. Some key concepts were new to me, including some aspects of the routing and the ISPs tiers.

While writing this post, I realized I know very little of the practical aspects of the internet: How it is structured, its major players, etc. I’ll keep studying these topics further.

References

[1] Cloudflare – What is a CDN?
[2] Cloudflare – CDN Performance
[3] Cloudflare – CDN SSL/TLS | CDN Security
[4] Cloudflare –  What is an Internet Exchange Point
[5] Cloudflare – What is Anycast?
[6] Wikipedia – Tier 1 network
[7] Wikipedia – Tier 2 network
[8] Imperva – CDN Caching
[9] Imperva – Route Optimization
[10] Cloudflare – Load Balancing without Load Balancers