Largest sets of subsequences in OCaml

I’ve noticed that there is this set of words in English that look very similar: tough, though, through, thought, thorough, through and trough. Except thought, they have one property in common: they’re all subsequence of thorough. It made me wonder if there are interesting sets of words that are subsequences of other words.

Word cloud made with Jason Davie’s tool (https://www.jasondavies.com/wordcloud/)

This post is an attempt to answer a more general question: given a list of words, what is the largest set of these words such that they’re subsequences of a given word?

A word A is a subsequence of a word B if A can be obtained by removing zero or more characters from B. For example, “ac” is a subsequence of “abc”, so is “bc” and even “abc”, but not “ba” nor “aa”.

A simple algorithm to determine if a word A is a subsequence of another is to start with 2 pointers at the beginning of each word, pA and pB. We move pB forward until pA and pB point to the same character. In that case we move pA forward. A is a subsequence of B if and only if we reach the end of A before B. We could then iterate over each word W and find all the words that are subsequences of W. If the size of the dictionary is n, and the size of the largest word is w, this algorithm would run in O(n^2 w).

For English words, we can use entries from /usr/share/dict/words. In this case, n is around 235k (wc -l /usr/share/dict/words), so a quadratic algorithm will take a while to run (around 5e10 operations).

Another approach is to generate all subsequences of words for a given word W and search the dictionary for the generated word. There are O(2^w) subsequences of a word of length w. If we use a hash table, we can then do it in O(n w 2^w). In /usr/share/dict/words, the length of the largest word, w, is 24.

Running a calculation with the numbers (R script), the number of high-level operations is 4e10, about the same order of magnitude as the quadratic algorithm.

Distribution using ggplot2

A third approach is to use a trie. This data structure allows us to store the words from the dictionary in a space-efficient way and we can search for all subsequences using this structure. The trie will have at most 2e6 characters (sum of characters of all words), less because of shared prefixes. Since any valid subsequence has to be a node in the trie, the cost of search for a given word cannot be more than the size of the trie t, so the complexity per word is O(\min(2^w, t)). A back of envelope calculation gives us 2e9. But we’re hoping that the size of the trie will be much less than 2e6.

Before implementing the algorithm, let’s define out trie data structure.

The Trie data structure

A trie is a tree where each node has up to |E| children, where |E| is the size of the alphabet in consideration. For this problem, we’ll use lower case ascii only, so it’s 26. The node has also a flag telling whether there’s a word ending at this node.

Notice that in this implementation of trie, the character is in the edge of the trie, not in the node. The Map structure from the stlib uses a tree underneath so get and set operations are O(log |E|).

The insertion is the core method of the structure. At a given node we have a string we want to insert. We look at the first character of the word. If a corresponding edge exists, we keep following down that path. If not, we first create a new node.

To decide whether a trie has a given string, we just need to traverse the trie until we either can’t find an edge to follow or after reaching the end node it doesn’t have the hasEntry flag set to true:

This and other trie methods are available on github.

The search algorithm

Given a word W, we can search for all its subsequences in a trie with the following recursive algorithm: given a trie and a string we perform two searches: 1) for all the subsequences that contain the first character of current string, in which case we “consume” the first character and follow the corresponding node and 2) for all the subsequences that do not contain the first character of the current string, in which case we “consume” the character but stay at the current node. In pseudo-code:

Search(t: TrieNode, w: string):
    Let c be the first character of w.
    Let wrest be w with the first character removed

    If t contains a word, it's a subsequence of the
    original word. Save it.

    // Pick character c
    Search(t->child[c], wrest)

    // Do not pick character c
    Search(t, wrest)

The implementation in OCaml is given below:

Experiments

Our experiment consists in loading the words from /usr/share/dict/words into a trie, and then, for each word in the dictionary, look for its subsequences. The full code is available on github.

The code takes 90 seconds to run on my laptop. Not too bad but I’m still exploring ways to improve the performance. One optimization I tried is to, instead of returning an explicit list of strings as mentioned in the search implementation, return them encoded in a trie, since we can save some operations due to shared prefixes. I have that version on github, but unfortunately that takes 240 seconds to run and requires more memory.

Another way is to parallelize the code. The search for subsequences is independent for each word, so it’s an embarrassingly parallel case. I haven’t tried this path yet.

The constructed trie has 8e5 nodes or ~40% of the size of sum of characters.

Subsequences of “thorough”

The question that inspired this analysis was finding all the subsequences of thorough. It turns out it has 44 subsequences, but most of them are boring, that is, single letter or small words that look completely unrelated to the original word. The most interesting ones are those that start with t and have at least three letters. I selected some of them here:

  • tho
  • thoo
  • thoro
  • thorough
  • thou
  • though
  • thro
  • throu
  • through
  • thug
  • tog
  • tou
  • toug
  • tough
  • trough
  • trug
  • tug

The word with most subsequences is pseudolamellibranchiate, 1088! The word cloud at the beginning of the post contains the 100 words with the largest number of subsequences. I tried to find interesting words among these, but they’re basically the largest words – large words have exponentially more combination of subsequences, and hence the chance of them existing in the dictionary is greater. I tried to come up with penalization for the score:

1) Divide the number of subsequences by the word’s length. This is not enough, the largest words still show on top.
2) Apply log2 to the number of subsequences and divide by the word’s length. In theory this should account for the exponential number of subsequences of a word. This turns out to be too much of a penalization and the smallest word fare too well in this scenario.

I plotted the distribution of number of subsequences by word lengths. We can see a polynomial curve but with increased variance:

Generated with this ggplot2

In the chart above, we’d see all points with the same x-value in a single vertical line. One neat visualization trick is to add noise (jitter) so we also get a sense of density.

If we use a box plot instead, we can see a quadratic pattern more clearly by looking at the median for each length.

Generated with this ggplot2

Given this result, I tried a final scoring penalization, by dividing the number of subsequences by the square of the length of the word, but it’s still not enough to surface too many interesting words. Among the top 25, streamlined is the most common word, and it has 208 subsequences.

One interesting fact is that the lowest scoring words are those with repeated patterns, for example: kivikivi, Mississippi, kinnikinnick, curucucu and deedeed. This is basically because we only count unique subsequences.

Conclusion

This was a fun problem to think about and even though it didn’t have very interesting findings, I learned more about OCaml and R. After having to deal with bugs, compilation and execution errors, I like OCaml more than before and I like R less than before.

R has too many ways of doing the same thing and the API is too lenient. That works well for the 80% of the cases which it supports, but finding what went wrong in the other 20% is a pain. OCaml on the other hand is very strict. It doesn’t even let you add an int and a float without an explicit conversion.

I learned an interesting syntax that allows to re-use the qualifier/namespace between several operations when chaining them, for example:

I also used the library Batteries for the first time. It has a nice extension for the rather sparse String module. It allows us to simply do open Batteries but that overrides a lot of the standard modules and that can be very confusing. I was scratching my head for a long time to figure out why the compiler couldn’t find the union() function in the Map module, even though I seemed to have the right version, until I realized it was being overridden by Batteries. From now on, I’ll only use the specific modules, such as BatString, so it’s easy to tell which method is coming from which module.

References

OCaml

  • [1] OCaml Tutorials > Map
  • [2] Strings – Batteries included
  • [3] Using batteries when compiling

R

  • [1] R Tutorial – Histogram
  • [2] Creating plots in R using ggplot2 – part 10: boxplots
  • [3] My R Cheat sheet

OpenVis Conf 2017

I attended the OpenVis Conf in Boston. It’s a broad Data Visualization single-track 2-day conference with an industry focus. Here are my notes on the talks.

Mike Bostock’s Keynote

Mike Bostock (the famous creator of D3.js) opened up the conference by talking mainly about d3.express, a new library he’s working on. Despite the name, it has no dependency on D3 itself, but rather, it looks more like Python/R notebooks in which you can write JavaScript expressions in a console-like environment, but that get re-evaluated as you change upstream assignments. It borrows a lot of ideas from Reactive programming, in particular Bret Victor’s ideas (this paradigm immediately reminded me of his Ladder of Abstraction).

Another interesting feature of this library is the built-in animation loop functionality. Through a slight modification of the ES6 syntax, Bostock uses generators as a simple construct to express animations. The library also include helpers to easliy build common UI input controls, such as a scroller and checkboxes.

d3.express is currently in development, not yet ready for use.

Data Sketch|es: a visualization a month

Shirley Wu and Nadieh Brehmer presented the lessons learned during their (ongoing) project called Data Sketch|ES, which consists in crafting a Data Visualization a month. A technique adopted by creative artists, this constraint is both the challenge of coming up with original ideas but also getting them done in a predicted amount of time.

Some of their lessons included cross-checking the data — even if obtained from usually reliable sources, iterate between sketches on paper and actual code prototypes — it’s hard to predict how a visualization will look like, especially if interactive, delight the audience — spending time polishing the work and making use of animation to keep users interested and engaged.

Visualizing data with Deck.gl

Nicolas Belmonte is the head of the Data Visualization at Uber. They’ve open sourced Deck.gl a library on top of WebGL, used to display data overlaid in a map. It has good integration with React and Mapbox GL.
They showcased a really awesome application simulating wind patterns using particles based on a wind vector map.

Wind map using particles

What Store Does Your Timeline Tell?

Matthew Bremer is a researcher at Microsoft and he explored many unconventional ways to represent temporal data besides the regular line chart. He showed some cases with radial axis (daily routine, lifetimes), spiral (cycles), grid. He and other folks wrote a paper on this topic and they have a page with more information.

GDAL

Robert Simmon explained how to use GDAL, a library to work with Geo Spatial data, mostly through command line. Too technical and too specific in domain for me. I didn’t get much from this talk, besides satellite imagery requiring a lot of post-processing to look presentable.

How Spatial Polygons Shape our World

Amelia McNamara discussed representation of quantities in maps, mainly through polygons. These polygons can be arbitrary shapes on the map, but are often represented by district areas, counties, and states.

She first presented a few examples including Choropleth US map and how big sparse states are over-represented while densely-populated small states are under-represented. Some strategies like distorted maps and unit-area maps (like the NPR’s hexmap, which we posted about) can be used with the downside of familiarity or incorrect adjacency of states.

She discussed scenarios in which data from different aggregation levels must be combined, say, one data has county level aggregation, the other has state level aggregation.

Up-scaling is easy. For the county to state aggregation, it might be a matter of just summing values. Down-scaling requires some other external reference, for example, if we can assume the count for a state is a relatively uniform function of the population, we can break it down by county based on their relative population. Side-scaling is harder, especially when one of the polygons is not contained in the other (which was the case for up and down scaling).

Introducing REGL

Mikola Lysenko is the author of the REGL library (which is an evolution of Stack.gl), which provides a functional (stateless) API on top of WebGL, much in line with the paradigm adopted by d3.express from Bostock’s talk. He then proceed to perform a quick live demo demonstrating the expressiveness (and his proficiency), by displaying a 3D scan dataset in a browser. This was by far the most visually appealing talk, with 3D bouncing rainbow bunnies.

Untangling the hairball

John Gomez started with a problem: how to display network data in a meaningful and consumable way. He explored different strategies like limiting the number of nodes to show, clustering, only show edges on hover. It was interesting to learn the steps of crafting a visualization from scratch. He used mostly D3 and developed some modules of his own. Very funny talk.

Designing Visualization Tools for Learners

Catherine D’Ignazio and Rahul Bhargava are interested in tools for people who do not have a analytical background or are not data savvy. Developing tools with this mindset automatically makes tools more accessible for everyone. They presented 4 traits that a good tool must have: focused, guided, inviting and expandable.

A focused tool must do one thing and do it very well (UNIX philosophy). Tableau is an example of a tool that is not focused. It’s not necessarily bad, it’s just overwhelming for a learner. They cited Timeline.js as a focused tool.

A guided tool provides clear affordances on what steps should be taken next. They showcased a tool which starts with a blank canvas where you can add components to it. They provided ways to improve that by starting with a sample component/widget instead of a blank page.

An inviting tool is familiar. It contains examples that users can related to their area/context (journalism, arts, etc). It provides a safe playground for users to try it out, being able to easily undo it and make mistakes. An example of uninviting tool is excel’s pivot table.

An expandable tool allows the learner to take a step further towards mastery. It’s not a rigid static tool. It does not shy away from technical terms because it doesn’t want to be incorrect or misleading, but it provides signifiers such as tooltips. An example of non-expandable tool is a (static) infographic.

Visualizing Incarceration in the US on Polygraph

Matt Daniels provided a set of rich and interactive visualizations to represent data from incarcerations in the US. The project started by looking at a line chart representing the growth of convicts in the US, where a sharp increase in the numbers during the 2000s led to an astonishing 1% of the American male population being behind the bars these days.

His attempts were to gain insights on what measures can be taken to reduce incarcerations, breaking it down by causes, gender and other dimensions. He described the pain in obtaining accurate and detailed data, having to settle for Florida’s data, a state with an unusually transparent criminal system.

It felt that much of the work was yet to be done, so the talk didn’t unleash its full potential.

Amanda Cox’s Keynote

Amanda Cox discussed uncertainty in charts. It’s a challenging component because probability is not intuitive and a lot of people cannot interpret it correctly.

I missed the first half of this talk, so it’s probably not capturing the full context.

Uncertainty represented as moving gauges

A data point walks into a bar: designing data for empathy

Lisa Rost discussed the importance of including the emotional component in the design of the visualization. Rationality is necessary for a informed decision, but emotion is also needed to convince.

She provided 5 strategies that can be used to trigger more emotional response from a user:

  • Use colors – explore intensity, an example was the use of bright red to evoke the image of blood in a violent deaths chart
  • Show what you’re talking about – instead of plain bar charts, how about a stack of icons representing people if we’re talking about mortality?
  • Show what data would mean for the user – create a visualization that will put the viewer as the protagonist. Examples include wording (using “you”), or analogies to more common ground (e.g. for an event that occurred in Syria, use an analogy to what it would look like if it was in the US)
  • Zoom into one datapoint – people related more to individuals than groups. Focus on one person’s story.
  • Show masses as individuals – masses are impersonal. Instead of communicating totals, use averages to make it more personal. For example: “X people died in the event”, can be rephrased as “1 person died every Y minutes”.

D3 with Canvas

Kai Chang demonstrated how to render D3 charts using Canvas. It was very code-oriented, hence hard to follow, but it was good to learn what can be done. I don’t recall him using any extra library for that.
One interesting trick he shared is how to perform a onClick functionality in a polygon rendered with canvas. For SVG it’s easy, but for Canvas it’s a bitmap, so we’d need to use some sort of point-inside-polygon algorithm. He suggested rendering another polygon underneath with the same shape but with a custom color. Canvas has an API to retrieve that color, so we are able to tell which polygon the click was from, based on the color associated to that polygon.

Pulling a Polygon out of a hat

Noah Veltman had a very interesting talk about animated transitions between polygons. He started with a simple mapping from a square to square, then triangle to square, which require introducing auxiliary points to the triangle so the number of vertices between the polygons match. He generalized to arbitrary (including concave) shapes and compared how much more natural his algorithm looked compared to a naive approach.
He finished with a yet-to-explore question: how to map polygons with different topological properties (e.g. a polygon with a hole to a polygon without one). Very entertaining and educative.

Text Mining and Visualization, the tidy way

Julia Silge explained the basics of text mining using unsupervised methods to obtain interesting insights from large bodies of text. She presented a demo using R and ggplot2. The main visualization used were histograms.

Why does Data Vis need a style guide?

Amy Cesal discussed the advantages of a consistent style for visualizations. It must be consistency across an organization, not only software (especially because in many cases multiple tools are used). Color, text (terminology, writing style, tone) and also layout.

Vega-lite: A grammar of interactive graphics

Kanit Wongsuphasawat, Dominik Moritz and Arvind Satyanarayan are students at UW, and have developed a simpler version of Vega, called Vega-Lite. It’s basically a system which translates a JSON configuration into Vega syntax. It relies on convention over configuration to make it very simple to create basic charts but also allows for more powerful visualizations.

A very interesting concept is the multi-chart configuration. Through operators it’s possible to “explode” a single chart into multiple ones (e.g. break down a line chart with multiple lines into multiple charts with its own line), it allows combining different chart types.
The most interesting idea in my opinion is being able to use nested configurations to create dashboards. The binary tree hierarchy can be used to define how the widgets are laid out (through horizontal and vertical partitioning). This hierarchy can also be used to define controllers that affect a subtree. For example, if we define a scroller at a given node, it affects all the widgets within that subtree.
Voyager is a UI built on top of Vega and looks very similar to Tableau.

This is a Vega-Lite specification to create a bar chart that shows the average temperature in Seattle for each month.

Data as a creative constraint

Eric Socolofsky has done some work for the Exploratorium and talked about computer art and how it can be applied to data visualization. He mentioned some components/types of digital art:

  • Generative art: computer generated imagery.
  • Randomness – for example, to represent uncertainty as mentioned in Amanda Cox’s talk
  • Particle systems – for example, the Wind Map presented by Nicolas Belmonte
  • Recursion and repetition – fractals
  • Motion – animation
  • Color

Empowering effective visualization (color) design

Connor Gramazio proposes an algorithm generated palette based on the CIELab color space and defines two metrics: discriminability (being able to tell different colors apart) and preferability (subjective measure of how an user likes the colors in the palette). He performed user studies to compare this palette to other palettes such as colorbrewer and Microsoft’s.

Overall it was very academic and technical and I got lost on the details.

Hacking your health with JavaScript

Alan McLean, talked about his works in Health tracking companies and also analyzing his own health. The tone of the presentation was a bit dark, but it did raised awareness (in the lines of Rost’s empathy talk) of how visualizations can be impersonal and cold, especially when the data displayed is about the user themselves.

The role of visualization in exploratory data analysis

This was basically a quick R tutorial focusing on packages like dplyr and ggplot2. Hadley Wickham performed a live demo to represent data of his github commits and his trips.

Conclusion

Overall the conference was very interesting, with a diverse set of topics. I like that it’s industry driven, that is, not too academic. A lot of the talks are about ad-hoc visualizations, but since I work on developing general-purpose UI data tools, “Designing Visualization Tools for Learners“, “Why does Data Vis need a style guide” and “Vega-lite: A grammar of interactive graphics” were the most applicable to my work.

Uber has an interesting setup. One of the presenters, Nicolas Belmonte, is the head of the Data Visualization at Uber. I’ve talked to one of their teammates during lunch. They’re a horizontal team of around 25 people which embed themselves in different teams to work on specific visualization projects.

Prior to the conference I took a few days to explore the city, including the nearby Cambridge. I toured around Harvard, MIT, did the Freedom Trail, ate a lot of seafood but the highlight of the trip was the Boston Museum of Fine Arts.

1. Asian Art from the Museum of Fine Arts in Boston; 2. Boston Harbor
3. MIT Ray and Maria Stata Center; 4. Room at the Museum of Fine Arts.

US as an hexagonal map

In this post we’ll study a way to visualize maps in a hexagonal grid, in which each entity have uniform area. We’ll then model that as a mathematical problem.

One challenge in displaying data in maps is that larger area countries or states tend to get more attention than smaller ones, even when economically or population-wise, the smaller state is more relevant (e.g. New Jersey vs. Alaska). One idea is to normalize the areas of all the states by using symbols such as squares. Recently I ran into a NPR map that used hexagons and it looked very neat, so I decided to try building it in D3 and perform some analysis.

Below is the result of plotting the state populations (log scale):

US Hexmap: Population (log scale)

US Hexmap: Population (log scale)

One important property of visualizing data in maps is familiarity of the location (you can easily find specific states because you remember where they are) and also adjacency patterns can provide insights. For example, if we plot a measure as a choropleth map and see that the West coast is colored differently from the Midwest, then we gain an insight we wouldn’t have by looking at a column chart for example.

Because of this, ideally the homogeneous area maps should preserve adjacencies as much as possible. With that in mind, we can come up with a similarity score. Let X be the set of pairs of states that share a border in the actual US map. Now, let Y be the set of pairs of states that share a border in the hexagonal map (that is, two hexagons sharing a side). The similarity score is the size of their symmetric difference and we can normalize by the size of the original:

(|X - Y| + |Y - X|) / |X|

The lower the score the better. In an ideal case, the borders sets would match perfectly for a score of 0.

The size of the symmetric difference between the two sets seems like a good measure for similarity, but I’m not sure about the normalization factor. I initially picked the size of the union of X and Y, but this wouldn’t let us model this problem as a linear programming model as we’ll see next. The problem with using the size of X is that the score could theoretically be larger than 1, but it’s trivial to place the hexagons in the grid in such a way that none of them are touching and thus Y is empty, so we can assume the score is between 0 and 1.

Hexgrid coordinates convention

Hexgrid coordinates convention

The score from the NPR maps is 0.67.

An Optimization Problem

Let’s generalize and formalize the problem above as follows: given a graph G = (V,E), and another graph H = (V_H, E_H) representing our grid, find the induced subgraph of H, I = (V_I, E_I), such that there’s bijection f: V \rightarrow V_I and the size of the symmetric difference of f(E) and E_I is minimized (f(E) is an abuse of notation, but it means applying the bijection to each vertex in the set of edges E).

To make it clearer, let’s apply the definition above to the original problem. G represents the adjacency of states in the original map. V is the set of states and E is the set of pairs of states that share a border. H is the hexagonal grid. V_H is the set of all hexagons and E_H is the set of pairs of hexagons that are adjacent. We want to find a subset of the hexagons where we can place each of the states (hence the bijection from states to hexagons) and if two hexagons are in the grid, and we place two states there, we consider the states to be adjacent, hence the need for an induced graph, so the adjacency in the grid is preserved.

Is this general case an NP-hard problem? We can reduce the Graph Isomorphism problem to this one. It consists in deciding whether two graphs A and B are isomorphic. If we set G = A and H = B, then A and B are isomorphic if and only if I = H and the symmetric difference of f(E) and E_I is 0. The problem is that it’s not known whether Graph Isomorphism belongs to NP-Complete.

What if G is planar (which is the case for maps)? I haven’t given much thought about it, but I decided to come up with an integer programming model nevertheless.

An Integer Linear Programming Model

Note: the model uses the original grid analogy instead of the more general problem so that the constraints are easier to understand.

Boolean algebra as linear constraints

Before we start, we need to recall how to model logical constraints (AND, OR and EQUAL) using linear constraints. Let a and b be binary variables. We want x to be the result of logical operators applied to a and b.

For AND, we can do (x = 1 if and only if a = 1 and b = 1)

x \le a
x \le b
x \ge a + b - 1

For OR, we can do (x = 0 if and only if a = 0 and b = 0)

x \ge a
x \ge b
x \le a + b

For EQUAL, we can do (x = 1 if and only if a = b)

x \le 1 - (a - b)
x \le 1 - (b - a)
x \ge a + b - 1
x \ge -(a + b - 1)

We can introduce a notation and assume these constraints can be generated by a function. For example, if we say
x = \mbox{EQ}(a, b), we’re talking about the four constraints we defined above for modeling EQUAL. This is discussed in [2].

Constants

Let G be the set of pairs (x,y) representing the grid positions. Let P be the set of pieces p that have to be placed in the grid. Let N(x,y) be the set of pairs (x',y') that are adjacent to (x, y) in the grid.

Let A_{v1, v2} represent whether v1 and v2 are adjacent to each other in the dataset.

“Physical” constraints

Let b_{x,y,s} be a binary variable, and equals 1 if and only if state s is placed position (x, y).

1) A piece has to be placed in exactly one spot in the grid:

\sum_{(x,y) \in G} b_{x,y,p} = 1 for all p \in P

2) A spot can only be occupied by at most one state:

\sum_s b_{x,y,s} \le 1 for all (x,y) \in G

Adjacency constraints

Let a_{p1, p2, x, y} be a binary variable and equals 1 if and only if piece p1 is placed in (x, y) and adjacent to p2 in the grid.

3) a_{p1, p2, x, y} has to be 0 if p1 is not in (x,y) or p2 is not adjacent to any of (x,y) neighbors:

a_{p1, p2, x, y} = \mbox{AND}(\sum_{(x', y') \in N(x, y)} b_{x', y', p2}, b_{x,y,p})

We have that a_{p1, p2, x, y} is 1 if and only if p1 is in (x,y) and p2 is adjacent to any of (x,y) neighbors.

Finally, we can model the adjacency between two pieces p1 and p2. Let a_{p1, p2} be a binary variable and equals 1 if and only if p1 and p2 are adjacent in the grid:

a_{p1, p2} = \sum_{(x,y) in G} a_{p1, p2, x, y}

Symmetric difference constraints

Let y_{p1, p2} be a binary variable and equals to 1 if and only if a_{p1, p2} \ne A_{p1, p2}.

4) y_{p1, p2} \ge a_{p1, p2} - A_{p1, p2}
5) y_{p1, p2} \ge A_{p1, p2} - a_{p1, p2}

Objective function

The sum of all y‘s is the size of the symmetric difference:

\min \sum_{p1, p2 \in P} y_{p1, p2}.

Practical concerns

This model can be quite big. For our US map example, we have |P| = 50 and we need to estimate the size of the grid. A 50×50 grid is enough for any type of arrangement. The problem is that the number of variables a_{p1, p2, x, y} is |P|^2|G| = 50^4 which is not practical.

We can also solve the problem for individual connected components in the original graph and it’s trivial construct the optimal solution from each optimal sub-solution. This doesn’t help much in our example, since only Hawaii and Alaska are disconnected, so we have |P| = 48. The grid could also be reduced. It’s very unlikely that an optimal solution would be a straight line. In the NPR map, the grid is 8×12. Sounds like doubling these dimensions would give the optimal solution enough room, so |G| = 8*12*4 = 384.

We can also assume states are orderer and we only have variables a_{p1, p2, x, y} for p1 < p2, so the number of a_{p1, p2, x, y} is about 450K. Still too large, unfortunately.

Another important optimization we can do in our case because we're working with a grid is to define the adjacency for x and y independently and combine them afterwards.

Refined adjacency constraints

Instead of working with b_{x,y,s} we use X_{x, s}, and equals 1 if and only if state s is placed position (x, y) for any y and Y_{y, s}, which equals 1 iff state s is placed position (x, y) for any x. The physical constraints are analogous to the previous model:

6) A piece has to be placed in exactly one spot in the grid:

\sum_{x \in G} X_{x,p} = 1 for all p \in P
\sum_{y \in G} Y_{y,p} = 1 for all p \in P

7) A spot can only be occupied by at most one state:

\sum_s X_{xs} \le 1 for all x \in G
\sum_s Y_{y,s} \le 1 for all y \in G

In a hexagonal grid, if we have the piece p1 in position (x,y), it will be adjacent to another piece p2 if and only if p2 is in one of these six positions: 1: (x-1, y), 2: (x+1, y), 3: (x-1, y-1), 4: (x, y-1), 5: (x-1, y+1) or 6: (x, y+1). We can define two adjacency categories: Type I, which happens when p1.y - p2.y = 0 and |p1.x - p2.x| = 1 (cases 1 and 2); and Type II, which is when |p1.y - p2.y| = 1 and p1.x - p2.x \le 0 (cases 3, 4, 5 and 6).

Let’s define Y_{d=0, p1, p2, y} = 1 iff p1.y - p2.y = 0 for a given y. Similarly we define X_{|d|=1, p1, p2, x} = 1 iff |p1.x - p2.x| = 1, Y_{|d|=1, p1, p2, y} = 1 iff |p1.y - p2.y| = 1 and finally X_{d \ge 0, p1, p2, x} = 1 iff p1.x - p2.x \ge 0.

8) We can have the following constraints do model the variables we just defined:

Y_{d=0, p1, p2, y} = \mbox{EQ}(Y_{y, p_1}, Y_{y, p2})
X_{|d|=1, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x-1, p2} + X_{x+1, p2})
Y_{|d|=1, p1, p2, y} = \mbox{EQ}(Y_{y, p1}, Y_{y-1, p2} + Y_{y+1, p2})
X_{d \ge 0, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x, p2} + X_{x+1, p2})

9) Let Y_{d=0, p1, p2} = 1 iff p1.x - p2.y = 0 for any y. We can define analogous variables for the other cases:

Y_{d=0, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{|d|=1, p1, p2} = \sum_{x} X_{d=0, p1, p2, x}
Y_{|d|=1, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{d \ge 0, p1, p2} = \sum_{x} Y_{d \ge 0, p1, p2, x}

10) Let T'_{p1, p2} = 1 iff p1 and p2 have the Type I adjacency and T''_{p1, p2} = 1 iff p1 and p2 have Type II adjacency:

T'_{p1, p2} = \mbox{AND}(Y_{d=0, p1, p2}, X_{|d|=1, p1, p2})
T''_{p1, p2} = \mbox{AND}(Y_{|d|=1, p1, p2}, X_{d \ge 0, p1, p2})

11) Finally, we say that p1 and p2 are adjacency iff either Type I or Type II occurs:

a_{p1, p2} = \mbox{OR}(T'_{p1, p2}, T''_{p1, p2})

The model for adjacency became much more complicated but we were able to reduce the number of adjacency variables are now roughly O(|P|^2 \sqrt{|G|}). The number of non-zero entries in the right hand size of (which represents the size of the sparse matrix) is roughly 11M, dominated by the type (8) constraints. I’m still not confident this model will run, so I’ll punt on implementing it for now.

Conclusion

In this post we explored a different way to visualize the US states map. I was mainly exploring the idea of how good of an approximation this layout is and a natural question was how to model this as an optimization problem. Turns out that if we model it using graphs, the problem definition is pretty simple and happens to be a more general version of the Graph Isomorphism problem.

I struggled with coming up with an integer programming model and couldn’t find one with a manageable size, but it was a fun exercise!

World Map?

One cannot help wondering if we can display the countries in a hexagonal map. I’m planning to explore this idea in a future post. The main challenge is that the US states areas are more uniform than the countries. For example, the largest state (Alaska) is 430 times larger than the smallest (Rhode Island). While the largest country (Russia) is almost 40,000,000 bigger than the smallest (Vatican City).

Also, the layout of the US map was devised by someone from NPR and they did a manual process. I’m wondering if we can come up with a simple heuristic to place the hexagons and then perform manual adjustments.

References

[1] NPR Visuals Team – Let’s Tesselate: Hexagons For Tile Grid Maps
[2] Computer Science: Express boolean logic operations in zero-one integer linear programming (ILP)
[3] SOM – Creating hexagonal heatmaps with D3.js
[4] Github – d3/d3-hexbin

Data sources

[5] US State Borders
[6] Wikipedia – Population of US states and territories
[7] Tool to download Wikipedia tables as CSV
[8] List of US states and territories by area

Introduction to d3.js

mike-bostock

Mike Bostok (@mbostock) is a Computer Scientist currently working for the NYTimes. He was a PhD student at Stanford, when, together with professor Jeff Heer and Vadim Ogievetsky, they created D3.js, a framework for creating data visualizations in Javascript.

D3 stands for Data Driven Documents (DDD) and it simplifies the process of building visualizations on top of data, by handling most of the math and boilerplate necessary to generate visual elements.

In this post, we’ll discuss the basic aspects of D3.js, heavily based on the excellent book from Scott Murray, Interactive Data Visualization for the Web. The book is a relatively short and fun read, and it was compiled from a series of tutorials Scott wrote in the past.

Scott’s book is written for non-programmers and part of the book is also introducing web development technologies such as HTML, CSS and, of course, Javascript. In this post we assume our readers already know about them.

Setup

D3.js is an open-source library and it’s available on github. We can clone that repository and use them in our code. The source code is spread out into multiple files in d3/src/ but they are compiled into /d3/d3.js using a node.js module called smash, also by @mbostock.

The basic template for embedding D3 in a web page is by following this template:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>D3 Page Template</title>
        <script type="text/javascript" src="d3/d3.js"></script>
    </head>
    <body>
        <script type="text/javascript" src="my_d3_example.js" />
    </body>
</html>

Note the meta tag UTF-8. It’s important because D3 source file uses unicode characters (like the Greek character π). We’ll do all the work in a separate javascript file, say my_d3_example.js.

We can do all the testing using our local host or use jsfiddle.

DOM Manipulation

D3 handles DOM manipulation very neatly. For example, one of the first things we’ll do when writing a D3 code is to select the body of our html page:

d3.select("body");

From there we can perform other DOM operations like adding other DOM elements,

d3
  .select("body")
  .append("p")
  .text("Hello World");

The selection uses CSS3 selectors syntax, so we can also select elements by class ".myClassName" or id "#myIDName".

Multi-selection. One key component of D3 expressiveness is batching operations. This saves us from writing for loops and it makes the code more concise. Say we have an HTML body like:

<body>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
  <p>Paragraph 3</p>
  <p>Paragraph 4</p>
</body>

We can access all paragraphs by multi-selecting all <p> tags within the body:

d3
  .select("body")
  .selectAll("p")
  .text("Hello World");

This will set all of the contents of the paragraph to “Hello World”. In most of cases we’ll want to define a callback instead of passing a constant string to handle each entry differently. For example, we could do:

var counter = 0;
d3
  .select("body")
  .selectAll("p")
  .text(function() {
      counter += 1;
      return "New paragraph: "+counter;
  });

Observe how it relies a lot on function chaining. For it to work, it depends on the compatibility of the return type and the next method call, so it can be fragile. The API is very well crafted though, and it usually behaves as we’d expect. Moreover, it makes the code much more legible, removing keyword boilerplates and intermediate variables.

Binding data

One of the most import operation in D3 is binding data to DOM elements. This is done via the data() method. For example, we could do:

var dataset = [3, 5, 8, 13];
d3.select("body")
    .selectAll("p")
    .data(dataset)
    .enter()
    .append("p")
    .text(function(value) { return value; });

First, we are selecting all existing p elements, then we’re binding the data. The method enter() contains the rows from data that are not in the current selection. More specifically, say selectAll() returned 2 existing p elements, that is, an array with index 0 and 1. Our data is an array of 4 elements, with index 0 to 3.

D3 will assume that the indexes 0 and 1 are already there, so it’s not binding the values 3 and 5. We have the 8 and 13 values “unbounded”, so it will append one p element for each of these values and set the text.

It’s possible to specify the keys of the entries in data, but the default key is the array index. So let’s create keys for each of our entries:

  .data(dataset, function(value){ return "my_key"+value; })

Now we can verify all 4 paragraphs are rendered in addition to the 2 existing ones. For more details, see [2].

SVG and Attributes

Let’s create a simple random column chart. We’ll use SVG elements to render the columns. First, we can start creating and set the dimensions of a SVG element using the attr() method:

var width  = 800;
var height = 300;
var svg = d3.select("body").append("svg");
svg
    .attr("width", width)
    .attr("height", height);

Then, we can generate one rect element per entry of our data. In the code below, note how we set attributes in batch, by calling attr() with a list of attribute names and values.

var barWidth = width/dataset.length;
var padding = barWidth/10.0;

svg
    .selectAll("rect")
    .data(dataset)
    .enter()
    .append("rect")
    .attr({
        'x':    function(d, i) {return i*barWidth;},
        'y':    function(d)    {return height - d;},
        width:  barWidth - padding,
        height: function(d)    {return d;},
    })
    .text(function(d) {return d;});

Running the above with 25 random points renders a simple column chart:

Figure 1: Simple column chart

Figure 1: Simple column chart

Scales and Axis

Scale is essentially a function, that is, it maps a set of input to another output. One example would be if our data had X values ranging from 40-100, but our chart had width 1200px, and we wanted to map the range [40-100] onto [0,1200]. The most natural way to map a continuous interval onto another is through a linear transformation. We could write a function to perform that for us, but D3 makes it very easy to setup such mapping:

var scale = d3
    .scale
    .linear()
    .domain([40, 100])
    .range([0, 1200]);

In this syntax, domain is the input and range is the output.

Scales are important for axis, because axis are essentially visual representations of scales. Creating a simple axis from a scale is simple:

var axis = d3.svg.axis()
    .scale(scale)
    .orient("bottom"); 

// Append the axis element as an independent element on the svg
svg.append("g")
    .call(axis);

This will place a x-axis at the top of the chart. As we know, the x-axis is commonly positioned at the bottom of the chart, so we need to perform a y-translation of height units.

svg.append("g")
    .attr("transform", "translate(0," + (height) + ")")
    .call(axis);

This will cause the axis to not be shown because it got displaced beyond the SVG element limits. One way around that is to account for an extra height when defining the SVG height:

var width = 800;
var height = 300;
var axisHeight = 20;

var svg = d3.select("body").append("svg");
svg
    .attr("width", width)
    .attr("height", height + axisHeight);
Figure 2: Column chart with axis

Figure 2: Column chart with axis

Interactiveness

Another important aspect in data visualization is the interactiveness of the data.

In D3 we can set event listeners on SVG elements through the method on. It takes an event name (examples include “click”, “mouseover”, “mouseout”). A simple example is setting an event listener on the rectangles of our column chart. Let’s color it orange on hover:

svg
    .selectAll("rect")
    .data(dataset)
    .enter()
    .append("rect")
    .attr({
       'x':    function(d, i) {return i*barWidth;},
       'y':    function(d)    {return height - d;},
       width:  barWidth - padding,
       height: function(d)    {return d;},
    })
    .on("mouseover", function (d) { d3.select(this).attr("fill", "orange") })
    .on("mouseout",  function (d) { d3.select(this).attr("fill", "black") })
    .text(function(d) {return d;});

One observation here is that this within the callback function passed to the on method, is bound to the SVG element on which we’re setting up the listener.

The result can be seen on this jsfiddle.

Layouts

Constructing a column/bar chart is relatively straightforward using regular SVG rectangles and the D3 axis helper functions. On the other hand, chart types like pie charts for example, involves working with radians and more complicated math.

To leverage this, D3 uses the concept of layouts. One of the layouts is the pie layout:

var pie = d3.layout.pie();

It is basically a function that can transform our regular data into a suitable format for rendering SVG arcs, which will represent the slices of our piechart.

The code below creates binds a generic group element to each element of our dataset. It also translates our pie chart because all values are calculated taking the center of the circle as the origin (0, 0).

var arcs = svg.selectAll("g.arc")
        .data(pie(dataset))
        .enter()
        .append("g")
        .attr("class", "arc")
        .attr("transform", "translate(" + outerRadius + ", " + outerRadius + ")");

Now we can append the actual wedge (represented by the SVG arc element), which can be easily created with the d3.svg.arc() function.

We can use the d3.scale.category10() for generating a set of up to 10 distinct colors for each slice.

var arc = 
  d3.svg.arc()
    .innerRadius(innerRadius)
    .outerRadius(outerRadius);

var color = d3.scale.category10();
arcs.append("path")
    .attr("fill", function(d, i) {
        return color(i);
    })
    .attr("stroke", "white")
    .attr("d", arc);

Some magic seems to be going on here. Nowhere we set the start and end angles of our slice. I had to dig into the source code to realize that arc doesn’t represent the actual arc, but an arc generator. Then we set the attribute d, we’re actually calling a function arc() and the data is passed to this function. The startAngle and endAngle properties are being set by the pie layout.

Doing some other tweaks like adding the labels leads to the following pie chart:

Figure 3: Pie chart

Figure 3: Pie chart (jsfiddle)

Geo-mapping

GeoJSON is a JSON for describing maps in terms of SVG elements. For example, for a US map, each state has it’s own entry in this JSON and they define a set of coordinates that when project become a polygon defining the boundary of the state.

This GeoJSON is usually big, so it makes sense loading them from a file. We can start by doing

var MapsExample = {
    run: function() {
        d3.json(
            "data/us-states.json",
            this.handleGeoJSONLoaded.bind(this)
        );
    },

This file has no actual data, so we need to join with some other file, for example with a CSV file containing state names and some metric, like agricultural productivity (as in Chapter 12 of [1]). So after we have our map info loaded, we can also load the real data:

var MapsExample = {
    ...
    handleGeoJSONLoaded: function(json) {
        this._geoJSON = json;
        d3.csv(
            "data/us-ag-productivity-2004.csv",
            this.handleUSDataLoaded.bind(this)
        );
    },

    handleUSDataLoaded: function(data) {
        this.joinWithData(data);
        this.render();
    },

And before rendering we merge the data into the geoJSON:

var MapsExample = {
    ...
    joinWithData: function(data) {
        // Index geo objects by state name
        geoByState = {};
        this._geoJSON.features.forEach(function(feature) {
            var jsonState = feature.properties.name;
            geoByState[jsonState] = feature;
        });
        // Add the value attribute to the geoJSON
        data.forEach(function(row) {
            var dataState = row.state;
            geoByState[dataState].properties.value = parseFloat(row.value);
        });
    },

Now we’re ready to generate the SVG elements:

var MapsExample = {
    ...
    render: function() {
        var path = d3.geo.path()
            .projection(projection);
        svg.selectAll("path")
            .data(this._geoJSON.features)
            .enter()
            .append("path")
            .attr("d", path)
            .style("fill", function(d) {
                var value = d.properties.value;
                return (value) ? this.color(value) : '#ccc';
            }.bind(this));

The only missing piece here is the color, which maps values from the data into a discrete set of values:

this.color = d3.scale.quantize()
  .range([
    "rgb(237,248,233)",
    "rgb(186,228,179)",
    "rgb(116,196,118)",
    "rgb(49,163,84)",
    "rgb(0,109,44)"
  ])
  .domain([
    d3.min(data, function(d) { return d.value; }),
    d3.max(data, function(d) { return d.value; })
]);

The complete code with additional data added as circles can be seen on github.

Figure 4: Choropleth + Symbol Maps

Figure 4: Choropleth + Symbol Maps

Conclusion

D3.js is a very neat library and fun to work with. I’ve learned a lot about D3 and SVG writing this post and also became aware of the effort in standardizing computational cartography (GeoJSON). I’m super excited to try more examples, building stuff on my own and possibly contribute to the project.

My research in grad school was related to proportional symbol maps, and I was surprised that one of the examples consisted in actually constructing a proportional symbol maps with circles.

References

[1] Interactive Data Visualization for the Web – Scott Murray
[2] Knowledge Stockpile – Understanding selectAll, data, enter, append sequence in D3.js