US as an hexagonal map

In this post we’ll study a way to visualize maps in a hexagonal grid, in which each entity have uniform area. We’ll then model that as a mathematical problem.

One challenge in displaying data in maps is that larger area countries or states tend to get more attention than smaller ones, even when economically or population-wise, the smaller state is more relevant (e.g. New Jersey vs. Alaska). One idea is to normalize the areas of all the states by using symbols such as squares. Recently I ran into a NPR map that used hexagons and it looked very neat, so I decided to try building it in D3 and perform some analysis.

Below is the result of plotting the state populations (log scale):

US Hexmap: Population (log scale)

US Hexmap: Population (log scale)

One important property of visualizing data in maps is familiarity of the location (you can easily find specific states because you remember where they are) and also adjacency patterns can provide insights. For example, if we plot a measure as a choropleth map and see that the West coast is colored differently from the Midwest, then we gain an insight we wouldn’t have by looking at a column chart for example.

Because of this, ideally the homogeneous area maps should preserve adjacencies as much as possible. With that in mind, we can come up with a similarity score. Let X be the set of pairs of states that share a border in the actual US map. Now, let Y be the set of pairs of states that share a border in the hexagonal map (that is, two hexagons sharing a side). The similarity score is the size of their symmetric difference and we can normalize by the size of the original:

(|X - Y| + |Y - X|) / |X|

The lower the score the better. In an ideal case, the borders sets would match perfectly for a score of 0.

The size of the symmetric difference between the two sets seems like a good measure for similarity, but I’m not sure about the normalization factor. I initially picked the size of the union of X and Y, but this wouldn’t let us model this problem as a linear programming model as we’ll see next. The problem with using the size of X is that the score could theoretically be larger than 1, but it’s trivial to place the hexagons in the grid in such a way that none of them are touching and thus Y is empty, so we can assume the score is between 0 and 1.

Hexgrid coordinates convention

Hexgrid coordinates convention

The score from the NPR maps is 0.67.

An Optimization Problem

Let’s generalize and formalize the problem above as follows: given a graph G = (V,E), and another graph H = (V_H, E_H) representing our grid, find the induced subgraph of H, I = (V_I, E_I), such that there’s bijection f: V \rightarrow V_I and the size of the symmetric difference of f(E) and E_I is minimized (f(E) is an abuse of notation, but it means applying the bijection to each vertex in the set of edges E).

To make it clearer, let’s apply the definition above to the original problem. G represents the adjacency of states in the original map. V is the set of states and E is the set of pairs of states that share a border. H is the hexagonal grid. V_H is the set of all hexagons and E_H is the set of pairs of hexagons that are adjacent. We want to find a subset of the hexagons where we can place each of the states (hence the bijection from states to hexagons) and if two hexagons are in the grid, and we place two states there, we consider the states to be adjacent, hence the need for an induced graph, so the adjacency in the grid is preserved.

Is this general case an NP-hard problem? We can reduce the Graph Isomorphism problem to this one. It consists in deciding whether two graphs A and B are isomorphic. If we set G = A and H = B, then A and B are isomorphic if and only if I = H and the symmetric difference of f(E) and E_I is 0. The problem is that it’s not known whether Graph Isomorphism belongs to NP-Complete.

What if G is planar (which is the case for maps)? I haven’t given much thought about it, but I decided to come up with an integer programming model nevertheless.

An Integer Linear Programming Model

Note: the model uses the original grid analogy instead of the more general problem so that the constraints are easier to understand.

Boolean algebra as linear constraints

Before we start, we need to recall how to model logical constraints (AND, OR and EQUAL) using linear constraints. Let a and b be binary variables. We want x to be the result of logical operators applied to a and b.

For AND, we can do (x = 1 if and only if a = 1 and b = 1)

x \le a
x \le b
x \ge a + b - 1

For OR, we can do (x = 0 if and only if a = 0 and b = 0)

x \ge a
x \ge b
x \le a + b

For EQUAL, we can do (x = 1 if and only if a = b)

x \le 1 - (a - b)
x \le 1 - (b - a)
x \ge a + b - 1
x \ge -(a + b - 1)

We can introduce a notation and assume these constraints can be generated by a function. For example, if we say
x = \mbox{EQ}(a, b), we’re talking about the four constraints we defined above for modeling EQUAL. This is discussed in [2].

Constants

Let G be the set of pairs (x,y) representing the grid positions. Let P be the set of pieces p that have to be placed in the grid. Let N(x,y) be the set of pairs (x',y') that are adjacent to (x, y) in the grid.

Let A_{v1, v2} represent whether v1 and v2 are adjacent to each other in the dataset.

“Physical” constraints

Let b_{x,y,s} be a binary variable, and equals 1 if and only if state s is placed position (x, y).

1) A piece has to be placed in exactly one spot in the grid:

\sum_{(x,y) \in G} b_{x,y,p} = 1 for all p \in P

2) A spot can only be occupied by at most one state:

\sum_s b_{x,y,s} \le 1 for all (x,y) \in G

Adjacency constraints

Let a_{p1, p2, x, y} be a binary variable and equals 1 if and only if piece p1 is placed in (x, y) and adjacent to p2 in the grid.

3) a_{p1, p2, x, y} has to be 0 if p1 is not in (x,y) or p2 is not adjacent to any of (x,y) neighbors:

a_{p1, p2, x, y} = \mbox{AND}(\sum_{(x', y') \in N(x, y)} b_{x', y', p2}, b_{x,y,p})

We have that a_{p1, p2, x, y} is 1 if and only if p1 is in (x,y) and p2 is adjacent to any of (x,y) neighbors.

Finally, we can model the adjacency between two pieces p1 and p2. Let a_{p1, p2} be a binary variable and equals 1 if and only if p1 and p2 are adjacent in the grid:

a_{p1, p2} = \sum_{(x,y) in G} a_{p1, p2, x, y}

Symmetric difference constraints

Let y_{p1, p2} be a binary variable and equals to 1 if and only if a_{p1, p2} \ne A_{p1, p2}.

4) y_{p1, p2} \ge a_{p1, p2} - A_{p1, p2}
5) y_{p1, p2} \ge A_{p1, p2} - a_{p1, p2}

Objective function

The sum of all y‘s is the size of the symmetric difference:

\min \sum_{p1, p2 \in P} y_{p1, p2}.

Practical concerns

This model can be quite big. For our US map example, we have |P| = 50 and we need to estimate the size of the grid. A 50×50 grid is enough for any type of arrangement. The problem is that the number of variables a_{p1, p2, x, y} is |P|^2|G| = 50^4 which is not practical.

We can also solve the problem for individual connected components in the original graph and it’s trivial construct the optimal solution from each optimal sub-solution. This doesn’t help much in our example, since only Hawaii and Alaska are disconnected, so we have |P| = 48. The grid could also be reduced. It’s very unlikely that an optimal solution would be a straight line. In the NPR map, the grid is 8×12. Sounds like doubling these dimensions would give the optimal solution enough room, so |G| = 8*12*4 = 384.

We can also assume states are orderer and we only have variables a_{p1, p2, x, y} for p1 < p2, so the number of a_{p1, p2, x, y} is about 450K. Still too large, unfortunately.

Another important optimization we can do in our case because we're working with a grid is to define the adjacency for x and y independently and combine them afterwards.

Refined adjacency constraints

Instead of working with b_{x,y,s} we use X_{x, s}, and equals 1 if and only if state s is placed position (x, y) for any y and Y_{y, s}, which equals 1 iff state s is placed position (x, y) for any x. The physical constraints are analogous to the previous model:

6) A piece has to be placed in exactly one spot in the grid:

\sum_{x \in G} X_{x,p} = 1 for all p \in P
\sum_{y \in G} Y_{y,p} = 1 for all p \in P

7) A spot can only be occupied by at most one state:

\sum_s X_{xs} \le 1 for all x \in G
\sum_s Y_{y,s} \le 1 for all y \in G

In a hexagonal grid, if we have the piece p1 in position (x,y), it will be adjacent to another piece p2 if and only if p2 is in one of these six positions: 1: (x-1, y), 2: (x+1, y), 3: (x-1, y-1), 4: (x, y-1), 5: (x-1, y+1) or 6: (x, y+1). We can define two adjacency categories: Type I, which happens when p1.y - p2.y = 0 and |p1.x - p2.x| = 1 (cases 1 and 2); and Type II, which is when |p1.y - p2.y| = 1 and p1.x - p2.x \le 0 (cases 3, 4, 5 and 6).

Let’s define Y_{d=0, p1, p2, y} = 1 iff p1.y - p2.y = 0 for a given y. Similarly we define X_{|d|=1, p1, p2, x} = 1 iff |p1.x - p2.x| = 1, Y_{|d|=1, p1, p2, y} = 1 iff |p1.y - p2.y| = 1 and finally X_{d \ge 0, p1, p2, x} = 1 iff p1.x - p2.x \ge 0.

8) We can have the following constraints do model the variables we just defined:

Y_{d=0, p1, p2, y} = \mbox{EQ}(Y_{y, p_1}, Y_{y, p2})
X_{|d|=1, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x-1, p2} + X_{x+1, p2})
Y_{|d|=1, p1, p2, y} = \mbox{EQ}(Y_{y, p1}, Y_{y-1, p2} + Y_{y+1, p2})
X_{d \ge 0, p1, p2, x} = \mbox{EQ}(X_{x, p1}, X_{x, p2} + X_{x+1, p2})

9) Let Y_{d=0, p1, p2} = 1 iff p1.x - p2.y = 0 for any y. We can define analogous variables for the other cases:

Y_{d=0, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{|d|=1, p1, p2} = \sum_{x} X_{d=0, p1, p2, x}
Y_{|d|=1, p1, p2} = \sum_{y} Y_{d=0, p1, p2, y}
X_{d \ge 0, p1, p2} = \sum_{x} Y_{d \ge 0, p1, p2, x}

10) Let T'_{p1, p2} = 1 iff p1 and p2 have the Type I adjacency and T''_{p1, p2} = 1 iff p1 and p2 have Type II adjacency:

T'_{p1, p2} = \mbox{AND}(Y_{d=0, p1, p2}, X_{|d|=1, p1, p2})
T''_{p1, p2} = \mbox{AND}(Y_{|d|=1, p1, p2}, X_{d \ge 0, p1, p2})

11) Finally, we say that p1 and p2 are adjacency iff either Type I or Type II occurs:

a_{p1, p2} = \mbox{OR}(T'_{p1, p2}, T''_{p1, p2})

The model for adjacency became much more complicated but we were able to reduce the number of adjacency variables are now roughly O(|P|^2 \sqrt{|G|}). The number of non-zero entries in the right hand size of (which represents the size of the sparse matrix) is roughly 11M, dominated by the type (8) constraints. I’m still not confident this model will run, so I’ll punt on implementing it for now.

Conclusion

In this post we explored a different way to visualize the US states map. I was mainly exploring the idea of how good of an approximation this layout is and a natural question was how to model this as an optimization problem. Turns out that if we model it using graphs, the problem definition is pretty simple and happens to be a more general version of the Graph Isomorphism problem.

I struggled with coming up with an integer programming model and couldn’t find one with a manageable size, but it was a fun exercise!

World Map?

One cannot help wondering if we can display the countries in a hexagonal map. I’m planning to explore this idea in a future post. The main challenge is that the US states areas are more uniform than the countries. For example, the largest state (Alaska) is 430 times larger than the smallest (Rhode Island). While the largest country (Russia) is almost 40,000,000 bigger than the smallest (Vatican City).

Also, the layout of the US map was devised by someone from NPR and they did a manual process. I’m wondering if we can come up with a simple heuristic to place the hexagons and then perform manual adjustments.

References

[1] NPR Visuals Team – Let’s Tesselate: Hexagons For Tile Grid Maps
[2] Computer Science: Express boolean logic operations in zero-one integer linear programming (ILP)
[3] SOM – Creating hexagonal heatmaps with D3.js
[4] Github – d3/d3-hexbin

Data sources

[5] US State Borders
[6] Wikipedia – Population of US states and territories
[7] Tool to download Wikipedia tables as CSV
[8] List of US states and territories by area

Advertisements

Introduction to d3.js

mike-bostock

Mike Bostok (@mbostock) is a Computer Scientist currently working for the NYTimes. He was a PhD student at Stanford, when, together with professor Jeff Heer and Vadim Ogievetsky, they created D3.js, a framework for creating data visualizations in Javascript.

D3 stands for Data Driven Documents (DDD) and it simplifies the process of building visualizations on top of data, by handling most of the math and boilerplate necessary to generate visual elements.

In this post, we’ll discuss the basic aspects of D3.js, heavily based on the excellent book from Scott Murray, Interactive Data Visualization for the Web. The book is a relatively short and fun read, and it was compiled from a series of tutorials Scott wrote in the past.

Scott’s book is written for non-programmers and part of the book is also introducing web development technologies such as HTML, CSS and, of course, Javascript. In this post we assume our readers already know about them.

Setup

D3.js is an open-source library and it’s available on github. We can clone that repository and use them in our code. The source code is spread out into multiple files in d3/src/ but they are compiled into /d3/d3.js using a node.js module called smash, also by @mbostock.

The basic template for embedding D3 in a web page is by following this template:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>D3 Page Template</title>
        <script type="text/javascript" src="d3/d3.js"></script>
    </head>
    <body>
        <script type="text/javascript" src="my_d3_example.js" />
    </body>
</html>

Note the meta tag UTF-8. It’s important because D3 source file uses unicode characters (like the Greek character π). We’ll do all the work in a separate javascript file, say my_d3_example.js.

We can do all the testing using our local host or use jsfiddle.

DOM Manipulation

D3 handles DOM manipulation very neatly. For example, one of the first things we’ll do when writing a D3 code is to select the body of our html page:

d3.select("body");

From there we can perform other DOM operations like adding other DOM elements,

d3
  .select("body")
  .append("p")
  .text("Hello World");

The selection uses CSS3 selectors syntax, so we can also select elements by class ".myClassName" or id "#myIDName".

Multi-selection. One key component of D3 expressiveness is batching operations. This saves us from writing for loops and it makes the code more concise. Say we have an HTML body like:

<body>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
  <p>Paragraph 3</p>
  <p>Paragraph 4</p>
</body>

We can access all paragraphs by multi-selecting all <p> tags within the body:

d3
  .select("body")
  .selectAll("p")
  .text("Hello World");

This will set all of the contents of the paragraph to “Hello World”. In most of cases we’ll want to define a callback instead of passing a constant string to handle each entry differently. For example, we could do:

var counter = 0;
d3
  .select("body")
  .selectAll("p")
  .text(function() {
      counter += 1;
      return "New paragraph: "+counter;
  });

Observe how it relies a lot on function chaining. For it to work, it depends on the compatibility of the return type and the next method call, so it can be fragile. The API is very well crafted though, and it usually behaves as we’d expect. Moreover, it makes the code much more legible, removing keyword boilerplates and intermediate variables.

Binding data

One of the most import operation in D3 is binding data to DOM elements. This is done via the data() method. For example, we could do:

var dataset = [3, 5, 8, 13];
d3.select("body")
    .selectAll("p")
    .data(dataset)
    .enter()
    .append("p")
    .text(function(value) { return value; });

First, we are selecting all existing p elements, then we’re binding the data. The method enter() contains the rows from data that are not in the current selection. More specifically, say selectAll() returned 2 existing p elements, that is, an array with index 0 and 1. Our data is an array of 4 elements, with index 0 to 3.

D3 will assume that the indexes 0 and 1 are already there, so it’s not binding the values 3 and 5. We have the 8 and 13 values “unbounded”, so it will append one p element for each of these values and set the text.

It’s possible to specify the keys of the entries in data, but the default key is the array index. So let’s create keys for each of our entries:

  .data(dataset, function(value){ return "my_key"+value; })

Now we can verify all 4 paragraphs are rendered in addition to the 2 existing ones. For more details, see [2].

SVG and Attributes

Let’s create a simple random column chart. We’ll use SVG elements to render the columns. First, we can start creating and set the dimensions of a SVG element using the attr() method:

var width  = 800;
var height = 300;
var svg = d3.select("body").append("svg");
svg
    .attr("width", width)
    .attr("height", height);

Then, we can generate one rect element per entry of our data. In the code below, note how we set attributes in batch, by calling attr() with a list of attribute names and values.

var barWidth = width/dataset.length;
var padding = barWidth/10.0;

svg
    .selectAll("rect")
    .data(dataset)
    .enter()
    .append("rect")
    .attr({
        'x':    function(d, i) {return i*barWidth;},
        'y':    function(d)    {return height - d;},
        width:  barWidth - padding,
        height: function(d)    {return d;},
    })
    .text(function(d) {return d;});

Running the above with 25 random points renders a simple column chart:

Figure 1: Simple column chart

Figure 1: Simple column chart

Scales and Axis

Scale is essentially a function, that is, it maps a set of input to another output. One example would be if our data had X values ranging from 40-100, but our chart had width 1200px, and we wanted to map the range [40-100] onto [0,1200]. The most natural way to map a continuous interval onto another is through a linear transformation. We could write a function to perform that for us, but D3 makes it very easy to setup such mapping:

var scale = d3
    .scale
    .linear()
    .domain([40, 100])
    .range([0, 1200]);

In this syntax, domain is the input and range is the output.

Scales are important for axis, because axis are essentially visual representations of scales. Creating a simple axis from a scale is simple:

var axis = d3.svg.axis()
    .scale(scale)
    .orient("bottom"); 

// Append the axis element as an independent element on the svg
svg.append("g")
    .call(axis);

This will place a x-axis at the top of the chart. As we know, the x-axis is commonly positioned at the bottom of the chart, so we need to perform a y-translation of height units.

svg.append("g")
    .attr("transform", "translate(0," + (height) + ")")
    .call(axis);

This will cause the axis to not be shown because it got displaced beyond the SVG element limits. One way around that is to account for an extra height when defining the SVG height:

var width = 800;
var height = 300;
var axisHeight = 20;

var svg = d3.select("body").append("svg");
svg
    .attr("width", width)
    .attr("height", height + axisHeight);
Figure 2: Column chart with axis

Figure 2: Column chart with axis

Interactiveness

Another important aspect in data visualization is the interactiveness of the data.

In D3 we can set event listeners on SVG elements through the method on. It takes an event name (examples include “click”, “mouseover”, “mouseout”). A simple example is setting an event listener on the rectangles of our column chart. Let’s color it orange on hover:

svg
    .selectAll("rect")
    .data(dataset)
    .enter()
    .append("rect")
    .attr({
       'x':    function(d, i) {return i*barWidth;},
       'y':    function(d)    {return height - d;},
       width:  barWidth - padding,
       height: function(d)    {return d;},
    })
    .on("mouseover", function (d) { d3.select(this).attr("fill", "orange") })
    .on("mouseout",  function (d) { d3.select(this).attr("fill", "black") })
    .text(function(d) {return d;});

One observation here is that this within the callback function passed to the on method, is bound to the SVG element on which we’re setting up the listener.

The result can be seen on this jsfiddle.

Layouts

Constructing a column/bar chart is relatively straightforward using regular SVG rectangles and the D3 axis helper functions. On the other hand, chart types like pie charts for example, involves working with radians and more complicated math.

To leverage this, D3 uses the concept of layouts. One of the layouts is the pie layout:

var pie = d3.layout.pie();

It is basically a function that can transform our regular data into a suitable format for rendering SVG arcs, which will represent the slices of our piechart.

The code below creates binds a generic group element to each element of our dataset. It also translates our pie chart because all values are calculated taking the center of the circle as the origin (0, 0).

var arcs = svg.selectAll("g.arc")
        .data(pie(dataset))
        .enter()
        .append("g")
        .attr("class", "arc")
        .attr("transform", "translate(" + outerRadius + ", " + outerRadius + ")");

Now we can append the actual wedge (represented by the SVG arc element), which can be easily created with the d3.svg.arc() function.

We can use the d3.scale.category10() for generating a set of up to 10 distinct colors for each slice.

var arc = 
  d3.svg.arc()
    .innerRadius(innerRadius)
    .outerRadius(outerRadius);

var color = d3.scale.category10();
arcs.append("path")
    .attr("fill", function(d, i) {
        return color(i);
    })
    .attr("stroke", "white")
    .attr("d", arc);

Some magic seems to be going on here. Nowhere we set the start and end angles of our slice. I had to dig into the source code to realize that arc doesn’t represent the actual arc, but an arc generator. Then we set the attribute d, we’re actually calling a function arc() and the data is passed to this function. The startAngle and endAngle properties are being set by the pie layout.

Doing some other tweaks like adding the labels leads to the following pie chart:

Figure 3: Pie chart

Figure 3: Pie chart (jsfiddle)

Geo-mapping

GeoJSON is a JSON for describing maps in terms of SVG elements. For example, for a US map, each state has it’s own entry in this JSON and they define a set of coordinates that when project become a polygon defining the boundary of the state.

This GeoJSON is usually big, so it makes sense loading them from a file. We can start by doing

var MapsExample = {
    run: function() {
        d3.json(
            "data/us-states.json",
            this.handleGeoJSONLoaded.bind(this)
        );
    },

This file has no actual data, so we need to join with some other file, for example with a CSV file containing state names and some metric, like agricultural productivity (as in Chapter 12 of [1]). So after we have our map info loaded, we can also load the real data:

var MapsExample = {
    ...
    handleGeoJSONLoaded: function(json) {
        this._geoJSON = json;
        d3.csv(
            "data/us-ag-productivity-2004.csv",
            this.handleUSDataLoaded.bind(this)
        );
    },

    handleUSDataLoaded: function(data) {
        this.joinWithData(data);
        this.render();
    },

And before rendering we merge the data into the geoJSON:

var MapsExample = {
    ...
    joinWithData: function(data) {
        // Index geo objects by state name
        geoByState = {};
        this._geoJSON.features.forEach(function(feature) {
            var jsonState = feature.properties.name;
            geoByState[jsonState] = feature;
        });
        // Add the value attribute to the geoJSON
        data.forEach(function(row) {
            var dataState = row.state;
            geoByState[dataState].properties.value = parseFloat(row.value);
        });
    },

Now we’re ready to generate the SVG elements:

var MapsExample = {
    ...
    render: function() {
        var path = d3.geo.path()
            .projection(projection);
        svg.selectAll("path")
            .data(this._geoJSON.features)
            .enter()
            .append("path")
            .attr("d", path)
            .style("fill", function(d) {
                var value = d.properties.value;
                return (value) ? this.color(value) : '#ccc';
            }.bind(this));

The only missing piece here is the color, which maps values from the data into a discrete set of values:

this.color = d3.scale.quantize()
  .range([
    "rgb(237,248,233)",
    "rgb(186,228,179)",
    "rgb(116,196,118)",
    "rgb(49,163,84)",
    "rgb(0,109,44)"
  ])
  .domain([
    d3.min(data, function(d) { return d.value; }),
    d3.max(data, function(d) { return d.value; })
]);

The complete code with additional data added as circles can be seen on github.

Figure 4: Choropleth + Symbol Maps

Figure 4: Choropleth + Symbol Maps

Conclusion

D3.js is a very neat library and fun to work with. I’ve learned a lot about D3 and SVG writing this post and also became aware of the effort in standardizing computational cartography (GeoJSON). I’m super excited to try more examples, building stuff on my own and possibly contribute to the project.

My research in grad school was related to proportional symbol maps, and I was surprised that one of the examples consisted in actually constructing a proportional symbol maps with circles.

References

[1] Interactive Data Visualization for the Web – Scott Murray
[2] Knowledge Stockpile – Understanding selectAll, data, enter, append sequence in D3.js