# OpenVis Conf 2017

I attended the OpenVis Conf in Boston. It’s a broad Data Visualization single-track 2-day conference with an industry focus. Here are my notes on the talks.

#### Mike Bostock’s Keynote

Mike Bostock (the famous creator of D3.js) opened up the conference by talking mainly about d3.express, a new library he’s working on. Despite the name, it has no dependency on D3 itself, but rather, it looks more like Python/R notebooks in which you can write JavaScript expressions in a console-like environment, but that get re-evaluated as you change upstream assignments. It borrows a lot of ideas from Reactive programming, in particular Bret Victor’s ideas (this paradigm immediately reminded me of his Ladder of Abstraction).

Another interesting feature of this library is the built-in animation loop functionality. Through a slight modification of the ES6 syntax, Bostock uses generators as a simple construct to express animations. The library also include helpers to easliy build common UI input controls, such as a scroller and checkboxes.

d3.express is currently in development, not yet ready for use.

#### Data Sketch|es: a visualization a month

Shirley Wu and Nadieh Brehmer presented the lessons learned during their (ongoing) project called Data Sketch|ES, which consists in crafting a Data Visualization a month. A technique adopted by creative artists, this constraint is both the challenge of coming up with original ideas but also getting them done in a predicted amount of time.

Some of their lessons included cross-checking the data — even if obtained from usually reliable sources, iterate between sketches on paper and actual code prototypes — it’s hard to predict how a visualization will look like, especially if interactive, delight the audience — spending time polishing the work and making use of animation to keep users interested and engaged.

#### Visualizing data with Deck.gl

Nicolas Belmonte is the head of the Data Visualization at Uber. They’ve open sourced Deck.gl a library on top of WebGL, used to display data overlaid in a map. It has good integration with React and Mapbox GL.
They showcased a really awesome application simulating wind patterns using particles based on a wind vector map.

Wind map using particles

#### What Store Does Your Timeline Tell?

Matthew Bremer is a researcher at Microsoft and he explored many unconventional ways to represent temporal data besides the regular line chart. He showed some cases with radial axis (daily routine, lifetimes), spiral (cycles), grid. He and other folks wrote a paper on this topic and they have a page with more information.

#### GDAL

Robert Simmon explained how to use GDAL, a library to work with Geo Spatial data, mostly through command line. Too technical and too specific in domain for me. I didn’t get much from this talk, besides satellite imagery requiring a lot of post-processing to look presentable.

#### How Spatial Polygons Shape our World

Amelia McNamara discussed representation of quantities in maps, mainly through polygons. These polygons can be arbitrary shapes on the map, but are often represented by district areas, counties, and states.

She first presented a few examples including Choropleth US map and how big sparse states are over-represented while densely-populated small states are under-represented. Some strategies like distorted maps and unit-area maps (like the NPR’s hexmap, which we posted about) can be used with the downside of familiarity or incorrect adjacency of states.

She discussed scenarios in which data from different aggregation levels must be combined, say, one data has county level aggregation, the other has state level aggregation.

Up-scaling is easy. For the county to state aggregation, it might be a matter of just summing values. Down-scaling requires some other external reference, for example, if we can assume the count for a state is a relatively uniform function of the population, we can break it down by county based on their relative population. Side-scaling is harder, especially when one of the polygons is not contained in the other (which was the case for up and down scaling).

#### Introducing REGL

Mikola Lysenko is the author of the REGL library (which is an evolution of Stack.gl), which provides a functional (stateless) API on top of WebGL, much in line with the paradigm adopted by d3.express from Bostock’s talk. He then proceed to perform a quick live demo demonstrating the expressiveness (and his proficiency), by displaying a 3D scan dataset in a browser. This was by far the most visually appealing talk, with 3D bouncing rainbow bunnies.

#### Untangling the hairball

John Gomez started with a problem: how to display network data in a meaningful and consumable way. He explored different strategies like limiting the number of nodes to show, clustering, only show edges on hover. It was interesting to learn the steps of crafting a visualization from scratch. He used mostly D3 and developed some modules of his own. Very funny talk.

#### Designing Visualization Tools for Learners

Catherine D’Ignazio and Rahul Bhargava are interested in tools for people who do not have a analytical background or are not data savvy. Developing tools with this mindset automatically makes tools more accessible for everyone. They presented 4 traits that a good tool must have: focused, guided, inviting and expandable.

A focused tool must do one thing and do it very well (UNIX philosophy). Tableau is an example of a tool that is not focused. It’s not necessarily bad, it’s just overwhelming for a learner. They cited Timeline.js as a focused tool.

A guided tool provides clear affordances on what steps should be taken next. They showcased a tool which starts with a blank canvas where you can add components to it. They provided ways to improve that by starting with a sample component/widget instead of a blank page.

An inviting tool is familiar. It contains examples that users can related to their area/context (journalism, arts, etc). It provides a safe playground for users to try it out, being able to easily undo it and make mistakes. An example of uninviting tool is excel’s pivot table.

An expandable tool allows the learner to take a step further towards mastery. It’s not a rigid static tool. It does not shy away from technical terms because it doesn’t want to be incorrect or misleading, but it provides signifiers such as tooltips. An example of non-expandable tool is a (static) infographic.

#### Visualizing Incarceration in the US on Polygraph

Matt Daniels provided a set of rich and interactive visualizations to represent data from incarcerations in the US. The project started by looking at a line chart representing the growth of convicts in the US, where a sharp increase in the numbers during the 2000s led to an astonishing 1% of the American male population being behind the bars these days.

His attempts were to gain insights on what measures can be taken to reduce incarcerations, breaking it down by causes, gender and other dimensions. He described the pain in obtaining accurate and detailed data, having to settle for Florida’s data, a state with an unusually transparent criminal system.

It felt that much of the work was yet to be done, so the talk didn’t unleash its full potential.

#### Amanda Cox’s Keynote

Amanda Cox discussed uncertainty in charts. It’s a challenging component because probability is not intuitive and a lot of people cannot interpret it correctly.

I missed the first half of this talk, so it’s probably not capturing the full context.

Uncertainty represented as moving gauges

#### A data point walks into a bar: designing data for empathy

Lisa Rost discussed the importance of including the emotional component in the design of the visualization. Rationality is necessary for a informed decision, but emotion is also needed to convince.

She provided 5 strategies that can be used to trigger more emotional response from a user:

• Use colors – explore intensity, an example was the use of bright red to evoke the image of blood in a violent deaths chart
• Show what you’re talking about – instead of plain bar charts, how about a stack of icons representing people if we’re talking about mortality?
• Show what data would mean for the user – create a visualization that will put the viewer as the protagonist. Examples include wording (using “you”), or analogies to more common ground (e.g. for an event that occurred in Syria, use an analogy to what it would look like if it was in the US)
• Zoom into one datapoint – people related more to individuals than groups. Focus on one person’s story.
• Show masses as individuals – masses are impersonal. Instead of communicating totals, use averages to make it more personal. For example: “X people died in the event”, can be rephrased as “1 person died every Y minutes”.

#### D3 with Canvas

Kai Chang demonstrated how to render D3 charts using Canvas. It was very code-oriented, hence hard to follow, but it was good to learn what can be done. I don’t recall him using any extra library for that.
One interesting trick he shared is how to perform a onClick functionality in a polygon rendered with canvas. For SVG it’s easy, but for Canvas it’s a bitmap, so we’d need to use some sort of point-inside-polygon algorithm. He suggested rendering another polygon underneath with the same shape but with a custom color. Canvas has an API to retrieve that color, so we are able to tell which polygon the click was from, based on the color associated to that polygon.

#### Pulling a Polygon out of a hat

Noah Veltman had a very interesting talk about animated transitions between polygons. He started with a simple mapping from a square to square, then triangle to square, which require introducing auxiliary points to the triangle so the number of vertices between the polygons match. He generalized to arbitrary (including concave) shapes and compared how much more natural his algorithm looked compared to a naive approach.
He finished with a yet-to-explore question: how to map polygons with different topological properties (e.g. a polygon with a hole to a polygon without one). Very entertaining and educative.

#### Text Mining and Visualization, the tidy way

Julia Silge explained the basics of text mining using unsupervised methods to obtain interesting insights from large bodies of text. She presented a demo using R and ggplot2. The main visualization used were histograms.

#### Why does Data Vis need a style guide?

Amy Cesal discussed the advantages of a consistent style for visualizations. It must be consistency across an organization, not only software (especially because in many cases multiple tools are used). Color, text (terminology, writing style, tone) and also layout.

#### Vega-lite: A grammar of interactive graphics

Kanit Wongsuphasawat, Dominik Moritz and Arvind Satyanarayan are students at UW, and have developed a simpler version of Vega, called Vega-Lite. It’s basically a system which translates a JSON configuration into Vega syntax. It relies on convention over configuration to make it very simple to create basic charts but also allows for more powerful visualizations.

A very interesting concept is the multi-chart configuration. Through operators it’s possible to “explode” a single chart into multiple ones (e.g. break down a line chart with multiple lines into multiple charts with its own line), it allows combining different chart types.
The most interesting idea in my opinion is being able to use nested configurations to create dashboards. The binary tree hierarchy can be used to define how the widgets are laid out (through horizontal and vertical partitioning). This hierarchy can also be used to define controllers that affect a subtree. For example, if we define a scroller at a given node, it affects all the widgets within that subtree.
Voyager is a UI built on top of Vega and looks very similar to Tableau.

This is a Vega-Lite specification to create a bar chart that shows the average temperature in Seattle for each month.

#### Data as a creative constraint

Eric Socolofsky has done some work for the Exploratorium and talked about computer art and how it can be applied to data visualization. He mentioned some components/types of digital art:

• Generative art: computer generated imagery.
• Randomness – for example, to represent uncertainty as mentioned in Amanda Cox’s talk
• Particle systems – for example, the Wind Map presented by Nicolas Belmonte
• Recursion and repetition – fractals
• Motion – animation
• Color

#### Empowering effective visualization (color) design

Connor Gramazio proposes an algorithm generated palette based on the CIELab color space and defines two metrics: discriminability (being able to tell different colors apart) and preferability (subjective measure of how an user likes the colors in the palette). He performed user studies to compare this palette to other palettes such as colorbrewer and Microsoft’s.

Overall it was very academic and technical and I got lost on the details.

#### Hacking your health with JavaScript

Alan McLean, talked about his works in Health tracking companies and also analyzing his own health. The tone of the presentation was a bit dark, but it did raised awareness (in the lines of Rost’s empathy talk) of how visualizations can be impersonal and cold, especially when the data displayed is about the user themselves.

#### The role of visualization in exploratory data analysis

This was basically a quick R tutorial focusing on packages like dplyr and ggplot2. Hadley Wickham performed a live demo to represent data of his github commits and his trips.

### Conclusion

Overall the conference was very interesting, with a diverse set of topics. I like that it’s industry driven, that is, not too academic. A lot of the talks are about ad-hoc visualizations, but since I work on developing general-purpose UI data tools, “Designing Visualization Tools for Learners“, “Why does Data Vis need a style guide” and “Vega-lite: A grammar of interactive graphics” were the most applicable to my work.

Uber has an interesting setup. One of the presenters, Nicolas Belmonte, is the head of the Data Visualization at Uber. I’ve talked to one of their teammates during lunch. They’re a horizontal team of around 25 people which embed themselves in different teams to work on specific visualization projects.

Prior to the conference I took a few days to explore the city, including the nearby Cambridge. I toured around Harvard, MIT, did the Freedom Trail, ate a lot of seafood but the highlight of the trip was the Boston Museum of Fine Arts.

1. Asian Art from the Museum of Fine Arts in Boston; 2. Boston Harbor
3. MIT Ray and Maria Stata Center; 4. Room at the Museum of Fine Arts.

This is the first post under the category “Paper Reading” which will consist in summarizing my notes after reading a paper.

In this post we’ll discuss Spanner: Google’s Globally-Distributed Database. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

### Paper’s Overview

Spanner is a distributed database used at Google. It provides a SQL API and several features such as reading data from a past timestamp, lock-free read-only transactions and atomic schema changes.

In Section 2, Implementation, it describes how the data is distributed in machines called spanservers, which themselves contain data structures called tablets, which are stored as B-tree-like files in a distributed file system called Colossus. It describes the data model, which allows hierarchy of tables, where the parent table’s rows are interleaved with the children tables’.

The paper puts emphasis on the TrueTime API as the linchpin that address many challenges in real-world distributed systems, especially around latency. This is described in Section 3.

Section 4 describes technical details on how to implement the distributed transactions and safe schema changes.

In Section 5, the authors provide benchmarks and use cases, in particular F1, a database built on top of Spanner and used by Google’s AdWords team, to address some scaling limitations of a sharded MySQL database.

### Architecture

#### Spanservers

Figure 1: Spanner stack [1]

As we can see in Figure 1, Spanner is organized in a set of zones, which are the unit of administrative deployment. Even though there might be more than one zone per data center, each zone correspond to a disjoint physical set of server. Within a zone, we have the zonemaster, which the paper doesn’t delve into, the location proxy, which serves as a index that directs requests to the appropriate spanserver, and the spanserver itself, which is the main unit in the system, containing the actual data.

Figure 2: Spanner architecture

A spanserver contains multiple tablets, which are a map

(key, timestamp) -> value

They are stored in Colossus, a distributed file system (successor of Google File System). A spanserver contains multiple replicas of the same data and the set of replicas form a Paxos group, which we can see in Figure 2. Reads can go directly to any replica that is sufficiently up-to-date. Writes must go through a leader, which is chosen via Paxos. The lock table depicted on top of the leader replica in Figure 2 allows concurrency control. The transaction manager is used to coordinate distributed transactions (that is, across multiple groups) and is also chooses the participant leader. If the transaction involves more than one Paxos group, a coordinator leader must be elected among the participant leaders of each group.

Within a tablet, keys are organized in directories, which is a set of contiguous keys sharing a common prefix. A directory is the unit of data, meaning that data movement happens by directories, not individual keys or rows. Directories can be moved between Paxos groups.

#### Data Model

The paper mentions that its data is in a semi-relational table. This is because it has characteristics from both relational tables (e.g. MySQL tables) and non-relational table (e.g. HBase tables). It looks like a relational table because it has rows, columns and versioned values. It looks like a key-value store table because rows must have a unique identifier, which acts a key, while the row is the value. The qualifier schematized seems to denote that these tables have well-defined schemas (including column types and primary keys).

In particular, the schema supports Protocol Buffers. This means that besides the native types existing in databases (string, int, etc.), the schema supports the strongly typed structures from Protocol Buffers.

Another feature of this database is how it allows storing related tables with their rows interleaved. This helps having data that is often queried together to be located physically close (co-located). In the example of Figure 3, it stores the Users and Albums interleaved. If we think of this as an use case for the Google Photos application, it should be a pretty common operation to fetch all albums of a given user.

Figure 3: Interleaved tables

#### TrueTime API

The main method from the true time API (TT) is:

TT.now(): [earliest, latest]

It returns the current universal time within a confidence interval. In other words, there’s a very strong guarantee that the actual time lies within the returned time range. Let $\epsilon$ be half of a given confidence interval length and $\bar\epsilon$ the average across all $\epsilon$‘s. According to their experiments, $\bar\epsilon$ varied from 1 to 7ms during the pollings.

Of course, the strength doesn’t lie on the API itself, but the small confidence interval it can guarantee. They achieve this through a combination of atomic clocks and GPSs, each of which have different uncorrelated synchronization and failure sources, which means that even if one of them fails, the other is likely working properly.

### Implementation Details

Spanner has three main types of operations: read-write transaction, read-only transaction and snapshot read at a timestamp.

The writes in a read-write transaction, or RW for short, are buffered in the client until commit, which means the reads in that transaction do not see the effects of the writes.

First, the client perform the reads, by issuing read requests to each of the leader of the (Paxos) groups that have the data. The leader acquires read locks and reads the most recent data.

Then the client starts the 2 phase commit for the writes. It chooses a coordinator group (a representative among the Paxos groups) and notify the leaders of the other groups with the identity of the coordinator plus the write requests. A non-coordinator group’s leader first acquires write locks and then chooses a prepare timestamp that is greater than all the timestamps issued by this group, to preserve monotonicity, and sends this timestamp to the coordinator’s leader.

The coordinator’s leader acquires write locks and receives all the timestamps from the other leaders and chooses a timestamp s that is greater than all of these, greater than the TT.now().latest at the time it received the commit request and greater than any timestamps issued within the coordinator group.

The coordinator’s leader might need to wait until it’s sure s is in the past, that is, until TT.now().earliest > s. After that, the coordinator sends s to all other leaders which commit their transactions with timestamp s and release their write locks.

This choice of timestamp can be shown to guarantee the external consistency property, that is, if the start of a transaction T2 happens after the commit of transaction T1, then the commit timestamp assigned to T2 has to be grater than T1’s.

If the read involves a single Paxos group, then it chooses a read timestamp as the timestamp of the latest committed write in the group (note that if there’s a write going on, it would have a lock on it). If the read involves more than one group, Spanner will choose for timestamp TT.now().latest, possibly waiting until it’s sure that timestamp is in the past.

### Conclusion

In this post we learned about some of the most recent Google’s databases, the main pioneer in large-scale distributed systems. It addresses some limitations with other solutions developed in the past: Big Table, which lacks schema and strong consistency; Megastore, which has poor write performance; and a sharded MySQL, which required the application to know about the sharding schema and resharding being a risky and lengthy process.

One thing I missed from this paper is whether Spanner can perform more advanced relational database operations such as aggregations, subqueries or joins. Usually performing these in a distributed system requires some extra component to store the intermediate values, which was not mentioned in the paper.

### References

• [1] Spanner: Google’s Globally-Distributed Database
• [2] Principles of Computer Systems – Distributed Transactions
• [4] Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
• [5] System level concurrency control for distributed database systems

### Appendix: Terminology and concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

Spanner is a database that shards data across many set of Paxos.

Sharding is basically distributing the data into multiple machines. It’s used commonly in a database context meaning partitioning the table’s rows into multiple machines. Paxos is a protocol used to solve the consensus problem, in which a set of machines (participants) in a distributed environment must agree on a value (consensus). We discussed it in a previous post, and it guarantees that at least a quorum (subset of more than half of the participants) agree on a value. In “set of Paxos”, it seems that Paxos is an allegory to represent a set of machines using the Paxos protocol.

Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent reads and writes, and globally-consistent reads across the database at a timestamp.

### Software

OS. I decided to try the Raspbian OS, which is a fork of Debian (wheezy) adapter for Raspberry. We first download it and write the image to the SD card.

We can then insert the card into the Raspberry Pi and connect a monitor/keyboard/mouse. The boot process will ask us to fill in some information and should be straightforward. The wi-fi adapter worked out of the box.

SSH. After installed, it felts very sluggish to run GUI, so I decided to do everything through SSH. The instructables has a very detailed guide for enabling it through the UI. After changing the password as instructed in the guide, I go the internal IP of the Pi using

hostname -I


I can then connect through:

ssh pi@<IP>


We can then install everything through command line.

Text editor and webserver. I installed my favorite editor and the server nginx (we could have used Apache’s HTTP server alternatively).

sudo apt-get update
sudo apt-get install emacs nginx


To test if the server is installed properly, we can run it out of the box:

sudo /etc/init.d/nginx start


If you put the Pi’s IP on the browser address bar in your laptop you should be able to see the default page served.

### Server Configuration

We can make some changes in the default nginx config to ease development.

We can edit the configuration file /etc/nginx/sites-available/default (has to be sudo). The first thing I changed is for it to read files from my home folder instead of /usr/share/nginx/www

server {
root /home/pi/www;
}


Since I’m using for private purposes, I turned HTTP authentication on. First we need to register a login entry in a .htpasswd file using htpasswd application. This can be obtained in Debian via

sudo apt-get install apache2-utils


Then we run:

sudo htpasswd -c /etc/nginx/.htpasswd <USER>


replacing with the username you’ll provide at login. When running the command above, you’ll be prompted to provide the password. The user name and an encrypted password will be saved to /etc/nginx/.htpasswd. Now we can configure nginx to use credentials from that file to perform the access check:

server {
...
auth_basic "Private Site";
auth_basic_user_file /etc/nginx/.htpasswd;
}


We can now add a custom HTML file to /home/pi/www (or whatever path you put in the nginx config), such as /home/pi/www/index.html

<html>
<title>Pi's webpage</title>
<body>
Hello world
</body>
</html>


Restart the server and reload the page, and you should get the new custom page!

sudo /etc/init.d/nginx restart


In a future post we’ll see how to work with a Node.js server, but this is as far as we’ll go in this first tutorial.

### Network Configuration

Static Internal IP. To make sure the internal IP of the Pi doesn’t keep changing you might need to configure your router. My router is a MediaLink 300N, which stores a table of MAC addresses (a unique identifier for your hardware) to internal IPs automatically so I don’t have to do anything.

Static External IP. The remaining problem is your external IP. Unless you have asked for static IP, chances are that your ISP (mine is Comcast) will change the external IP from time to time, so you don’t have much control over that.

Dynamic DNS. To solve that, first we need to get a domain (I registered a new one via Google domains). You can configure it to point to a specific IP (your external IP) which will be stored in a DNS. The problem, as we said, is that your external IP might change, so we need to update the mapping from periodically.

We don’t want to do this manually, so we can use a system like ddclient which runs a daemon on your server machine (the Pi in our case) that will periodically check the external IP and update the DNS entry with the new IP in case it has changed.

To install we can simply do

sudo apt-get install ddclient


We then need to configure it so it knows where to go to update the entry. The file lives in /etc/ddclient.conf (need to edit as sudo). The configuration will depend on what is your domain provider. For google domains it will look like:

protocol=dyndns2
use=web
ssl=yes
<DOMAIN NAME>


There’s a good tutorial for how to setup ddclient using Google domains.

To run the daemon, we can do

sudo ddclient -debug


Port Forwarding. When we access an URL like http://site.kunigami.info/, it’s implicitly assuming port 80 when routing that domain to the actual IP. We’ll need to tell our router to forward that request to a specific internal IP (otherwise how does it know whether it should go to your laptop or the pi?). Most routers offer a way to perform this mapping, which is also known as port forwarding.

For my MediaLink router it’s under Advanced Settings > Virtual Server > Port Range Forwarding.

NOTE: There’s one current issue I haven’t been able to figure out. The port forwarding seem to only work if I access it from outside of my local network, that is, through my phone network or via some VPN. It might be some issue with MediaLink.

### Conclusion

In this post we learned some details of the Internet Protocol and learned how to configure a Raspberry Pi to act as a server in a domestic network.

### References

[1] How-To Geek – How to Forward Ports on your Router.
[3] Server Fault – How does the HTTP GET method work in relation to DNS protocol?
[4] Wikipedia – Classful network
[5] Page to test if a specific port is open and being forwarded
[6] Setting up a nginx server on a Raspberry Pi

# 2016 in Review

This is a meta-post to review what happened in 2016.

This year I improved my knowledge on Web Development, learning more about HTTPS, JavaScript development, Web workers. I read a book about human-computer interaction, The Design of Everyday Things.

From my last year’s resolutions, I finished reading Code Complete. I’ve started learning about OCaml and started reading Purely Functional Data Structures by Okasaki, while implementing the data structures introduced in this book in OCaml. I’m still only 1/4 of the way in, so I’ll keep it in my 2017 goals.

I’ve only managed to try out one data visualization project, the hex map, and I’ll continue exploring this area next year.

I missed enrolling in any Coursera classes and learning about either Scala or Spark, unfortunately.

### Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

#### Trips

In 2016 I was lucky to have travelled a lot. In the beginning of the year, I did a road trip around Arizona, where I visited some beautiful National Parks and monuments, and a Biosphere, which inspired a blog post!

1. Arizona: Saguaro NP; 2. Chiricahua Monument; 3. Blue Mesas at Petrified Forest NP

I’ve also been back to Brazil, which included a short trip to Caldas Novas in Goiás, where many resorts and hot springs are located.

Then I had an opportunity to work for a month in Tel Aviv, Israel. During my free time I visited Jerusalem, the Dead Sea and Masada National Park, and Eilat. I also visited the magnificent Petra, the ancient city carved on stone, in Jordan. This was the most fantastic and memorable trip of the year.

1. Yehuda Market in Jerusalem; 2. ruins in Petra, Jordan; 3. a beach in Tel Aviv, Israel

Later in the year I’ve been to Yellowstone and Grand Teton National Parks, and also stopped by some museums in Salt Lake City in the way.

1. Prismatic springs; 2. Yellowstone falls; 3. Grand Teton NP

Shorter trips included Austin TX, Seattle WA, Mammoth Lakes CA and Las Vegas NV. It was a pretty intense year in terms of travelling, including 2 new countries and 5 new US states. I’m very grateful for being able to see these places and I hope 2017 will also be plenty in exploring.

#### Books

I read some really good books in 2016. Harari’s Sapiens: A Brief History of Humankind and Dawkins’ The Selfish Gene were my favorite science books. Empires of Indus is a great book on the history of civilization around the Indus river.

Farewell to Manzanar is a touching biography of Jeanne Wakatsuki focusing on internment camps Japanese Americans were sent to, in particular Manzanar, during the Second World War. I had a chance to visit the historical site that exists there today.

I haven’t read much fiction but Steinbeck’s Of Mice and Men and Morgenstern’s The Night Circus were entertaining.

As for biographies, my favorite is Einstein‘s by Walter Isaacson. I had enjoyed his work on Steve Jobs and he didn’t disappoint here. Einstein is one of the few scientists I also admire as a person.

#### Movies

I’m not a big movie watcher, but this year I made a point of watching well known classic movies and it was rewarding. I’ve seen: A Clockwork Orange, Empire of the Sun, The Godfather, Schindler’s List, 2001: A Space Odyssey, Seven Samurai, Akira, My Neighbor Totoro and Gandhi.

The most popular post was the Introduction to the Parsec Library, with 2k visits. No post from this year got popular, all of them below 50 visits :( Overall it had 7.5k visitors.

I kept the resolution to post once a month. I missed April, but I made up for it in November, so I wrote a total of 12 posts (excluding the meta post). The blog completed 4 years with 51 posts.

And we now have a new domain, kunigami.blog :)

### Resolutions for 2017

I missed many resolutions from last year, so I’ll carry some over, which includes finishing Okasaki’s Purely Functional Data Structures and learning about Scala and Spark. I’d like to write an iPhone app next year too and continue exploring ideas around data visualization.

# Amortization Analysis

In Chapter 5 of Purely Functional Data Structures, Okasaki introduces the amortization analysis as preparation for the complexity analysis of functional data structures.

The idea of amortized analysis is to prove the complexity of an operation when the average cost per operation is less than the worst case cost of each individual operation.

In this post we’ll perform analysis of 2 data structures and show that their amortized running time is much better than their worst case. There are 2 methods presented in the book, the Banker’s Method and the Physicist Method. I found the Physicist easier to work with, so I’ll only use that. Also, I find it easy to understand concepts with an examples, so we’ll first introduce the data structure and then introduce the analysis.

### Queue with lists

An easy way to implement the queue data structure is by using two lists, one representing the beginning (left) and the other the end (right). Inserting/removing an element from the beginning of a list in OCaml is O(1), while doing it at the end is O(n). For this reason, we store the end of the queue reversed in the right list. (In imperative languages the opposite is true, insertions/removals at the end are O(1) and at the beginning O(n), so the left array that is reversed).

Queue using lists representing [1, 2, 3, 4, 5, 6]

Insertions are performed by appending an element to the beginning of the right array. Removing from the front of the queue requires removing the first element of the left array. When the left array is empty, we need to reverse the contents of the right array and move it to the left.

We work with an invariant, in which the left array is only empty if the right array is empty as well. That requires a slight change to the removal, where we perform the reversal and move of the right array when the left array has only one element and is about to become empty.

I’ve implemented this idea in Ocaml.

### Analysis – The Physicist Method

If we only look at the worst case, we will get that removing from a queue is an O(n) operation. We can improve this by looking at the amortized cost. One way to perform this analysis is via the physicist method, in which we define a potential (or cost) function $\Phi(d)$ for the data structure $d$. In the queue case, we can define it to be the size of the right list.

The amortized cost of step i, $a_i$ is defined by

$a_i = t_i + \Phi(d_i) - \Phi(d_{i-1})$

Where $t_i$ is the actual cost (in number of operations)

The intuition is that some costly operations such as reversing the list, which takes O(n) operations, can only happen after O(n) insertions happened, so the average cost per operation is O(1). Translating it to the potential analogy, each insersion operation increases the potential of the data structure by 1, so by the time a revert happens, the right list is emptied and that causes a loss of potential of n, which had been accumulated and that offsets the cost of the operation $t_i$.

We can get the accumulated amortized cost of the operations up to a specific step by summing them, and we’ll get

$A_i = T_i + \Phi(d_i) - \Phi(d_0)$

If we choose $\Phi(d)$ such that $\Phi(d_i) \ge \Phi(d_0)$, for all i, then A_i is an upper bound of $T_i$. Since we assume $d_0$ is the empty queue, $\Phi(d_0) = 0$ and we have $\Phi(d_i) \ge 0$, this property is satisfied.

One important remark is that this analysis is only applicable to non-persistent data structures, which is not the case in OCaml. We present the OCaml code as exercise, but the correct analysis will done in later chapters.

### Splay Trees

Splay Tree is a binary search tree. The insertion of an element e consists in spliting the current tree in 2 subtrees, one of all elements less or equal than e, which we’ll call small and another will all elements larger than e, which we’ll call big. We then make e the root of tree and small the left subtree and big the right subtree.

Partition

The partition operation is O(height of tree). The intuition is, when looking for the small subtree, imagine we’re analysing a node x. If x >= e, then we know all elements in the left subtree are also >= e, so are all the elements in the right subtree, so we can discard it and only look at the left subtree. The same idea applies to looking for the big subtree.

In an ideal scenario, the tree will be well balanced in such a way that the height of the tree is O(log n), and so the insertion operation. But that’s not always the case, so we introduce a balancing strategy. Whenever we follow 2 lefts in a row when searching for the big subtree, we perform a right rotation. Similarly, when following 2 rights in a row when searching for the small subtree, we perform a left rotation.

These re-balancing are enough to make partition and also insertion an O(log n) amortized time, as we shall see in the next section.

Rotation

The operation for finding the minimum element is trivial. To delete the minimum element, we also perform rotations. Since we only traverse the left paths, we only have to do a left rotation.

I’ve implemented these ideas in OCaml.

### Amortized Analysis of Splay Trees

We first define a potential function. Given a tree $t$ and a node $x \in t$, let $t(x)$ be the subtree of $t$ rooted in $x$. Then $\phi(t(x)) = \log(|t(x)| + 1)$ be a potential function where $|t(x)|$ the number of nodes of the subtree of which $x$ is root, and let the potential of a tree $t$, $\Phi(t)$, be the sum of the sub-potentials of all nodes of $t$, that is,

(1) $\Phi(t) = \sum_{x \in t} \phi(t(x))$.

The amortized equation of calling partition() is the actual cost of the partition, plus the potential of the resulting trees (small and big) minus the potential of the original tree. Let’s denote small(t) as $s$, and big(t) as $b$.

(2) $\mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)$

We can prove that $\mathcal{A}(t) \le 1 + 2 \log(|t|)$ (see appendix), which means the amortized cost of the partition operation is $O(\log n)$. It implies that insertion also has an amortized cost of $O(\log n)$.

The same idea can be used to prove the amortized costs of the deletion of the minimum element. Okasaki claims that splay trees are one of the fastest implementations for heaps when persistence is not needed.

### Conclusion

I found amortized analysis very hard to grasp. My feeling is that the amortized analysis is too abstracted from the real problem, making it not very intuitive. It’s also not clear how one picks the potential function and whether there’s a choice of potential that might result in a better than the “actual” amortized cost.

### Appendix

As we can see in code, there are four cases to consider: left-left, left-right, right-left and right-right. The left-* and right-* are symmetric, so let’s focus on the left-left and left-right cases. For the left-left case, suppose we perform the following partition:

Left-left case of partition()

(3) $\mathcal{A}(t) = \mathcal{T}(t) + \Phi(s) + \Phi(b) - \Phi(t)$

Since $= \mathcal{T}$ corresponds to the number of recursive calls to partition, $\mathcal{T}(t) = 1 + \mathcal{T}(u)$, so

(4) $= 1 + \mathcal{T}(u) + \Phi(s) + \Phi(b) - \Phi(t)$

The recursive call to u yields (s’, b’), so we have the equivalence, $\mathcal{A}(u) = \mathcal{T}(u) + \Phi(s') + \Phi(b') - \Phi(u)$,

(5) $= 1 + \mathcal{A}(u) - \Phi(s') - \Phi(b') + \Phi(u) + \Phi(s) + \Phi(b) - \Phi(t)$

Since, s = s’,

(6) $= 1 + \mathcal{A}(u) - \Phi(b') + \Phi(u) + \Phi(b) - \Phi(t)$

and $\Phi(b') - \Phi(b) = \phi(b(x)) + \phi(b(y)) + \Phi(c) + \Phi(d)$, and $\Phi(t) - \Phi(u) = \phi(t(x)) + \phi(t(y)) + \Phi(c) + \Phi(d)$, we can cancel out some terms and simplify the equation above to

(7) $= 1 + \mathcal{A}(u) - \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

by induction, $\mathcal{A}(u) \le 1 + 2 \phi(u)$,

(8) $\le 2 + 2\phi(u) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

Since $u$ is a subtree of $t(y)$, $\phi(u) < \phi(t(y))$ and since the nodes in $b$ is a subset of t, $\phi(b(y)) \le \phi(t(x))$

(9) $< 2 + \phi(u) + \phi(b(x))$

It’s possible to show that, given $x, y, z$ such that $y + z \le x$, then $1 + \log y + \log z < 2 \log x$. We have that $|t| = |u| + |c| + |d| + 2$, while $|b(x)| = |c| + |d| + 1$, thus, $|u| + |b(x)| < |t|$ and by the relation above, $1 + \log (|u|) + \log (|b(x)|) + 1 < 2 \log (|t|) + 1$, $\phi(u) + \phi(b(x)) + 1 < 2 \phi(t)$, which yields

(10) $\mathcal{A}(t) < 1 + 2 \phi(t)$.

It’s very subtle, but in this last step we used the assumption of the rotation, in particular in $|b(x)| = |c| + |d| + 1$, if we hadn’t performed the rotation, $|b(x)| = |b| + |c| + |d| + 2$, and $|u| + |b(x)| < |t|$ wouldn’t hold!

For the left-right case, we can do a similar analysis, except that now we’ll apply the recursion to $c$, and the resulting partition will look different, as depicted below:

Left-right case of partition()

In this case, (8) will become

(11) $\le 2 + 2\phi(c) + \phi(b(x)) + \phi(b(y)) - \phi(t(x)) - \phi(t(y))$

We then use that $\phi(c) < \phi(t(y))$ and $\phi(t(y)) < \phi(t(x))$ this will allow us to cancel out some terms to

(12) $< 2 + \phi(b(x)) + \phi(b(y))$

Similar to what we've done before, we can show that $|b(x)| + |b(y)| = |s| + |b| = |t|$ since the partitions of $|t|$ are mutually exclusive. We can still use the result from the other case and get (10).

### References

[1] Wikipedia – Amortized Analysis