Bulls and Cows

Bulls and Cows (also known as MOO) is a 2-player game in which one player comes up with a secret and the other has to guess the secret. The secret consists of 4 digits from 0 to 9, where each digit is distinct. Player 2 has to guess these 4 digits, in the right order. At each guess from player 2, player 1 provides feedback as hints. The hints are two numbers: one telling how many digits from the secret player 2 got in the right order (bull), and another telling how many digits they got but in the wrong order (cow).

For example, if player 1 came up with 4271, and player 2 guessed 1234, then bull is 1 (digit 2), and cow is 2 (1 and 4 are in the secret, but not in the order guessed by player 2).

The goal of the game is for player 2 to guess the secret with the least amount of guesses.

DSC_0110

Bulls sculptures from Cyprus. Dated from sometime between 2000 and 1600 BC. Photo take at the National Archeology Museum in Athens.

In this post we’ll present computational experiments to solve a game of Bulls and Cows optimally.

Objective function

We’ll focus on a search strategy to minimize the maximum number of guesses one has to make, for any possible secret. Alternatively, John Francis’s paper [1] proposes heuristics that minimize the expected number of guesses, that is, for the worst case these heuristics might not be optimal, but they perform better on average.

Because the game has turns, our solution is not a single list of guesses, but rather a decision tree where at each node we branch depending on the hint we get back. The metric we are trying to optimize is then the height of such tree.

To break ties between trees of the same height, we consider the smallest tree (the one with least nodes).

Brute-force search

Our search algorithm is recursive. At any given recursion level, we are given a set containing all possible numbers that could be secrets. We’ll try all combinations of guesses and hints and see which ones yield the best solution.

When we simulate a given guess and hint, we are restricting the possible numbers that can still be secret, so in the next level of recursion, the set of potential secrets will be smaller.

For example, say that the possible secrets are any 4-digit number with no repeated digits. We then use one of them as a guess, say, [0, 1, 2, 3]. One possible hint we could receive is (3, 0). What are the possible secrets that would cause us to receive this hint? [0, 1, 2, 4][0, 1, 2, 5] and [7, 1, 2, 3] are a few of those. If we recurse for this guess and hint, the set of secrets in the next level will be restricted to those that would return a hint (3, 0) for [0, 1, 2, 3].

Continuing our example, is [0, 1, 2, 3] the best guess we can make at that point? After recursing for all possible hints, we’ll have a decision tree rooted on [0, 1, 2, 3]. The key idea of the search is that we can minimize the height of the final decision tree by minimizing the subtrees at each recursion level (greedy strategy). Thus, we want to find the guess with the shortest subtree.

In pseudo-code code, the search looks like this:

We start with all possible secrets (4-digit number with no repeated digits) and the tree returned should be optimal in minimizing the number of guesses for the worst case.

Rust implementation

I initially implemented the search in Python, but it was taking too long, even when artificially restricting the branching. I re-implemented it in Rust and saw some 20x speedups (caveat: I haven’t really tried to optimize the Python version).

The main difference between the pseudo-code and the actual implementation is that we pre-group possible_secrets by their hint for guess, which is more efficient than scanning possible_secrets for all possible hints:

The function group_possibilities_by_score() above makes use of compute_score and it also uses a fixed-length array for performance. The set of hints is proportional to the squared size N of the guess, in our case N=4.

Turns out that the Rust implementation is still not efficient enough, so we’ll need further optimizations.

Optimization – classes of equivalence

What is a good first guess? It doesn’t matter! We don’t have any prior information about the secret and every valid number is equality probable. For example, if we guess [0, 1, 2, 3] or [5, 1, 8, 7], the height of the decision tree will be the same. An intuitive way to see why this is the case is that we could relabel the digits such that [5, 1, 8, 7] would map to [0, 1, 2, 3].

Francis [1] generalizes this idea for other cases. Say that at a given point we made guesses covering the digits 0, 6, 7, 8 and 9 at least once. Now say we our next guess is [0, 8, 9, 3]. In here, 3 is the only digit we haven’t tried yet, but using the re-labeling argument, we can see that [0, 8, 9, 1] would yield the same decision tree if we were to swap the labels of 1 and 3. This allow us to skip guesses that belong to the same class, which reduces the branch factor.

We can generate an ID representing a given class. A way to do this is by adding one to each digit we have tried before and  making any digit we haven’t as 0, then converting that number from base 11 to base 10. For example, [0, 8, 9, 1] becomes [1, 9, 10, 0]. If this was a number in base 11, in base 10 it is 2530 ((((1*11) + 9)*11 + 10)*11 + 0). If we do the same with  [0, 8, 9, 3], we’ll get the same number. The code below implements this idea.

In our case, D = 10 and we store the set of visited digits in a bitset, visited_bits (that is, bit i is 1 if digit i has been visited.

On the search side, we keep a set of classes ids already visited and skip a guess if its class is already in there.

With this optimization the search algorithm runs in under 2 minutes. By inspecting the height of the resulting tree we conclude that the minimum number of guesses necessary for any secret is 7.

The complete code is available on Github.

Visualizing

The JSON output by the search algorithm is quite big (the smallest tree with height 7 has almost 7000 nodes). A more interesting way to inspect the data is to create a Bulls and Cows solver. We feed the JSON to this application and ask the user to select the outcome based on the secret they have in mind. We are basically traversing the edges of the decision tree.

Screen Shot 2018-06-03 at 20.48.47

I’ve uploaded the application to my personal website and the source code is available on Github.

Conclusion

I learned about this game pretty recently and was curious to learn of good strategies to solve this problem. From what I’ve been reading, the heuristics that yield good solutions are very complicated for a human to perform. This leads to an question: is there are any solution which is simple to follow but that is reasonably good?

One natural extension for this problem is to use larger numbers of digits (N > 4) and have each element be sourced from 0 to D – 1. An exhaustive search might be prohibitive, but maybe we can come up with heuristics with constant guarantees. What is a lower bound for the number of guesses for variants with arbitrary N and D?

I struggled to implement this code in Rust, but I found the challenge worthwhile. I learned a bit more about its specifics, especially regarding memory management and data types.

In [1], the author mentions that with optimizations the search took 45 minutes to run on their laptop (2GHz Intel Core 2 Duo). I ran mine on a Intel i7 2.2GHz and was surprised by the running time of 2 minutes. CPUs are not getting exponentially faster these days and my code runs on a single thread.

References

[1] Strategies for playing MOO or “Bulls and Cows”.
[2] Github – Blog Examples: Bulls and Cows

Advertisements

Cell biology and programming

rosalind

Rosalind Franklin was an English chemist and X-ray crystallographer. She is best known for her work on the X-ray diffraction images of DNA, particularly Photo 51, while at King’s College, London, which led to the discovery of the DNA structure by Watson and Crick.

James Watson, Francis Crick and Maurice Wilkins shared the Nobel Prize in Physiology or Medicine in 1962. According to Wikipedia [1], Watson suggested that Franklin would have ideally been awarded a Nobel Prize in Chemistry four years after Franklin passed away due to ovary cancer.

photo_51

Photo 51

In this post we’ll study some basic concepts of cell biology, mostly around the DNA. We’ll start by introducing the structure of the DNA and then two of its main functions: replication and synthesis of protein. Since this is a programming blog, we’ll provide analogies (some forceful) to computer systems to help us relate to prior knowledge.

The end goal of this post to provide a good basis for later learning bio-informatics algorithms.

Genome

Genome is the set of information necessary for creating an organism. In a computer programming analogy we can think of the genome as the entire source code of an application.

Chromosome

Let’s recall that cells can be classified into two categories, eukaryotic (from the Greek, good + nucleus) and prokaryotic (before + nucleus). As the name suggests, eukaryotic cells have a well define nucleus surrounded by a membrane.

In eukaryotic cells the genome is divided across multiple chromosomes which are basically compacted DNA. They need to be compacted to fit within the nucleus. According to Science Focus [3], if stretched, the DNA chain would be 2 meters long.

Humans cells usually have 23 pairs of chromosomes, 22 of which are called autosomes and they are numbered based on size (autosome 1 is the largest). The remaining is a sex chromosome and can be of type either X or Y.

Men and women share the same types of autosomes, but females have two copies of chromosome X, and men one chromosome X and one Y.

Chromosomes are usually depicted as X-like structures and neatly arranged [4], but recent research was able to visualize the actual structure and it looks more like:

chromosome.jpg

Prokaryotes (e.g. bacteria) on the other hand typically store their entire genome within a single circular DNA chain.

In our computer programming analogy, the chromosome serve as units of organization of the source code, for example the files containing the code. If your application is simple, like a bacteria, the entire source code could be stored in a single file!

We can push the analogy even further and think of the tangled DNA within the chromosome as the “minification” step that JavaScript applications apply to the original source code to reduce network payload.

DNA

The deoxyribonucleic acid, more commonly known as DNA is a structure usually composed of two strands (chains) connected through steps to form a double helix.

dna-concept

Conceptual representation of the DNA: the double helix

In our analogy the DNA could represent the text of the source code.

Nucleotide

Nucleotides are the discrete units that form the base of the DNA. In the DNA it can be one of: Adenine, Cytosine, Guanine and Thymine.

Chemically speaking, we can divide the nucleotide in 3 parts: a sugar group, a phosphate group and a nitrogen base. The first two are common among the nucleotides, while the base differentiates them.

Screen Shot 2018-04-28 at 10.45.57 PM

Guanine, one of the 4 possible nucleotides in the DNA

Any two nucleotides can be connected through their sugar and phosphate groups (the sugar group of one nucleotide links to the phosphate group of the next). Nucleotides linked this way form the backbone of a single strand of the DNA.

In addition, two nucleotides can be linked together via their nitrogen base, but in this case, there’s a restriction on which two bases can be linked together. Adenine can only be paired with Thymine, and Cytosine can only be paired with Guanine. These pairings form the “steps” in between two DNA strands.

Screen Shot 2018-04-28 at 11.09.01 PM

4 nucleotides linked together through the sugar-phosphate groups or through the nitrogen bases.

Because the phosphate group of a nucleotide is linked to the sugar group of the next, we can define a direction for a chain of nucleotides. The endpoint that ends with the phosphate group has 5 carbon molecules and is called 5′ (read as five prime), while the sugar group has 3 and is called 3′ (three prime). The two strands in a DNA molecule are oriented in opposite directions, as depicted in the figure above.

We can now complete the analogy of computer programming by stating that nucleotides are the characters that compose the text of the source code (DNA). In our case, the alphabet contains 4 letters: A (Adenine), C (Cytosine), G (Guanine), T (Thymine).

The replication

The DNA is capable of replicating itself so that it can be passed down to new formed cells.

In high-level, the double strands of the DNA start separating and other structures start binding new nucleotides to each strand (templates) until both the strands are completely duplicated (and form a double strand again).

The separation of the strands is triggered by the protein helicase, and can happen at any point of the DNA and it might happen in many places at the same time. One way to visualize this is opening a zipper jacket from the middle and keep pushing it open in one direction.

While the strands are being separated, proteins called DNA polymerase starts scanning each strand and making a copy strand by adding nucleotides to it.

The DNA polymerase can only extend an existing chain of nucleotide, so it requires an initial fragment to start with. For that reason, in the beginning of the duplication process, a small fragment of DNA or RNA, called primer, needs to be connected to the strand.

One important limitation the polymerase has is that it can only process a strand in one specific direction: from 3′ to 5′. But since we saw that strands are oriented in opposite direction of each other, it means that the replication process doesn’t happen symmetrically on both strands.

For the strand oriented 3′ to 5′ the replication is smooth, generating a continuous strand. For the reverse strand though, it will be done in pieces, because the polymerase is adding nucleotides in the opposite side of where the opening is happening. These pieces are known as Okazaki fragments and later joined together by another protein called ligase.

We can see these two cases taking place in the picture below:

dna-rep.png

One interesting fact about this process is that errors do happen and there are error corrections in place to minimize them. The result of the replication is also not 100% accurate, especially at the endpoints of the strands, where each new replica formed has its endpoints shorter than the original template. To compensate for this, the DNA has repeated redundant segments of nucleotides at the endpoints, know as telomeres, but eventually this extra segments get wore off to a point they cannot be used for replication. The shortening of telomeres is associated with aging.

In our analogy to computer programming, we could imagine the replication being the distribution of copies of the source code to other people. This analogy is weak, though. If we want to be precise, we’d need to come up with a program that is capable of printing its own source as output. This was named Quine by Douglas Hofstadter in Gödel, Escher, Bach [6] and it’s an example of a self-replicating automata, also studied by the famous computer scientist John von Neumann.

Protein Production

A second function of the DNA is the production of proteins. The first step, called transcription, is very similar to replication: One of the strands serve as template, and a new strand with complementary base is generated. The difference is that instead of Thymine, the nucleotide Uracil is used. The resulting structure is called mRNA, short for messenger RNA.

MRNA-interaction

Production of mRNA and its exit to the cytoplasm

The mRNA detaches itself from the DNA strand and exits the nucleus to the cytoplasm where the second step, translation, begins. In there, it is “interpreted” by a structure called ribosome. Every 3 nucleotides, denominated codon, translates to an amino acid, which in turn form a protein. The mapping of every possible 64 combinations of codons are displayed below:

dna-152136_640.png

Mapping of codons to amino-acids. For example, GGA maps to Glycine.

The tRNA, short for transfer RNA, is a small chain of RNA that connects amino-acids to their corresponding codon in the mRNA. There are some special codons that indicate the end of the process. The output of this translation will be a peptide chain.

translation

Synthesis of a protein

The production of protein is the actual functioning of the DNA. If we link to the computer programming model, we could think of the source code (DNA) being interpreted or compiled into an actual program that can be executed. It’s incredible how biological systems evolved to define an explicit code for the end of the translation, much like how a semi-color or new line often indicates the end of an expression.

Applications of Bio-informatics

How can computer science help with the study of biological systems? Computers are  good at repetitive tasks and handling large amounts of data. Here are some applications according to Wikipedia [7].

  • DNA sequencing – consists of determining the order of nucleotides in the DNA from raw data. Computers are useful here because the raw data often come as fragments that need to be merged.
  • Protein structure prediction – given a sequence of nucleotides, determine the chain of amino-acids is well-understood, but predicting the structure of the final protein is an open problem.
  • Reduce noise in massive datasets output by experiments.

Conclusion

From reading a lot of introductory material, I felt that there were a lot of  imprecise or ambiguous descriptions. That seems to come from the sheer complexity of biological systems and most results coming from empirical evidence, which currently only provide an incomplete picture of the whole, but new information still comes at a very frequent pace, sometimes making existing models obsolete or incorrect.

I last studied Biology back in high school. I don’t intend to study it in depth, but just enough to understand how computer science can be used to solve particular problems.

The idea of modeling living organisms as organic computers is fascinating. As we discover more about the inner workings of cells, I hope we can come up with better models and be able to predict their outcome with greater accuracy.

References

[1] Rosalind Franklin – Wikipedia
[2] Photo 51 – Wikipedia
[3] How long is your DNA?
[4] How many chromosomes do people have?
[5] What a chromosome really looks like
[6] Gödel, Escher, Bach: An Eternal Golden Braid – Douglas R. Hofstadter
[7] Bioinformatics – Wikipedia

HyperLogLog in Rust

In this post we’ll study the hyper log log algorithm and provide an implementation in Rust.

Introduction

frajollet

Philippe Flajolet was a French computer scientist at INRIA [5]. He introduced the field of Analytic Combinatorics and is known for the creation of a family of algorithms of probabilistic counting, including the HyperLogLog.

HyperLogLog is a probabilistic algorithm to determine the number of distinct elements (or cardinality) from a multi-set (a set that allows repeated values) with high accuracy using very low memory. It is then suitable for streaming applications or distributed databases in which we do not have the luxury of keeping all distinct values in memory at any time.

The HyperLogLog

The intuition behind the HyperLogLog is the following. Imagine we have a basket full of balls, each containing a number, which can be repeated. We want to estimate how many distinct numbers we see in the basket with the limitation that we can only look at one ball at a time and we do not have paper and pen, so we have to rely on memory.

balls

The number of balls in the basket can be very large, so it’s unrealistic to keep track of all possible values we see, and we need a proxy to estimate the number of distinct values. The trick is to find some property that only very rare numbers satisfy, and the intuition is that if such property was satisfied by any of the numbers we saw, then there’s a good chance we saw a lot of numbers (or we’re very lucky).

To start, say that the property is “multiple of 2”. That is not a rare property, so if we assume the numbers in the balls do not have any special pattern, on average, we just need to find 2 different numbers to see that property being satisfied. Now say the property is “multiple of 4”. Less numbers satisfy this property, so the average number of distinct values we need to look at would be 4. We can keep going on for higher and higher powers of 2, and the average number of distinct values we look at would need to be as big.

This means that if we keep track of the largest power of 2 that divides any number we drew, we could estimate that that was the number of distinct values! The power of 2 property is interesting because it doesn’t change if there are duplicated values (which is great because we only want distinct values) and it doesn’t rely on the magnitude of the values as long as the set of values are uniformly distributed.

This estimate is not great though because the error is proportional to the number of distinct values found. For example, if we found out that the largest power of 2 divider was 2048 (2^11), the actual number of distinct values could be between 2048 and 4095 (2^12 – 1). Also, we could get unlucky and have all the balls in basket be have a single number, say 1024, so the estimate would be off by a lot.

Averaging

To reduce the chances of drawing a ball with a value that turns out to be a large power of 2 and throwing off the estimates, we divide the balls into m groups. The important property of the groups is that balls with the same values should go to the same group and the distribution should be uniform.

We can then find the largest power of 2 divider for each group and average them out. This would help in the case where we saw a single number 1024, because while one group would estimate it to be 1024, all the other groups would be find that the largest divisor was 1 (empty group).

To assign each number to a group, we could use modular arithmetic. For example, if we had 12 groups, we use the remainder of a number by 12 as an index to the group. The problem is that the elements assigned to a given group have a property that will bias the result. For example, for the group corresponding to the reminder 0, all numbers in there are obviously a multiple of 12 and hence 2^2. To make the numbers in the group unbiased, we need to discard the information used to assign them to groups. An easy way to achieve that is to represent each number as binary, use the first bits to decide which group to assign it to, and then discard those bits when computing the power of 2.

We can see that we can increase the number of groups to reduce errors but the tradeoff is that it requires to keep more information in memory. This is the idea first proposed by [2] and was called stochastic averaging.

Harmonic Mean

The HyperLogLog algorithm uses a different combination of the powers of two to obtain a better estimate than using averages: the harmonic mean [1]. To recall, the harmonic mean consists in averaging the reverse of the elements and then reversing the result. For example, the harmonic mean of 1, 4 and 4 is

Screen Shot 2018-03-30 at 8.57.13 PM

The bulk of the HyperLogLog paper [1] is actually proving that this metric yields an estimate with smaller errors than a simple average.

Hashing

Not all real world input will be numbers that are distributed uniformly. For example, we might be interested in counting the number of distinct values of a string column in a database. We then need to transform these values using a hash function which maps these inputs to numbers uniformly.

Algorithm

We are now ready to outline the core of the algorithm to estimate the cardinality of a multi-set of values.

Given a stream of n elements:

  • For each element
    • Hash it
    • Use the first bits of the hash to determine to which group j to assign it to
    • Discard those bits
    • Keep track of the largest power of 2 that divides some value in the group, and store the exponent in M[j]

Finally, estimate the number of distinct values based on a harmonic mean of the values in M.

The Expected Value and the Alpha Correction

The expectation of the number of distinct values is given by [1]:

Screen Shot 2018-03-31 at 10.23.11 AM

E is equivalent to the harmonic mean times \alpha_m m, and the constant \alpha_m is a constant to correct some bias. For implementation purposes, the authors provide approximations to this constant:

CodeCogsEqn (3).png

Small and Big Range Corrections

When the estimated number of distinct elements is relatively small compared to the number of groups m, the author propose a correction.

Let V be the number of groups to which no element was assigned to. If V > 0, then experiments show that for E \le \frac{5}{2} m, we can change the estimate to

CodeCogsEqn (4)

Conversely, if the number of distinct elements is very high, closer to 2^32, then the probability of hash collision are very high. The correction accounts for that:

CodeCogsEqn (6)

Implementation in Rust

We’ll now present an implementation using Rust. The first part of the algorithm consists in computing M, which we name more clearly as first_non_zero_by_bucket. The following code implements the pseudo-code described above:

We don’t need to know much Rust to read the code above. It’s worth mentioning that Rust is very strict about types, so we need to perform explicit conversions. Also we use 2 bit operations: one is to obtain the least significant k bits of an integer by using a bit mask. It relies on the fact that (2^k)-1 is a number with k bits 1 and doing a bitwise AND with any number has the effect of only extracting the first k bits of that number. The other trick is to divide a number by 2^k, which can be done by shifting the bits to the right, via the >> operator.
 
The hash function we use here is from the package farmhash, which is a Rust implementation of Google’s Farmhash, which in turn is a variant of Murmurhash [6]. It basically takes a string and shuffles its bits in a hopefully uniform way, generating a 32-bit integer:

first_non_zero_bit_position() is given by:

The formula to obtain the expected number of distinct values is given by

Screen Shot 2018-03-28 at 9.06.39 PM

The code below implements this function:

The values of alpha were discussed in the The Expected Value and the Alpha Correction above.

The correction for small and large ranges can be implemented as:

The complete code is available on Github.

Experiments

For values of b ranging from 4 to 16, I ran the program 100 times for n=100k, with numbers randomly selected from 1 to 100k. Then I plotted the results using the box plot chart using a R script:

Screen Shot 2018-03-31 at 5.48.27 PM

In the chart above, the x-axis represents the number of experiments we divided the input into, and the y-axis represents the relative error compared to the actual value. We can see that as we increase the number of experiments, the errors go down.

This chart was also helpful in finding a bug in my implementation: the initial plot had a sudden spike for values larger than 11, which was when the small range correction applied. After some debugging, I realized that algorithm should you the natural logarithm, not log2. It was great to spot the bug via data visualization!

Conclusion

I’ve been wanting to study the HyperLogLog algorithm for a while, and also one of my resolutions for this year is to learn Rust. It was a productive exercise to implement it.

I’ll post some future impression on Rust from someone with experience in C++ in a future post.

References

[1] HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
[2] Probabilistic Counting Algorithms for Database Applications
[3] Thu Trang Pham – Curiosity #2: How does Prestodb implement approx_distinct?
[4] Probabilistic Data Structures for Web Analytics and Data Mining
[5] Gödel’s Lost Letter and P=NP: Philippe Flajolet 1948–2011
[6] GitHub: seiflotfy/rust-farmhash

Paper Reading – F1: A Distributed SQL Database That Scales

In this post we’ll discuss F1: A Distributed SQL Database That Scales. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

F1 is a database used at Google to serve data for the AdWords product. This is the system that displays embedded ads in external pages using Google’s
infrastructure.

F1 is named after a term from Genetics, F1 hybrid, in analogy to the idea that it combines the best aspects of relational databases and NoSQL systems (my initial thought was that it was named after Formula One).

In a high-level, F1 is a layer on top of the Spanner database which we covered in a previous post. It adds features such a SQL interface (worth noting that Spanner recently has evolved to support SQL natively too). Spanner handles a lot of the distributed system difficulties under the hood, providing some guarantees and F1 builds relational database features on top of that.

The paper mentions that AdWords migrated from a sharded MySQL solution, which was very expensive to maintain, especially when the data outgrew the instance and re-sharding was necessary. Now, this data redistribution is handled transparently by Spanner and while F1 is theoretically slower, in practice they were able to structure the data in such a way that it’s performant in practice (5-10 ms for reads and 50-150ms for writes).

Architecture

F1 is deployed into multiple servers, across geographically distributed data centers. A client sends requests to a load balancer that then redirects to an F1 server, usually in the closest datacenter.

Screen Shot 2018-01-31 at 8.30.13 PM

Architecture of F1: In display are two datacenters containing multiple machines running F1 servers, Spanner and CFS instances.

F1 servers are usually in the same Data Center as the Spanner servers that contain their data. F1 can communicate to Spanner servers outside of its datacenter directly, but Spanner only communicates with the Colossus file system (CFS) within the same datacenter because CFS is not globally replicated.

F1 servers are stateless except when the client performs pessimistic transactions, which make them easier to scale, since no data movement is required.

In addition to F1 servers, there are F1 master and a pool of F1 slaves. According to the paper the master controls which processes are in the pool, and that F1 slaves execute part of the query plan on behalf of F1 servers, but it’s not clear why they need to be a separate component.

Data Model

The table data can be organized in a hierarchical schema much like Spanner (see
“Data Model” in the post about Spanner). This hierarchical schema is optional but it’s a key piece in making the performance of F1 competitive with the previous sharded MySQL system AdWords used.

In this hierarchy, we have a root table and a child table. For each row in the root table, called root row, we have rows clustered under that row based on their matching keys.

For example, say that we have tables Customer and Campaign. The Customer table has customerID and Campaign (customerID, campaignID). A possible structure for the rows would be:

Screen Shot 2018-01-31 at 9.05.08 PM

Where Campaign rows with customerID=3 are clustered together under the corresponding Customer row.

The paper mentions that a given cluster of child rows fall within the same Spanner directory, which means that a transaction would query a single Spanner server, avoiding the overhead of synchronization between multiple servers (see Read-write transactions in the Spanner post).

Indexes. Tables support two types of indexes: local or global. Local index entries are stored in the same Spanner server of the rows the index but it must include the key of the root table. Global indexes do not have such restrictions but they can be distributed across multiple Spanner servers and very expensive to update within a transaction.

Schema changes. Updating the schema of a table is challenging because the rows of the table are distributed, which makes schema consistency expensive to achieve, and the system cannot have downtime.

The proposed solution is to break schema changes into smaller steps, such that as if no 2 severs are more than 2 steps apart, the changes are safe.

Writes

Transactions

F1 supports 3 types of transactions: snapshot transactions, pessimist transactions and optimistic transactions. It relies on the transactions supported by Spanner, so it’s worth re-reading the Implementation Details of the Spanner post.

Snapshot transactions are read-only transactions and map to the corresponding read-only transactions from Spanner. The pessimistic transaction maps to the read-write transactions from Spanner.

The optimistic transaction consists of multiple reads followed by a single write. F1 stores an extra column with the last modified timestamp for each row. When the client performs the reads, it stores the largest timestamp it saw for any row it received. It sends this timestamp together with the final write request. F1 then performs a pessimistic transaction to read only the timestamps of the affected rows. If any of them differ from the timestamp sent by the client, it aborts. Otherwise it performs the write.

It’s worth noting that F1 didn’t create a transaction until the very last write request was sent. It’s assuming there were no writes between these reads and the final write, so we say it’s optimistic. This is the default transaction type and the paper describes a few advantages over the pessimistic type.

Locks

F1 supports very granular locking of the tables, including row level and cell (set of columns for a given row) level.

Change history

F1 stores change history of the data. The changes are stored as regular tables in a structure called ChangeBatch, children of a root table. If a transaction updates multiple root rows, several ChangeBatches are created under their corresponding root tables.

This feature allows clients to subscribe to changes in F1 tables using a publish-subscriber system. Whenever the client receives a notification, it can ask for incremental changes after a specific timestamp and apply to the data it has stored locally.

Reads

F1 supports both NoSQL and SQL interfaces. The SQL dialect was extended to support Protocol Buffers, which are complex data types with strong types (as opposed to loosely typed structures such as JSON). This extension allows, for example, to read and update internal fields of such structures.

Local vs distributed. F1 supports centralized execution (running on a single F1 node) or distributed. The query optimizer decides which one to execute. The paper details the process of executing the query in a distributed fashion.

Distributed Query Example

The paper describes an example of a SQL query being mapped to an execution plan. It involves two joins, filters and aggregations:

Screen Shot 2018-02-02 at 11.10.01 PM

This SQL query is parsed and converted to a query plan that will be executed by multiple machines called operators. A possible execution plan for the sample query is:

Screen Shot 2018-02-02 at 11.13.23 PM

The arrows indicate the data flow, starting at the bottom. The first join is a lookup join (hash join). The operation reads rows from the AdClick table until it has about 100k unique lookup keys stored in a hash table. Then it performs a single (batch) lookup to the other table, AdGroupCreative. The rows from the AdClick table are kept in memory for faster lookup.

As soon as the join for these keys are completed, they’re streamed to the next stage of the pipeline.

The second join with Creative is a distributed join. It first repartitions each row from each table based on the values of the columns listed in the USING clause. Each partition might end up in different machines for the next stage which consists of joining the columns of matching rows.

Finally, the rows are again repartitioned, now by the values from the group by columns and then aggregators apply the aggregation for sets of rows under the same partition.

Distributed Execution Overview

More generally, the query plan created by F1 is a DAG (directed aclyclic graph) where each node is an operator like the join or aggregator described above. Note that there are multiple operators running the same operation in parallel.

The paper says:

A technique frequently used by distributed database systems is to take advantage of an explicit co-partitioning of the stored data.

It’s not very clear to me what they mean with that, especially because they don’t cite any references, but from context it suggests that the base operators (the scan and lookup join) are in the same machine as the data (co-located) and they do as much of the processing upfront as possible. This helps minimize data transfer which can become the bottleneck in a distributed computation. F1 cannot do that because Spanner abstracts the data location from F1.

A side effect is that there’s a lot of network traffic. The authors claim that Google has network switch hardware improvements which allows servers to communicate with each other close to full network speed.

When the hash tables in memory grow too large, they write part of the data to disk. So while F1 doesn’t store data in a persistent way, it still needs to write to disk for intermediate operations.

For efficiency, F1 doesn’t write checkpoints to disk. The downside is that the system is not fault tolerant. Failures in any stage of the execution causes the entire query to fail. Retries are done transparently but long queries (>1h) are bound to fail.

Other features

F1 exposes data from intermediate nodes to clients. This avoid having all the data concentrating at the last node of the query execution. They cite Map-Reduce jobs as examples of such feature.

Conclusion

In this post we learned about one of Google’s many distributed databases, F1. It’s a loosely coupled layer on top of Spanner to provide a more familiar level of abstraction which are relational databases.

It seems that we could make an analogy between Google’s systems and similar open source solutions. The Colossus File System (CFS) is the distributed filesystem that could map to Hadoop Distributed File System (HDFS), and Spanner would map to Hadoop’s YARN, and F1 providing SQL semantics on top of Spanner could be mapped to Hive which does the same for Hadoop. It’s a very rough comparison, and maybe Spanner is more similar to Spark but it’s interesting to see the patterns and relationship between these systems.

References

[1] F1: A Distributed SQL Database That Scales
[2] Spanner: Google’s Globally-Distributed Database

Appendix: Terminology and Concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

We also have a lot of experience with eventual consistency
systems at Google

Eventual consistency means that a set of servers might not contain the most recent updates but eventually will. For example, the client might issue a write that affects multiple machines. If the system only provides eventual consistency then a subsequent read is not guaranteed to get the data up-to-date with those writes.

The quote mentions experience with eventual consistency in a negative way, because of the extra complexities that clients have to deal with to work around this limitation.

CSS Layout

Introduction

In a previous post we studied some major components of the browser, including the rendering engine. In this post we’ll dig a bit further on the layout of the render tree and take a look at an important piece of this process: the cascading style sheets or CSS.

To recap, every DOM element is usually represented by one rectangle. The job of the rendering engine is to determine two properties of these rectangles: their size (height, width), position (top, left) and stacking order when they overlap. In this post we’ll see how different CSS properties can affect the resulting rectangle.

Note: we’ll use box and rectangle interchangeably in this post.

CSS

History

hakon

CSS stands for cascading style sheets. It was proposed by Håkon Wium Lie in 1994.

The CSS 1 specification was finished in 1996. CSS 2 was created to address some issues with the previous version in 1997. CSS 3 was started in 1998 but hasn’t been concluded yet! The thing is that CSS 3 is subdivided in modules and each is fairly independent of each other, which resulted in different modules having different phases.

The diagram below provides a good overview on the different modules from CSS and their stage:

css-modules-status

CSS Modules and levels

Wikipedia has an interesting history of the development of CSS, including the initial lack of compliance to the specification which caused a lot of headaches to front-end developers, especially when working with early versions of the Internet Explorer.

It’s interesting to take a look at the early days of CSS because it plays a big role in how it looks today (due to back-compatibility). It’s useful to remember that back in the days when CSS first came around, web pages were generally pure HTML containing mostly text, so a lot of CSS was designed around concepts such as paragraphs and simple images.

In this post we’ll focus on the CSS modules that affect the layout of the DOM elements, in particular the Visual formatting model. According to the spec, these are factors that influence the layout of a box:

* Box dimensions (height and width properties)
* Type of box (display property)
* Positional schemes (position, left and top properties)
* Relationship between elements (hierarchy of DOM tree)
* External information (e.g. window size)

We’ll briefly cover some of these properties and then play with a few examples.

The display property

There are many possible values for the display property. The main ones are none, block, inline and inline-block. I’ve been using flex increasingly but that deserves a post in itself (this article is a great reference).

Display none removes the element from the layout calculation so it’s effectively invisible.

A value of block causes the element to be visually formatted as a block [5] (other values like list-item and table do to). In general a block is a box that starts at a new line and takes the entire width of the parent.

An inline box on the other hand starts from the left of the previous box. It also ignores explicit width and height values and any explicit vertical spacing (i.e.
top/bottom of margin/padding).

The main difference between inline and inline-box is that the latter does account for width and height, and vertical spacing [6].

The position property

There are 5 possible values for the position property: static, relative, absolute, fixed, sticky.

static is the default positioning schema and it follows the normal positioning flow of the page.

A relative positioned element accounts for top and left properties. These are in relation to the parent.

An absolute positioned element is similar to a relative, except that it is removed from the normal layout flow (i.e. other elements ignore its existence when being positioned) and its top and left are in relation to the first positioned ancestor in the DOM tree (or the document if none is). A positioned element is any with position != static.

Here is an example where we only change the inner (red) div to static, relative and absolute, respectively.

Screen Shot 2018-01-13 at 4.25.01 PM

In the first example, A ignores the top/left properties. In the third example is “crosses” the boundary of the middle (blue) box because it’s not positioned.

Note that in the code above we have set overflow to auto. This is a hack to prevent margin-collapsing.

An element with position:fixed is similar to position:absolute, except that instead of having its offset relative to an ancestor with position:relative, usually it’s relative to the viewport. That implies that such element scrolls with the page. The special case happens when one of its ancestors has transform, perspective, or filter set to something other than none, in which case it behaves much like an absolute positioned element.

Finally, as described in [2] an element with position:sticky is treated as relatively positioned until it crosses a specified threshold, at which point it is treated as fixed until it reaches the boundary of its parent.

The float property

When a element is floated, it is taken out of the normal layout flow of the document. It is shifted to the left (assuming a float:left) until it touches the edge of its containing box, or another floated element.

Because float implies the use of the block layout, it modifies the computed value of most of the display values to block.

Floats are better understood in relationship with other types of boxes. Let’s check some examples.

Float + Block

Whether the renderer accounts for a floated element when positioning a new block depends on the block formatting context. For example:

Screen Shot 2018-01-12 at 10.20.07 AM

The green block ignores the presence of the blue block but its child (yellow) does not. That’s because blue and yellow are in different block formatting contexts.

Float + Inline

Analogous to a block context, an inline formatting context is a set of inline elements. In such context, the boxes are laid out horizontally. Each “row” of inline elements is called a line box. In the presence of floats, the spec states the following:

In general, the left edge of a line box touches the left edge of its containing block and the right edge touches the right edge of its containing block. However, floating boxes may come between the containing block edge and the line box edge. Thus, although line boxes in the same inline formatting context generally have the same width (that of the containing block), they may vary in width if available horizontal space is reduced due to floats.

We can see an example of that in here. In this example the first three line boxes have a shorter width than the fourth one due to the presence of the float element.

Screen Shot 2018-01-19 at 11.20.05 PM

Clear

When applied to a floating element, clear moves the margin edge of the element below the margin edge of all relevant floats (depending on whether it’s clear left, right or both). Here’s an example:

Screen Shot 2018-01-15 at 7.49.52 PM

The z-index property

Besides determining the size and position of boxes, the layout engine needs to determine how to handle overlapping. The boxes ordering is transitive, meaning that if a box A is under B, and B is under C, A has to be under C.

The main attribute to control the stack order of elements is the z-index property (in reference of the z-axis, commonly used as the “depth” dimension in 3D). But this number only applies for boxes under the same class. As we’ll see now, it’s more complicated than it seems.

First, we need to define the concept of stacking context. A stacking context encompasses a set of DOM elements which can be compared to each other. The order of the stack context always take precedence over individual orders of elements within a stack context. For example, imagine that we have 2 stacking contexts:

Stack context 1: [A -> B -> C -> D]
Stack context 2: [E -> F -> G]
Stack context 1 -> Stack context 2

The arrow (->) represents that the element on the left is on top of the element on the right. Because A, B, C and D belong to context 1, all of them are placed over E, F and G, no matter how big the z-index of elements in the second group are.

Another way to see it is that a stacking context defines an “atomic position”, meaning that elements from outside it cannot be placed in between its elements. It has to be either above or below.

In the DOM tree, if a node E satisfies some conditions, it starts a new stack context, which means that all elements in the DOM subtree of that element will be under that stack context (we say E is the root of such stacking context). Note that a stack context can contain other stack contexts. A few properties that cause a stacking context to be formed are:

* position: absolute or relative and z-index other than auto
* position: fixed or sticky
* opacity less than 1.
* others

Note that position: static ignores z-indexes, so a corollary is that the use of z-index effectively creates a new stacking context.

Within a stacking context, the order of elements is defined by the spec. Here’s a simplified version, from bottom to top:

* The background and borders of the element forming the stacking context;
* Stacking contexts with negative z-indexes;
* Non-positioned block boxes;
* Non-positioned floats;
* Non-positioned inlined boxes;
* Stacking contexts with positive z-indexes;

Another corollary is that a parent is always rendered below its children due to the first rule above.

Here is an example with elements of each of these categories.

Screen Shot 2018-01-18 at 8.51.44 PM


Here is an interesting example with stacking contexts.

Screen Shot 2018-01-19 at 8.47.39 PM

The red box is the parent, so it has to be under the green box. The green, purple and black boxes are in the same stacking-context so they are ordered based on the z-indexes (since red doesn’t start a stack context, green belongs to the top-level stack context). Finally, note how pink has the highest z-index, but is still placed under green because it belongs to the stack context of purple, which is placed under green.

References

[1] MDN Web Docs: CSS display
[2] MDN Web Docs: CSS Position
[3] MDN Web Docs: CSS Float
[4] A Complete Guide to Flexbox
[5] MDN Web Docs: Block formatting context
[6] StackOverflow: display: inline vs inline-block
[7] All about floats
[8] CSS 2.1 Specification: Visual formatting model

Conclusion

My main goal with this study was to learn more about the layout algorithm used by render engines but this turned out to be an analysis of some CSS properties. It was an interesting exercise anyway. I work with CSS on a daily basis but often times I don’t take the time to read the spec or understanding why a given set of properties behave in a certain way.

2017 in Review

This is a meta-post to review what happened in 2017.

I focused on finishing reading Purely Functional Data Structures, by Chris Okasaki, which was one of my goals from last year, so it’s no surprise that more than half of my posts were notes on it. In the process, I learned more about OCaml’s features. Speaking of OCaml, I used it to answer a question on subsequences.

I also completed a course on Scala, which was another of my 2017’s resolutions.

I finally took some time to play with my raspberry Pi, to develop a web server and a simple monitoring system.

Other things included learning more about how JavaScript works, about Google’s Spanner database. I’ve also attended a conference, the OpenVis Conf.

Personal

The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog.

Trips

As in 2016, 2017 was very busy with trips.

I visited quite a few cities in the Eastern US this year. This included Washington DC, Baltimore, Boston and Chicago. I must say they have better museums than the West coast. My favorite ones were the Smithsonian and the Boston MFA.

cities2

Top: Lincoln Memorial in Washington DC, Downtown Boston. Bottom: Marina City Towers in Chicago, MIT in Cambridge.

I love National Parks and was glad to be able to visit a few of them. This included trips to Sequoia and Kings Canyon National Parks, here in California. In Seattle, I made a quick trip to the Olympic National Park. And more recently, Carlsbad Caverns and Big Bend (New Mexico and Texas).

nparks

Top: Trail in Kings Canyon National Park; Cape Flattery, near Olympic National Park.
Bottom: Santa Elena Canyon in Big Bend National Park; White Sands National Monument

For a long time I wished to visit Japan and I was finally able to make it. I spent 2 weeks there, exploring the main cities like Tokyo, Kyoto, Nara, Osaka, Komatsu, Hiroshima, Himeji.

japan

Top: 1. Tokyo, 2. Temple in Nara, 3. Fushimi Inari-taisha.
Bottom: 4. Itsukushima Shrine, 5. Himeji Castle, 6. Mount Hakusan National Park

Books

I read a lot of fiction this year.

War and Peace took me a long time to get through, but it was a very rewarding read. It’s a mix between history, philosophy and fiction, set during the 1812 invasion or Russia by Napoleon. Slaughterhouse-Five is also about war, the WWII, with a dark humor take.

If on a winter’s night a traveler by Italo Calvino is sort of a meta-book, full of self-references. Pretty interesting read.

Hard-Boiled Wonderland and the End of the World, Strange Weather in Tokyo and Snow Country all have their background in Japan. I enjoyed the idea of reading books related to a country before visiting it.

I’m usually not into science-fiction but I enjoyed reading Red Mars.

books

For non-fiction, Survival of the Sickest was my favorite science book of the year. Homo Deus was also very interesting. It’s from the same author of Homo Sapiens, Yuval Harari. Even though it’s about the future of humanity, I liked his discussions of the present better – especially about humanism.

The Blog in 2017

Like last year, the most popular post was the Introduction to the Parsec Library, with 1.6k visits. From this year, the notes on Google’s Spanner database paper got 250 views. Overall it had 9.5k visitors.

I kept the resolution to post at least once a month. The blog completed 5 years with 66 posts.

Resolutions for 2018

For this year, I plan to learn Rust. I haven’t decided which technical book to read yet. I also want to learn mobile development and read more papers.

Generalized Tries in OCaml

In Chapter 10 of Purely Functional Data Structures, Okasaki describes a generalized trie. In this post we’ll see how tries can be seen as maps over a specific type of keys and how we can construct maps over any type of keys.

We discussed tries in OCaml in a previous post. At first glance, it looks like a special data structure but we’ll now see it’s basically a map.

The basic form of a map is a pair of (key, value). In most programming languages the key has to be a scalar (e.g. int, string) or some complex that can be converted to one (think of hashCode() in Java).

For simplicity, let’s assume our scalar is an int and our value is of generic type T. Also, note that the syntax we present below is not valid OCaml (I find OCaml’s type syntax a bit hard to read). A map with int keys can be represented as:

map<int, T>

If we want a two dimensional key, for example (int, int), we can have a map of maps:

map<int, map<int, T>>

and if the dimensions are not fixed, for example, a list of integers, then we can define a recursive structure for our map

map<list<int>, T> = 
  Node of (option<T> * map<int, mapOverList>)

If we think of strings as lists of characters, a trie is then a map where the key is a list of characters. From the type above we can simply change the key of our map above to a list of characters and for a basic trie T is a boolean

map<string, bool> = 
  Node of (bool * map<char, mapOverList>)

Note that option<bool> is redundant, so we can use bool only.

Maps over trees

tree2

We can now generalize the key to more complex recursive types such as trees. Suppose we fave the following type for tree:

tree<int> = (int * tree<int> * tree<int>)

The outermost map is indexed by the root of the tree. The inner maps are indexed by the left and right subtrees respectively:

map<tree<int>, T> =
  map<int, map<tree<int>, map<tree<int>, T>>>

If we expand the key of the second map we get the following:

map<tree<int>, map<tree<int>, T>> =
  map<int, map<tree<int>, map<tree<int>, map<tree<int>, T>>>

It gets pretty involved very quickly but because we traverse these types recursively, the implementation is still simple. Let’s see how to implement these in OCaml. The type of a map over an int tree can be defined as follows:

Note that the outermost map is a regular IntMap, which uses the root element of the map, a scalar, as key.

The search function takes a tree representing the key, and the map. The base case is when the tree is empty, when we’re just past a leaf. The recursive case consists in obtaining the maps in order, first using the root element, then using the left tree and finally using the right tree:

Note that the recursive call is non-uniform, so we need explicit annotations, as we discussed previously.

The insertion is similar, expect that when we fail to find a key in any of the maps, we need to first insert it with an empty element before recursing.

Because the Map.find() implementation throws exceptions when a key doesn’t exist, we can wrap the call with a try and if an exception occurs, we can insert the missing key (alternatively we could use Map.mem()).

Maps over products and sums

There are two ways a structure can branch out recursively: through combination or through alternative. For a combination, refer to our definition of Node:

Node of 'a * 'a tree * 'a tree

In theory, any combination of values for 'a, 'a tree, 'a tree are possible values for a Node. The * in between the components in fact represent the cartesian product operator. This is also known as product.

For alternative, the type of the tree itself can be either Empty or Node:

type 'a tree = Empty | Node of (...)

In this case, the valid values for a tree is the sum of values of each alternative. Hence, this is also known as sum.

We can generalize the map with keys of any types by looking at their definition. If it’s a product, we end up with nested maps. For example, if

Tk = (Tk1 * Tk2)

then the map over Tk can be defined as

map<Tk, T> = map<Tk1, map<Tk2, T>>

In our example, this came the nested maps 'a trie trie IntMap.t.

For sums, if the type is

Tk = Tk1 | Tk2

the map over Tk would end up as a product of maps:

map<Tk, T> = (map<Tk1, T> * map<Tk2, T>)

In our example, this came the product ('a option) and ('a trie trie IntMap.t). option can be thought as a one-dimensional map.

Conclusion

In this post we saw how a trie can be modeled as a map using the same principles as the ones we use to construct matrices, that is, two-dimensional maps (nested maps). We then generalized the string keys to trees, and implemented the insert/find functions in OCaml. I found it pretty hard to reason about these structures.

We then went a step further and saw how to construct maps based on the key structure. And we learned about product and sum of types when discussing recursive types.