Log Structured Merge Trees

In this post we’ll discuss a data structure called Log Structured Merge Trees or LSM Trees for short. It provides a good alternative to structures like B+ Trees when the use case is more write-intensive.

According to [1], hardware advances are doing more for read performance than they are for writes. Thus it makes sense to select a write-optimised file structure.

B+ Trees and Append Logs

B+ Trees add structure to data in such a way that the read operation is efficient. It organizes the data in a tree structure and performs regular rebalancing to keep the tree height small so that we never need to look up too many entries to find a record.

If the B+ Tree is stored in disk, updating it requires performing random access which is expensive for a spinning disk. Random access is order of magnitudes slower than sequential access in disk. Adam Jacobs [3] describes an experiment where sequential access achieves a throughput of ~50M access/second while a random access only 300 (100,000x slower!). SSDs have a smaller gap ~40M access/second for sequential access and 2000 access/second for random access.

The other extreme alternative to avoid disk seeks when writing is just to append content sequentially. We can do this by appending rows to a log file. The problem of this is that the stored data has no structure so searching for a record would require scanning the entire dataset in the worst case!

The LSM Tree aims to combine the best of both worlds to achieve better write throughput without sacrificing too much of read performance. The overall idea is to write to a log file but as the file gets too large, restructure the data to optimize reads. We can see it as a lazy data structure data gets updated in batches.

First we’ll describe the original version of LSM Trees and then an improved version with better performance for real world applications and used by databases like LevelDB [4].

LSM Trees

Let’s study LSM Trees applies to the implementation of a key-value database. Writes are initially done to an in-memory structure called memtable, where the keys are kept sorted (random access of RAM is not expensive). Once the table “fills up”, it’s persisted in disk as an immutable (read-only) file.

lsm-insert-mem

Figure 1: Inserting new key in memtable

Searching for a key consists in scanning each file and within a file we can keep an index for the keys, so we can quickly find a record. Note that a key might appear in multiple files representing multiple updates to that key. We can scan the files by the most recent first because that would contain the last update to the key. The major cost of searching is due to the linear scanning of the files. As our database grows, the number of files will become too large to scan linearly.

lsm-write-disk.png

Figure 2: Writing memtable to a file

To avoid that, once the number of files grow past a given number, we merge every pair of files into a new file using an external merge sort to keep the keys sorted. The linear factor of the search was cut in half, and while the file size doubled, the cost was sublinear, O(log n), so the search became twice as fast. This approach is known as tiered compaction [2].

The main disadvantage of this method is that once the files get past a certain size, the merge operation starts getting costly. Given m sorted files of size S, the merge operation would be O(m S log S). While this compaction will happen rather infrequently (roughly when the database doubles in size), it will take a really long time for that one time it happens.

lsm-compaction

Figure 3: Tiered compation

This resembles the discussion of amortized analysis for data structures [5]. We saw that while amortized complexity may yield efficient average performance of a data structure, there are situations where we cannot afford the worst case scenario, even if it happens very rarely.

LSM with Level Compaction

An alternative approach to work around expensive worst case scenarios is to keep the file sizes small (under 2MB) and divide them into levels. Excluding the first level which is special, the set of keys each two files at a given level contain must be disjoint, that is, a given key cannot appear in more than one file at the same level. Each level can contain multiple files, but the total size of the files should be under a limit. Each level is k times larger than the previous one. In LevelDB [4], level L has a (10^L) MB size limit (that is, 10MB for level 1, 100MB for level 2, etc).

Promotion. Whenever a given level reaches its size limit, one of the files at that level is selected to be merged with the next level or promoted. To keep the property of disjoint keys satisfied, we first identify which files in the next level have duplicated keys with the file being merged and then merge all these files together. Instead of outputting a single combined file like in the tiered compaction, we output many files of size up to 2MB. During the merge, if we find collisions, the key from the lower level is more recent, so we can just discard the key from the high level.

lsm-level-promotion.png

Figure 4: Promotion from Level 0 to Level 1

Details

When merging, to detect which files contain a given key, we can use Bloom filters for each file. Recall that a bloom filter allows us to check whether a given key belongs to a set with low memory usage. If it says the key is not in the set, we know it’s correct, while if it says it is in the set, then there’s a chance it is wrong. So we can quickly check whether a given key belongs to a file with low memory footprint.

The first level is special because the keys don’t need to be disjoint, but when merging a file from this first level, we also include the files where that key is present. This way we guarantee that the most up-to-date key is at the lowest level it is found.

To select which file to be merged with the next level we use a round-robin approach. We keep track of which file was merged last and then pick the next one. This can be used to make sure that every file eventually gets promoted.

When outputting files from the merge operation, we might output files with less than 2MB in case we detect the current file would overlap with too many files (in LevelDB it’s 10) in the next level. This is to avoid having to merge too many files when this file gets promoted in the future.

Cost Analysis

Since the files sizes are bounded to 2MB, merging files is a relatively cheap operation. We saw above that we can limit the file to not contain too many duplicate keys with the files at the next level, so we’ll only have to merge around 11 files, for a total of 11MB of data, so we can easily do the merge sort in memory.

The promotion might also cascade through the next levels since once we promote a file from level t to t+1, it might overflow level t+1, which will require another promotion as well. This in fact will be common because merging only moves off 2MB worth of data to the next level, so it will require a promotion the next time it receives a new file from the level below (ignoring the fact that keys get overwritten during the merge). Fortunately the number of levels L grows O(log n) the size of the data. So for LevelDB, where the first level size limit is 100MB, even for a disk with 100TB capacity, we would still need only about 8 levels.

Reads

The fact that each key belongs to at most one file at each level allows us to keep an index (e.g. a hash table in disk) of keys to files for each level. (This of course excludes the first level, but it has a small number of files, so linear search is not expensive).

One interesting property is that each level acts as some sort of write-through cache. Whenever a key gets updated, it’s inserted at a file at lower levels. It will take many promotions for it to be placed at a higher level with other files. This means that searching for a key that has been recently updated will require scanning very few levels or smaller indexes since it will be found at lower levels.

References

[1] ben stopford – Log Structured Merge Trees
[2] Datastax – Leveled Compaction in Apache Cassandra
[3] ACM Queue – The Pathologies of Big Data
[4] LevelDB – Wiki
[5] NP-Incompleteness – Eliminating Amortization

Advertisements

Writing JavaScript using OCaml

output

In this post we’ll explore BuckleScript, a framework that enables developers to write JavaScript applications using OCaml syntax.

Note that BuckleScript is not a way to convert general OCaml code into JavaScript, but rather write JavaScript applications using OCaml’s syntax and type system. The end runtime and libraries are still JavaScript.

For example, if you want to work with dates, you wouldn’t look for the corresponding OCaml library, but instead include the JavaScript Date module in your OCaml code.

Let’s first compare BuckleScript with similar frameworks to understand better why it exists and when to use it. Then we’ll follow with several examples to get a better grasp of BuckleScript and how it translates to JavaScript.

It’s better to have some knowledge of OCaml, and we have written extensively about it, but it’s also easy enough to get the gist of the benefits from our simple examples.

Comparisons

BuckleScript vs js_of_ocaml

js_of_ocaml is another framework that connects the OCaml and JavaScript worlds.

According to [4] both projects aim to compile OCaml code to JavaScript. The differences pointed are:

* js_of_ocaml takes low-level bytecode from OCaml compiler, BuckleScript takes the high-level rawlambda representation from OCaml compiler
* js_of_ocaml focuses more on existing OCaml ecosystem(opam) while BuckleScript’s major goal is to target npm

A simplistic way to see the differences is that BuckleScript is for JavaScript developers to start using better language features from OCaml, while js_of_ocaml is for OCaml developers to be able to run their code in the browser.

BuckleScript vs ReasonML

ReasonML is often mentioned together with BuckleScript, which makes it a bit confusing to understand their differences at first.

Compared do OCaml, ReasonML has a friendlier syntax (for people coming from JS) and better support for JSX (inline XML tags). The difference in syntax is significant that we are really talking about some dialect of OCaml.

Note that BuckleScript and ReasonML are complementary. BuckleScript can compile either OCaml or ReasonML to JavaScript. ReasonML in mostly about the syntax.

BuckleScript vs TypeScript

Typescript is a framework for adding types to a JavaScript codebase, similar to Flow.

At first glance, TypeScript and BuckleScript seem to serve different purpose, but one of the main advantages of using OCaml to write JavaScript applications is to provide type safety.

In [2], Hongbo provides a comparison between the two systems. Some of the pros and cons raised are:

TypeScript

  •  Pros:
    • Designed for JS, easy inter-operate with JS
  • Cons:
    • Compiler slow (not scalable for FB/Google scale)
    • Verbose, very limited type inference
    • Start up time slow (very hard to build traditional build tools)
    • Types are only used for tooling – soundness is not the design goal, not very reliable
    • No code optimizations

BuckleScript

  • Pros:
    • Compiles much faster, scales better
    • Sound type system, global type inference
    • Types are used in code optimization, optimizing compiler
    • Outputs to JavaScript but can also generate native backend for X86, ARM
  • Cons:
    • Learning curve is higher compared with TypeScript

The author also suggests that OCaml is a better language than JavaScript, without providing more details.

Setup

The easiest way to try out examples is through this BuckleScript playground.

To try it locally, we can follow [5]. It generates an initial bsconfig.json file and the compilation process can be done via npm. The process involve converting a OCaml file (.ml) to a JavaScript file (.bs.js). The latter should be included in your application, but both should be considered part of your codebase and hence committed.

Examples

Because most people learning BuckleScript are familiar with JavaScript and less familiar with OCaml, I’ll provide the reverse examples: how to do implement a given JavaScript snippet of code in OCaml?

console.log(‘Hello World’)

How to print to stdout using BucketScript? The most basic program is simply printing a Hello World. In OCaml we would do “Print.printf” to print to stdout, but OCaml modules are not readily available for BuckleScript. We can use the Js.log() function for this purpose:

This maps to:

Note that Js log is a library provided by BuckleScript to make the integration with JavaScript more seamless.

Looping over an Array

We can use an imperative-style code in OCaml:

Note that arrays have a different syntax than in JavaScript: [| 10; 20; 30 |]. If we generate this code we get:

The interesting thing here is that we do have access to some basic OCaml libraries, for example Array.

Since we are dealing with a functional language, we might as well use some more idiomatic OCaml code. If we are to translate the code to native JavaScript structures, we have to use functions from the Js module:

Which maps to

Looping over a List

So Array is the basic structure to represent a sequence of items in JavaScript but in OCaml it’s List. What happens if we use List instead?

We simply dropped the | to use a List instead of Array and now we can use the more standard fold_left instead of reduce. This translates to:

This is very interesting! We studied functional data structures OCaml extensively in the past and the key concept is that by default data structures are persistent. If we look closely what is being passed to sum(), we see it’s a linked-list like structure: [10, [20, [30, 0]]], the 0 being used to indicate the end.

Modules

Another common pattern in JavaScript is to group a bunch of functions inside an Object. The Object name serves as namespace or a module name that can be referenced elsewhere. A natural mapping here are the OCaml modules:

which maps to:

Note that the signature is only used for compilation/transpilation checks and is erased from the final JavaScript code. Another curious thing is that the functions are exported as arrays to MyModule. To me it would make more sense to export them as Object.

Currying Functions

A nice feature from functional languages is the concept of currying, which allow us to perform partial application of functions. For example, if we have a function that adds two numbers, we can derive an increment function that partially applies sum by binding the first value to 1:

The resulting JavaScript code is:

Note that we can already perform partial application of function in vanilla JavaScript via bind(), but the syntax is not as neat:

Chaining Functions

Another neat syntax from OCaml is the chaining operator. One can chain functions via the |> operator: the result of the lefthand function is passed as argument to the righthand function.

A common use for this is when we want to apply a series of function in sequence without assigning to a variable.

For example, if we have a function to normalize a given string by converting it to lower case, trimming the leading and trailing whitespaces and also converting intermediate spaces to underscores, we could write, in JavaScript:

An alternative would be to nest the calls so we don’t have to repeat the variable, but that would hurt legibility. In OCaml, we could chain these calls:

Note the [%bs.re] tag. It is a macro that allows us to write regexes using JavaScript syntax. We can avoid repeating the module names if they are all the same:

Using JavaScript libraries

One of the main selling points of BuckleScript is that you can adopt it gradually, module by module. This is possible because we can require JavaScript modules inside our OCaml code. For example, if we were to convert the following code that reads a file asynchronously in JavaScript:

We could do:

Here, the OCaml code is more verbose but we provided a stronger contract by typing the function readFile(). The syntax for importing modules is

Note: if is the same as , the latter can be omitted.

Objects as Maps

In JavaScript Objects are often used either as maps (key-value) or records (entries of distinct fields). In OCaml we can rely on types to enforce the specific use we want via types. In the example below, we declare a map with type string to int. If we try to set a value with a different type we get a compilation error:

Objects as Records

To represent an Object as a record, we can use a OCaml record type syntax:

We added the [@bs.optional] to indicate that a field is optional. We also added the [@@bs.deriving abstract] attribute to indicate it should not be directly instantiated like

Instead, it generates a “constructor” function. In the same way, the properties of a record are not directly available. They also need to go through intermediate auto-generated accessors:

The generated JavaScript code translates to an Object:

The interesting thing is that the generated JavaScript Object is mutable, but within the OCaml code, the record cannot be modified. It’s possible to mark it mutable, but the default immutability makes it easier to reason about code.

Downsides

The benefits being stated, there are two main potential drawbacks of using BuckleScript.

Mixed languages. Adopting BuckleScript will cause the codebase to have a mix of different languages, which makes it harder for new developers to ramp up. Of course this can be mitigated by converting the entire codebase to use OCaml.

Debugging. We’ll be writing code in on language but it’s another language that will end up being executed. If a problem happens in the underlying JavaScript code, how to figure out which OCaml code is generating the faulty code?

BuckleScript tries to solve this issue by preserving the structure of the code as much as possible so that it’s easier to understand what parts maps to what. This works well if we are using the Js wrappers that resembles the JavaScript code patterns, but it’s unclear how easy the structure is preserved if we use more of OCaml persistent data structures or functional programming patterns like currying.

One possible improvement would be to add some traceability to the generated JavaScript code such that you won’t need to look at the JavaScript code most of the time, in the same way that one doesn’t usually need to inspect Assembly code when their C++ application crashes.

Conclusion

In this post we did a comparison of BuckleScript with different frameworks and libraries to understand why and when to use it. Following that, we studied a few basic examples which one might encounter in a day-to-day JavaScript programming and how to express that in OCaml.

Through these examples, we saw the OCaml type system in use, as well as some neat syntax and immutable data structures, which can lead to more readable, succinct and reliable code.

References

[1] BuckleScript Examples
[2] BuckleScript: An OCaml to JavaScript compiler (Hacker News discussion)
[3] BuckleScript playground
[4] BuckleScript/bucklescript – Comparisons
[5] BuckleScript – New Project

Bulls and Cows

Bulls and Cows (also known as MOO) is a 2-player game in which one player comes up with a secret and the other has to guess the secret. The secret consists of 4 digits from 0 to 9, where each digit is distinct. Player 2 has to guess these 4 digits, in the right order. At each guess from player 2, player 1 provides feedback as hints. The hints are two numbers: one telling how many digits from the secret player 2 got in the right order (bull), and another telling how many digits they got but in the wrong order (cow).

For example, if player 1 came up with 4271, and player 2 guessed 1234, then bull is 1 (digit 2), and cow is 2 (1 and 4 are in the secret, but not in the order guessed by player 2).

The goal of the game is for player 2 to guess the secret with the least amount of guesses.

DSC_0110

Bulls sculptures from Cyprus. Dated from sometime between 2000 and 1600 BC. Photo take at the National Archeology Museum in Athens.

In this post we’ll present computational experiments to solve a game of Bulls and Cows optimally.

Objective function

We’ll focus on a search strategy to minimize the maximum number of guesses one has to make, for any possible secret. Alternatively, John Francis’s paper [1] proposes heuristics that minimize the expected number of guesses, that is, for the worst case these heuristics might not be optimal, but they perform better on average.

Because the game has turns, our solution is not a single list of guesses, but rather a decision tree where at each node we branch depending on the hint we get back. The metric we are trying to optimize is then the height of such tree.

To break ties between trees of the same height, we consider the smallest tree (the one with least nodes).

Brute-force search

Our search algorithm is recursive. At any given recursion level, we are given a set containing all possible numbers that could be secrets. We’ll try all combinations of guesses and hints and see which ones yield the best solution.

When we simulate a given guess and hint, we are restricting the possible numbers that can still be secret, so in the next level of recursion, the set of potential secrets will be smaller.

For example, say that the possible secrets are any 4-digit number with no repeated digits. We then use one of them as a guess, say, [0, 1, 2, 3]. One possible hint we could receive is (3, 0). What are the possible secrets that would cause us to receive this hint? [0, 1, 2, 4][0, 1, 2, 5] and [7, 1, 2, 3] are a few of those. If we recurse for this guess and hint, the set of secrets in the next level will be restricted to those that would return a hint (3, 0) for [0, 1, 2, 3].

Continuing our example, is [0, 1, 2, 3] the best guess we can make at that point? After recursing for all possible hints, we’ll have a decision tree rooted on [0, 1, 2, 3]. The key idea of the search is that we can minimize the height of the final decision tree by minimizing the subtrees at each recursion level (greedy strategy). Thus, we want to find the guess with the shortest subtree.

In pseudo-code code, the search looks like this:

We start with all possible secrets (4-digit number with no repeated digits) and the tree returned should be optimal in minimizing the number of guesses for the worst case.

Rust implementation

I initially implemented the search in Python, but it was taking too long, even when artificially restricting the branching. I re-implemented it in Rust and saw some 20x speedups (caveat: I haven’t really tried to optimize the Python version).

The main difference between the pseudo-code and the actual implementation is that we pre-group possible_secrets by their hint for guess, which is more efficient than scanning possible_secrets for all possible hints:

The function group_possibilities_by_score() above makes use of compute_score and it also uses a fixed-length array for performance. The set of hints is proportional to the squared size N of the guess, in our case N=4.

Turns out that the Rust implementation is still not efficient enough, so we’ll need further optimizations.

Optimization – classes of equivalence

What is a good first guess? It doesn’t matter! We don’t have any prior information about the secret and every valid number is equality probable. For example, if we guess [0, 1, 2, 3] or [5, 1, 8, 7], the height of the decision tree will be the same. An intuitive way to see why this is the case is that we could relabel the digits such that [5, 1, 8, 7] would map to [0, 1, 2, 3].

Francis [1] generalizes this idea for other cases. Say that at a given point we made guesses covering the digits 0, 6, 7, 8 and 9 at least once. Now say we our next guess is [0, 8, 9, 3]. In here, 3 is the only digit we haven’t tried yet, but using the re-labeling argument, we can see that [0, 8, 9, 1] would yield the same decision tree if we were to swap the labels of 1 and 3. This allow us to skip guesses that belong to the same class, which reduces the branch factor.

We can generate an ID representing a given class. A way to do this is by adding one to each digit we have tried before and  making any digit we haven’t as 0, then converting that number from base 11 to base 10. For example, [0, 8, 9, 1] becomes [1, 9, 10, 0]. If this was a number in base 11, in base 10 it is 2530 ((((1*11) + 9)*11 + 10)*11 + 0). If we do the same with  [0, 8, 9, 3], we’ll get the same number. The code below implements this idea.

In our case, D = 10 and we store the set of visited digits in a bitset, visited_bits (that is, bit i is 1 if digit i has been visited.

On the search side, we keep a set of classes ids already visited and skip a guess if its class is already in there.

With this optimization the search algorithm runs in under 2 minutes. By inspecting the height of the resulting tree we conclude that the minimum number of guesses necessary for any secret is 7.

The complete code is available on Github.

Visualizing

The JSON output by the search algorithm is quite big (the smallest tree with height 7 has almost 7000 nodes). A more interesting way to inspect the data is to create a Bulls and Cows solver. We feed the JSON to this application and ask the user to select the outcome based on the secret they have in mind. We are basically traversing the edges of the decision tree.

Screen Shot 2018-06-03 at 20.48.47

I’ve uploaded the application to my personal website and the source code is available on Github.

Conclusion

I learned about this game pretty recently and was curious to learn of good strategies to solve this problem. From what I’ve been reading, the heuristics that yield good solutions are very complicated for a human to perform. This leads to an question: is there are any solution which is simple to follow but that is reasonably good?

One natural extension for this problem is to use larger numbers of digits (N > 4) and have each element be sourced from 0 to D – 1. An exhaustive search might be prohibitive, but maybe we can come up with heuristics with constant guarantees. What is a lower bound for the number of guesses for variants with arbitrary N and D?

I struggled to implement this code in Rust, but I found the challenge worthwhile. I learned a bit more about its specifics, especially regarding memory management and data types.

In [1], the author mentions that with optimizations the search took 45 minutes to run on their laptop (2GHz Intel Core 2 Duo). I ran mine on a Intel i7 2.2GHz and was surprised by the running time of 2 minutes. CPUs are not getting exponentially faster these days and my code runs on a single thread.

References

[1] Strategies for playing MOO or “Bulls and Cows”.
[2] Github – Blog Examples: Bulls and Cows

Cell biology and programming

rosalind

Rosalind Franklin was an English chemist and X-ray crystallographer. She is best known for her work on the X-ray diffraction images of DNA, particularly Photo 51, while at King’s College, London, which led to the discovery of the DNA structure by Watson and Crick.

James Watson, Francis Crick and Maurice Wilkins shared the Nobel Prize in Physiology or Medicine in 1962. According to Wikipedia [1], Watson suggested that Franklin would have ideally been awarded a Nobel Prize in Chemistry four years after Franklin passed away due to ovary cancer.

photo_51

Photo 51

In this post we’ll study some basic concepts of cell biology, mostly around the DNA. We’ll start by introducing the structure of the DNA and then two of its main functions: replication and synthesis of protein. Since this is a programming blog, we’ll provide analogies (some forceful) to computer systems to help us relate to prior knowledge.

The end goal of this post to provide a good basis for later learning bio-informatics algorithms.

Genome

Genome is the set of information necessary for creating an organism. In a computer programming analogy we can think of the genome as the entire source code of an application.

Chromosome

Let’s recall that cells can be classified into two categories, eukaryotic (from the Greek, good + nucleus) and prokaryotic (before + nucleus). As the name suggests, eukaryotic cells have a well define nucleus surrounded by a membrane.

In eukaryotic cells the genome is divided across multiple chromosomes which are basically compacted DNA. They need to be compacted to fit within the nucleus. According to Science Focus [3], if stretched, the DNA chain would be 2 meters long.

Humans cells usually have 23 pairs of chromosomes, 22 of which are called autosomes and they are numbered based on size (autosome 1 is the largest). The remaining is a sex chromosome and can be of type either X or Y.

Men and women share the same types of autosomes, but females have two copies of chromosome X, and men one chromosome X and one Y.

Chromosomes are usually depicted as X-like structures and neatly arranged [4], but recent research was able to visualize the actual structure and it looks more like:

chromosome.jpg

Prokaryotes (e.g. bacteria) on the other hand typically store their entire genome within a single circular DNA chain.

In our computer programming analogy, the chromosome serve as units of organization of the source code, for example the files containing the code. If your application is simple, like a bacteria, the entire source code could be stored in a single file!

We can push the analogy even further and think of the tangled DNA within the chromosome as the “minification” step that JavaScript applications apply to the original source code to reduce network payload.

DNA

The deoxyribonucleic acid, more commonly known as DNA is a structure usually composed of two strands (chains) connected through steps to form a double helix.

dna-concept

Conceptual representation of the DNA: the double helix

In our analogy the DNA could represent the text of the source code.

Nucleotide

Nucleotides are the discrete units that form the base of the DNA. In the DNA it can be one of: Adenine, Cytosine, Guanine and Thymine.

Chemically speaking, we can divide the nucleotide in 3 parts: a sugar group, a phosphate group and a nitrogen base. The first two are common among the nucleotides, while the base differentiates them.

Screen Shot 2018-04-28 at 10.45.57 PM

Guanine, one of the 4 possible nucleotides in the DNA

Any two nucleotides can be connected through their sugar and phosphate groups (the sugar group of one nucleotide links to the phosphate group of the next). Nucleotides linked this way form the backbone of a single strand of the DNA.

In addition, two nucleotides can be linked together via their nitrogen base, but in this case, there’s a restriction on which two bases can be linked together. Adenine can only be paired with Thymine, and Cytosine can only be paired with Guanine. These pairings form the “steps” in between two DNA strands.

Screen Shot 2018-04-28 at 11.09.01 PM

4 nucleotides linked together through the sugar-phosphate groups or through the nitrogen bases.

Because the phosphate group of a nucleotide is linked to the sugar group of the next, we can define a direction for a chain of nucleotides. The endpoint that ends with the phosphate group has 5 carbon molecules and is called 5′ (read as five prime), while the sugar group has 3 and is called 3′ (three prime). The two strands in a DNA molecule are oriented in opposite directions, as depicted in the figure above.

We can now complete the analogy of computer programming by stating that nucleotides are the characters that compose the text of the source code (DNA). In our case, the alphabet contains 4 letters: A (Adenine), C (Cytosine), G (Guanine), T (Thymine).

The replication

The DNA is capable of replicating itself so that it can be passed down to new formed cells.

In high-level, the double strands of the DNA start separating and other structures start binding new nucleotides to each strand (templates) until both the strands are completely duplicated (and form a double strand again).

The separation of the strands is triggered by the protein helicase, and can happen at any point of the DNA and it might happen in many places at the same time. One way to visualize this is opening a zipper jacket from the middle and keep pushing it open in one direction.

While the strands are being separated, proteins called DNA polymerase starts scanning each strand and making a copy strand by adding nucleotides to it.

The DNA polymerase can only extend an existing chain of nucleotide, so it requires an initial fragment to start with. For that reason, in the beginning of the duplication process, a small fragment of DNA or RNA, called primer, needs to be connected to the strand.

One important limitation the polymerase has is that it can only process a strand in one specific direction: from 3′ to 5′. But since we saw that strands are oriented in opposite direction of each other, it means that the replication process doesn’t happen symmetrically on both strands.

For the strand oriented 3′ to 5′ the replication is smooth, generating a continuous strand. For the reverse strand though, it will be done in pieces, because the polymerase is adding nucleotides in the opposite side of where the opening is happening. These pieces are known as Okazaki fragments and later joined together by another protein called ligase.

We can see these two cases taking place in the picture below:

dna-rep.png

One interesting fact about this process is that errors do happen and there are error corrections in place to minimize them. The result of the replication is also not 100% accurate, especially at the endpoints of the strands, where each new replica formed has its endpoints shorter than the original template. To compensate for this, the DNA has repeated redundant segments of nucleotides at the endpoints, know as telomeres, but eventually this extra segments get wore off to a point they cannot be used for replication. The shortening of telomeres is associated with aging.

In our analogy to computer programming, we could imagine the replication being the distribution of copies of the source code to other people. This analogy is weak, though. If we want to be precise, we’d need to come up with a program that is capable of printing its own source as output. This was named Quine by Douglas Hofstadter in Gödel, Escher, Bach [6] and it’s an example of a self-replicating automata, also studied by the famous computer scientist John von Neumann.

Protein Production

A second function of the DNA is the production of proteins. The first step, called transcription, is very similar to replication: One of the strands serve as template, and a new strand with complementary base is generated. The difference is that instead of Thymine, the nucleotide Uracil is used. The resulting structure is called mRNA, short for messenger RNA.

MRNA-interaction

Production of mRNA and its exit to the cytoplasm

The mRNA detaches itself from the DNA strand and exits the nucleus to the cytoplasm where the second step, translation, begins. In there, it is “interpreted” by a structure called ribosome. Every 3 nucleotides, denominated codon, translates to an amino acid, which in turn form a protein. The mapping of every possible 64 combinations of codons are displayed below:

dna-152136_640.png

Mapping of codons to amino-acids. For example, GGA maps to Glycine.

The tRNA, short for transfer RNA, is a small chain of RNA that connects amino-acids to their corresponding codon in the mRNA. There are some special codons that indicate the end of the process. The output of this translation will be a peptide chain.

translation

Synthesis of a protein

The production of protein is the actual functioning of the DNA. If we link to the computer programming model, we could think of the source code (DNA) being interpreted or compiled into an actual program that can be executed. It’s incredible how biological systems evolved to define an explicit code for the end of the translation, much like how a semi-color or new line often indicates the end of an expression.

Applications of Bio-informatics

How can computer science help with the study of biological systems? Computers are  good at repetitive tasks and handling large amounts of data. Here are some applications according to Wikipedia [7].

  • DNA sequencing – consists of determining the order of nucleotides in the DNA from raw data. Computers are useful here because the raw data often come as fragments that need to be merged.
  • Protein structure prediction – given a sequence of nucleotides, determine the chain of amino-acids is well-understood, but predicting the structure of the final protein is an open problem.
  • Reduce noise in massive datasets output by experiments.

Conclusion

From reading a lot of introductory material, I felt that there were a lot of  imprecise or ambiguous descriptions. That seems to come from the sheer complexity of biological systems and most results coming from empirical evidence, which currently only provide an incomplete picture of the whole, but new information still comes at a very frequent pace, sometimes making existing models obsolete or incorrect.

I last studied Biology back in high school. I don’t intend to study it in depth, but just enough to understand how computer science can be used to solve particular problems.

The idea of modeling living organisms as organic computers is fascinating. As we discover more about the inner workings of cells, I hope we can come up with better models and be able to predict their outcome with greater accuracy.

References

[1] Rosalind Franklin – Wikipedia
[2] Photo 51 – Wikipedia
[3] How long is your DNA?
[4] How many chromosomes do people have?
[5] What a chromosome really looks like
[6] Gödel, Escher, Bach: An Eternal Golden Braid – Douglas R. Hofstadter
[7] Bioinformatics – Wikipedia

HyperLogLog in Rust

In this post we’ll study the hyper log log algorithm and provide an implementation in Rust.

Introduction

frajollet

Philippe Flajolet was a French computer scientist at INRIA [5]. He introduced the field of Analytic Combinatorics and is known for the creation of a family of algorithms of probabilistic counting, including the HyperLogLog.

HyperLogLog is a probabilistic algorithm to determine the number of distinct elements (or cardinality) from a multi-set (a set that allows repeated values) with high accuracy using very low memory. It is then suitable for streaming applications or distributed databases in which we do not have the luxury of keeping all distinct values in memory at any time.

The HyperLogLog

The intuition behind the HyperLogLog is the following. Imagine we have a basket full of balls, each containing a number, which can be repeated. We want to estimate how many distinct numbers we see in the basket with the limitation that we can only look at one ball at a time and we do not have paper and pen, so we have to rely on memory.

balls

The number of balls in the basket can be very large, so it’s unrealistic to keep track of all possible values we see, and we need a proxy to estimate the number of distinct values. The trick is to find some property that only very rare numbers satisfy, and the intuition is that if such property was satisfied by any of the numbers we saw, then there’s a good chance we saw a lot of numbers (or we’re very lucky).

To start, say that the property is “multiple of 2”. That is not a rare property, so if we assume the numbers in the balls do not have any special pattern, on average, we just need to find 2 different numbers to see that property being satisfied. Now say the property is “multiple of 4”. Less numbers satisfy this property, so the average number of distinct values we need to look at would be 4. We can keep going on for higher and higher powers of 2, and the average number of distinct values we look at would need to be as big.

This means that if we keep track of the largest power of 2 that divides any number we drew, we could estimate that that was the number of distinct values! The power of 2 property is interesting because it doesn’t change if there are duplicated values (which is great because we only want distinct values) and it doesn’t rely on the magnitude of the values as long as the set of values are uniformly distributed.

This estimate is not great though because the error is proportional to the number of distinct values found. For example, if we found out that the largest power of 2 divider was 2048 (2^11), the actual number of distinct values could be between 2048 and 4095 (2^12 – 1). Also, we could get unlucky and have all the balls in basket be have a single number, say 1024, so the estimate would be off by a lot.

Averaging

To reduce the chances of drawing a ball with a value that turns out to be a large power of 2 and throwing off the estimates, we divide the balls into m groups. The important property of the groups is that balls with the same values should go to the same group and the distribution should be uniform.

We can then find the largest power of 2 divider for each group and average them out. This would help in the case where we saw a single number 1024, because while one group would estimate it to be 1024, all the other groups would be find that the largest divisor was 1 (empty group).

To assign each number to a group, we could use modular arithmetic. For example, if we had 12 groups, we use the remainder of a number by 12 as an index to the group. The problem is that the elements assigned to a given group have a property that will bias the result. For example, for the group corresponding to the reminder 0, all numbers in there are obviously a multiple of 12 and hence 2^2. To make the numbers in the group unbiased, we need to discard the information used to assign them to groups. An easy way to achieve that is to represent each number as binary, use the first bits to decide which group to assign it to, and then discard those bits when computing the power of 2.

We can see that we can increase the number of groups to reduce errors but the tradeoff is that it requires to keep more information in memory. This is the idea first proposed by [2] and was called stochastic averaging.

Harmonic Mean

The HyperLogLog algorithm uses a different combination of the powers of two to obtain a better estimate than using averages: the harmonic mean [1]. To recall, the harmonic mean consists in averaging the reverse of the elements and then reversing the result. For example, the harmonic mean of 1, 4 and 4 is

Screen Shot 2018-03-30 at 8.57.13 PM

The bulk of the HyperLogLog paper [1] is actually proving that this metric yields an estimate with smaller errors than a simple average.

Hashing

Not all real world input will be numbers that are distributed uniformly. For example, we might be interested in counting the number of distinct values of a string column in a database. We then need to transform these values using a hash function which maps these inputs to numbers uniformly.

Algorithm

We are now ready to outline the core of the algorithm to estimate the cardinality of a multi-set of values.

Given a stream of n elements:

  • For each element
    • Hash it
    • Use the first bits of the hash to determine to which group j to assign it to
    • Discard those bits
    • Keep track of the largest power of 2 that divides some value in the group, and store the exponent in M[j]

Finally, estimate the number of distinct values based on a harmonic mean of the values in M.

The Expected Value and the Alpha Correction

The expectation of the number of distinct values is given by [1]:

Screen Shot 2018-03-31 at 10.23.11 AM

E is equivalent to the harmonic mean times \alpha_m m, and the constant \alpha_m is a constant to correct some bias. For implementation purposes, the authors provide approximations to this constant:

CodeCogsEqn (3).png

Small and Big Range Corrections

When the estimated number of distinct elements is relatively small compared to the number of groups m, the author propose a correction.

Let V be the number of groups to which no element was assigned to. If V > 0, then experiments show that for E \le \frac{5}{2} m, we can change the estimate to

CodeCogsEqn (4)

Conversely, if the number of distinct elements is very high, closer to 2^32, then the probability of hash collision are very high. The correction accounts for that:

CodeCogsEqn (6)

Implementation in Rust

We’ll now present an implementation using Rust. The first part of the algorithm consists in computing M, which we name more clearly as first_non_zero_by_bucket. The following code implements the pseudo-code described above:

We don’t need to know much Rust to read the code above. It’s worth mentioning that Rust is very strict about types, so we need to perform explicit conversions. Also we use 2 bit operations: one is to obtain the least significant k bits of an integer by using a bit mask. It relies on the fact that (2^k)-1 is a number with k bits 1 and doing a bitwise AND with any number has the effect of only extracting the first k bits of that number. The other trick is to divide a number by 2^k, which can be done by shifting the bits to the right, via the >> operator.
 
The hash function we use here is from the package farmhash, which is a Rust implementation of Google’s Farmhash, which in turn is a variant of Murmurhash [6]. It basically takes a string and shuffles its bits in a hopefully uniform way, generating a 32-bit integer:

first_non_zero_bit_position() is given by:

The formula to obtain the expected number of distinct values is given by

Screen Shot 2018-03-28 at 9.06.39 PM

The code below implements this function:

The values of alpha were discussed in the The Expected Value and the Alpha Correction above.

The correction for small and large ranges can be implemented as:

The complete code is available on Github.

Experiments

For values of b ranging from 4 to 16, I ran the program 100 times for n=100k, with numbers randomly selected from 1 to 100k. Then I plotted the results using the box plot chart using a R script:

Screen Shot 2018-03-31 at 5.48.27 PM

In the chart above, the x-axis represents the number of experiments we divided the input into, and the y-axis represents the relative error compared to the actual value. We can see that as we increase the number of experiments, the errors go down.

This chart was also helpful in finding a bug in my implementation: the initial plot had a sudden spike for values larger than 11, which was when the small range correction applied. After some debugging, I realized that algorithm should you the natural logarithm, not log2. It was great to spot the bug via data visualization!

Conclusion

I’ve been wanting to study the HyperLogLog algorithm for a while, and also one of my resolutions for this year is to learn Rust. It was a productive exercise to implement it.

I’ll post some future impression on Rust from someone with experience in C++ in a future post.

References

[1] HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
[2] Probabilistic Counting Algorithms for Database Applications
[3] Thu Trang Pham – Curiosity #2: How does Prestodb implement approx_distinct?
[4] Probabilistic Data Structures for Web Analytics and Data Mining
[5] Gödel’s Lost Letter and P=NP: Philippe Flajolet 1948–2011
[6] GitHub: seiflotfy/rust-farmhash

Paper Reading – F1: A Distributed SQL Database That Scales

In this post we’ll discuss F1: A Distributed SQL Database That Scales. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

F1 is a database used at Google to serve data for the AdWords product. This is the system that displays embedded ads in external pages using Google’s
infrastructure.

F1 is named after a term from Genetics, F1 hybrid, in analogy to the idea that it combines the best aspects of relational databases and NoSQL systems (my initial thought was that it was named after Formula One).

In a high-level, F1 is a layer on top of the Spanner database which we covered in a previous post. It adds features such a SQL interface (worth noting that Spanner recently has evolved to support SQL natively too). Spanner handles a lot of the distributed system difficulties under the hood, providing some guarantees and F1 builds relational database features on top of that.

The paper mentions that AdWords migrated from a sharded MySQL solution, which was very expensive to maintain, especially when the data outgrew the instance and re-sharding was necessary. Now, this data redistribution is handled transparently by Spanner and while F1 is theoretically slower, in practice they were able to structure the data in such a way that it’s performant in practice (5-10 ms for reads and 50-150ms for writes).

Architecture

F1 is deployed into multiple servers, across geographically distributed data centers. A client sends requests to a load balancer that then redirects to an F1 server, usually in the closest datacenter.

Screen Shot 2018-01-31 at 8.30.13 PM

Architecture of F1: In display are two datacenters containing multiple machines running F1 servers, Spanner and CFS instances.

F1 servers are usually in the same Data Center as the Spanner servers that contain their data. F1 can communicate to Spanner servers outside of its datacenter directly, but Spanner only communicates with the Colossus file system (CFS) within the same datacenter because CFS is not globally replicated.

F1 servers are stateless except when the client performs pessimistic transactions, which make them easier to scale, since no data movement is required.

In addition to F1 servers, there are F1 master and a pool of F1 slaves. According to the paper the master controls which processes are in the pool, and that F1 slaves execute part of the query plan on behalf of F1 servers, but it’s not clear why they need to be a separate component.

Data Model

The table data can be organized in a hierarchical schema much like Spanner (see
“Data Model” in the post about Spanner). This hierarchical schema is optional but it’s a key piece in making the performance of F1 competitive with the previous sharded MySQL system AdWords used.

In this hierarchy, we have a root table and a child table. For each row in the root table, called root row, we have rows clustered under that row based on their matching keys.

For example, say that we have tables Customer and Campaign. The Customer table has customerID and Campaign (customerID, campaignID). A possible structure for the rows would be:

Screen Shot 2018-01-31 at 9.05.08 PM

Where Campaign rows with customerID=3 are clustered together under the corresponding Customer row.

The paper mentions that a given cluster of child rows fall within the same Spanner directory, which means that a transaction would query a single Spanner server, avoiding the overhead of synchronization between multiple servers (see Read-write transactions in the Spanner post).

Indexes. Tables support two types of indexes: local or global. Local index entries are stored in the same Spanner server of the rows the index but it must include the key of the root table. Global indexes do not have such restrictions but they can be distributed across multiple Spanner servers and very expensive to update within a transaction.

Schema changes. Updating the schema of a table is challenging because the rows of the table are distributed, which makes schema consistency expensive to achieve, and the system cannot have downtime.

The proposed solution is to break schema changes into smaller steps, such that as if no 2 severs are more than 2 steps apart, the changes are safe.

Writes

Transactions

F1 supports 3 types of transactions: snapshot transactions, pessimist transactions and optimistic transactions. It relies on the transactions supported by Spanner, so it’s worth re-reading the Implementation Details of the Spanner post.

Snapshot transactions are read-only transactions and map to the corresponding read-only transactions from Spanner. The pessimistic transaction maps to the read-write transactions from Spanner.

The optimistic transaction consists of multiple reads followed by a single write. F1 stores an extra column with the last modified timestamp for each row. When the client performs the reads, it stores the largest timestamp it saw for any row it received. It sends this timestamp together with the final write request. F1 then performs a pessimistic transaction to read only the timestamps of the affected rows. If any of them differ from the timestamp sent by the client, it aborts. Otherwise it performs the write.

It’s worth noting that F1 didn’t create a transaction until the very last write request was sent. It’s assuming there were no writes between these reads and the final write, so we say it’s optimistic. This is the default transaction type and the paper describes a few advantages over the pessimistic type.

Locks

F1 supports very granular locking of the tables, including row level and cell (set of columns for a given row) level.

Change history

F1 stores change history of the data. The changes are stored as regular tables in a structure called ChangeBatch, children of a root table. If a transaction updates multiple root rows, several ChangeBatches are created under their corresponding root tables.

This feature allows clients to subscribe to changes in F1 tables using a publish-subscriber system. Whenever the client receives a notification, it can ask for incremental changes after a specific timestamp and apply to the data it has stored locally.

Reads

F1 supports both NoSQL and SQL interfaces. The SQL dialect was extended to support Protocol Buffers, which are complex data types with strong types (as opposed to loosely typed structures such as JSON). This extension allows, for example, to read and update internal fields of such structures.

Local vs distributed. F1 supports centralized execution (running on a single F1 node) or distributed. The query optimizer decides which one to execute. The paper details the process of executing the query in a distributed fashion.

Distributed Query Example

The paper describes an example of a SQL query being mapped to an execution plan. It involves two joins, filters and aggregations:

Screen Shot 2018-02-02 at 11.10.01 PM

This SQL query is parsed and converted to a query plan that will be executed by multiple machines called operators. A possible execution plan for the sample query is:

Screen Shot 2018-02-02 at 11.13.23 PM

The arrows indicate the data flow, starting at the bottom. The first join is a lookup join (hash join). The operation reads rows from the AdClick table until it has about 100k unique lookup keys stored in a hash table. Then it performs a single (batch) lookup to the other table, AdGroupCreative. The rows from the AdClick table are kept in memory for faster lookup.

As soon as the join for these keys are completed, they’re streamed to the next stage of the pipeline.

The second join with Creative is a distributed join. It first repartitions each row from each table based on the values of the columns listed in the USING clause. Each partition might end up in different machines for the next stage which consists of joining the columns of matching rows.

Finally, the rows are again repartitioned, now by the values from the group by columns and then aggregators apply the aggregation for sets of rows under the same partition.

Distributed Execution Overview

More generally, the query plan created by F1 is a DAG (directed aclyclic graph) where each node is an operator like the join or aggregator described above. Note that there are multiple operators running the same operation in parallel.

The paper says:

A technique frequently used by distributed database systems is to take advantage of an explicit co-partitioning of the stored data.

It’s not very clear to me what they mean with that, especially because they don’t cite any references, but from context it suggests that the base operators (the scan and lookup join) are in the same machine as the data (co-located) and they do as much of the processing upfront as possible. This helps minimize data transfer which can become the bottleneck in a distributed computation. F1 cannot do that because Spanner abstracts the data location from F1.

A side effect is that there’s a lot of network traffic. The authors claim that Google has network switch hardware improvements which allows servers to communicate with each other close to full network speed.

When the hash tables in memory grow too large, they write part of the data to disk. So while F1 doesn’t store data in a persistent way, it still needs to write to disk for intermediate operations.

For efficiency, F1 doesn’t write checkpoints to disk. The downside is that the system is not fault tolerant. Failures in any stage of the execution causes the entire query to fail. Retries are done transparently but long queries (>1h) are bound to fail.

Other features

F1 exposes data from intermediate nodes to clients. This avoid having all the data concentrating at the last node of the query execution. They cite Map-Reduce jobs as examples of such feature.

Conclusion

In this post we learned about one of Google’s many distributed databases, F1. It’s a loosely coupled layer on top of Spanner to provide a more familiar level of abstraction which are relational databases.

It seems that we could make an analogy between Google’s systems and similar open source solutions. The Colossus File System (CFS) is the distributed filesystem that could map to Hadoop Distributed File System (HDFS), and Spanner would map to Hadoop’s YARN, and F1 providing SQL semantics on top of Spanner could be mapped to Hive which does the same for Hadoop. It’s a very rough comparison, and maybe Spanner is more similar to Spark but it’s interesting to see the patterns and relationship between these systems.

References

[1] F1: A Distributed SQL Database That Scales
[2] Spanner: Google’s Globally-Distributed Database

Appendix: Terminology and Concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

We also have a lot of experience with eventual consistency
systems at Google

Eventual consistency means that a set of servers might not contain the most recent updates but eventually will. For example, the client might issue a write that affects multiple machines. If the system only provides eventual consistency then a subsequent read is not guaranteed to get the data up-to-date with those writes.

The quote mentions experience with eventual consistency in a negative way, because of the extra complexities that clients have to deal with to work around this limitation.

CSS Layout

Introduction

In a previous post we studied some major components of the browser, including the rendering engine. In this post we’ll dig a bit further on the layout of the render tree and take a look at an important piece of this process: the cascading style sheets or CSS.

To recap, every DOM element is usually represented by one rectangle. The job of the rendering engine is to determine two properties of these rectangles: their size (height, width), position (top, left) and stacking order when they overlap. In this post we’ll see how different CSS properties can affect the resulting rectangle.

Note: we’ll use box and rectangle interchangeably in this post.

CSS

History

hakon

CSS stands for cascading style sheets. It was proposed by Håkon Wium Lie in 1994.

The CSS 1 specification was finished in 1996. CSS 2 was created to address some issues with the previous version in 1997. CSS 3 was started in 1998 but hasn’t been concluded yet! The thing is that CSS 3 is subdivided in modules and each is fairly independent of each other, which resulted in different modules having different phases.

The diagram below provides a good overview on the different modules from CSS and their stage:

css-modules-status

CSS Modules and levels

Wikipedia has an interesting history of the development of CSS, including the initial lack of compliance to the specification which caused a lot of headaches to front-end developers, especially when working with early versions of the Internet Explorer.

It’s interesting to take a look at the early days of CSS because it plays a big role in how it looks today (due to back-compatibility). It’s useful to remember that back in the days when CSS first came around, web pages were generally pure HTML containing mostly text, so a lot of CSS was designed around concepts such as paragraphs and simple images.

In this post we’ll focus on the CSS modules that affect the layout of the DOM elements, in particular the Visual formatting model. According to the spec, these are factors that influence the layout of a box:

* Box dimensions (height and width properties)
* Type of box (display property)
* Positional schemes (position, left and top properties)
* Relationship between elements (hierarchy of DOM tree)
* External information (e.g. window size)

We’ll briefly cover some of these properties and then play with a few examples.

The display property

There are many possible values for the display property. The main ones are none, block, inline and inline-block. I’ve been using flex increasingly but that deserves a post in itself (this article is a great reference).

Display none removes the element from the layout calculation so it’s effectively invisible.

A value of block causes the element to be visually formatted as a block [5] (other values like list-item and table do to). In general a block is a box that starts at a new line and takes the entire width of the parent.

An inline box on the other hand starts from the left of the previous box. It also ignores explicit width and height values and any explicit vertical spacing (i.e.
top/bottom of margin/padding).

The main difference between inline and inline-box is that the latter does account for width and height, and vertical spacing [6].

The position property

There are 5 possible values for the position property: static, relative, absolute, fixed, sticky.

static is the default positioning schema and it follows the normal positioning flow of the page.

A relative positioned element accounts for top and left properties. These are in relation to the parent.

An absolute positioned element is similar to a relative, except that it is removed from the normal layout flow (i.e. other elements ignore its existence when being positioned) and its top and left are in relation to the first positioned ancestor in the DOM tree (or the document if none is). A positioned element is any with position != static.

Here is an example where we only change the inner (red) div to static, relative and absolute, respectively.

Screen Shot 2018-01-13 at 4.25.01 PM

In the first example, A ignores the top/left properties. In the third example is “crosses” the boundary of the middle (blue) box because it’s not positioned.

Note that in the code above we have set overflow to auto. This is a hack to prevent margin-collapsing.

An element with position:fixed is similar to position:absolute, except that instead of having its offset relative to an ancestor with position:relative, usually it’s relative to the viewport. That implies that such element scrolls with the page. The special case happens when one of its ancestors has transform, perspective, or filter set to something other than none, in which case it behaves much like an absolute positioned element.

Finally, as described in [2] an element with position:sticky is treated as relatively positioned until it crosses a specified threshold, at which point it is treated as fixed until it reaches the boundary of its parent.

The float property

When a element is floated, it is taken out of the normal layout flow of the document. It is shifted to the left (assuming a float:left) until it touches the edge of its containing box, or another floated element.

Because float implies the use of the block layout, it modifies the computed value of most of the display values to block.

Floats are better understood in relationship with other types of boxes. Let’s check some examples.

Float + Block

Whether the renderer accounts for a floated element when positioning a new block depends on the block formatting context. For example:

Screen Shot 2018-01-12 at 10.20.07 AM

The green block ignores the presence of the blue block but its child (yellow) does not. That’s because blue and yellow are in different block formatting contexts.

Float + Inline

Analogous to a block context, an inline formatting context is a set of inline elements. In such context, the boxes are laid out horizontally. Each “row” of inline elements is called a line box. In the presence of floats, the spec states the following:

In general, the left edge of a line box touches the left edge of its containing block and the right edge touches the right edge of its containing block. However, floating boxes may come between the containing block edge and the line box edge. Thus, although line boxes in the same inline formatting context generally have the same width (that of the containing block), they may vary in width if available horizontal space is reduced due to floats.

We can see an example of that in here. In this example the first three line boxes have a shorter width than the fourth one due to the presence of the float element.

Screen Shot 2018-01-19 at 11.20.05 PM

Clear

When applied to a floating element, clear moves the margin edge of the element below the margin edge of all relevant floats (depending on whether it’s clear left, right or both). Here’s an example:

Screen Shot 2018-01-15 at 7.49.52 PM

The z-index property

Besides determining the size and position of boxes, the layout engine needs to determine how to handle overlapping. The boxes ordering is transitive, meaning that if a box A is under B, and B is under C, A has to be under C.

The main attribute to control the stack order of elements is the z-index property (in reference of the z-axis, commonly used as the “depth” dimension in 3D). But this number only applies for boxes under the same class. As we’ll see now, it’s more complicated than it seems.

First, we need to define the concept of stacking context. A stacking context encompasses a set of DOM elements which can be compared to each other. The order of the stack context always take precedence over individual orders of elements within a stack context. For example, imagine that we have 2 stacking contexts:

Stack context 1: [A -> B -> C -> D]
Stack context 2: [E -> F -> G]
Stack context 1 -> Stack context 2

The arrow (->) represents that the element on the left is on top of the element on the right. Because A, B, C and D belong to context 1, all of them are placed over E, F and G, no matter how big the z-index of elements in the second group are.

Another way to see it is that a stacking context defines an “atomic position”, meaning that elements from outside it cannot be placed in between its elements. It has to be either above or below.

In the DOM tree, if a node E satisfies some conditions, it starts a new stack context, which means that all elements in the DOM subtree of that element will be under that stack context (we say E is the root of such stacking context). Note that a stack context can contain other stack contexts. A few properties that cause a stacking context to be formed are:

* position: absolute or relative and z-index other than auto
* position: fixed or sticky
* opacity less than 1.
* others

Note that position: static ignores z-indexes, so a corollary is that the use of z-index effectively creates a new stacking context.

Within a stacking context, the order of elements is defined by the spec. Here’s a simplified version, from bottom to top:

* The background and borders of the element forming the stacking context;
* Stacking contexts with negative z-indexes;
* Non-positioned block boxes;
* Non-positioned floats;
* Non-positioned inlined boxes;
* Stacking contexts with positive z-indexes;

Another corollary is that a parent is always rendered below its children due to the first rule above.

Here is an example with elements of each of these categories.

Screen Shot 2018-01-18 at 8.51.44 PM


Here is an interesting example with stacking contexts.

Screen Shot 2018-01-19 at 8.47.39 PM

The red box is the parent, so it has to be under the green box. The green, purple and black boxes are in the same stacking-context so they are ordered based on the z-indexes (since red doesn’t start a stack context, green belongs to the top-level stack context). Finally, note how pink has the highest z-index, but is still placed under green because it belongs to the stack context of purple, which is placed under green.

References

[1] MDN Web Docs: CSS display
[2] MDN Web Docs: CSS Position
[3] MDN Web Docs: CSS Float
[4] A Complete Guide to Flexbox
[5] MDN Web Docs: Block formatting context
[6] StackOverflow: display: inline vs inline-block
[7] All about floats
[8] CSS 2.1 Specification: Visual formatting model

Conclusion

My main goal with this study was to learn more about the layout algorithm used by render engines but this turned out to be an analysis of some CSS properties. It was an interesting exercise anyway. I work with CSS on a daily basis but often times I don’t take the time to read the spec or understanding why a given set of properties behave in a certain way.