# Two-factor authentication

In this post we’ll talk about some popular security measures to protect user accounts on the web via two-factor authentication. The term refers to the requirement of two methods of authentication for logging in into a given account. The first method is mostly always a password, and the second is one of the methods we’ll describe in this post.

### Why do we need an additional form of authentication?

In an ideal world, people would have strong (long, not complex) passwords, which would never get stolen and people would never forget them. In the real world, applications have to deal with two scenarios: 1) someone else knows your password or 2) you forgot your password.

#### Scenario 1: They are not who they claim to be

If someone else knows your password, the system needs to somehow know that this person is not you.

They can then employ a secondary method of authentication to verify that you are yourself. In theory they could ask for a secondary password or ask a security question. The problem with these is that they’re exposed to the same set of vulnerability that might have compromised the original password in the first place, for example, the password is too easy to crack or there was a breach of database storing plain text passwords. In addition, since these secondary methods are to be used in very rare occasions, it’s extremely likely you’ll incur in the second problem, i.e. forget your password.

Physical devices. Nowadays, security systems can almost always rely on the fact that even if someone has your password, they do not have your physical belongings (e.g. cellphone). Some websites allow users to setup the requirement to use both a password and a secondary authentication to access the account.

#### Scenario 2: I’m who I claim to be

To address the problem of a user losing a password, some websites offers a recovery mechanism, usually by sending a secure email with a link to re-set the password or, in case of email applications like GMail, allowing the secondary authentication method as an alternative to inputing your password.

Websites such as GMail and Github also have a set of auto-generated “master passwords” that you can print and store in a safe place. After used, these passwords become invalid. This is one of the safest options, but it also requires more effort from the user (printing and making sure they can find the printed document when needed).

Scenario 1 deals with security, and Scenario 2 deals with usability (recovering passwords), and these are usually at odds with which other. Security systems have to find the balance between the two.

We’ll now cover three popular secondary authentication mechanisms: SMS (text messages), third-party app authentication and hardware authentication.

### SMS

In the SMS (Short Message Service) method, the server generates a short code and sends it to the user via a text (SMS) message which is valid for a few minutes. The user can then copy the code from the phone to the computer and send the code to the server which can then authenticate the initial request.

During this period of time, the user account is technically vulnerable to a very weak code (a 6-digit number) which is very easy to crack. However, this period is very narrow, which great limits the ability of a bad agent to take any action.

#### Vulnerabilities

The real danger of the SMS method is a bad agent being able to intercept the SMS message that is supposed to go to the user. According to this Wired article, the telecoms use a network called SS7 (Signaling System No. 7) to transport text messages. This network relies on trust to implement features such as roaming, which enables a person from New York to receive/send text messages when they’re traveling to Berlin. In this case a carrier in Berlin could request the user’s carrier back in New York to receive the text messages so it can deliver to the user.

This system has a vulnerability because a hacked carrier could be used to intercept the text messages by pretending it’s doing so on behalf of a user. The carriers might not do any checks to verify the authenticity of the request. Hence, if an attacker knows your email, phone number and has access to a hacked carrier, they could technically hack into your account.

### App Authentication

Another authentication method is to install a third-party app that can be used to generated the authentication codes. One popular option is the Google Authenticator App, which you can install on your phone (Android or iOS).

It uses the Time-based One-time Password algorithm or TOTP [2, 3]. The general idea is to perform a one-time registration between your phone and the server which consists of having both store a secret.

Whenever the client needs to authenticate itself, it uses the current timestamp and the secret to generate a hash, and from this hash it extracts a simpler code (6 characters) that the user copies and sends to the server. The server performs the same operation and if the generated code matches, it accepts the authentication.

The precision of the timestamp defines on how much time the user has to copy and send the code to the server. For example, the server can define the timestamp granularity to be 30 seconds. This also defines how long the server is vulnerable, since the code is usually short and hence easier to crack via brute force, so it cannot be too long.

### Hardware Authentication

A more recent approach to authentication is using a dedicated piece of hardware. YubiKey is an example of such device, which can be connected to the USB port. One way it can be used is part of the open authentication protocol called Universal 2nd Factor (U2F), developed by Google and Yubico (the company that manufactures YubiKey). We’ll describe this protocol next. In the discussion that follows we’ll refer to the Yubikey device generically as U2F.

The general flow consists of a enrollment phase, where the use registers the U2F in the target webpage. The webpage asks for a confirmation, which the user can do by tapping the U2F, which sends some information to the webpage, which stores it.

The other part is the signing phase. When this webpage needs to verify the user, say during login, it can ask the user to tap the U2F, which will send information that can be validated by the webpage to make sure it’s the same device that was registered in the first step.

#### Implementation details

One of the designs of this system is to be cross compatible and require no extra configuration from the user, like installing drivers. To achieve that, the communication between the U2F and the server is mediated via the browser. That means that the website calls a browser API (via JavaScript) which in turn communicates with the U2F. Henceforth when we refer to the communication between the U2F and the server, we’re implicitly assuming it’s done via the browser.

During the enrollment process, the device generates a pair of public and private keys (public-key cryptography). It sends the public key to the server which stores it together with other information. During the signing phase the server can generate a challenge (string), encrypt with the public key and send it to the U2F, which can decode it. At this point, the user is asked to tap the U2F. Once that it’s done, it sends the challenge back to the server encrypted with its private key. If the server can then decode the message, it can trust the U2F and authenticate the user.

The reason a new public and private key is generated at every enrollment is for privacy reasons (not security). This is to prevent the case of different websites that enable U2F, to share data between them and be able to track the user. For example, if the same public key was used for all enrollments, a site A and B would be able to identify the user via their public key and share this information among themselves. If site A is a online shopping, it could use this information to show targeted ads in site B.

Stateless U2F. The problem of having to generate a pair of public/private keys every time is that now the U2F has to store them somehow. Since another important part of design is for the U2F to be very accessible, the implication is that they also have to be cheap. Hence, the protocol cannot assume the device has embedded storage. The solution is to send the pair for the server to store!

This seems to defeat the whole purpose of using cryptography but this information is sent to the server encrypted, which only the U2F itself can decode. Now, in addition to the server storing the public key, it has to store this extra information which the protocol calls Key Handle [5]. During the signing phase it sends not only the encrypted challenge, but also the Key Handle.

Man-in-the-middle. One potential security hole could be a scam website that looks like the real one and acts as a man-in-the-middle. First, the user will provide the scam site with the username and password. The scam site can then forward these to the real site to trigger the secondary factor request, which will send down the Key Handle and encrypted challenge. The scam site will forward it back to the U2F. Then the U2F would encrypt the challenge, which would be sent to the scam site, which in turn would relay it to the real site, finally allowing the bad actor to login as the user.

To prevent that, the site origin can be stored in the Key Handle as well. Before deciding to send data back, the U2F can check if the origin of the server and match it against the data in the Key Handle. The site origin is hard to tamper with when using an HTTPS connection unless the real site’s certificates are compromised.

Vendor reliability. Another component of the security is the trust in the manufacturer of the device. It could have malicious intent or flawed implementation. To address that concern, the U2F should also contain an extra pair of attestation public-private pair of keys. The attestation is to prove the identity of the manufacturer/vendor. During the enrollment, the public key that is generated is encrypted with the private attestation key. The public attestation key is made available by some trusted organization for the server to consult. If it’s able to decode the generated public key, then it can trust the U2F vendor.

### Conclusion

In this post we covered 3 methods of extra protection to the online identity. We saw that SMS has serious vulnerability while third-party and hardware authentication are much safer, which is no surprise since SMS were not initially designed to serve as a secure channel for authentication. No method is 100% secure but recent authentication mechanisms go to great lengths to reduce the vulnerable surface area to a minimum.

Note how all these methods assume the possession of a physical device separate from the computer where we’re trying to log into. Physical devices are much harder to steal compared to piece of information like passwords.

### References

[1] Information Security – How does Google Authenticator work?
[2] Wikipedia – HMAC-based One-time Password algorithm
[3] Wikipedia – Time-based One-time Password algorithm
[4] Wired – Fixing the cell network flaw that lets hackers drain bank accounts
[5] Google U2F (Gnubby) Documents – Snapshot prior to joining FIDO

# DNA Sequencing

Frederick Sanger. Wikipedia

Frederick Sanger was a British biochemist. He is known for the first sequencing of a protein (1955) and a method for sequencing DNA that bears his name, the Sanger Method. Sanger won two Nobel Prizes in Chemistry [1].

In this post we’ll talk about one of the first steps of DNA analysis, DNA sequencing, which is obtaining the data from the DNA, how it’s performed (we’ll focus on the Sanger method) and some interesting computational problems associated with it.

This is the second post in the series of Biochemistry studies from a Computer Scientist perspective. Our first post is a brief discussion of basic concepts in Cell Biology.

## DNA Sequencing

DNA sequencing is the determination of the physical order of the nucleotide bases in a molecule of DNA.

The first living organism to have its genome sequenced was a bacteria, Haemophilus influenzae, whose genome is about 1.8 million base pairs.

Genome represents the whole set of  genetic information of an organism. For a bacteria, it’s singular circular chromosome, but for humans it’s the set of all 23 pairs of chromosomes. A base-pair (shortened as bp) refers to a pair of nucleotides (bases) that are bound together in the double-strands of DNA.

In the human genome, there are 3 billion of base-pairs, and it took 13 years for it to be completed.

There are two main methods of sequencing, Sanger and Next-generation [5]. We’ll talk about the Sanger in details and discuss briefly the Next-generation from a real-world use case.

## Sanger Sequencing

The Sanger method is able to determine the nucleotide sequence of small fragments (up to abound 900bps) [5] of DNA.

### Overview

The first step is cloning the fragment into multiple copies (like billions) by a process called amplification. This is essentially mimicking the DNA cloning process in an artificial setup. In very high level we have:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Extend the primer with nucleotides until it reaches the other extremity

Once this is complete, we end up with two double-stranded fragments. We can repeat the process to generate many copies of the original fragment (theoretically doubling at each step). This process is known as Polymerase Chain Reaction (PCR).

Once we have enough copies of the fragment, we do a similar process, but this time, we also allow extending the primer with a special nucleotide named (dideoxy nucleotide). The key difference is that once it’s added, a dideoxy nucleotide cannot be further extended, and it also contains a special marker that causes each different base to have a different color. The process is now the following:

• Separate the double strands (denaturation)
• Add a special molecule (primer) to the extremity of each strand
• Add to the primer either
• Regular nucleotide (non-terminal – can be further extended)
• Dideoxy nucleotide (terminal – cannot be further extended)

Now, instead of ending up with more clones, we’ll have fragments with a variety of lengths.

We then run theses fragments through a process called Capillary Gel Electrophoresis which roughly consists in subjecting the fragments to a electric field, which then causes fragments to move with speed proportional to their length (the smaller the fragment the faster it moves). Once a group of fragments (which have the same length) reach a sensor at end of the field, we make use of the color marker in the special nucleotide at the tip of the fragment to determine the base of that dideoxy nucleotide. This enables us to determine the sequence of the nucleotides in the original fragment!

To given an example, say that the original sequence is GATTCAGC. Because there are many copies of the fragment after amplification, it’s very likely that we’ll end up with fragments with all possible lengths, from 1 to 8. Since a fragment of length 1 moves faster than any other, it will reach the sensor first. We know that a fragment with length 1 is a base pair G-C, and C is a dideoxy nucleotide (if it was not, it would have continued extended further). Say the color marker for C is orange. Then the first color to be captured by the sensor will beorange. The second set of fragments to reach the sensor is of length 2, which is a fragment (G-C, A-T), where T is a dideoxy nucleotide. If we assume it has color red, that’s the color which will be captured next by the sensor. You can see that based on the colors captured by the sensor we can infer the nucleotide sequence in the original segment.

Fragments with different lengths and color markers. Image copied from Whole-Genome Sequencing

### Details

Let’s go in more details for these processes. For both cases we work with solutions. Finer grained manipulation of molecules is infeasible.

To separate the double strands we heat the solution up to 96ºC, in a process we call denaturation. The high temperature causes the hydrogen bonds between pairs of nucleotides to break.

In the same solution we have the primer molecules (aka oligonucleotides) which are carefully chosen to match the beginning of the DNA strand. They also bind to DNA at a higher temperature than the strands (e.g. 55ºC). This is important because we can now lower the temperature of the solution slowly, enough so that primers can bind, but not so low to the point where the original strands will join back together. This slow cooling is called annealing. These primers can be synthesized artificially.

Gap in understanding: how to choose the right primer, since we need to know at least some of sequence from the nucleotide in order to know which primer will end up binding there? One possible explanation is that we know the exact sequence where a DNA was cut if we restriction enzymes, since their binding site is known [9] and hence we have an idea of the result endpoints after the cut.

# □

We also add an enzyme, DNA polymerase, and nucleotides to the solution. The enzyme is able to extend the original primer segment by binding free nucleotides in the solution. In particular we use the enzyme of a bacteria that lives at 70ºC (Thermus Acquaticus), also know as Taq polymerase because it is functional at higher temperatures. Performing the replication at a higher temperature prevents the separated strands from gluing together again.

The dideoxy nucleotide are modified versions of the regular nucleotide by supressing the OH group. They can still be incorporated to the primer via the DNA polymerase, but they prevent other nucleotides to binding to them via the sugar-phosphate binding.

In addition these dideoxy nucleotides contain a fluorescent molecule whose color is unique for each different type of base. How are these molecules “inserted” into the nucleotides? The abstract of [6] states:

Avian myeloblastosis virus reverse transcriptase is used in a modified dideoxy DNA sequencing protocol to produce a complete set of fluorescence-tagged fragments in one reaction mixture.

which suggests it’s possible to synthesize them by using a specific virus.

Dideoxynucleotide vs deoxynucleotide. The lack of the OH group prevents a ddNTP from binding to the “next” nucleotide, effectively terminating the chain. Image copied from Whole-Genome Sequencing

## Next-Generation Sequencing (Illumina)

The Sanger method is a very slow process, making it infeasible for analyzing large amounts of DNAs such as the human genome.

Modern sequencers make use of the “Next-generation” methods, which consist in massive parallelism to speed up the process. The most advanced sequencer in the market is produced by a company called Illumina. As of the time of this writing, their top equipment, Hiseq X Ten, costs about $10 million dollars, can sequence about 18k full genomes a year and it costs about$1000 per genome [2, 3].

Illumina’s educational video [4], describes the process:

• Cut a long sequence into small fragments
• Have the segments attach to beads in a glass plate via the adapters. The beads are basically the primers.
• Amplify the segments (massive cloning)
• Extend the primer with fluorescent nucleotides
• Every time a nucleotide is added, it emits light which is captured by a sensor

After the process is over, we have the sequence of many fragments based on the colors that were emitted.

We can see that the principles resemble the Sanger method, but it uses different technologies to allow a very automated and parallel procedure.

This whole process is very vague and it’s hard to have a good understanding of it. It’s understandable given that a lot of the information is likely industry secret.

## Sequencing vs Genotyping in personal genomics

Some of the most popular personal genetic analysis companies, such as 23andMe, provide a service in which they analyze the user DNA for a small fee. It’s way cheaper than the full genome analysis provided by Illumina, but that’s because these companies don’t do DNA sequencing, but rather genotyping.

Genotyping is the process of determining which genetic variants an individual possesses. This is easier than sequencing because a lot of known diseases and traits can be traced back to specific regions and specific chromosomes.

This is the information you most likely want to know about yourself. Remember that the majority of DNA in complex organisms is not useful (introns). In humans genome, exome (the part of DNA consisting of exons) account for less than 2% of total DNA.

Sequencing technology has not yet progressed to the point where it is feasible to sequence an entire person’s genome quickly and cheaply enough to keep costs down for consumers. It took the Human Genome Project, a consortium of multiple research labs, over 10 years to sequence the whole genomes of just a few individuals.

## Sequence Assembly Problem

Current technology is unable to sequence large segments of DNA, so it has to break it down into small fragments. Once that is done, we need to reconstruct the original sequence from the data of individual fragments.

There are two scenarios:

• Mapping: matching fragments of DNA to existing sequences by some similarity criteria
• De-novo: reconstruct the DNA when there’s no prior data about that sequence.

There is a family of computational methods for the mapping case known as Sequence Alignment, which we’ll study in more details in a future post.

Gap in understanding: It’s unclear what exact information is expected for the De-novo sequencing. It is impossible to determine the original sequence by only having data about individual segments, in the same way it’s impossible to reconstruct a book by only knowing it’s paragraphs contents (but not their order), borrowing the analogy from Wikipedia [8].

One thing I can think of is that if we repeat the experiments multiple times, each time cutting the molecules at different places, it might be possible to infer the relative order of the segments if they’re unique enough.

For example, if the original segment is: GATTCAGC and we run two experiments, one yielding (GAT, TCAGC) and another (GATTC, AGC), then in this case there’s only one way to assemble (order) these sequences in a consistent way.

# □

## Conclusion

In this post we studied some aspects of DNA sequencing. I found the Sanger method fascinating. In Computer Science ideas usually translate to algorithms and code very directly, but in other science branches the mapping from ideas to implementation is a problem in itself.

In Mechanics for example, we have to work with the physical world, so when converting from a circular movement to a linear one requires some clever tricks.

This needs to be taken to another level in Molecular Biology, because we don’t have direct access to the medium like we do in a mechanical device, for example, we can’t directly manipulate a double strand of DNA fragment to separate it, but have to resort to indirect ways.

The field of Biotechnology seems to be progressing at such a pace that it was challenging to find good sources of information. I’m yet to find a resource that explains end to end the steps from the process that takes a sample of cells and outputs the DNA nucleotides to a computer, including the technical challenges in doing so. This is what this post aimed to do.

## References

[1] Wikipedia – Frederick Sanger
[2] Genohub – Illumina’s Latest Release: HiSeq 3000, 4000, NextSeq 550 and HiSeq X5
[3] Illumina Hiseq-X
[4] Youtube – Illumina Sequencing by Synthesis
[5] Khan Academy – DNA sequencing
[6] Science Magazine: A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides
[7] Towards Data Science – DNA Sequence Data Analysis – Starting off in Bioinformatics
[8] Wikipedia – Sequence assembly
[9] Dummies – How Scientists Cut DNA with Restriction Enzymes

# Log Structured Merge Trees

In this post we’ll discuss a data structure called Log Structured Merge Trees or LSM Trees for short. It provides a good alternative to structures like B+ Trees when the use case is more write-intensive.

According to [1], hardware advances are doing more for read performance than they are for writes. Thus it makes sense to select a write-optimised file structure.

### B+ Trees and Append Logs

B+ Trees add structure to data in such a way that the read operation is efficient. It organizes the data in a tree structure and performs regular rebalancing to keep the tree height small so that we never need to look up too many entries to find a record.

If the B+ Tree is stored in disk, updating it requires performing random access which is expensive for a spinning disk. Random access is order of magnitudes slower than sequential access in disk. Adam Jacobs [3] describes an experiment where sequential access achieves a throughput of ~50M access/second while a random access only 300 (100,000x slower!). SSDs have a smaller gap ~40M access/second for sequential access and 2000 access/second for random access.

The other extreme alternative to avoid disk seeks when writing is just to append content sequentially. We can do this by appending rows to a log file. The problem of this is that the stored data has no structure so searching for a record would require scanning the entire dataset in the worst case!

The LSM Tree aims to combine the best of both worlds to achieve better write throughput without sacrificing too much of read performance. The overall idea is to write to a log file but as the file gets too large, restructure the data to optimize reads. We can see it as a lazy data structure data gets updated in batches.

First we’ll describe the original version of LSM Trees and then an improved version with better performance for real world applications and used by databases like LevelDB [4].

### LSM Trees

Let’s study LSM Trees applies to the implementation of a key-value database. Writes are initially done to an in-memory structure called memtable, where the keys are kept sorted (random access of RAM is not expensive). Once the table “fills up”, it’s persisted in disk as an immutable (read-only) file.

Figure 1: Inserting new key in memtable

Searching for a key consists in scanning each file and within a file we can keep an index for the keys, so we can quickly find a record. Note that a key might appear in multiple files representing multiple updates to that key. We can scan the files by the most recent first because that would contain the last update to the key. The major cost of searching is due to the linear scanning of the files. As our database grows, the number of files will become too large to scan linearly.

Figure 2: Writing memtable to a file

To avoid that, once the number of files grow past a given number, we merge every pair of files into a new file using an external merge sort to keep the keys sorted. The linear factor of the search was cut in half, and while the file size doubled, the cost was sublinear, O(log n), so the search became twice as fast. This approach is known as tiered compaction [2].

The main disadvantage of this method is that once the files get past a certain size, the merge operation starts getting costly. Given m sorted files of size S, the merge operation would be O(m S log S). While this compaction will happen rather infrequently (roughly when the database doubles in size), it will take a really long time for that one time it happens.

Figure 3: Tiered compation

This resembles the discussion of amortized analysis for data structures [5]. We saw that while amortized complexity may yield efficient average performance of a data structure, there are situations where we cannot afford the worst case scenario, even if it happens very rarely.

### LSM with Level Compaction

An alternative approach to work around expensive worst case scenarios is to keep the file sizes small (under 2MB) and divide them into levels. Excluding the first level which is special, the set of keys each two files at a given level contain must be disjoint, that is, a given key cannot appear in more than one file at the same level. Each level can contain multiple files, but the total size of the files should be under a limit. Each level is k times larger than the previous one. In LevelDB [4], level L has a (10^L) MB size limit (that is, 10MB for level 1, 100MB for level 2, etc).

Promotion. Whenever a given level reaches its size limit, one of the files at that level is selected to be merged with the next level or promoted. To keep the property of disjoint keys satisfied, we first identify which files in the next level have duplicated keys with the file being merged and then merge all these files together. Instead of outputting a single combined file like in the tiered compaction, we output many files of size up to 2MB. During the merge, if we find collisions, the key from the lower level is more recent, so we can just discard the key from the high level.

Figure 4: Promotion from Level 0 to Level 1

#### Details

When merging, to detect which files contain a given key, we can use Bloom filters for each file. Recall that a bloom filter allows us to check whether a given key belongs to a set with low memory usage. If it says the key is not in the set, we know it’s correct, while if it says it is in the set, then there’s a chance it is wrong. So we can quickly check whether a given key belongs to a file with low memory footprint.

The first level is special because the keys don’t need to be disjoint, but when merging a file from this first level, we also include the files where that key is present. This way we guarantee that the most up-to-date key is at the lowest level it is found.

To select which file to be merged with the next level we use a round-robin approach. We keep track of which file was merged last and then pick the next one. This can be used to make sure that every file eventually gets promoted.

When outputting files from the merge operation, we might output files with less than 2MB in case we detect the current file would overlap with too many files (in LevelDB it’s 10) in the next level. This is to avoid having to merge too many files when this file gets promoted in the future.

#### Cost Analysis

Since the files sizes are bounded to 2MB, merging files is a relatively cheap operation. We saw above that we can limit the file to not contain too many duplicate keys with the files at the next level, so we’ll only have to merge around 11 files, for a total of 11MB of data, so we can easily do the merge sort in memory.

The promotion might also cascade through the next levels since once we promote a file from level t to t+1, it might overflow level t+1, which will require another promotion as well. This in fact will be common because merging only moves off 2MB worth of data to the next level, so it will require a promotion the next time it receives a new file from the level below (ignoring the fact that keys get overwritten during the merge). Fortunately the number of levels L grows O(log n) the size of the data. So for LevelDB, where the first level size limit is 100MB, even for a disk with 100TB capacity, we would still need only about 8 levels.

The fact that each key belongs to at most one file at each level allows us to keep an index (e.g. a hash table in disk) of keys to files for each level. (This of course excludes the first level, but it has a small number of files, so linear search is not expensive).

One interesting property is that each level acts as some sort of write-through cache. Whenever a key gets updated, it’s inserted at a file at lower levels. It will take many promotions for it to be placed at a higher level with other files. This means that searching for a key that has been recently updated will require scanning very few levels or smaller indexes since it will be found at lower levels.

### References

[1] ben stopford – Log Structured Merge Trees
[2] Datastax – Leveled Compaction in Apache Cassandra
[3] ACM Queue – The Pathologies of Big Data
[4] LevelDB – Wiki
[5] NP-Incompleteness – Eliminating Amortization

# Writing JavaScript using OCaml

In this post we’ll explore BuckleScript, a framework that enables developers to write JavaScript applications using OCaml syntax.

Note that BuckleScript is not a way to convert general OCaml code into JavaScript, but rather write JavaScript applications using OCaml’s syntax and type system. The end runtime and libraries are still JavaScript.

For example, if you want to work with dates, you wouldn’t look for the corresponding OCaml library, but instead include the JavaScript Date module in your OCaml code.

Let’s first compare BuckleScript with similar frameworks to understand better why it exists and when to use it. Then we’ll follow with several examples to get a better grasp of BuckleScript and how it translates to JavaScript.

It’s better to have some knowledge of OCaml, and we have written extensively about it, but it’s also easy enough to get the gist of the benefits from our simple examples.

### Comparisons

#### BuckleScript vs js_of_ocaml

js_of_ocaml is another framework that connects the OCaml and JavaScript worlds.

According to [4] both projects aim to compile OCaml code to JavaScript. The differences pointed are:

* js_of_ocaml takes low-level bytecode from OCaml compiler, BuckleScript takes the high-level rawlambda representation from OCaml compiler
* js_of_ocaml focuses more on existing OCaml ecosystem(opam) while BuckleScript’s major goal is to target npm

A simplistic way to see the differences is that BuckleScript is for JavaScript developers to start using better language features from OCaml, while js_of_ocaml is for OCaml developers to be able to run their code in the browser.

#### BuckleScript vs ReasonML

ReasonML is often mentioned together with BuckleScript, which makes it a bit confusing to understand their differences at first.

Compared do OCaml, ReasonML has a friendlier syntax (for people coming from JS) and better support for JSX (inline XML tags). The difference in syntax is significant that we are really talking about some dialect of OCaml.

Note that BuckleScript and ReasonML are complementary. BuckleScript can compile either OCaml or ReasonML to JavaScript. ReasonML in mostly about the syntax.

#### BuckleScript vs TypeScript

Typescript is a framework for adding types to a JavaScript codebase, similar to Flow.

At first glance, TypeScript and BuckleScript seem to serve different purpose, but one of the main advantages of using OCaml to write JavaScript applications is to provide type safety.

In [2], Hongbo provides a comparison between the two systems. Some of the pros and cons raised are:

TypeScript

•  Pros:
• Designed for JS, easy inter-operate with JS
• Cons:
• Compiler slow (not scalable for FB/Google scale)
• Verbose, very limited type inference
• Start up time slow (very hard to build traditional build tools)
• Types are only used for tooling – soundness is not the design goal, not very reliable
• No code optimizations

BuckleScript

• Pros:
• Compiles much faster, scales better
• Sound type system, global type inference
• Types are used in code optimization, optimizing compiler
• Outputs to JavaScript but can also generate native backend for X86, ARM
• Cons:
• Learning curve is higher compared with TypeScript

The author also suggests that OCaml is a better language than JavaScript, without providing more details.

### Setup

The easiest way to try out examples is through this BuckleScript playground.

To try it locally, we can follow [5]. It generates an initial bsconfig.json file and the compilation process can be done via npm. The process involve converting a OCaml file (.ml) to a JavaScript file (.bs.js). The latter should be included in your application, but both should be considered part of your codebase and hence committed.

### Examples

Because most people learning BuckleScript are familiar with JavaScript and less familiar with OCaml, I’ll provide the reverse examples: how to do implement a given JavaScript snippet of code in OCaml?

console.log(‘Hello World’)

How to print to stdout using BucketScript? The most basic program is simply printing a Hello World. In OCaml we would do “Print.printf” to print to stdout, but OCaml modules are not readily available for BuckleScript. We can use the Js.log() function for this purpose:

This maps to:

Note that Js log is a library provided by BuckleScript to make the integration with JavaScript more seamless.

Looping over an Array

We can use an imperative-style code in OCaml:

Note that arrays have a different syntax than in JavaScript: [| 10; 20; 30 |]. If we generate this code we get:

The interesting thing here is that we do have access to some basic OCaml libraries, for example Array.

Since we are dealing with a functional language, we might as well use some more idiomatic OCaml code. If we are to translate the code to native JavaScript structures, we have to use functions from the Js module:

Which maps to

Looping over a List

So Array is the basic structure to represent a sequence of items in JavaScript but in OCaml it’s List. What happens if we use List instead?

We simply dropped the | to use a List instead of Array and now we can use the more standard fold_left instead of reduce. This translates to:

This is very interesting! We studied functional data structures OCaml extensively in the past and the key concept is that by default data structures are persistent. If we look closely what is being passed to sum(), we see it’s a linked-list like structure: [10, [20, [30, 0]]], the 0 being used to indicate the end.

Modules

Another common pattern in JavaScript is to group a bunch of functions inside an Object. The Object name serves as namespace or a module name that can be referenced elsewhere. A natural mapping here are the OCaml modules:

which maps to:

Note that the signature is only used for compilation/transpilation checks and is erased from the final JavaScript code. Another curious thing is that the functions are exported as arrays to MyModule. To me it would make more sense to export them as Object.

Currying Functions

A nice feature from functional languages is the concept of currying, which allow us to perform partial application of functions. For example, if we have a function that adds two numbers, we can derive an increment function that partially applies sum by binding the first value to 1:

The resulting JavaScript code is:

Note that we can already perform partial application of function in vanilla JavaScript via bind(), but the syntax is not as neat:

Chaining Functions

Another neat syntax from OCaml is the chaining operator. One can chain functions via the |> operator: the result of the lefthand function is passed as argument to the righthand function.

A common use for this is when we want to apply a series of function in sequence without assigning to a variable.

For example, if we have a function to normalize a given string by converting it to lower case, trimming the leading and trailing whitespaces and also converting intermediate spaces to underscores, we could write, in JavaScript:

An alternative would be to nest the calls so we don’t have to repeat the variable, but that would hurt legibility. In OCaml, we could chain these calls:

Note the [%bs.re] tag. It is a macro that allows us to write regexes using JavaScript syntax. We can avoid repeating the module names if they are all the same:

Using JavaScript libraries

One of the main selling points of BuckleScript is that you can adopt it gradually, module by module. This is possible because we can require JavaScript modules inside our OCaml code. For example, if we were to convert the following code that reads a file asynchronously in JavaScript:

We could do:

Here, the OCaml code is more verbose but we provided a stronger contract by typing the function readFile(). The syntax for importing modules is

Note: if is the same as , the latter can be omitted.

Objects as Maps

In JavaScript Objects are often used either as maps (key-value) or records (entries of distinct fields). In OCaml we can rely on types to enforce the specific use we want via types. In the example below, we declare a map with type string to int. If we try to set a value with a different type we get a compilation error:

Objects as Records

To represent an Object as a record, we can use a OCaml record type syntax:

We added the [@bs.optional] to indicate that a field is optional. We also added the [@@bs.deriving abstract] attribute to indicate it should not be directly instantiated like

Instead, it generates a “constructor” function. In the same way, the properties of a record are not directly available. They also need to go through intermediate auto-generated accessors:

The generated JavaScript code translates to an Object:

The interesting thing is that the generated JavaScript Object is mutable, but within the OCaml code, the record cannot be modified. It’s possible to mark it mutable, but the default immutability makes it easier to reason about code.

### Downsides

The benefits being stated, there are two main potential drawbacks of using BuckleScript.

Mixed languages. Adopting BuckleScript will cause the codebase to have a mix of different languages, which makes it harder for new developers to ramp up. Of course this can be mitigated by converting the entire codebase to use OCaml.

Debugging. We’ll be writing code in on language but it’s another language that will end up being executed. If a problem happens in the underlying JavaScript code, how to figure out which OCaml code is generating the faulty code?

BuckleScript tries to solve this issue by preserving the structure of the code as much as possible so that it’s easier to understand what parts maps to what. This works well if we are using the Js wrappers that resembles the JavaScript code patterns, but it’s unclear how easy the structure is preserved if we use more of OCaml persistent data structures or functional programming patterns like currying.

One possible improvement would be to add some traceability to the generated JavaScript code such that you won’t need to look at the JavaScript code most of the time, in the same way that one doesn’t usually need to inspect Assembly code when their C++ application crashes.

### Conclusion

In this post we did a comparison of BuckleScript with different frameworks and libraries to understand why and when to use it. Following that, we studied a few basic examples which one might encounter in a day-to-day JavaScript programming and how to express that in OCaml.

Through these examples, we saw the OCaml type system in use, as well as some neat syntax and immutable data structures, which can lead to more readable, succinct and reliable code.

### References

[1] BuckleScript Examples
[2] BuckleScript: An OCaml to JavaScript compiler (Hacker News discussion)
[3] BuckleScript playground
[4] BuckleScript/bucklescript – Comparisons
[5] BuckleScript – New Project

# Bulls and Cows

Bulls and Cows (also known as MOO) is a 2-player game in which one player comes up with a secret and the other has to guess the secret. The secret consists of 4 digits from 0 to 9, where each digit is distinct. Player 2 has to guess these 4 digits, in the right order. At each guess from player 2, player 1 provides feedback as hints. The hints are two numbers: one telling how many digits from the secret player 2 got in the right order (bull), and another telling how many digits they got but in the wrong order (cow).

For example, if player 1 came up with 4271, and player 2 guessed 1234, then bull is 1 (digit 2), and cow is 2 (1 and 4 are in the secret, but not in the order guessed by player 2).

The goal of the game is for player 2 to guess the secret with the least amount of guesses.

Bulls sculptures from Cyprus. Dated from sometime between 2000 and 1600 BC. Photo take at the National Archeology Museum in Athens.

In this post we’ll present computational experiments to solve a game of Bulls and Cows optimally.

#### Objective function

We’ll focus on a search strategy to minimize the maximum number of guesses one has to make, for any possible secret. Alternatively, John Francis’s paper [1] proposes heuristics that minimize the expected number of guesses, that is, for the worst case these heuristics might not be optimal, but they perform better on average.

Because the game has turns, our solution is not a single list of guesses, but rather a decision tree where at each node we branch depending on the hint we get back. The metric we are trying to optimize is then the height of such tree.

To break ties between trees of the same height, we consider the smallest tree (the one with least nodes).

#### Brute-force search

Our search algorithm is recursive. At any given recursion level, we are given a set containing all possible numbers that could be secrets. We’ll try all combinations of guesses and hints and see which ones yield the best solution.

When we simulate a given guess and hint, we are restricting the possible numbers that can still be secret, so in the next level of recursion, the set of potential secrets will be smaller.

For example, say that the possible secrets are any 4-digit number with no repeated digits. We then use one of them as a guess, say, [0, 1, 2, 3]. One possible hint we could receive is (3, 0). What are the possible secrets that would cause us to receive this hint? [0, 1, 2, 4][0, 1, 2, 5] and [7, 1, 2, 3] are a few of those. If we recurse for this guess and hint, the set of secrets in the next level will be restricted to those that would return a hint (3, 0) for [0, 1, 2, 3].

Continuing our example, is [0, 1, 2, 3] the best guess we can make at that point? After recursing for all possible hints, we’ll have a decision tree rooted on [0, 1, 2, 3]. The key idea of the search is that we can minimize the height of the final decision tree by minimizing the subtrees at each recursion level (greedy strategy). Thus, we want to find the guess with the shortest subtree.

In pseudo-code code, the search looks like this:

We start with all possible secrets (4-digit number with no repeated digits) and the tree returned should be optimal in minimizing the number of guesses for the worst case.

#### Rust implementation

I initially implemented the search in Python, but it was taking too long, even when artificially restricting the branching. I re-implemented it in Rust and saw some 20x speedups (caveat: I haven’t really tried to optimize the Python version).

The main difference between the pseudo-code and the actual implementation is that we pre-group possible_secrets by their hint for guess, which is more efficient than scanning possible_secrets for all possible hints:

The function group_possibilities_by_score() above makes use of compute_score and it also uses a fixed-length array for performance. The set of hints is proportional to the squared size N of the guess, in our case N=4.

Turns out that the Rust implementation is still not efficient enough, so we’ll need further optimizations.

#### Optimization – classes of equivalence

What is a good first guess? It doesn’t matter! We don’t have any prior information about the secret and every valid number is equality probable. For example, if we guess [0, 1, 2, 3] or [5, 1, 8, 7], the height of the decision tree will be the same. An intuitive way to see why this is the case is that we could relabel the digits such that [5, 1, 8, 7] would map to [0, 1, 2, 3].

Francis [1] generalizes this idea for other cases. Say that at a given point we made guesses covering the digits 0, 6, 7, 8 and 9 at least once. Now say we our next guess is [0, 8, 9, 3]. In here, 3 is the only digit we haven’t tried yet, but using the re-labeling argument, we can see that [0, 8, 9, 1] would yield the same decision tree if we were to swap the labels of 1 and 3. This allow us to skip guesses that belong to the same class, which reduces the branch factor.

We can generate an ID representing a given class. A way to do this is by adding one to each digit we have tried before and  making any digit we haven’t as 0, then converting that number from base 11 to base 10. For example, [0, 8, 9, 1] becomes [1, 9, 10, 0]. If this was a number in base 11, in base 10 it is 2530 ((((1*11) + 9)*11 + 10)*11 + 0). If we do the same with  [0, 8, 9, 3], we’ll get the same number. The code below implements this idea.

In our case, D = 10 and we store the set of visited digits in a bitset, visited_bits (that is, bit i is 1 if digit i has been visited.

On the search side, we keep a set of classes ids already visited and skip a guess if its class is already in there.

With this optimization the search algorithm runs in under 2 minutes. By inspecting the height of the resulting tree we conclude that the minimum number of guesses necessary for any secret is 7.

The complete code is available on Github.

#### Visualizing

The JSON output by the search algorithm is quite big (the smallest tree with height 7 has almost 7000 nodes). A more interesting way to inspect the data is to create a Bulls and Cows solver. We feed the JSON to this application and ask the user to select the outcome based on the secret they have in mind. We are basically traversing the edges of the decision tree.

I’ve uploaded the application to my personal website and the source code is available on Github.

### Conclusion

I learned about this game pretty recently and was curious to learn of good strategies to solve this problem. From what I’ve been reading, the heuristics that yield good solutions are very complicated for a human to perform. This leads to an question: is there are any solution which is simple to follow but that is reasonably good?

One natural extension for this problem is to use larger numbers of digits (N > 4) and have each element be sourced from 0 to D – 1. An exhaustive search might be prohibitive, but maybe we can come up with heuristics with constant guarantees. What is a lower bound for the number of guesses for variants with arbitrary N and D?

I struggled to implement this code in Rust, but I found the challenge worthwhile. I learned a bit more about its specifics, especially regarding memory management and data types.

In [1], the author mentions that with optimizations the search took 45 minutes to run on their laptop (2GHz Intel Core 2 Duo). I ran mine on a Intel i7 2.2GHz and was surprised by the running time of 2 minutes. CPUs are not getting exponentially faster these days and my code runs on a single thread.

### References

[1] Strategies for playing MOO or “Bulls and Cows”.
[2] Github – Blog Examples: Bulls and Cows

# Cell biology and programming

Rosalind Franklin was an English chemist and X-ray crystallographer. She is best known for her work on the X-ray diffraction images of DNA, particularly Photo 51, while at King’s College, London, which led to the discovery of the DNA structure by Watson and Crick.

James Watson, Francis Crick and Maurice Wilkins shared the Nobel Prize in Physiology or Medicine in 1962. According to Wikipedia [1], Watson suggested that Franklin would have ideally been awarded a Nobel Prize in Chemistry four years after Franklin passed away due to ovary cancer.

Photo 51

In this post we’ll study some basic concepts of cell biology, mostly around the DNA. We’ll start by introducing the structure of the DNA and then two of its main functions: replication and synthesis of protein. Since this is a programming blog, we’ll provide analogies (some forceful) to computer systems to help us relate to prior knowledge.

The end goal of this post to provide a good basis for later learning bio-informatics algorithms.

### Genome

Genome is the set of information necessary for creating an organism. In a computer programming analogy we can think of the genome as the entire source code of an application.

#### Chromosome

Let’s recall that cells can be classified into two categories, eukaryotic (from the Greek, good + nucleus) and prokaryotic (before + nucleus). As the name suggests, eukaryotic cells have a well define nucleus surrounded by a membrane.

In eukaryotic cells the genome is divided across multiple chromosomes which are basically compacted DNA. They need to be compacted to fit within the nucleus. According to Science Focus [3], if stretched, the DNA chain would be 2 meters long.

Humans cells usually have 23 pairs of chromosomes, 22 of which are called autosomes and they are numbered based on size (autosome 1 is the largest). The remaining is a sex chromosome and can be of type either X or Y.

Men and women share the same types of autosomes, but females have two copies of chromosome X, and men one chromosome X and one Y.

Chromosomes are usually depicted as X-like structures and neatly arranged [4], but recent research was able to visualize the actual structure and it looks more like:

Prokaryotes (e.g. bacteria) on the other hand typically store their entire genome within a single circular DNA chain.

In our computer programming analogy, the chromosome serve as units of organization of the source code, for example the files containing the code. If your application is simple, like a bacteria, the entire source code could be stored in a single file!

We can push the analogy even further and think of the tangled DNA within the chromosome as the “minification” step that JavaScript applications apply to the original source code to reduce network payload.

#### DNA

The deoxyribonucleic acid, more commonly known as DNA is a structure usually composed of two strands (chains) connected through steps to form a double helix.

Conceptual representation of the DNA: the double helix

In our analogy the DNA could represent the text of the source code.

#### Nucleotide

Nucleotides are the discrete units that form the base of the DNA. In the DNA it can be one of: Adenine, Cytosine, Guanine and Thymine.

Chemically speaking, we can divide the nucleotide in 3 parts: a sugar group, a phosphate group and a nitrogen base. The first two are common among the nucleotides, while the base differentiates them.

Guanine, one of the 4 possible nucleotides in the DNA

Any two nucleotides can be connected through their sugar and phosphate groups (the sugar group of one nucleotide links to the phosphate group of the next). Nucleotides linked this way form the backbone of a single strand of the DNA.

In addition, two nucleotides can be linked together via their nitrogen base, but in this case, there’s a restriction on which two bases can be linked together. Adenine can only be paired with Thymine, and Cytosine can only be paired with Guanine. These pairings form the “steps” in between two DNA strands.

4 nucleotides linked together through the sugar-phosphate groups or through the nitrogen bases.

Because the phosphate group of a nucleotide is linked to the sugar group of the next, we can define a direction for a chain of nucleotides. The endpoint that ends with the phosphate group has 5 carbon molecules and is called 5′ (read as five prime), while the sugar group has 3 and is called 3′ (three prime). The two strands in a DNA molecule are oriented in opposite directions, as depicted in the figure above.

We can now complete the analogy of computer programming by stating that nucleotides are the characters that compose the text of the source code (DNA). In our case, the alphabet contains 4 letters: A (Adenine), C (Cytosine), G (Guanine), T (Thymine).

### The replication

The DNA is capable of replicating itself so that it can be passed down to new formed cells.

In high-level, the double strands of the DNA start separating and other structures start binding new nucleotides to each strand (templates) until both the strands are completely duplicated (and form a double strand again).

The separation of the strands is triggered by the protein helicase, and can happen at any point of the DNA and it might happen in many places at the same time. One way to visualize this is opening a zipper jacket from the middle and keep pushing it open in one direction.

While the strands are being separated, proteins called DNA polymerase starts scanning each strand and making a copy strand by adding nucleotides to it.

The DNA polymerase can only extend an existing chain of nucleotide, so it requires an initial fragment to start with. For that reason, in the beginning of the duplication process, a small fragment of DNA or RNA, called primer, needs to be connected to the strand.

One important limitation the polymerase has is that it can only process a strand in one specific direction: from 3′ to 5′. But since we saw that strands are oriented in opposite direction of each other, it means that the replication process doesn’t happen symmetrically on both strands.

For the strand oriented 3′ to 5′ the replication is smooth, generating a continuous strand. For the reverse strand though, it will be done in pieces, because the polymerase is adding nucleotides in the opposite side of where the opening is happening. These pieces are known as Okazaki fragments and later joined together by another protein called ligase.

We can see these two cases taking place in the picture below:

One interesting fact about this process is that errors do happen and there are error corrections in place to minimize them. The result of the replication is also not 100% accurate, especially at the endpoints of the strands, where each new replica formed has its endpoints shorter than the original template. To compensate for this, the DNA has repeated redundant segments of nucleotides at the endpoints, know as telomeres, but eventually this extra segments get wore off to a point they cannot be used for replication. The shortening of telomeres is associated with aging.

In our analogy to computer programming, we could imagine the replication being the distribution of copies of the source code to other people. This analogy is weak, though. If we want to be precise, we’d need to come up with a program that is capable of printing its own source as output. This was named Quine by Douglas Hofstadter in Gödel, Escher, Bach [6] and it’s an example of a self-replicating automata, also studied by the famous computer scientist John von Neumann.

### Protein Production

A second function of the DNA is the production of proteins. The first step, called transcription, is very similar to replication: One of the strands serve as template, and a new strand with complementary base is generated. The difference is that instead of Thymine, the nucleotide Uracil is used. The resulting structure is called mRNA, short for messenger RNA.

Production of mRNA and its exit to the cytoplasm

The mRNA detaches itself from the DNA strand and exits the nucleus to the cytoplasm where the second step, translation, begins. In there, it is “interpreted” by a structure called ribosome. Every 3 nucleotides, denominated codon, translates to an amino acid, which in turn form a protein. The mapping of every possible 64 combinations of codons are displayed below:

Mapping of codons to amino-acids. For example, GGA maps to Glycine.

The tRNA, short for transfer RNA, is a small chain of RNA that connects amino-acids to their corresponding codon in the mRNA. There are some special codons that indicate the end of the process. The output of this translation will be a peptide chain.

Synthesis of a protein

The production of protein is the actual functioning of the DNA. If we link to the computer programming model, we could think of the source code (DNA) being interpreted or compiled into an actual program that can be executed. It’s incredible how biological systems evolved to define an explicit code for the end of the translation, much like how a semi-color or new line often indicates the end of an expression.

### Applications of Bio-informatics

How can computer science help with the study of biological systems? Computers are  good at repetitive tasks and handling large amounts of data. Here are some applications according to Wikipedia [7].

• DNA sequencing – consists of determining the order of nucleotides in the DNA from raw data. Computers are useful here because the raw data often come as fragments that need to be merged.
• Protein structure prediction – given a sequence of nucleotides, determine the chain of amino-acids is well-understood, but predicting the structure of the final protein is an open problem.
• Reduce noise in massive datasets output by experiments.

### Conclusion

From reading a lot of introductory material, I felt that there were a lot of  imprecise or ambiguous descriptions. That seems to come from the sheer complexity of biological systems and most results coming from empirical evidence, which currently only provide an incomplete picture of the whole, but new information still comes at a very frequent pace, sometimes making existing models obsolete or incorrect.

I last studied Biology back in high school. I don’t intend to study it in depth, but just enough to understand how computer science can be used to solve particular problems.

The idea of modeling living organisms as organic computers is fascinating. As we discover more about the inner workings of cells, I hope we can come up with better models and be able to predict their outcome with greater accuracy.

### References

[1] Rosalind Franklin – Wikipedia
[2] Photo 51 – Wikipedia
[3] How long is your DNA?
[4] How many chromosomes do people have?
[5] What a chromosome really looks like
[6] Gödel, Escher, Bach: An Eternal Golden Braid – Douglas R. Hofstadter
[7] Bioinformatics – Wikipedia

In this post we’ll study the hyper log log algorithm and provide an implementation in Rust.

### Introduction

Philippe Flajolet was a French computer scientist at INRIA [5]. He introduced the field of Analytic Combinatorics and is known for the creation of a family of algorithms of probabilistic counting, including the HyperLogLog.

HyperLogLog is a probabilistic algorithm to determine the number of distinct elements (or cardinality) from a multi-set (a set that allows repeated values) with high accuracy using very low memory. It is then suitable for streaming applications or distributed databases in which we do not have the luxury of keeping all distinct values in memory at any time.

### The HyperLogLog

The intuition behind the HyperLogLog is the following. Imagine we have a basket full of balls, each containing a number, which can be repeated. We want to estimate how many distinct numbers we see in the basket with the limitation that we can only look at one ball at a time and we do not have paper and pen, so we have to rely on memory.

The number of balls in the basket can be very large, so it’s unrealistic to keep track of all possible values we see, and we need a proxy to estimate the number of distinct values. The trick is to find some property that only very rare numbers satisfy, and the intuition is that if such property was satisfied by any of the numbers we saw, then there’s a good chance we saw a lot of numbers (or we’re very lucky).

To start, say that the property is “multiple of 2”. That is not a rare property, so if we assume the numbers in the balls do not have any special pattern, on average, we just need to find 2 different numbers to see that property being satisfied. Now say the property is “multiple of 4”. Less numbers satisfy this property, so the average number of distinct values we need to look at would be 4. We can keep going on for higher and higher powers of 2, and the average number of distinct values we look at would need to be as big.

This means that if we keep track of the largest power of 2 that divides any number we drew, we could estimate that that was the number of distinct values! The power of 2 property is interesting because it doesn’t change if there are duplicated values (which is great because we only want distinct values) and it doesn’t rely on the magnitude of the values as long as the set of values are uniformly distributed.

This estimate is not great though because the error is proportional to the number of distinct values found. For example, if we found out that the largest power of 2 divider was 2048 (2^11), the actual number of distinct values could be between 2048 and 4095 (2^12 – 1). Also, we could get unlucky and have all the balls in basket be have a single number, say 1024, so the estimate would be off by a lot.

Averaging

To reduce the chances of drawing a ball with a value that turns out to be a large power of 2 and throwing off the estimates, we divide the balls into m groups. The important property of the groups is that balls with the same values should go to the same group and the distribution should be uniform.

We can then find the largest power of 2 divider for each group and average them out. This would help in the case where we saw a single number 1024, because while one group would estimate it to be 1024, all the other groups would be find that the largest divisor was 1 (empty group).

To assign each number to a group, we could use modular arithmetic. For example, if we had 12 groups, we use the remainder of a number by 12 as an index to the group. The problem is that the elements assigned to a given group have a property that will bias the result. For example, for the group corresponding to the reminder 0, all numbers in there are obviously a multiple of 12 and hence 2^2. To make the numbers in the group unbiased, we need to discard the information used to assign them to groups. An easy way to achieve that is to represent each number as binary, use the first bits to decide which group to assign it to, and then discard those bits when computing the power of 2.

We can see that we can increase the number of groups to reduce errors but the tradeoff is that it requires to keep more information in memory. This is the idea first proposed by [2] and was called stochastic averaging.

Harmonic Mean

The HyperLogLog algorithm uses a different combination of the powers of two to obtain a better estimate than using averages: the harmonic mean [1]. To recall, the harmonic mean consists in averaging the reverse of the elements and then reversing the result. For example, the harmonic mean of 1, 4 and 4 is

The bulk of the HyperLogLog paper [1] is actually proving that this metric yields an estimate with smaller errors than a simple average.

Hashing

Not all real world input will be numbers that are distributed uniformly. For example, we might be interested in counting the number of distinct values of a string column in a database. We then need to transform these values using a hash function which maps these inputs to numbers uniformly.

### Algorithm

We are now ready to outline the core of the algorithm to estimate the cardinality of a multi-set of values.

Given a stream of n elements:

• For each element
• Hash it
• Use the first bits of the hash to determine to which group j to assign it to
• Keep track of the largest power of 2 that divides some value in the group, and store the exponent in M[j]

Finally, estimate the number of distinct values based on a harmonic mean of the values in M.

The Expected Value and the Alpha Correction

The expectation of the number of distinct values is given by [1]:

E is equivalent to the harmonic mean times $\alpha_m m$, and the constant $\alpha_m$ is a constant to correct some bias. For implementation purposes, the authors provide approximations to this constant:

Small and Big Range Corrections

When the estimated number of distinct elements is relatively small compared to the number of groups m, the author propose a correction.

Let $V$ be the number of groups to which no element was assigned to. If $V > 0$, then experiments show that for $E \le \frac{5}{2} m$, we can change the estimate to

Conversely, if the number of distinct elements is very high, closer to 2^32, then the probability of hash collision are very high. The correction accounts for that:

Implementation in Rust

We’ll now present an implementation using Rust. The first part of the algorithm consists in computing M, which we name more clearly as first_non_zero_by_bucket. The following code implements the pseudo-code described above:

We don’t need to know much Rust to read the code above. It’s worth mentioning that Rust is very strict about types, so we need to perform explicit conversions. Also we use 2 bit operations: one is to obtain the least significant k bits of an integer by using a bit mask. It relies on the fact that (2^k)-1 is a number with k bits 1 and doing a bitwise AND with any number has the effect of only extracting the first k bits of that number. The other trick is to divide a number by 2^k, which can be done by shifting the bits to the right, via the >> operator.

The hash function we use here is from the package farmhash, which is a Rust implementation of Google’s Farmhash, which in turn is a variant of Murmurhash [6]. It basically takes a string and shuffles its bits in a hopefully uniform way, generating a 32-bit integer:

first_non_zero_bit_position() is given by:

The formula to obtain the expected number of distinct values is given by

The code below implements this function:

The values of alpha were discussed in the The Expected Value and the Alpha Correction above.

The correction for small and large ranges can be implemented as:

The complete code is available on Github.

### Experiments

For values of b ranging from 4 to 16, I ran the program 100 times for n=100k, with numbers randomly selected from 1 to 100k. Then I plotted the results using the box plot chart using a R script:

In the chart above, the x-axis represents the number of experiments we divided the input into, and the y-axis represents the relative error compared to the actual value. We can see that as we increase the number of experiments, the errors go down.

This chart was also helpful in finding a bug in my implementation: the initial plot had a sudden spike for values larger than 11, which was when the small range correction applied. After some debugging, I realized that algorithm should you the natural logarithm, not log2. It was great to spot the bug via data visualization!

### Conclusion

I’ve been wanting to study the HyperLogLog algorithm for a while, and also one of my resolutions for this year is to learn Rust. It was a productive exercise to implement it.

I’ll post some future impression on Rust from someone with experience in C++ in a future post.

### References

[1] HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
[2] Probabilistic Counting Algorithms for Database Applications
[3] Thu Trang Pham – Curiosity #2: How does Prestodb implement approx_distinct?
[4] Probabilistic Data Structures for Web Analytics and Data Mining
[5] Gödel’s Lost Letter and P=NP: Philippe Flajolet 1948–2011
[6] GitHub: seiflotfy/rust-farmhash