# Consistent Hashing

Daniel Lewin was an Israeli-American mathematician and entrepreneur. He was aboard the American Airlines Flight 11, which was hijacked by al-Qaeda during the September 11 attacks.

Tom Leighton is a professor (on leave) of Applied Mathematics at CSAIL @ MIT and an expert on algorithms for network applications.

Together, Lewin and Leighton founded the company Akamai, which was a pioneer in the business of content delivery networks (CDNs) and is currently one of the top players in the segment. One of the key technologies employed by the company was the use of consistent hashing, which we’ll present in this post.

### Motivation

One of the main purposes of the CDN is to be a cache for static data. Due to large amounts of data, we cannot possibly store the cache in a single machine. Instead we’ll have many servers each of which will be responsible for storing a portion of the data.

We can see this as a distributed key-value store, and we have two main operations: read and write. For the write part, we provide the data to be written and an associated key (address). For the read part, we provide the key and the system either returns the stored data or decides it doesn’t exist.

In scenarios where we cannot make any assumptions over the pattern of data and keys, we can try to distribute the entries uniformly over the set of servers. One simple way to do this is to hash the keys and get the remainder of the division by N (mod N), where N corresponds to the number of servers. Then we assign the entry (key, value) to the corresponding server.

The problem arises when the set of servers changes very frequently. This can happen in practice, for example, if servers fail and need to be put offline, or we might need to reintroduce servers after reboots or even add new servers for scaling.

Changing the value of N would cause almost complete redistribution of the keys to different servers which is very inefficient. We need to devise a way to hash the keys in a way that adding or removing servers will only require few keys from changing servers.

### Consistent Hashing

The key idea of the consistent hashing algorithm is to include the key for the server in the hash table. A possible key for the server could be its IP address.

Say that our hash function h() generates a 32-bit integer. Then, to determine to which server we will send a key k, we find the server s whose hash h(s) is the smallest that is larger than h(k). To make the process simpler, we assume the table is circular, which means that if we cannot find a server with hash larger than h(k), we wrap around and start looking from the beginning of the array.

Big blue circles are servers, orange circles are keys. Right: If we remove server S3, only entries corresponding to keys K5 and K4 need to be moved / re-assigned.

If we assume that the hash distributes the keys uniformly, including the server keys, we’ll still get a uniform distribution of keys to each server.

The advantage comes to when adding and removing servers to the list. When adding a new server sx to the system, its hash will be in between 2 server hashes, say h(s1) and h(s2) in the circle. Only the keys from h(s1) to h(sx), which belonged to s2, will change servers, to sx. Conversely, when removing a server sx, only the keys assigned to it will need to go to a different server, in this case the server that immediately follows sx.

How can we find the server associated to a given key? The naive way is to scan linearly the hashes until we find a server hash. A more efficient way is to keep the server hashes in a binary balanced search tree, so we can find the leaf with the smallest value larger that h(x) in O(log n), while adding and removing servers to the tree is also a O(log n) operation.

### Implementation in Rust

We will provide an implementation of the ideas above in Rust as an exercise. We define the interface of our structure as

Note that we’ll store the list of servers (containers) and keys (entries) in separate structures. We can store the entries in a simple hash table since we just need efficient insertion, deletion and look up. For the containers we need insertion, deletion but also finding the smallest element that is larger than a given value, which we’ll call successor. As we discussed above, we can use a binary balanced search tree which allow all these operations in O(log n), for example a Red-Black tree. I found this Rust implementation of the Red-Black tree [1].

Finally, we also include the hash function as part of the structure in case we want customize the implementation (handy for testing), but we can provide a default implementation.

To “construct” a new structure, we define a method new() in the implementation section, and use farmhash as the default implementation for the hash function [2].

The insertion and removal are already provided by the data structures, and are trivial to extend to ours. The interesting method is determining the server corresponding to a given key, namely get_container_id_for_entry().

In there we need to traverse the Red-Black tree to find the successor of our value v. The API of the Red-Black tree doesn’t have such method, only one to search for the exact key. However due to the nature of binary search trees, we can guarantee that the smallest element greater than the searched value v will be visited while searching for v.

Thus, we can modify the search algorithm to include a visitor, that is, a callback that is called whenever a node is visited during the search. In the code below we start with a reference to the root, temp, and in a loop we keep traversing the tree depending on comparison between the key and the value at the current node.

Let’s take a detour to study the Rust code a bit. First, we see the unsafe block [3]. It can be used to de-reference a raw pointer. A raw pointer is similar to a C pointer, i.e. it points to a specific memory address. When we de-reference the pointer, we have access to the value stored in that memory address. For example:

The reason we need the unsafe block in our implementation is that self.root is a raw pointer to RBTreeNode, as we can see in line 1 and 4 below:

The other part worth mentioning is the type of the visitor function. It’s defined as

It relies on several concepts from Rust, including Traits, Closures, and Trait Bounds [4, 5]. The syntax indicates that the type of visitor must be FnMut(&K), which in turns mean a closure that has a single parameter of type &K (K is the type of the key of the RB tree). There are three traits a closure can implement: Fn, FnMut and FnOnce. FnMut allows closures that can capture and mutate variables in their environment (see Capturing the Environment with Closures). We need this because our visitor will update a variable defined outside of the closure as we’ll see next.

We are now done with our detour into the Rust features realm, so we can analyze the closure we pass as visitor. It’s a simple idea: whenever we visit a node, we check if it’s greater than our searched value and if it’s smaller than the one we found so far. It’s worth noticing we define closest_key outside of the closure but mutate it inside it:

We also need to handle a corner case which is that if the hash of the value is larger than all of those of the containers, in which case we wrap around our virtual circular table and return the container with smallest hash:

The full implementation is on Github and it also contains a set of basic unit tests.

### Conclusion

The idea of a consistent hash is very clever. It relies on the fact that binary search trees can be used to search not only exact values (those stored in the nodes) but also the closest value to a given query.

In a sense, this use of binary trees is analogous to a common use of quad-trees, which is to subdivide the 2d space into regions. In our case we’re subdividing the 1d line into segments, or more precisely, we’re subdividing  a 1d circumference into segments, since our line wraps around.

I struggled quite a bit with the Rust strict typing, especially around passing lambda functions as arguments and also setting up the testing. I found the mocking capability from the Rust toolchain lacking, and decided to work with dependency injection to mock the hash function and easier to test. I did learn a ton, though!

### References

[1] GitHub: /tickbh/rbtree-rs
[2] GitHub: seiflotfy/rust-farmhash
[3] The Rust Programming Language book – Ch19: Unsafe Rust
[4] The Rust Programming Language book – Ch13: Closures: Anonymous Functions that Can Capture Their Environment
[5] The Rust Programming Language book – Ch13: Traits: Defining Shared Behavior

# Rust Memory Management

Graydon Hoare is a Software Developer who created the Rust programming language while working at Mozilla Research [1]. He has an interesting presence on the internet, for example responding to this question and on Twitter.

In this post we’ll talk about one of the key features of Rust, which I find the hardest to wrap my head around, which is its memory management.

### Motivation

In most programming languages we either have to manage memory allocation ourselves or rely on a complex garbage collector to which we have limited control and can lead to unpredictable performance bottlenecks.

Rust has a set of constraints around memory allocation that results in a deterministic automatic memory management.

We’ll now delve into more details on what these constraints are, how they enforce the necessary guarantees to enable an efficient memory management by the runtime.

### Ownership Rules

Rust has the following basic rules around ownership:

• Each value in Rust has a variable that’s called its owner
• There can only be one owner at a time
• When the owner goes out of scope, the value will be dropped

They’re very basic but have deep implications in how we think about programs. Let’s analyze some of them.

### Assignment transfers ownership

To conform the second rule, whenever we assign a variable to another, we are transferring the ownership from one variable to another. For example:

In the example above, vec1 transfer ownership of its data to vec2. This means that we can neither read nor write to vec1 anymore. It’s as if it was out of scope (unless it is assigned ownership to some other data later).

By having a single owner we don’t have to worry about keeping track of references to a given object as garbage collectors do to know when we are allowed to free the memory. For example, if we had:

Because vec2 owns the vector allocated initially and it goes out of scope after line 4, the runtime can free the memory safely, since we know vec1 cannot access that data after line 3.

Similarly, when we use a variable as argument to a function, its data is transferred to the parameter. In the example below, vec1‘s data is transferred to vec in mutate_vec(), so we cannot access it in line 9.

One way to return the ownership back to vec1 is for the function to return the argument.

### References & Borrowing

To avoid transferring data to a variable on assignment, we can use references. In the example below, vec2 “borrows” the data from vec1 but there’s no ownership transfer. We have read access to the data via vec2, which we can do by dereferencing vec2 with &vec2.

If we want write access, we need to make vec1 mutable and also obtain a mutable reference to vec1 via the &mut operator, like in the example below:

However, if we try to access the data via vec1 while vec2 has a mutable reference to it, we’ll get an error.

We can take as many (non-mutable) references as we want:

But once we try to obtain a mutable reference, we get an error:

Read and write locks. Rust enforces these constraints to prevent race condition bugs in multi-thread applications. In fact the borrowing mechanism via references is very similar to read locks and write locks.

To recall, if some data has a read lock, we can acquire as many other reads locks as we want but we cannot acquire a write lock. This way we prevent data inconsistency between multiple reads since we prevent data mutations by not allowing writes. Conversely, if some data has a write lock, we cannot acquire neither read or write locks.

We can see that the regular borrowing implements a read lock while a mutable borrowing implements a write lock.

Dangling references. One potential issue with references is that if we return a reference to some variable that went out of scope.

Luckily, the compiler will prevent this case by making a compile error. It’s now always the case that we cannot return a reference. If the reference we’re returning was sent as argument, it’s still a valid one, for example:

However, if we try to pass two parameters, we’ll run into a compile error:

To see why it’s not possible to guarantee this memory-safe, we can consider the following code calling get_largest:

Here we’re sending references for both vec1 and vec2, and returning back either of them. If vec2 happened to be larger, we’d return it to result and try to access the data after vec2 went out of scope.

However, if result is used while both vec1 and vec2 are in scope, it should be theoretically safe to allow calling get_largest:

In fact, it’s possible and we’ll see how next, but we’ll need to introduce a new terminology.

The lifetime of a variable represents the duration in which the variable is valid. In the example below, the lifetime of variable a is from line 2 to line 7. The lifetimes for b is from 4 to 5, for c is line 5 and for d is line 7. Note that a’s lifetime contains all other lifetimes and b’s lifetime contains c’s lifetime.

We can annotate a function using a syntax similar to generics to parametrize the function arguments by their lifetimes. For example, if we want to include the lifetimes of the arguments in get_largest(), we can do:

This is essentially binding the each variable to the lifetime specified by 'a. Note it’s forcing that both variables have the same lifetime and that the return type has the same lifetime as the input parameters.

Now, if we replace get_largest() with get_largest_with_lifetime(), we won’t get compiler errors. In the example below, result has the same lifetime was the common lifetimes of vec1 and vec2, which is vec2‘s lifetime. This means we’re fine using result with the inner block.

### Conclusion

Rust documentation is very detailed and well written. All the concepts presented in this post are explained in length there. In here I’m presenting them in my own words and different examples (vec instead of string).

I also tried to group topics around memory management, which is different from the documentation. In there “lifetimes” are not covered under ownership and I skipped the generics discussion.

### References

[1] Wikipedia – Rust (programming language)
[2] Rust Documentation: What is Ownership?
[3] Rust Documentation: References & Borrowing

# Bulls and Cows

Bulls and Cows (also known as MOO) is a 2-player game in which one player comes up with a secret and the other has to guess the secret. The secret consists of 4 digits from 0 to 9, where each digit is distinct. Player 2 has to guess these 4 digits, in the right order. At each guess from player 2, player 1 provides feedback as hints. The hints are two numbers: one telling how many digits from the secret player 2 got in the right order (bull), and another telling how many digits they got but in the wrong order (cow).

For example, if player 1 came up with 4271, and player 2 guessed 1234, then bull is 1 (digit 2), and cow is 2 (1 and 4 are in the secret, but not in the order guessed by player 2).

The goal of the game is for player 2 to guess the secret with the least amount of guesses.

Bulls sculptures from Cyprus. Dated from sometime between 2000 and 1600 BC. Photo take at the National Archeology Museum in Athens.

In this post we’ll present computational experiments to solve a game of Bulls and Cows optimally.

#### Objective function

We’ll focus on a search strategy to minimize the maximum number of guesses one has to make, for any possible secret. Alternatively, John Francis’s paper [1] proposes heuristics that minimize the expected number of guesses, that is, for the worst case these heuristics might not be optimal, but they perform better on average.

Because the game has turns, our solution is not a single list of guesses, but rather a decision tree where at each node we branch depending on the hint we get back. The metric we are trying to optimize is then the height of such tree.

To break ties between trees of the same height, we consider the smallest tree (the one with least nodes).

#### Brute-force search

Our search algorithm is recursive. At any given recursion level, we are given a set containing all possible numbers that could be secrets. We’ll try all combinations of guesses and hints and see which ones yield the best solution.

When we simulate a given guess and hint, we are restricting the possible numbers that can still be secret, so in the next level of recursion, the set of potential secrets will be smaller.

For example, say that the possible secrets are any 4-digit number with no repeated digits. We then use one of them as a guess, say, [0, 1, 2, 3]. One possible hint we could receive is (3, 0). What are the possible secrets that would cause us to receive this hint? [0, 1, 2, 4][0, 1, 2, 5] and [7, 1, 2, 3] are a few of those. If we recurse for this guess and hint, the set of secrets in the next level will be restricted to those that would return a hint (3, 0) for [0, 1, 2, 3].

Continuing our example, is [0, 1, 2, 3] the best guess we can make at that point? After recursing for all possible hints, we’ll have a decision tree rooted on [0, 1, 2, 3]. The key idea of the search is that we can minimize the height of the final decision tree by minimizing the subtrees at each recursion level (greedy strategy). Thus, we want to find the guess with the shortest subtree.

In pseudo-code code, the search looks like this:

We start with all possible secrets (4-digit number with no repeated digits) and the tree returned should be optimal in minimizing the number of guesses for the worst case.

#### Rust implementation

I initially implemented the search in Python, but it was taking too long, even when artificially restricting the branching. I re-implemented it in Rust and saw some 20x speedups (caveat: I haven’t really tried to optimize the Python version).

The main difference between the pseudo-code and the actual implementation is that we pre-group possible_secrets by their hint for guess, which is more efficient than scanning possible_secrets for all possible hints:

The function group_possibilities_by_score() above makes use of compute_score and it also uses a fixed-length array for performance. The set of hints is proportional to the squared size N of the guess, in our case N=4.

Turns out that the Rust implementation is still not efficient enough, so we’ll need further optimizations.

#### Optimization – classes of equivalence

What is a good first guess? It doesn’t matter! We don’t have any prior information about the secret and every valid number is equality probable. For example, if we guess [0, 1, 2, 3] or [5, 1, 8, 7], the height of the decision tree will be the same. An intuitive way to see why this is the case is that we could relabel the digits such that [5, 1, 8, 7] would map to [0, 1, 2, 3].

Francis [1] generalizes this idea for other cases. Say that at a given point we made guesses covering the digits 0, 6, 7, 8 and 9 at least once. Now say we our next guess is [0, 8, 9, 3]. In here, 3 is the only digit we haven’t tried yet, but using the re-labeling argument, we can see that [0, 8, 9, 1] would yield the same decision tree if we were to swap the labels of 1 and 3. This allow us to skip guesses that belong to the same class, which reduces the branch factor.

We can generate an ID representing a given class. A way to do this is by adding one to each digit we have tried before and  making any digit we haven’t as 0, then converting that number from base 11 to base 10. For example, [0, 8, 9, 1] becomes [1, 9, 10, 0]. If this was a number in base 11, in base 10 it is 2530 ((((1*11) + 9)*11 + 10)*11 + 0). If we do the same with  [0, 8, 9, 3], we’ll get the same number. The code below implements this idea.

In our case, D = 10 and we store the set of visited digits in a bitset, visited_bits (that is, bit i is 1 if digit i has been visited.

On the search side, we keep a set of classes ids already visited and skip a guess if its class is already in there.

With this optimization the search algorithm runs in under 2 minutes. By inspecting the height of the resulting tree we conclude that the minimum number of guesses necessary for any secret is 7.

The complete code is available on Github.

#### Visualizing

The JSON output by the search algorithm is quite big (the smallest tree with height 7 has almost 7000 nodes). A more interesting way to inspect the data is to create a Bulls and Cows solver. We feed the JSON to this application and ask the user to select the outcome based on the secret they have in mind. We are basically traversing the edges of the decision tree.

I’ve uploaded the application to my personal website and the source code is available on Github.

### Conclusion

I learned about this game pretty recently and was curious to learn of good strategies to solve this problem. From what I’ve been reading, the heuristics that yield good solutions are very complicated for a human to perform. This leads to an question: is there are any solution which is simple to follow but that is reasonably good?

One natural extension for this problem is to use larger numbers of digits (N > 4) and have each element be sourced from 0 to D – 1. An exhaustive search might be prohibitive, but maybe we can come up with heuristics with constant guarantees. What is a lower bound for the number of guesses for variants with arbitrary N and D?

I struggled to implement this code in Rust, but I found the challenge worthwhile. I learned a bit more about its specifics, especially regarding memory management and data types.

In [1], the author mentions that with optimizations the search took 45 minutes to run on their laptop (2GHz Intel Core 2 Duo). I ran mine on a Intel i7 2.2GHz and was surprised by the running time of 2 minutes. CPUs are not getting exponentially faster these days and my code runs on a single thread.

### References

[1] Strategies for playing MOO or “Bulls and Cows”.
[2] Github – Blog Examples: Bulls and Cows

In this post we’ll study the hyper log log algorithm and provide an implementation in Rust.

### Introduction

Philippe Flajolet was a French computer scientist at INRIA [5]. He introduced the field of Analytic Combinatorics and is known for the creation of a family of algorithms of probabilistic counting, including the HyperLogLog.

HyperLogLog is a probabilistic algorithm to determine the number of distinct elements (or cardinality) from a multi-set (a set that allows repeated values) with high accuracy using very low memory. It is then suitable for streaming applications or distributed databases in which we do not have the luxury of keeping all distinct values in memory at any time.

### The HyperLogLog

The intuition behind the HyperLogLog is the following. Imagine we have a basket full of balls, each containing a number, which can be repeated. We want to estimate how many distinct numbers we see in the basket with the limitation that we can only look at one ball at a time and we do not have paper and pen, so we have to rely on memory.

The number of balls in the basket can be very large, so it’s unrealistic to keep track of all possible values we see, and we need a proxy to estimate the number of distinct values. The trick is to find some property that only very rare numbers satisfy, and the intuition is that if such property was satisfied by any of the numbers we saw, then there’s a good chance we saw a lot of numbers (or we’re very lucky).

To start, say that the property is “multiple of 2”. That is not a rare property, so if we assume the numbers in the balls do not have any special pattern, on average, we just need to find 2 different numbers to see that property being satisfied. Now say the property is “multiple of 4”. Less numbers satisfy this property, so the average number of distinct values we need to look at would be 4. We can keep going on for higher and higher powers of 2, and the average number of distinct values we look at would need to be as big.

This means that if we keep track of the largest power of 2 that divides any number we drew, we could estimate that that was the number of distinct values! The power of 2 property is interesting because it doesn’t change if there are duplicated values (which is great because we only want distinct values) and it doesn’t rely on the magnitude of the values as long as the set of values are uniformly distributed.

This estimate is not great though because the error is proportional to the number of distinct values found. For example, if we found out that the largest power of 2 divider was 2048 (2^11), the actual number of distinct values could be between 2048 and 4095 (2^12 – 1). Also, we could get unlucky and have all the balls in basket be have a single number, say 1024, so the estimate would be off by a lot.

Averaging

To reduce the chances of drawing a ball with a value that turns out to be a large power of 2 and throwing off the estimates, we divide the balls into m groups. The important property of the groups is that balls with the same values should go to the same group and the distribution should be uniform.

We can then find the largest power of 2 divider for each group and average them out. This would help in the case where we saw a single number 1024, because while one group would estimate it to be 1024, all the other groups would be find that the largest divisor was 1 (empty group).

To assign each number to a group, we could use modular arithmetic. For example, if we had 12 groups, we use the remainder of a number by 12 as an index to the group. The problem is that the elements assigned to a given group have a property that will bias the result. For example, for the group corresponding to the reminder 0, all numbers in there are obviously a multiple of 12 and hence 2^2. To make the numbers in the group unbiased, we need to discard the information used to assign them to groups. An easy way to achieve that is to represent each number as binary, use the first bits to decide which group to assign it to, and then discard those bits when computing the power of 2.

We can see that we can increase the number of groups to reduce errors but the tradeoff is that it requires to keep more information in memory. This is the idea first proposed by [2] and was called stochastic averaging.

Harmonic Mean

The HyperLogLog algorithm uses a different combination of the powers of two to obtain a better estimate than using averages: the harmonic mean [1]. To recall, the harmonic mean consists in averaging the reverse of the elements and then reversing the result. For example, the harmonic mean of 1, 4 and 4 is

The bulk of the HyperLogLog paper [1] is actually proving that this metric yields an estimate with smaller errors than a simple average.

Hashing

Not all real world input will be numbers that are distributed uniformly. For example, we might be interested in counting the number of distinct values of a string column in a database. We then need to transform these values using a hash function which maps these inputs to numbers uniformly.

### Algorithm

We are now ready to outline the core of the algorithm to estimate the cardinality of a multi-set of values.

Given a stream of n elements:

• For each element
• Hash it
• Use the first bits of the hash to determine to which group j to assign it to
• Keep track of the largest power of 2 that divides some value in the group, and store the exponent in M[j]

Finally, estimate the number of distinct values based on a harmonic mean of the values in M.

The Expected Value and the Alpha Correction

The expectation of the number of distinct values is given by [1]:

E is equivalent to the harmonic mean times $\alpha_m m$, and the constant $\alpha_m$ is a constant to correct some bias. For implementation purposes, the authors provide approximations to this constant:

Small and Big Range Corrections

When the estimated number of distinct elements is relatively small compared to the number of groups m, the author propose a correction.

Let $V$ be the number of groups to which no element was assigned to. If $V > 0$, then experiments show that for $E \le \frac{5}{2} m$, we can change the estimate to

Conversely, if the number of distinct elements is very high, closer to 2^32, then the probability of hash collision are very high. The correction accounts for that:

Implementation in Rust

We’ll now present an implementation using Rust. The first part of the algorithm consists in computing M, which we name more clearly as first_non_zero_by_bucket. The following code implements the pseudo-code described above:

We don’t need to know much Rust to read the code above. It’s worth mentioning that Rust is very strict about types, so we need to perform explicit conversions. Also we use 2 bit operations: one is to obtain the least significant k bits of an integer by using a bit mask. It relies on the fact that (2^k)-1 is a number with k bits 1 and doing a bitwise AND with any number has the effect of only extracting the first k bits of that number. The other trick is to divide a number by 2^k, which can be done by shifting the bits to the right, via the >> operator.

The hash function we use here is from the package farmhash, which is a Rust implementation of Google’s Farmhash, which in turn is a variant of Murmurhash [6]. It basically takes a string and shuffles its bits in a hopefully uniform way, generating a 32-bit integer:

first_non_zero_bit_position() is given by:

The formula to obtain the expected number of distinct values is given by

The code below implements this function:

The values of alpha were discussed in the The Expected Value and the Alpha Correction above.

The correction for small and large ranges can be implemented as:

The complete code is available on Github.

### Experiments

For values of b ranging from 4 to 16, I ran the program 100 times for n=100k, with numbers randomly selected from 1 to 100k. Then I plotted the results using the box plot chart using a R script:

In the chart above, the x-axis represents the number of experiments we divided the input into, and the y-axis represents the relative error compared to the actual value. We can see that as we increase the number of experiments, the errors go down.

This chart was also helpful in finding a bug in my implementation: the initial plot had a sudden spike for values larger than 11, which was when the small range correction applied. After some debugging, I realized that algorithm should you the natural logarithm, not log2. It was great to spot the bug via data visualization!

### Conclusion

I’ve been wanting to study the HyperLogLog algorithm for a while, and also one of my resolutions for this year is to learn Rust. It was a productive exercise to implement it.

I’ll post some future impression on Rust from someone with experience in C++ in a future post.

### References

[1] HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
[2] Probabilistic Counting Algorithms for Database Applications
[3] Thu Trang Pham – Curiosity #2: How does Prestodb implement approx_distinct?
[4] Probabilistic Data Structures for Web Analytics and Data Mining
[5] Gödel’s Lost Letter and P=NP: Philippe Flajolet 1948–2011
[6] GitHub: seiflotfy/rust-farmhash