Eulerian Circuits

111px-Leonhard_Euler

Leonhard Euler was a Swiss mathematician in the 18th century. His paper on a problem known as the Seven Bridges of Königsberg is regarded as the first in the history in Graph Theory.

The history goes that in the city of Königsberg, in Prussia, there were seven bridges connecting different mass of lands along the Pregel river (see Figure 1). The challenge was to find a path through the city that crossed the bridges exactly once. Euler showed that no such solution existed.

Interesting unrelated fact: Today Königsberg called Kaliningrad in Russia, and Kaliningrad is actually separated from Russia geographically, lying between Lithuania and Poland.

Konigsberg_bridges.png

Figure 1: Map of Königsberg and the seven bridges. Source: Wikipedia

The solution to the Seven Bridges of Königsberg problem eventually led to a branch of Mathematics known as Graph Theory. In this post we’ll be talking about the theoretical framework that can be used to solve problems like the Seven Bridges of Königsberg, which is known as Eulerian Circuits.

We’ll provide a general definition to the problem, discuss a solution and implementation, and finally present some extensions and variations to the problem.

Definition

Let G(V, E) be a connected undirected graph, where V is the set of vertices and E the set of directed edges, and where (v, w) denotes an edge between vertices v and w. The Eulerian circuit problem consists in finding a circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

An Eulerian graph is a graph for which an Eulerian circuit exists.

Solution

We’ll first focus on the problem of deciding whether a connected graph has an Eulerian circuit. We claim that an Eulerian circuit exists if and only if every vertex in the graph has an even number of edges.

We can see this is a necessary condition. Let v be a node with an odd number of edges. Any circuit traversing all edges will have to traverse v. Moreover, it will have to use one edge to “enter” v and one edge to “leave” v. Since this circuit can traverse each edge no more than one time, it will have to use different edges each time, meaning it needs 2 edges every time it crosses v. If there are an odd number of edges, one edge will be left unvisited.

To show this is sufficient, we can provide an algorithm that always finds an Eulerian circuit in a graph satisfying these conditions. Start from any vertex v and keep traversing edges, deleting them from the graph afterwards. We can’t get stuck on any vertex besides v, because whenever we enter an edge there must be an exit edge since every node has an even number of edges. Thus eventually we’ll come back to v, and this path form a circuit.

This circuit doesn’t necessarily cover all the edges in the graph though, nor it means that are other circuits starting from v in the remaining graph. It must be however, that some node w in the circuit we just found has another circuit starting from it. We can repeat the search for every such node and we’ll always find another sub-circuit (this is a recursive procedure, and we might find sub-sub-circuits). Note that after we remove the edges from a circuit, the resulting graph might be disconnected, but each individual component is still Eulerian.

Once we have all the circuits, we can assemble them into a single circuit by starting the circuit from v. When we encounter a node w that has a sub-circuit, we take a “detour” though that sub-circuit which will lead us back to w, and we can continue on the main circuit.

Implementation

We’ll use the algorithm first described by Hierholzer to efficiently solve the Eulerian circuit problem, based on the proof sketched in the previous session.

The basic idea is that given a graph and a starting vertex v, we traverse edges until we find a circuit. As we’re traversing the edges, we delete them from the graph.

Once we have the circuit, we traverse it once more to look for any vertices that still have edges, which means these vertices will have sub-circuits. For each of these vertices we merge the sub-circuit into the main one. Assume the main circuit is given by a list  of vertices (v, p_2, ... , p_k-1, w, p_k+1, ..., p_n-1, v) and w is a vertex with a sub-circuit. Let (w, q_1, ..., q_m-1, w) be the sub-circuit starting from w. We can construct a new circuit (v, p_2, ..., p_k-1, w, q_1, ..., q_m-1, w, p_k+1, ..., p_n-1, v).

Let’s look at a specific implementation using JavaScript (with Flow). The core of the algorithm implements the ideas discussed above:

The complete code is on Github.

Analysis

We’ll now demonstrate that the algorithm described above runs in linear time of the size of the edges (i.e. O(|E|)).

Note that find_circuit() is a recursive function, but we claim that the number of times the while() loop executes across all function calls is bounded by the number of edges. The key is in the function:

graph.getNextEdgeForVertex(vertex);

graph is a convenience abstraction to an adjacency list, where for each vertex we keep a pointer to the last edge visited. Because of this,  getNextEdgeForVertex() will visit each edge of the graph at most once and we never “go back”. Since the graph object is shared across all function calls (global), we can see that the number of calls to getNextEdgeForVertex() is bounded by O(|E|), so is the number of times all while() loops execute.

Now we just need to prove that every other operation in the while loop is O(1). The only non-obvious one is:

graph.deleteEdge(edge);

This is a lazy deletion, meaning that we just set a flag in edge saying it’s deleted and it will later be taken into account by callers like graph.getNextEdgeForVertex() and graph.getDegree(). Hence, this is an O(1) operation.

For getNextEdgeForVertex(), we must skip edges that have been deleted, so we might need to iterate over a few edges before we find an undeleted one (or none if the graph is not Eulerian – in which case we terminate the algorithm). Since we’re still always processing at least one edge in every call to getNextEdgeForVertex() the argument about the total calls being bounded by O(|E|) holds.

In order for getDegree() to be an O(1) operation, we need to keep a non-lazy count of the degree of a vertex, but we can do it in O(1) when deleting an edge.

Finally, let’s analyze the second loop. The number of iterations is proportional to the length of the circuit. Since every possible circuit found (including the ones found recursively) are disjoint, the total number of times we loop over the vertices from circuits (across all function calls) is also bounded by the number of edges.

We already saw getDegree() is O(1) even with lazy deletion. The remaining operation is

path.insertAtVertex(vertex, subPath);

if we store the paths as a linked list of vertices, inserting subPath at a given node can be done in O(1) if we keep a reference from each vertex to its last (any actually) occurrence in the path.

Directed Graphs

We can extend the definition of Eulerian graphs to directed graphs. Let G(V, A) be a strongly connected graph, where V is the set of vertices and A the set of directed edges, and where (v, w) indicate a directed edge from v to w. The Eulerian circuit problem for a directed graph consists in finding a directed circuit that traverses every edge of this graph exactly once or deciding no such circuit exists.

It’s possible to show that such a circuit exists if and only if the strongly connected directed graph has, for each vertex v, the same in-degree and out-degree. The algorithm is essentially the same.

Counting Eulerian Circuits in directed graphs

It’s possible to count the number of different Eulerian circuits in a directed graph. According to the BEST theorem (named after de Bruijn, van Aardenne-Ehrenfest, Smith and Tutte) [3], the number of Eulerian circuits in a directed graph can be given by [4]:

ec(G) = t_w(G) \prod_{v \in V}(deg(v) - 1)! (1)

Where deg(v) represents the in-degree (or out-degree) or a vertex v and t_w(G) is the number of arborescences rooted in a vertex w (simply put, an arborescence is analogous to a spanning tree for a directed graph – but we can only include edges that are directed away from the root).

It’s possible to show that t_w(G) is the same for any vertex w if G is Eulerian. We can compute t_w(G) via the Matrix-Tree theorem [2], which says t_w(G) is equal to the determinant of the Laplacian of G without vertex w. Let’s try to understand the idea behind this equation.

The mapping from an arborescence to an Eulerian path can be made by the following. Let r be the root of a possible arborescence of G. Now, let r be the reference starting point for an Eulerian path in G (note this is just for reference, since there’s no starting point in a circuit).

We say that an Eulerian path is associated with a given arborescence if for each vertex v, the last edge passing through v, say (v, v’), belongs to the arborescence. This is more clear with an example. Consider the digraph from Figure 2. Here we’ll consider the arborescences rooted in A.

graph.png

Figure 2: Directed Graph

This graph has 2 possible arborescences depicted on the left in Figures 3 and 4. In Figure 3, we can see that the edge (B, D) has to be visited before (B, C) because (B, C) is in the arborescence.

arborescence1.png

Figure 3: One of the arborescences of G and a corresponding Eulerian circuit

Now, in Figure 4, because it’s (B, D) that’s in the arborescence, it has to be visited after we visit (B, C).

arborescence2.png

Figure 4: Another of the arborescence of G and a corresponding Eulerian circuit

Note that there can be more than one Eulerian path to a given arborescence. If B had more out-edges, we’d have multiple choices, since the arborescence only specifies the last edge to be taken, not the intermediate ones. More specifically, imagine B had k out-edges. Then we could traverse the first k-1 in any combination of orders, which leads to a total of (k – 1)! ways of doing so.

The same applies to all other nodes. Due to properties of Eulerian circuits, the choice of the out-edge at a given node can be seen as independent of the choice at other nodes, so the total possible Eulerian circuits corresponding to any arborescence is given by the product of the degrees from equation (1), namely:

\prod_{v \in V}(deg(v) - 1)! (2)

The key property of categorizing Eulerian circuits into arborescence classes is that they’re disjoint, that is, a Eulerian circuit corresponds to exactly one arborescence. This, in conjunction with the fact that the vertices degrees in Equation (2) are from the original graph, and hence independent of a arborescence, lead us to the two independent factors in equation (1).

Counting Eulerian Circuits in undirected graphs

Counting Eulerian circuits in undirected graphs is a much harder problem. It belongs to a complexity class known as #P-complete. This means that:

  1. It belongs to the #P class, which can informally be seen as the counting version of NP problems. For example: deciding whether a given graph has an Hamiltonian circuit (path that traverses all vertices exactly once) is a problem in the NP class. Counting how many Hamiltonian circuits existing in that graph is the corresponding problem in the #P class.
  2. It belongs to the #P-hard class, which means that any problem in #P can be reduced to it via a polynomial-time transformation.

Valiant proved the first condition in [5] while Brightwell and Winkler proved the second in [6] by reducing another #P-complete problem (counting Eulerian orientations) to it.

Note that a problem in the #P class is as hard as the equivalent class in NP, because we can reduce a problem in NP to #P. For example, we can decide whether a graph has an Hamiltonian circuit (NP problem) by counting the number of circuits it has (#P problem). The answer will be “yes” if it the #P version returns a number greater than 0 and “no” otherwise.

Because the problem of counting Eulerian circuits in an undirected graph being in #P, we can conclude that there’s no efficient (polynomial time) algorithm to solve it unless P = NP.

Conclusion

In this post we covered Eulerian circuits in an informal way and provided an implementation for it in JavaScript. I spend quite some time to setup the JavaScript environment to my taste. I strongly prefer using typed JavaScript (with Flow) and using ES6 syntax. I decided to write it in JavaScript with the potential to create a step-by-step interactive tool to demonstrate how the algorithm works.

I was familiar with the concept of Eulerian circuits, but I didn’t remember the algorithms to solve it, even though I was exposed to one of them in the past. It was a good learning experience to write the code from scratch to really understand what I was doing.

This is the first time I see the #P complexity class. It’s always nice to learn about new theories when digging further on a specific topic.

References

[1] Bioinformatics Algorithms: An Active Learning Approach – Compeau, P. and Pevzner P.
[2] Matrix-Tree Theorem for Directed Graphs – Margoliash, J.
[3] Circuits and trees in oriented linear graphs – Aardenne-Ehrenfest, van T., Bruijn, de N.G.
[4] Wikipedia – BEST Theorem
[5] The complexity of computing the permanent – L. G. Valiant
[6] Counting Eulerian circuits is #P-complete – Brightwell, G. and Winkler, P.

Advertisements

Two-factor authentication

In this post we’ll talk about some popular security measures to protect user accounts on the web via two-factor authentication. The term refers to the requirement of two methods of authentication for logging in into a given account. The first method is mostly always a password, and the second is one of the methods we’ll describe in this post.

Why do we need an additional form of authentication?

In an ideal world, people would have strong (long, not complex) passwords, which would never get stolen and people would never forget them. In the real world, applications have to deal with two scenarios: 1) someone else knows your password or 2) you forgot your password.

Scenario 1: They are not who they claim to be

If someone else knows your password, the system needs to somehow know that this person is not you.

They can then employ a secondary method of authentication to verify that you are yourself. In theory they could ask for a secondary password or ask a security question. The problem with these is that they’re exposed to the same set of vulnerability that might have compromised the original password in the first place, for example, the password is too easy to crack or there was a breach of database storing plain text passwords. In addition, since these secondary methods are to be used in very rare occasions, it’s extremely likely you’ll incur in the second problem, i.e. forget your password.

Physical devices. Nowadays, security systems can almost always rely on the fact that even if someone has your password, they do not have your physical belongings (e.g. cellphone). Some websites allow users to setup the requirement to use both a password and a secondary authentication to access the account.

Scenario 2: I’m who I claim to be

To address the problem of a user losing a password, some websites offers a recovery mechanism, usually by sending a secure email with a link to re-set the password or, in case of email applications like GMail, allowing the secondary authentication method as an alternative to inputing your password.

Websites such as GMail and Github also have a set of auto-generated “master passwords” that you can print and store in a safe place. After used, these passwords become invalid. This is one of the safest options, but it also requires more effort from the user (printing and making sure they can find the printed document when needed).

The ability to recover password is a necessary usability feature. This comes at a cost, though. As in a chain, the security system is as strong as its weakest link. If you have a way to recover the password of your online bank account via email, and there’s a alternative authentication method to your email, then your bank account is vulnerable to the weakest between: your bank account password, your email password, or the secondary authentication mechanism used by your email.

Scenario 1 deals with security, and Scenario 2 deals with usability (recovering passwords), and these are usually at odds with which other. Security systems have to find the balance between the two.

We’ll now cover three popular secondary authentication mechanisms: SMS (text messages), third-party app authentication and hardware authentication.

SMS

sms

In the SMS (Short Message Service) method, the server generates a short code and sends it to the user via a text (SMS) message which is valid for a few minutes. The user can then copy the code from the phone to the computer and send the code to the server which can then authenticate the initial request.

During this period of time, the user account is technically vulnerable to a very weak code (a 6-digit number) which is very easy to crack. However, this period is very narrow, which great limits the ability of a bad agent to take any action.

Vulnerabilities

The real danger of the SMS method is a bad agent being able to intercept the SMS message that is supposed to go to the user. According to this Wired article, the telecoms use a network called SS7 (Signaling System No. 7) to transport text messages. This network relies on trust to implement features such as roaming, which enables a person from New York to receive/send text messages when they’re traveling to Berlin. In this case a carrier in Berlin could request the user’s carrier back in New York to receive the text messages so it can deliver to the user.

This system has a vulnerability because a hacked carrier could be used to intercept the text messages by pretending it’s doing so on behalf of a user. The carriers might not do any checks to verify the authenticity of the request. Hence, if an attacker knows your email, phone number and has access to a hacked carrier, they could technically hack into your account.

App Authentication

google_auth Another authentication method is to install a third-party app that can be used to generated the authentication codes. One popular option is the Google Authenticator App, which you can install on your phone (Android or iOS).

It uses the Time-based One-time Password algorithm or TOTP [2, 3]. The general idea is to perform a one-time registration between your phone and the server which consists of having both store a secret.

Whenever the client needs to authenticate itself, it uses the current timestamp and the secret to generate a hash, and from this hash it extracts a simpler code (6 characters) that the user copies and sends to the server. The server performs the same operation and if the generated code matches, it accepts the authentication.

The precision of the timestamp defines on how much time the user has to copy and send the code to the server. For example, the server can define the timestamp granularity to be 30 seconds. This also defines how long the server is vulnerable, since the code is usually short and hence easier to crack via brute force, so it cannot be too long.

Hardware Authentication

yubikey

A more recent approach to authentication is using a dedicated piece of hardware. YubiKey is an example of such device, which can be connected to the USB port. One way it can be used is part of the open authentication protocol called Universal 2nd Factor (U2F), developed by Google and Yubico (the company that manufactures YubiKey). We’ll describe this protocol next. In the discussion that follows we’ll refer to the Yubikey device generically as U2F.

The general flow consists of a enrollment phase, where the use registers the U2F in the target webpage. The webpage asks for a confirmation, which the user can do by tapping the U2F, which sends some information to the webpage, which stores it.

The other part is the signing phase. When this webpage needs to verify the user, say during login, it can ask the user to tap the U2F, which will send information that can be validated by the webpage to make sure it’s the same device that was registered in the first step.

Implementation details

One of the designs of this system is to be cross compatible and require no extra configuration from the user, like installing drivers. To achieve that, the communication between the U2F and the server is mediated via the browser. That means that the website calls a browser API (via JavaScript) which in turn communicates with the U2F. Henceforth when we refer to the communication between the U2F and the server, we’re implicitly assuming it’s done via the browser.

During the enrollment process, the device generates a pair of public and private keys (public-key cryptography). It sends the public key to the server which stores it together with other information. During the signing phase the server can generate a challenge (string), encrypt with the public key and send it to the U2F, which can decode it. At this point, the user is asked to tap the U2F. Once that it’s done, it sends the challenge back to the server encrypted with its private key. If the server can then decode the message, it can trust the U2F and authenticate the user.

The reason a new public and private key is generated at every enrollment is for privacy reasons (not security). This is to prevent the case of different websites that enable U2F, to share data between them and be able to track the user. For example, if the same public key was used for all enrollments, a site A and B would be able to identify the user via their public key and share this information among themselves. If site A is a online shopping, it could use this information to show targeted ads in site B.

Stateless U2F. The problem of having to generate a pair of public/private keys every time is that now the U2F has to store them somehow. Since another important part of design is for the U2F to be very accessible, the implication is that they also have to be cheap. Hence, the protocol cannot assume the device has embedded storage. The solution is to send the pair for the server to store!

This seems to defeat the whole purpose of using cryptography but this information is sent to the server encrypted, which only the U2F itself can decode. Now, in addition to the server storing the public key, it has to store this extra information which the protocol calls Key Handle [5]. During the signing phase it sends not only the encrypted challenge, but also the Key Handle.

Man-in-the-middle. One potential security hole could be a scam website that looks like the real one and acts as a man-in-the-middle. First, the user will provide the scam site with the username and password. The scam site can then forward these to the real site to trigger the secondary factor request, which will send down the Key Handle and encrypted challenge. The scam site will forward it back to the U2F. Then the U2F would encrypt the challenge, which would be sent to the scam site, which in turn would relay it to the real site, finally allowing the bad actor to login as the user.

To prevent that, the site origin can be stored in the Key Handle as well. Before deciding to send data back, the U2F can check if the origin of the server and match it against the data in the Key Handle. The site origin is hard to tamper with when using an HTTPS connection unless the real site’s certificates are compromised.

Vendor reliability. Another component of the security is the trust in the manufacturer of the device. It could have malicious intent or flawed implementation. To address that concern, the U2F should also contain an extra pair of attestation public-private pair of keys. The attestation is to prove the identity of the manufacturer/vendor. During the enrollment, the public key that is generated is encrypted with the private attestation key. The public attestation key is made available by some trusted organization for the server to consult. If it’s able to decode the generated public key, then it can trust the U2F vendor.

Conclusion

In this post we covered 3 methods of extra protection to the online identity. We saw that SMS has serious vulnerability while third-party and hardware authentication are much safer, which is no surprise since SMS were not initially designed to serve as a secure channel for authentication. No method is 100% secure but recent authentication mechanisms go to great lengths to reduce the vulnerable surface area to a minimum.

Note how all these methods assume the possession of a physical device separate from the computer where we’re trying to log into. Physical devices are much harder to steal compared to piece of information like passwords.

References

[1] Information Security – How does Google Authenticator work?
[2] Wikipedia – HMAC-based One-time Password algorithm
[3] Wikipedia – Time-based One-time Password algorithm
[4] Wired – Fixing the cell network flaw that lets hackers drain bank accounts
[5] Google U2F (Gnubby) Documents – Snapshot prior to joining FIDO

Log Structured Merge Trees

In this post we’ll discuss a data structure called Log Structured Merge Trees or LSM Trees for short. It provides a good alternative to structures like B+ Trees when the use case is more write-intensive.

According to [1], hardware advances are doing more for read performance than they are for writes. Thus it makes sense to select a write-optimised file structure.

B+ Trees and Append Logs

B+ Trees add structure to data in such a way that the read operation is efficient. It organizes the data in a tree structure and performs regular rebalancing to keep the tree height small so that we never need to look up too many entries to find a record.

If the B+ Tree is stored in disk, updating it requires performing random access which is expensive for a spinning disk. Random access is order of magnitudes slower than sequential access in disk. Adam Jacobs [3] describes an experiment where sequential access achieves a throughput of ~50M access/second while a random access only 300 (100,000x slower!). SSDs have a smaller gap ~40M access/second for sequential access and 2000 access/second for random access.

The other extreme alternative to avoid disk seeks when writing is just to append content sequentially. We can do this by appending rows to a log file. The problem of this is that the stored data has no structure so searching for a record would require scanning the entire dataset in the worst case!

The LSM Tree aims to combine the best of both worlds to achieve better write throughput without sacrificing too much of read performance. The overall idea is to write to a log file but as the file gets too large, restructure the data to optimize reads. We can see it as a lazy data structure data gets updated in batches.

First we’ll describe the original version of LSM Trees and then an improved version with better performance for real world applications and used by databases like LevelDB [4].

LSM Trees

Let’s study LSM Trees applies to the implementation of a key-value database. Writes are initially done to an in-memory structure called memtable, where the keys are kept sorted (random access of RAM is not expensive). Once the table “fills up”, it’s persisted in disk as an immutable (read-only) file.

lsm-insert-mem

Figure 1: Inserting new key in memtable

Searching for a key consists in scanning each file and within a file we can keep an index for the keys, so we can quickly find a record. Note that a key might appear in multiple files representing multiple updates to that key. We can scan the files by the most recent first because that would contain the last update to the key. The major cost of searching is due to the linear scanning of the files. As our database grows, the number of files will become too large to scan linearly.

lsm-write-disk.png

Figure 2: Writing memtable to a file

To avoid that, once the number of files grow past a given number, we merge every pair of files into a new file using an external merge sort to keep the keys sorted. The linear factor of the search was cut in half, and while the file size doubled, the cost was sublinear, O(log n), so the search became twice as fast. This approach is known as tiered compaction [2].

The main disadvantage of this method is that once the files get past a certain size, the merge operation starts getting costly. Given m sorted files of size S, the merge operation would be O(m S log S). While this compaction will happen rather infrequently (roughly when the database doubles in size), it will take a really long time for that one time it happens.

lsm-compaction

Figure 3: Tiered compation

This resembles the discussion of amortized analysis for data structures [5]. We saw that while amortized complexity may yield efficient average performance of a data structure, there are situations where we cannot afford the worst case scenario, even if it happens very rarely.

LSM with Level Compaction

An alternative approach to work around expensive worst case scenarios is to keep the file sizes small (under 2MB) and divide them into levels. Excluding the first level which is special, the set of keys each two files at a given level contain must be disjoint, that is, a given key cannot appear in more than one file at the same level. Each level can contain multiple files, but the total size of the files should be under a limit. Each level is k times larger than the previous one. In LevelDB [4], level L has a (10^L) MB size limit (that is, 10MB for level 1, 100MB for level 2, etc).

Promotion. Whenever a given level reaches its size limit, one of the files at that level is selected to be merged with the next level or promoted. To keep the property of disjoint keys satisfied, we first identify which files in the next level have duplicated keys with the file being merged and then merge all these files together. Instead of outputting a single combined file like in the tiered compaction, we output many files of size up to 2MB. During the merge, if we find collisions, the key from the lower level is more recent, so we can just discard the key from the high level.

lsm-level-promotion.png

Figure 4: Promotion from Level 0 to Level 1

Details

When merging, to detect which files contain a given key, we can use Bloom filters for each file. Recall that a bloom filter allows us to check whether a given key belongs to a set with low memory usage. If it says the key is not in the set, we know it’s correct, while if it says it is in the set, then there’s a chance it is wrong. So we can quickly check whether a given key belongs to a file with low memory footprint.

The first level is special because the keys don’t need to be disjoint, but when merging a file from this first level, we also include the files where that key is present. This way we guarantee that the most up-to-date key is at the lowest level it is found.

To select which file to be merged with the next level we use a round-robin approach. We keep track of which file was merged last and then pick the next one. This can be used to make sure that every file eventually gets promoted.

When outputting files from the merge operation, we might output files with less than 2MB in case we detect the current file would overlap with too many files (in LevelDB it’s 10) in the next level. This is to avoid having to merge too many files when this file gets promoted in the future.

Cost Analysis

Since the files sizes are bounded to 2MB, merging files is a relatively cheap operation. We saw above that we can limit the file to not contain too many duplicate keys with the files at the next level, so we’ll only have to merge around 11 files, for a total of 11MB of data, so we can easily do the merge sort in memory.

The promotion might also cascade through the next levels since once we promote a file from level t to t+1, it might overflow level t+1, which will require another promotion as well. This in fact will be common because merging only moves off 2MB worth of data to the next level, so it will require a promotion the next time it receives a new file from the level below (ignoring the fact that keys get overwritten during the merge). Fortunately the number of levels L grows O(log n) the size of the data. So for LevelDB, where the first level size limit is 100MB, even for a disk with 100TB capacity, we would still need only about 8 levels.

Reads

The fact that each key belongs to at most one file at each level allows us to keep an index (e.g. a hash table in disk) of keys to files for each level. (This of course excludes the first level, but it has a small number of files, so linear search is not expensive).

One interesting property is that each level acts as some sort of write-through cache. Whenever a key gets updated, it’s inserted at a file at lower levels. It will take many promotions for it to be placed at a higher level with other files. This means that searching for a key that has been recently updated will require scanning very few levels or smaller indexes since it will be found at lower levels.

References

[1] ben stopford – Log Structured Merge Trees
[2] Datastax – Leveled Compaction in Apache Cassandra
[3] ACM Queue – The Pathologies of Big Data
[4] LevelDB – Wiki
[5] NP-Incompleteness – Eliminating Amortization