Two-factor authentication

In this post we’ll talk about some popular security measures to protect user accounts on the web via two-factor authentication. The term refers to the requirement of two methods of authentication for logging in into a given account. The first method is mostly always a password, and the second is one of the methods we’ll describe in this post.

Why do we need an additional form of authentication?

In an ideal world, people would have strong (long, not complex) passwords, which would never get stolen and people would never forget them. In the real world, applications have to deal with two scenarios: 1) someone else knows your password or 2) you forgot your password.

Scenario 1: They are not who they claim to be

If someone else knows your password, the system needs to somehow know that this person is not you.

They can then employ a secondary method of authentication to verify that you are yourself. In theory they could ask for a secondary password or ask a security question. The problem with these is that they’re exposed to the same set of vulnerability that might have compromised the original password in the first place, for example, the password is too easy to crack or there was a breach of database storing plain text passwords. In addition, since these secondary methods are to be used in very rare occasions, it’s extremely likely you’ll incur in the second problem, i.e. forget your password.

Physical devices. Nowadays, security systems can almost always rely on the fact that even if someone has your password, they do not have your physical belongings (e.g. cellphone). Some websites allow users to setup the requirement to use both a password and a secondary authentication to access the account.

Scenario 2: I’m who I claim to be

To address the problem of a user losing a password, some websites offers a recovery mechanism, usually by sending a secure email with a link to re-set the password or, in case of email applications like GMail, allowing the secondary authentication method as an alternative to inputing your password.

Websites such as GMail and Github also have a set of auto-generated “master passwords” that you can print and store in a safe place. After used, these passwords become invalid. This is one of the safest options, but it also requires more effort from the user (printing and making sure they can find the printed document when needed).

The ability to recover password is a necessary usability feature. This comes at a cost, though. As in a chain, the security system is as strong as its weakest link. If you have a way to recover the password of your online bank account via email, and there’s a alternative authentication method to your email, then your bank account is vulnerable to the weakest between: your bank account password, your email password, or the secondary authentication mechanism used by your email.

Scenario 1 deals with security, and Scenario 2 deals with usability (recovering passwords), and these are usually at odds with which other. Security systems have to find the balance between the two.

We’ll now cover three popular secondary authentication mechanisms: SMS (text messages), third-party app authentication and hardware authentication.

SMS

sms

In the SMS (Short Message Service) method, the server generates a short code and sends it to the user via a text (SMS) message which is valid for a few minutes. The user can then copy the code from the phone to the computer and send the code to the server which can then authenticate the initial request.

During this period of time, the user account is technically vulnerable to a very weak code (a 6-digit number) which is very easy to crack. However, this period is very narrow, which great limits the ability of a bad agent to take any action.

Vulnerabilities

The real danger of the SMS method is a bad agent being able to intercept the SMS message that is supposed to go to the user. According to this Wired article, the telecoms use a network called SS7 (Signaling System No. 7) to transport text messages. This network relies on trust to implement features such as roaming, which enables a person from New York to receive/send text messages when they’re traveling to Berlin. In this case a carrier in Berlin could request the user’s carrier back in New York to receive the text messages so it can deliver to the user.

This system has a vulnerability because a hacked carrier could be used to intercept the text messages by pretending it’s doing so on behalf of a user. The carriers might not do any checks to verify the authenticity of the request. Hence, if an attacker knows your email, phone number and has access to a hacked carrier, they could technically hack into your account.

App Authentication

google_auth Another authentication method is to install a third-party app that can be used to generated the authentication codes. One popular option is the Google Authenticator App, which you can install on your phone (Android or iOS).

It uses the Time-based One-time Password algorithm or TOTP [2, 3]. The general idea is to perform a one-time registration between your phone and the server which consists of having both store a secret.

Whenever the client needs to authenticate itself, it uses the current timestamp and the secret to generate a hash, and from this hash it extracts a simpler code (6 characters) that the user copies and sends to the server. The server performs the same operation and if the generated code matches, it accepts the authentication.

The precision of the timestamp defines on how much time the user has to copy and send the code to the server. For example, the server can define the timestamp granularity to be 30 seconds. This also defines how long the server is vulnerable, since the code is usually short and hence easier to crack via brute force, so it cannot be too long.

Hardware Authentication

yubikey

A more recent approach to authentication is using a dedicated piece of hardware. YubiKey is an example of such device, which can be connected to the USB port. One way it can be used is part of the open authentication protocol called Universal 2nd Factor (U2F), developed by Google and Yubico (the company that manufactures YubiKey). We’ll describe this protocol next. In the discussion that follows we’ll refer to the Yubikey device generically as U2F.

The general flow consists of a enrollment phase, where the use registers the U2F in the target webpage. The webpage asks for a confirmation, which the user can do by tapping the U2F, which sends some information to the webpage, which stores it.

The other part is the signing phase. When this webpage needs to verify the user, say during login, it can ask the user to tap the U2F, which will send information that can be validated by the webpage to make sure it’s the same device that was registered in the first step.

Implementation details

One of the designs of this system is to be cross compatible and require no extra configuration from the user, like installing drivers. To achieve that, the communication between the U2F and the server is mediated via the browser. That means that the website calls a browser API (via JavaScript) which in turn communicates with the U2F. Henceforth when we refer to the communication between the U2F and the server, we’re implicitly assuming it’s done via the browser.

During the enrollment process, the device generates a pair of public and private keys (public-key cryptography). It sends the public key to the server which stores it together with other information. During the signing phase the server can generate a challenge (string), encrypt with the public key and send it to the U2F, which can decode it. At this point, the user is asked to tap the U2F. Once that it’s done, it sends the challenge back to the server encrypted with its private key. If the server can then decode the message, it can trust the U2F and authenticate the user.

The reason a new public and private key is generated at every enrollment is for privacy reasons (not security). This is to prevent the case of different websites that enable U2F, to share data between them and be able to track the user. For example, if the same public key was used for all enrollments, a site A and B would be able to identify the user via their public key and share this information among themselves. If site A is a online shopping, it could use this information to show targeted ads in site B.

Stateless U2F. The problem of having to generate a pair of public/private keys every time is that now the U2F has to store them somehow. Since another important part of design is for the U2F to be very accessible, the implication is that they also have to be cheap. Hence, the protocol cannot assume the device has embedded storage. The solution is to send the pair for the server to store!

This seems to defeat the whole purpose of using cryptography but this information is sent to the server encrypted, which only the U2F itself can decode. Now, in addition to the server storing the public key, it has to store this extra information which the protocol calls Key Handle [5]. During the signing phase it sends not only the encrypted challenge, but also the Key Handle.

Man-in-the-middle. One potential security hole could be a scam website that looks like the real one and acts as a man-in-the-middle. First, the user will provide the scam site with the username and password. The scam site can then forward these to the real site to trigger the secondary factor request, which will send down the Key Handle and encrypted challenge. The scam site will forward it back to the U2F. Then the U2F would encrypt the challenge, which would be sent to the scam site, which in turn would relay it to the real site, finally allowing the bad actor to login as the user.

To prevent that, the site origin can be stored in the Key Handle as well. Before deciding to send data back, the U2F can check if the origin of the server and match it against the data in the Key Handle. The site origin is hard to tamper with when using an HTTPS connection unless the real site’s certificates are compromised.

Vendor reliability. Another component of the security is the trust in the manufacturer of the device. It could have malicious intent or flawed implementation. To address that concern, the U2F should also contain an extra pair of attestation public-private pair of keys. The attestation is to prove the identity of the manufacturer/vendor. During the enrollment, the public key that is generated is encrypted with the private attestation key. The public attestation key is made available by some trusted organization for the server to consult. If it’s able to decode the generated public key, then it can trust the U2F vendor.

Conclusion

In this post we covered 3 methods of extra protection to the online identity. We saw that SMS has serious vulnerability while third-party and hardware authentication are much safer, which is no surprise since SMS were not initially designed to serve as a secure channel for authentication. No method is 100% secure but recent authentication mechanisms go to great lengths to reduce the vulnerable surface area to a minimum.

Note how all these methods assume the possession of a physical device separate from the computer where we’re trying to log into. Physical devices are much harder to steal compared to piece of information like passwords.

References

[1] Information Security – How does Google Authenticator work?
[2] Wikipedia – HMAC-based One-time Password algorithm
[3] Wikipedia – Time-based One-time Password algorithm
[4] Wired – Fixing the cell network flaw that lets hackers drain bank accounts
[5] Google U2F (Gnubby) Documents – Snapshot prior to joining FIDO

Advertisements

Log Structured Merge Trees

In this post we’ll discuss a data structure called Log Structured Merge Trees or LSM Trees for short. It provides a good alternative to structures like B+ Trees when the use case is more write-intensive.

According to [1], hardware advances are doing more for read performance than they are for writes. Thus it makes sense to select a write-optimised file structure.

B+ Trees and Append Logs

B+ Trees add structure to data in such a way that the read operation is efficient. It organizes the data in a tree structure and performs regular rebalancing to keep the tree height small so that we never need to look up too many entries to find a record.

If the B+ Tree is stored in disk, updating it requires performing random access which is expensive for a spinning disk. Random access is order of magnitudes slower than sequential access in disk. Adam Jacobs [3] describes an experiment where sequential access achieves a throughput of ~50M access/second while a random access only 300 (100,000x slower!). SSDs have a smaller gap ~40M access/second for sequential access and 2000 access/second for random access.

The other extreme alternative to avoid disk seeks when writing is just to append content sequentially. We can do this by appending rows to a log file. The problem of this is that the stored data has no structure so searching for a record would require scanning the entire dataset in the worst case!

The LSM Tree aims to combine the best of both worlds to achieve better write throughput without sacrificing too much of read performance. The overall idea is to write to a log file but as the file gets too large, restructure the data to optimize reads. We can see it as a lazy data structure data gets updated in batches.

First we’ll describe the original version of LSM Trees and then an improved version with better performance for real world applications and used by databases like LevelDB [4].

LSM Trees

Let’s study LSM Trees applies to the implementation of a key-value database. Writes are initially done to an in-memory structure called memtable, where the keys are kept sorted (random access of RAM is not expensive). Once the table “fills up”, it’s persisted in disk as an immutable (read-only) file.

lsm-insert-mem

Figure 1: Inserting new key in memtable

Searching for a key consists in scanning each file and within a file we can keep an index for the keys, so we can quickly find a record. Note that a key might appear in multiple files representing multiple updates to that key. We can scan the files by the most recent first because that would contain the last update to the key. The major cost of searching is due to the linear scanning of the files. As our database grows, the number of files will become too large to scan linearly.

lsm-write-disk.png

Figure 2: Writing memtable to a file

To avoid that, once the number of files grow past a given number, we merge every pair of files into a new file using an external merge sort to keep the keys sorted. The linear factor of the search was cut in half, and while the file size doubled, the cost was sublinear, O(log n), so the search became twice as fast. This approach is known as tiered compaction [2].

The main disadvantage of this method is that once the files get past a certain size, the merge operation starts getting costly. Given m sorted files of size S, the merge operation would be O(m S log S). While this compaction will happen rather infrequently (roughly when the database doubles in size), it will take a really long time for that one time it happens.

lsm-compaction

Figure 3: Tiered compation

This resembles the discussion of amortized analysis for data structures [5]. We saw that while amortized complexity may yield efficient average performance of a data structure, there are situations where we cannot afford the worst case scenario, even if it happens very rarely.

LSM with Level Compaction

An alternative approach to work around expensive worst case scenarios is to keep the file sizes small (under 2MB) and divide them into levels. Excluding the first level which is special, the set of keys each two files at a given level contain must be disjoint, that is, a given key cannot appear in more than one file at the same level. Each level can contain multiple files, but the total size of the files should be under a limit. Each level is k times larger than the previous one. In LevelDB [4], level L has a (10^L) MB size limit (that is, 10MB for level 1, 100MB for level 2, etc).

Promotion. Whenever a given level reaches its size limit, one of the files at that level is selected to be merged with the next level or promoted. To keep the property of disjoint keys satisfied, we first identify which files in the next level have duplicated keys with the file being merged and then merge all these files together. Instead of outputting a single combined file like in the tiered compaction, we output many files of size up to 2MB. During the merge, if we find collisions, the key from the lower level is more recent, so we can just discard the key from the high level.

lsm-level-promotion.png

Figure 4: Promotion from Level 0 to Level 1

Details

When merging, to detect which files contain a given key, we can use Bloom filters for each file. Recall that a bloom filter allows us to check whether a given key belongs to a set with low memory usage. If it says the key is not in the set, we know it’s correct, while if it says it is in the set, then there’s a chance it is wrong. So we can quickly check whether a given key belongs to a file with low memory footprint.

The first level is special because the keys don’t need to be disjoint, but when merging a file from this first level, we also include the files where that key is present. This way we guarantee that the most up-to-date key is at the lowest level it is found.

To select which file to be merged with the next level we use a round-robin approach. We keep track of which file was merged last and then pick the next one. This can be used to make sure that every file eventually gets promoted.

When outputting files from the merge operation, we might output files with less than 2MB in case we detect the current file would overlap with too many files (in LevelDB it’s 10) in the next level. This is to avoid having to merge too many files when this file gets promoted in the future.

Cost Analysis

Since the files sizes are bounded to 2MB, merging files is a relatively cheap operation. We saw above that we can limit the file to not contain too many duplicate keys with the files at the next level, so we’ll only have to merge around 11 files, for a total of 11MB of data, so we can easily do the merge sort in memory.

The promotion might also cascade through the next levels since once we promote a file from level t to t+1, it might overflow level t+1, which will require another promotion as well. This in fact will be common because merging only moves off 2MB worth of data to the next level, so it will require a promotion the next time it receives a new file from the level below (ignoring the fact that keys get overwritten during the merge). Fortunately the number of levels L grows O(log n) the size of the data. So for LevelDB, where the first level size limit is 100MB, even for a disk with 100TB capacity, we would still need only about 8 levels.

Reads

The fact that each key belongs to at most one file at each level allows us to keep an index (e.g. a hash table in disk) of keys to files for each level. (This of course excludes the first level, but it has a small number of files, so linear search is not expensive).

One interesting property is that each level acts as some sort of write-through cache. Whenever a key gets updated, it’s inserted at a file at lower levels. It will take many promotions for it to be placed at a higher level with other files. This means that searching for a key that has been recently updated will require scanning very few levels or smaller indexes since it will be found at lower levels.

References

[1] ben stopford – Log Structured Merge Trees
[2] Datastax – Leveled Compaction in Apache Cassandra
[3] ACM Queue – The Pathologies of Big Data
[4] LevelDB – Wiki
[5] NP-Incompleteness – Eliminating Amortization