Paper Reading – F1: A Distributed SQL Database That Scales

In this post we’ll discuss F1: A Distributed SQL Database That Scales. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

F1 is a database used at Google to serve data for the AdWords product. This is the system that displays embedded ads in external pages using Google’s
infrastructure.

F1 is named after a term from Genetics, F1 hybrid, in analogy to the idea that it combines the best aspects of relational databases and NoSQL systems (my initial thought was that it was named after Formula One).

In a high-level, F1 is a layer on top of the Spanner database which we covered in a previous post. It adds features such a SQL interface (worth noting that Spanner recently has evolved to support SQL natively too). Spanner handles a lot of the distributed system difficulties under the hood, providing some guarantees and F1 builds relational database features on top of that.

The paper mentions that AdWords migrated from a sharded MySQL solution, which was very expensive to maintain, especially when the data outgrew the instance and re-sharding was necessary. Now, this data redistribution is handled transparently by Spanner and while F1 is theoretically slower, in practice they were able to structure the data in such a way that it’s performant in practice (5-10 ms for reads and 50-150ms for writes).

Architecture

F1 is deployed into multiple servers, across geographically distributed data centers. A client sends requests to a load balancer that then redirects to an F1 server, usually in the closest datacenter.

Screen Shot 2018-01-31 at 8.30.13 PM

Architecture of F1: In display are two datacenters containing multiple machines running F1 servers, Spanner and CFS instances.

F1 servers are usually in the same Data Center as the Spanner servers that contain their data. F1 can communicate to Spanner servers outside of its datacenter directly, but Spanner only communicates with the Colossus file system (CFS) within the same datacenter because CFS is not globally replicated.

F1 servers are stateless except when the client performs pessimistic transactions, which make them easier to scale, since no data movement is required.

In addition to F1 servers, there are F1 master and a pool of F1 slaves. According to the paper the master controls which processes are in the pool, and that F1 slaves execute part of the query plan on behalf of F1 servers, but it’s not clear why they need to be a separate component.

Data Model

The table data can be organized in a hierarchical schema much like Spanner (see
“Data Model” in the post about Spanner). This hierarchical schema is optional but it’s a key piece in making the performance of F1 competitive with the previous sharded MySQL system AdWords used.

In this hierarchy, we have a root table and a child table. For each row in the root table, called root row, we have rows clustered under that row based on their matching keys.

For example, say that we have tables Customer and Campaign. The Customer table has customerID and Campaign (customerID, campaignID). A possible structure for the rows would be:

Screen Shot 2018-01-31 at 9.05.08 PM

Where Campaign rows with customerID=3 are clustered together under the corresponding Customer row.

The paper mentions that a given cluster of child rows fall within the same Spanner directory, which means that a transaction would query a single Spanner server, avoiding the overhead of synchronization between multiple servers (see Read-write transactions in the Spanner post).

Indexes. Tables support two types of indexes: local or global. Local index entries are stored in the same Spanner server of the rows the index but it must include the key of the root table. Global indexes do not have such restrictions but they can be distributed across multiple Spanner servers and very expensive to update within a transaction.

Schema changes. Updating the schema of a table is challenging because the rows of the table are distributed, which makes schema consistency expensive to achieve, and the system cannot have downtime.

The proposed solution is to break schema changes into smaller steps, such that as if no 2 severs are more than 2 steps apart, the changes are safe.

Writes

Transactions

F1 supports 3 types of transactions: snapshot transactions, pessimist transactions and optimistic transactions. It relies on the transactions supported by Spanner, so it’s worth re-reading the Implementation Details of the Spanner post.

Snapshot transactions are read-only transactions and map to the corresponding read-only transactions from Spanner. The pessimistic transaction maps to the read-write transactions from Spanner.

The optimistic transaction consists of multiple reads followed by a single write. F1 stores an extra column with the last modified timestamp for each row. When the client performs the reads, it stores the largest timestamp it saw for any row it received. It sends this timestamp together with the final write request. F1 then performs a pessimistic transaction to read only the timestamps of the affected rows. If any of them differ from the timestamp sent by the client, it aborts. Otherwise it performs the write.

It’s worth noting that F1 didn’t create a transaction until the very last write request was sent. It’s assuming there were no writes between these reads and the final write, so we say it’s optimistic. This is the default transaction type and the paper describes a few advantages over the pessimistic type.

Locks

F1 supports very granular locking of the tables, including row level and cell (set of columns for a given row) level.

Change history

F1 stores change history of the data. The changes are stored as regular tables in a structure called ChangeBatch, children of a root table. If a transaction updates multiple root rows, several ChangeBatches are created under their corresponding root tables.

This feature allows clients to subscribe to changes in F1 tables using a publish-subscriber system. Whenever the client receives a notification, it can ask for incremental changes after a specific timestamp and apply to the data it has stored locally.

Reads

F1 supports both NoSQL and SQL interfaces. The SQL dialect was extended to support Protocol Buffers, which are complex data types with strong types (as opposed to loosely typed structures such as JSON). This extension allows, for example, to read and update internal fields of such structures.

Local vs distributed. F1 supports centralized execution (running on a single F1 node) or distributed. The query optimizer decides which one to execute. The paper details the process of executing the query in a distributed fashion.

Distributed Query Example

The paper describes an example of a SQL query being mapped to an execution plan. It involves two joins, filters and aggregations:

Screen Shot 2018-02-02 at 11.10.01 PM

This SQL query is parsed and converted to a query plan that will be executed by multiple machines called operators. A possible execution plan for the sample query is:

Screen Shot 2018-02-02 at 11.13.23 PM

The arrows indicate the data flow, starting at the bottom. The first join is a lookup join (hash join). The operation reads rows from the AdClick table until it has about 100k unique lookup keys stored in a hash table. Then it performs a single (batch) lookup to the other table, AdGroupCreative. The rows from the AdClick table are kept in memory for faster lookup.

As soon as the join for these keys are completed, they’re streamed to the next stage of the pipeline.

The second join with Creative is a distributed join. It first repartitions each row from each table based on the values of the columns listed in the USING clause. Each partition might end up in different machines for the next stage which consists of joining the columns of matching rows.

Finally, the rows are again repartitioned, now by the values from the group by columns and then aggregators apply the aggregation for sets of rows under the same partition.

Distributed Execution Overview

More generally, the query plan created by F1 is a DAG (directed aclyclic graph) where each node is an operator like the join or aggregator described above. Note that there are multiple operators running the same operation in parallel.

The paper says:

A technique frequently used by distributed database systems is to take advantage of an explicit co-partitioning of the stored data.

It’s not very clear to me what they mean with that, especially because they don’t cite any references, but from context it suggests that the base operators (the scan and lookup join) are in the same machine as the data (co-located) and they do as much of the processing upfront as possible. This helps minimize data transfer which can become the bottleneck in a distributed computation. F1 cannot do that because Spanner abstracts the data location from F1.

A side effect is that there’s a lot of network traffic. The authors claim that Google has network switch hardware improvements which allows servers to communicate with each other close to full network speed.

When the hash tables in memory grow too large, they write part of the data to disk. So while F1 doesn’t store data in a persistent way, it still needs to write to disk for intermediate operations.

For efficiency, F1 doesn’t write checkpoints to disk. The downside is that the system is not fault tolerant. Failures in any stage of the execution causes the entire query to fail. Retries are done transparently but long queries (>1h) are bound to fail.

Other features

F1 exposes data from intermediate nodes to clients. This avoid having all the data concentrating at the last node of the query execution. They cite Map-Reduce jobs as examples of such feature.

Conclusion

In this post we learned about one of Google’s many distributed databases, F1. It’s a loosely coupled layer on top of Spanner to provide a more familiar level of abstraction which are relational databases.

It seems that we could make an analogy between Google’s systems and similar open source solutions. The Colossus File System (CFS) is the distributed filesystem that could map to Hadoop Distributed File System (HDFS), and Spanner would map to Hadoop’s YARN, and F1 providing SQL semantics on top of Spanner could be mapped to Hive which does the same for Hadoop. It’s a very rough comparison, and maybe Spanner is more similar to Spark but it’s interesting to see the patterns and relationship between these systems.

References

[1] F1: A Distributed SQL Database That Scales
[2] Spanner: Google’s Globally-Distributed Database

Appendix: Terminology and Concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

We also have a lot of experience with eventual consistency
systems at Google

Eventual consistency means that a set of servers might not contain the most recent updates but eventually will. For example, the client might issue a write that affects multiple machines. If the system only provides eventual consistency then a subsequent read is not guaranteed to get the data up-to-date with those writes.

The quote mentions experience with eventual consistency in a negative way, because of the extra complexities that clients have to deal with to work around this limitation.

Advertisements

Paper Reading – Spanner: Google’s Globally-Distributed Database

This is the first post under the category “Paper Reading” which will consist in summarizing my notes after reading a paper.

In this post we’ll discuss Spanner: Google’s Globally-Distributed Database. We’ll provide a brief overview of the paper’s contents and study in more details the architecture of the system and the implementation details. At the end, we provide an Appendix to cover some distributed systems and databases concepts mentioned throughout the paper.

Paper’s Overview

Spanner is a distributed database used at Google. It provides a SQL API and several features such as reading data from a past timestamp, lock-free read-only transactions and atomic schema changes.

In Section 2, Implementation, it describes how the data is distributed in machines called spanservers, which themselves contain data structures called tablets, which are stored as B-tree-like files in a distributed file system called Colossus. It describes the data model, which allows hierarchy of tables, where the parent table’s rows are interleaved with the children tables’.

The paper puts emphasis on the TrueTime API as the linchpin that address many challenges in real-world distributed systems, especially around latency. This is described in Section 3.

Section 4 describes technical details on how to implement the distributed transactions and safe schema changes.

In Section 5, the authors provide benchmarks and use cases, in particular F1, a database built on top of Spanner and used by Google’s AdWords team, to address some scaling limitations of a sharded MySQL database.

Architecture

Spanservers

Figure 1: Spanner stack [1]

As we can see in Figure 1, Spanner is organized in a set of zones, which are the unit of administrative deployment. Even though there might be more than one zone per data center, each zone correspond to a physically disjoint set of servers (cluster). Within a zone, we have the zonemaster, which the paper doesn’t delve into, the location proxy, which serves as a index that directs requests to the appropriate spanserver, and the spanserver itself, which is the main unit in the system, containing the actual data.

Figure 2: Spanner architecture

A spanserver contains multiple tablets, which are a map

(key, timestamp) -> value

They are stored in Colossus, a distributed file system (successor of Google File System). A spanserver contains multiple replicas of the same data and the set of replicas form a Paxos group, which we can see in Figure 2. Reads can go directly to any replica that is sufficiently up-to-date. Writes must go through a leader, which is chosen via Paxos. The lock table depicted on top of the leader replica in Figure 2 allows concurrency control. The transaction manager is used to coordinate distributed transactions (that is, across multiple groups) and it also chooses the participant leader. If the transaction involves more than one Paxos group, a coordinator leader must be elected among the participant leaders of each group.

Within a tablet, keys are organized in directories, which is a set of contiguous keys sharing a common prefix. A directory is the unit of data, meaning that data movement happens by directories, not individual keys or rows. Directories can be moved between Paxos groups.

Data Model

The paper mentions that its data is in a semi-relational table. This is because it has characteristics from both relational tables (e.g. MySQL tables) and non-relational table (e.g. HBase tables). It looks like a relational table because it has rows, columns and versioned values. It looks like a key-value store table because rows must have a unique identifier, which acts a key, while the row is the value. The qualifier schematized seems to denote that these tables have well-defined schemas (including column types and primary keys).

In particular, the schema supports Protocol Buffers. This means that besides the native types existing in databases (string, int, etc.), the schema supports the strongly typed structures from Protocol Buffers.

Another feature of this database is how it allows storing related tables with their rows interleaved. This helps having data that is often queried together to be located physically close (co-located). In the example of Figure 3, it stores the Users and Albums interleaved. If we think of this as an use case for the Google Photos application, it should be a pretty common operation to fetch all albums of a given user.

Figure 3: Interleaved tables

TrueTime API

The main method from the true time API (TT) is:

TT.now(): [earliest, latest]

It returns the current universal time within a confidence interval. In other words, there’s a very strong guarantee that the actual time lies within the returned time range. Let \epsilon be half of a given confidence interval length and \bar\epsilon the average across all \epsilon‘s. According to their experiments, \bar\epsilon varied from 1 to 7ms during the pollings.

Of course, the strength doesn’t lie on the API itself, but the small confidence interval it can guarantee. They achieve this through a combination of atomic clocks and GPSs, each of which have different uncorrelated synchronization and failure sources, which means that even if one of them fails, the other is likely working properly.

Implementation Details

Spanner has three main types of operations: read-write transaction, read-only transaction and snapshot read at a timestamp.

Read-write transactions

The writes in a read-write transaction, or RW for short, are buffered in the client until commit, which means the reads in that transaction do not see the effects of the writes.

First, the client perform the reads, by issuing read requests to each of the leader of the (Paxos) groups that have the data. The leader acquires read locks and reads the most recent data.

Then the client starts the 2 phase commit for the writes. It chooses a coordinator group (a representative among the Paxos groups) and notify the leaders of the other groups with the identity of the coordinator plus the write requests. A non-coordinator group’s leader first acquires write locks and then chooses a prepare timestamp that is greater than all the timestamps issued by this group, to preserve monotonicity, and sends this timestamp to the coordinator’s leader.

The coordinator’s leader acquires write locks and receives all the timestamps from the other leaders and chooses a timestamp s that is greater than all of these, greater than the TT.now().latest at the time it received the commit request and greater than any timestamps issued within the coordinator group.

The coordinator’s leader might need to wait until it’s sure s is in the past, that is, until TT.now().earliest > s. After that, the coordinator sends s to all other leaders which commit their transactions with timestamp s and release their write locks.

This choice of timestamp can be shown to guarantee the external consistency property, that is, if the start of a transaction T2 happens after the commit of transaction T1, then the commit timestamp assigned to T2 has to be grater than T1’s.

Read-only transactions

If the read involves a single Paxos group, then it chooses a read timestamp as the timestamp of the latest committed write in the group (note that if there’s a write going on, it would have a lock on it). If the read involves more than one group, Spanner will choose for timestamp TT.now().latest, possibly waiting until it’s sure that timestamp is in the past.

Conclusion

In this post we learned about some of the most recent Google’s databases, the main pioneer in large-scale distributed systems. It addresses some limitations with other solutions developed in the past: Big Table, which lacks schema and strong consistency; Megastore, which has poor write performance; and a sharded MySQL, which required the application to know about the sharding schema and resharding being a risky and lengthy process.

One thing I missed from this paper is whether Spanner can perform more advanced relational database operations such as aggregations, subqueries or joins. Usually performing these in a distributed system requires some extra component to store the intermediate values, which was not mentioned in the paper.

References

  • [1] Spanner: Google’s Globally-Distributed Database
  • [2] Principles of Computer Systems – Distributed Transactions
  • [3] Radar – Google’s Spanner is all about time
  • [4] Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
  • [5] System level concurrency control for distributed database systems

Appendix: Terminology and concepts

I was unfamiliar with several of the terminology and concepts used throughout the paper, so I had to do some extra research. Here I attempt to explain some of these topics by quoting snippets from the paper.

Spanner is a database that shards data across many set of Paxos.

Sharding is basically distributing the data into multiple machines. It’s used commonly in a database context meaning partitioning the table’s rows into multiple machines. Paxos is a protocol used to solve the consensus problem, in which a set of machines (participants) in a distributed environment must agree on a value (consensus). We discussed it in a previous post, and it guarantees that at least a quorum (subset of more than half of the participants) agree on a value. In “set of Paxos”, it seems that Paxos is an allegory to represent a set of machines using the Paxos protocol.

Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent reads and writes, and globally-consistent reads across the database at a timestamp.

In the context of transactions in a concurrent environment, we need to have a total order of the transactions. This is to avoid problems such as the concurrent balance updates [2]: Imagine we have a bank account with balance B and two concurrent transactions: Transaction 1 (T1) reads the current balance and adds $1. Transaction 2 (T2) reads the current balance and subtracts $1. After T1 and T2 are executed (no matter in which order), one would expect that the final balance remains B. However, if the read from T2 happens before the write from T1 and the write from T2 after, T1 will be overridden and the final balance would be B-$1. The total ordering guarantees that T1 and T2 are disjoint (in time).

To obtain external consistency, the following property must hold: T1’s start time is less than T2’s, then T1 comes before T2 in the total order.

Google has datacenters all over the world. By providing globally-consistent reads at a timestamp, it guarantees that, given a timestamp, it will return the same data no matter which datacenter you ask the data from.

Running two-phase commit over Paxos mitigates the availability problems.

Two-phase commit or 2PC is a protocol to guarantee consistency of distributed transactions. It guarantees that either all transactions will succeed or none will be performed (they’ll rollback). The protocol requires the election of a coordinator among the participant transactions and such election can be performed using the Paxos protocol, which seems to be the case here. In the first phase, the leader must obtain an agreement from all participants and after that it starts the second phase and sends a message to each participant informing them to proceed with the transaction.

To support replication, each spanserver implements a single Paxos state machine on top of each tablet

A Paxos state machine is a method for implementing fault-tolerance [4]. This seem complicated enough to warrant an entire post on this topic, but the basic idea seems to use replicas, each of which containing a copy of a state machine and use this information to guarantee correctness and availability under failures.

Reads within read-write transactions use woundwait to avoid deadlocks

Wound-wait lock is a mechanism used between transactions to prevent deadlock [5]. More specifically, assume we have transactions T1 and T2 with associated unique timestamps and T2 is currently holding a lock. Let t(T) be the timestamp of a transaction T. We have two scenarios: either t(T1) < t(T2) or t(T1) > t(T2). In the first case, T1 is older than T2 and the protocol says that T2 should abort, rollback and re-tries later with the same timestamp. We say T1 wounds T2. In the second case, T1 is younger than T2 and it’s allowed to wait until the resource is available.

A converse approach is the wait-die mechanism. The comparison of these methods is explained here. I haven’t researched enough to understand what the tradeoffs between these two approaches are and why Spanner decided on the first.