Spanner

By Google

General:
- Externally-consistent distributed transactions
- It is a database that shards data across multiple sets of Paxos replicas globally
- Main point of the paper is the True-Time API
- Clients automatically failover between replicas
- Re-sharding and migrations are fully automated as well
- Main purpose is managing cross-datacenter replicated data
- Wins over BigTable with strong consistency + wide area replication and support for evolving schemas
- Spanner is a temporal multi-version database
- Data is stored in the schematised semi-relational tables
- Data is versioned, each version is timestamped with its commit time
- Old versions are garbage-collected
- Spanner exposes general-purpose transactions and exposes an SQL-like query language
Interesting features:
- Replication configurations can be dynamically controlled at a fine grain by the application
- Apps can specify constraints to control:
  - Which datacenter has which data
  - How far data is from its users (read latency)
  - How far replicas are from each other (write latency)
  - How many replicas are maintained (durability, availability and read performance)
- Data can be dynamically and transparently moved between datacenters (resource usage balancing)
- These features enable Spanner to support consistent backups and consistent MapReduce executions and atomic schema updates (globally, in the presence of ongoing transactions):
  - Externally consistent reads and writes (Difficult to implement in dist dbs)
  - Globally-consistent reads across the db at a timestamp (Difficult to implement in dist dbs)
- The above features Are enabled by globally-meaningful commit timestamps
- Timestamps reflect serialization order
True Time API feature overview:
- Exposes clock uncertainty
- If uncertainty is large , Spanner slows down to wait it out

Intro:
- Spanner deployment is called a universe
- Spanner is organised as a set of zones (each zone is ~= deployment of BigTable servers)
- Zones == locations across which data can be replicated
- Zones are dynamic -> can be added and removed from a running system
- Zones are a unit of physical isolation (one datacenter can have 1 or more zones)
- Each zone has a zonemaster and 100-n000 spanservers
- Zonemaster assigns data to spanservers, spanservers serve the data to clients
- Per-zone location proxies are used by clients to locate spanservers assigned to serve their data
- The universe master and placement driver are singletons
- Universe master -- a console that shows live metrics
- Placement driver -- handles automatic movement of data across zones (timescale of minutes)
Spanservers:
- Each spanserver is responsible for 100-1000 instances of tablets
- Tablet:
  - Similar to BigTable abstraction for a bag of (string, timestamp)->string mappings
  - Not necessarily a single lexicographically contiguous partition of row space
  - A container that may encapsulate many partitions of the row space:
    - Can colocate many directories that are frequently accessed together
- Spanner assigns timestamps to data (not to keys as BigTable)
- Each spanserver implements a single Paxos state machine on top of each tablet
- Each state machine stores its metadata and log in its tablet
- Paxos writes are logged twice -> once in a tablets log and once in a Paxos log
- Writes must initiate the Paxos protocol at the leader
- Reads access data directly from the tablet (at any Replica that is up to date)
- Collective set of replicas is called a Paxos group
- At every replica that is a leader the tablet has a lock table to implement concurrency control:
  - The lock table contains the state of two-phase locking
  - Maps ranges of keys to lock states
  - Operations that require synchronisation acquire a lock in the lock table
  - Other operations bypass the lock table altogether
- At every replica that is a leader the spanserver also implements transactions manager:
  - Support distributed transactions
  - Used to implement the participant leader (other replicas -> participant slaves)
  - If a transaction involves only one Paxos group -> it can bypass the transaction manager
  - If it involves more then the leaders coordinate to perform the two-phase commit
- Two-phase commit:
  - One of the participant groups is chosen as coordinator
  - The participant leader of that group becomes the coordinator leader, same with slaves
- Directory:
  - Additional abstraction (apart from key-value mappings) that is basically a bucket
  - Set of contiguous keys that share a common prefix
  - Allows apps to control locality of their data by carefully choosing the prefix
  - Unit of data placement -> same replication configuration
  - When data is moved between Paxos groups it is moved directory by directory
  - Smallest unit whose placement can be specified by an application
- Movedir:
  - Not implemented as a single transaction, instead it registers the start and moves data in the background
  - Background task used to move directories between Paxos groups
  - Adds or removes replicas to Paxos groups
- Replication:
  - Administrators control the number and types of replicas and use these to create tags
  - An application controls how data is replicated by tagging databases/directories
Data Models:
- Based on schematised semi-relational tables:
  - Because of Megastore and its popularity
- A query language
- General-purpose transactions
- Workflow:
  - Application creates one or more databases in the universe
  - Each database can contain unlimited number of schematised tables
  - Every table is required to have an ordered set of one or more primary-key columns:
    - This lets application control data locality through keys (see BigTable)
True Time API:
- Perks:
  - Represents time as TTinterval
  - TTinterval is an interval with bounded time uncertainty
  - Endpoints of TTinterval are of type TTstamp
  - Methods:
    1. now -- Returns a TTinterval that is guaranteed to contain the absolute time during which the method was invoked. The interval is generated with error bounds of size half of the interval.
    2. after and before are convenience method wrappers around now
  - True Time guarantees that that an invocation, earliest <= absolute time <= latest
- Implementation:
  - Time masters (set) per datacenter
  - Time slave per machine
  - Master are equipped with GPS clocks and some are using atomic clocks (GPS failover)
  - All masters time references are regularly compared with one another
  - Each master checks the rate against its own clock and evicts itself of there is a substantial divergence
  - Every slave pulls a variety of masters (mix of GPS nearby, GPS far and atomic)
  - Slaves use Mazullo's algorithm to detect and reject liars
  - Slaves synchronise local clocks to non-liars
  - Machines (slaves?) with frequency excursions larger than the worst case are evicted
  - Between synchronisations daemons advertise a slowly increasing time uncertainty based on the worst-case local clock drift.
  - Slaves take into account communication delay with master when calculating uncertainty
- Concurrency Control:
  - Timestamp management:
    1. Read only, read-write and snapshot read transactions
    2. Read-only transactions must be predeclared as not having any writes
    3. Reads in a read-only transaction are executed at a timestamp with no locks:
      1. Incoming writes are not blocked
    4. A snapshot read is a read in the past that executes without locking
    5. A client can specify a timestamp, or provide the upper bound for staleness and Spanner will pick the timestamp
    6. When a server fails clients can continue reads on a different server by repeating the timestamp and the current read position
    7. Leader Leases:
      1. Timed leases (10 seconds) to make leaders long-lived
      2. Potential leader sends a request for lease votes and if it receives a quorum -> score!
      3. As soon as some lease votes expire, the leaders lease ends -> no longer a leader
      4. A leader can abdicate and release its slaves from their lease votes
    8. Assigning timestamps to RW transactions:
      1. Transactional reads and writes use two-phase locking (timestamps can be assigned at any time, when holding all locks)
      2. For a given transaction, the timestamp is the same as the Paxos timestamp
      3. Paxos timestamps are assigned in a monotonically increasing order (as the leader can only assign timestamps during its lease and at any given point in time there is at most one leader in a Paxos group)
      4. Commit Wait: The coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(commit_i) is true
    9. Serving Reads at a timestamp:
      1. The monotonicity requirement allows Spanner to determine if a replica is up to date
      2. Every replica tracks the value t_safe, which is the max timestamp at which the replica is up to date
      3. T_safe is the min of the Paxos state machine and transaction managers safe time:
        
        Paxos safe time == the timestamp of the highest applied Paxos write
        T_tm is more complex..: Its infinite at a replica if there are 0 prepared (but not committed) transactions --> transactions in between two phases of a two-phase commit. That is because such transactions are unstable, a replica cannot know if the transaction will be committed. If there are some prepared and uncommitted transactions, the T_tm is the min of prepare timestamps for all transactions
    10. Assigning timestamps to read-only transactions:
    11. R-O executes in two stages:
      1. Assign a timestamp S_read (TT.now().latest)
      2. Execute transactions reads for a snapshot at S_read. These could be blocked if the corresponding t_safe has not advanced accordingly
  - Details: (the nitty-gritty)
    1. Read-Write Transactions:
      1. Like BigTable writes in a transaction are buffered at the client until commit
      2. Hence, reads in the same transaction will not see the changes introduced by the writes
      3. It goes like this:
        
        A client issues reads to the leader replica of the appropriate group
        While the transaction remains open, the client sends keepalive messages to other paticipant leaders prevent them from timing out the transaction
        When all reads are completed and all writes have been buffered, the client begins a two-phase commit
        Sends a message to each participants leader with the id of the coordinator and any buffered writes
      4. Commit (non-coordinator):
        
        Acquires write locks
        Chosses a prepare timestamp that is larger than any timestamps it has assigned before (to preserve monotonicity)
        Logs prepare record through Paxos
        Notifies coordinator of its prepare timestamp
      5. Commit (coordinator):
        
        Acquires write locks
        Chooses a timestamp for the entire transaction after receiving prepare timestamps form participants (non-coordinators)
        Commit timestamp is >= to all prepare timestamps
        Commit timestamp > TT.now().latest of when coordinator received the commit
        Commit timestamp > any other timestamp the coordinator had previously assigned
        Logs a commit record through Paxos (or abouts if it timed out while waiting for participants)
        Before allowing th changes to be applied, the coordinato waits for TT.after(commit_time)
        Sends the commit timestamp to the client and all participants, applying the changes
    2. Read-Only Transactions:
      1. Assigning a timestamp requires a negotiation phase between all of the Paxos groups involved in the read
      2. The notion of scope -> Set of all keys that will be read in the transaction
      3. For standalone queries the scope is automatically inferred
      4. If there is only on Paxos group serving the read -> client issues R-O transaciton to that groups leader (Paxos leader)
      5. The leader assigns a timestamp and performs the read (usually better than TT.now().latest)
      6. If there are no pending transactions, the timestamp read is the timestamp of the last committed write
      7. If there are multiple groups involved they can either nogotiate the appropriate (max) time of last committed write, or simply query at TT.now().latest (which may way for the appropriate amount of time to advance)
    3. Schema-Change Transactions:
      1. Schema changes are supported within one datacentre
      2. Treated as a normal transaction (with a timestamp in the future and "prepare" stage)
      3. Reads/Writes with a smaller timestamp can proceed, while later ones have to block

Notes on Spanner: Googles Globally-Distributed Database

Overview

Spanner Implementation