By Google
  1. What is the problem that this paper is trying to solve?
    • Create a scalable distributed storage that could be used as a repository for their other products
  2. Main idea
    • Create storage as a key-value storage
    • Data is indexed by a row key a column key and a timestamp
    • Moving away form RDBS
    • More flexible as a storage -- shifting data layout control to the end product/user
    • Advantage over RDBS ~> Search is easy
    • Each cell in the BigTable also contains multiple versions of the same data (by timestamp)
  3. Implementation:
    • Uses Chubby (highly available distributed lock system at Google)
    • When Chubby is down so is BigTable (the experiments shown that the coupling although tight, at scale is negligible)
    • Library linked to every client:
      • Caches tablet locations
      • If the location of a tablet is unknown, it moves up the tablet location hierarchy
      • If the clients cache is empty, the location algorithm requires three network round trips, including one read from Chubby
      • If the clients cache is stale, the location algorithm could take up to 6 round trips, since stale entries are inly discovered upon misses
      • Prefetches tablet location to reduce access time
    • Master server (single):
      • Responsible for:
        • Assigning tablets to tablet servers
        • Detecting the addition and expiration of tablet servers
        • Balancing tablet server load
        • Garbage collection on the GFS
        • Handles schema changes (how? Sends updates to each tablet server affected?)
    • Tablet servers (many):
      • Can be dynamically added or removed
      • Responsible for:
        • Manages a set of tablets (2-1000)
        • Handles read/write requests to the tablets that it has loaded
        • Splits tablets that have grown too large
    • Client data does not move through the master, clients communicate directly with the tablet servers
    • Clients do not rely on the master for tablet location information, hence master is lightly loaded in practice
    • A cluster stores a number of tables, each consists of a set of tablets and each tablet has all data associated with the row range
    • As the table grows, it is automatically split into multiple tablets, each appx 100-200 MB in size
  1. Tablet location:
    • File stored in Chubby that contains the location of the root tablet
    • Root tablet contains all the locations of the all tablets in a METADATA table
    • Each METADATA tablet contains the location of a set of user tablets
    • The root tablet is the first tablet in METADATA and is never split (to ensure that that tablet location hierarchy has no more than three levels)
    • See Implementation for details
  2. Tablet assignment:
    • Each tablet is assigned to one tablet server at a time
    • Master keeps track of the live tablet servers (including which tablets are unassigned)
    • When the tablet is unassigned and there is a tablet server with sufficient room, the master sends the tablet load request to that server (thus assigning the tablet to that server)
    • Chubby keeps track of tablet servers:
      • When the tablet server starts it creates and assigns an exclusive lock on a uniquely-named file in Chubby directory
      • The master monitors the directory to discover tablet servers
      • A tablet server stops serving tablets if it looses its lock (network problems causing the end of a Chubby session)
      • A tablet server then attempts to reacquire it's lock (if the file exists)
      • If the file does not exist, the tablet server can never become operational, so it kills itself
      • Whenever the cluster server terminates (because cluster servers machine has been removed from the cluster), it releases the lock allowing the master to reassign tablets quickly.
    • Master is detecting when the tablet server is no longer serving its tablets:
      • Master asks each tablet server for the status of its lock
      • Or if the master was unable to reach the server in the last several attempts, the master attempts to acquire that servers lock in Chubby
        • If the master is able to get the lock -> the server is dead:
          • Its chubby lockfile is deleted (so it can never serve again)
          • The master moves all the tablets assigned to the now dead server into the set of unassigned tablets
      • The master kills itself if its Chubby session expires (master failure does not change the assignment of tablets to servers)
    • Master startup:
      1. Grab the unique master lock in Chubby
      2. Scan the servers directory in Chubby to discover live servers
      3. Communicate with every live tablet server to discover which tablets are assigned to it
      4. Add the root tablet to the set of unassigned tablets if it was not discovered in step 3
      5. Scan the METADATA table to learn the set of tablets (adding unknown tablets to the unassigned list)
    • Master is able to track all the changes as it initiates all of them (but tablet merge)
    • Tablet split:
      • Initiated by the tablet server
      • Tablet server updates the METADATA table with the new tablet
      • When the split is committed, the master is notified If lost, the split is detected at the load stage for the tablet that was split
  3. Tablet serving:
    • Persistent state of a tablet is stored on the GFS
    • Updates are committed to a commit log that stores redo records
    • Recently committed changes are stored in memory, in a sorted buffer (memtable)
    • Older updates are stored in a sequence of SSTables
    • Recovering a tablet:
      • Tablet server reads the list of SSTables comprising a tablet (from METADATA)
      • Tablet server reads a set of redo points from commit logs
      • The server then reads the indices of the SSTables into memory and applies all of the updates that have committed since the redo points
    • Write operations:
      • Tablet server checks that the request is well-formed, and authenticates it
      • Authorisation is performed by reading a list of permitted writers from Chubby
      • A valid mutation is written to the commit log (Small mutations are aggregated into a group commit)
      • After the write is committed, its contents are inserted into the memtable
    • Read operations:
      • Well-formed / authenticated
      • Read is executed on a merged view of the sequence of SSTables and the memtable
      • Since both are sorted in a lexicographic order, the view can be formed efficiently
    • Incoming reads and writes can continue while tablets are split and merged
  4. Compactions:
    • When the memtable size reaches a threshold (writes fill it up):
      • Memtable is frozen
      • New memtable is created
      • Old (frozen) memtable is converted into an SSTable and written to GFS
      • This shrinks the memory usage of a tablet server
      • This reduces the amount of data that has to be read from the commit log during recovery (if server dies)
      • This does not block incoming reads and writes
      • The process is called minor compaction
    • Every minor compaction creates a new SSTable, hence, eventually reads will have to merge an arbitrary number of those, solution is:
      • merging compaction
      • Converts the contents of a few SSTables (and the memtable) into exactly one SSTable
      • The inputs SSTables and the memtable can be discarded as soon as its done
    • A merging compaction that rewrites all SSTables into one is called major compaction:
      • SSTables produced by non-major compaction can have special deletion entries that suppress data that is still present in some live SSTables
      • Major compaction produces a single SSTable, hence, no need for deletion entries
      • Major compaction is applied regularly
      • Allows to reclaim resources
      • Ensures that deleted data disappears from the system in a timely fashion
  5. Refinements (optimisations):
    • Locality groups:
      • Clients can group multiple column families into a locality group
      • A separate SSTable is generated for each locality group in each tablet
      • Segregating column families that are not read together allows for faster reads
      • Locality groups allow definition of some parameters (Keep in memory)
    • Compression:
      • Clients can control whether the SSTables for locality groups are compressed
      • Clients can control which compression format is used
      • Compression happens in blocks
      • Two-pass compression
    • Caching for read performance:
      • Two levels of caching
      • Scan Cache:
        • Higher-level
        • Stores key-value pairs returned by SSTable interface to the tablet server code
        • Most useful for applications that tend to read the same data repeatedly
      • Block Cache:
        • Lower-level
        • Caches SSTable blocks that were read from the GFS
        • Benefit applications that exploit spatial locality in data (sequential reads)
    • Bloom filters:
      • Read operation has to read from a joint view of all SSTables that make up a tablet
      • Bloom filters created for SSTables to reduce the number of disk accesses
      • Probabilistically state whether that SSTable contains the desired key-value pair
      • Most lookups for non-existent rows/columns do not need to touch disk!
    • Commit-log implementation:
      • Having a simple log per tablet would result in many disk reads/writes
      • Commit log is implemented on a per-tablet-server basis
      • Writes to different tablets are co-mingled in the same physical log file
      • This complicates recovery
      • The recovery is simplified by sorting all entries by keys <table, row, log seq #>
      • In the sorted output all entries for a tablet are contiguous and in order
      • The sort is parallelised and split among many tablet servers
      • The sorting is coordinated by the master and is initiated when the recovery is needed
      • Commit logging is done on two separate threads to protect against GFS latency spikes
    • Tablet recovery:
      • After the master decides to move the tablet to another server
      • The tablet server does a minor compaction on a tablet
      • Stops serving
      • Another minor compaction (to eliminate any uncompacted state) that arrived during 1st
      • After that the tablet can be loaded into any other server without any recovery
    • Exploiting immutability:
      • All SSTables generated are immutable
      • No need to synchronise reads - easier row concurrency
      • Memtable is the only mutable structure accessed by both reads and writes
      • Each memtable row is copy on write, which allows parallel reads and writes
      • Removing deleted data == garbage collecting old SSTables
      • Immutability of SSTables means children can share parts of parents SSTable -- speed