What is the problem that this paper is trying to solve?
Create a scalable distributed storage that could be used as a repository
for their other products
Main idea
Create storage as a key-value storage
Data is indexed by a row key a column key and a timestamp
Moving away form RDBS
More flexible as a storage -- shifting data layout control to the end product/user
Advantage over RDBS ~> Search is easy
Each cell in the BigTable also contains multiple versions of the same data (by timestamp)
Implementation:
Uses Chubby (highly available distributed lock system at Google)
When Chubby is down so is BigTable (the experiments shown that the
coupling although tight, at scale is negligible)
Library linked to every client:
Caches tablet locations
If the location of a tablet is unknown, it moves up the tablet
location hierarchy
If the clients cache is empty, the location algorithm requires three
network round trips, including one read from Chubby
If the clients cache is stale, the location algorithm could take up
to 6 round trips, since stale entries are inly discovered upon misses
Prefetches tablet location to reduce access time
Master server (single):
Responsible for:
Assigning tablets to tablet servers
Detecting the addition and expiration of tablet servers
Balancing tablet server load
Garbage collection on the GFS
Handles schema changes (how? Sends updates to each tablet server affected?)
Tablet servers (many):
Can be dynamically added or removed
Responsible for:
Manages a set of tablets (2-1000)
Handles read/write requests to the tablets that it has loaded
Splits tablets that have grown too large
Client data does not move through the master, clients communicate
directly with the tablet servers
Clients do not rely on the master for tablet location information,
hence master is lightly loaded in practice
A cluster stores a number of tables, each consists of a set of tablets
and each tablet has all data associated with the row range
As the table grows, it is automatically split into multiple tablets, each
appx 100-200 MB in size
Tablet location:
File stored in Chubby that contains the location of the root tablet
Root tablet contains all the locations of the all tablets in a METADATA table
Each METADATA tablet contains the location of a set of user tablets
The root tablet is the first tablet in METADATA and is never split
(to ensure that that tablet location hierarchy has no more than three
levels)
See Implementation for details
Tablet assignment:
Each tablet is assigned to one tablet server at a time
Master keeps track of the live tablet servers (including which tablets are unassigned)
When the tablet is unassigned and there is a tablet server with
sufficient room, the master sends the tablet load request to that
server (thus assigning the tablet to that server)
Chubby keeps track of tablet servers:
When the tablet server starts it creates and assigns an exclusive
lock on a uniquely-named file in Chubby directory
The master monitors the directory to discover tablet servers
A tablet server stops serving tablets if it looses its lock
(network problems causing the end of a Chubby session)
A tablet server then attempts to reacquire it's lock (if the file exists)
If the file does not exist, the tablet server can never become
operational, so it kills itself
Whenever the cluster server terminates (because cluster servers
machine has been removed from the cluster), it releases the lock
allowing the master to reassign tablets quickly.
Master is detecting when the tablet server is no longer serving its tablets:
Master asks each tablet server for the status of its lock
Or if the master was unable to reach the server in the last several attempts,
the master attempts to acquire that servers lock in Chubby
If the master is able to get the lock -> the server is dead:
Its chubby lockfile is deleted (so it can never serve again)
The master moves all the tablets assigned to the now dead
server into the set of unassigned tablets
The master kills itself if its Chubby session expires
(master failure does not change the assignment of tablets to servers)
Master startup:
Grab the unique master lock in Chubby
Scan the servers directory in Chubby to discover live servers
Communicate with every live tablet server to discover which
tablets are assigned to it
Add the root tablet to the set of unassigned tablets if it was
not discovered in step 3
Scan the METADATA table to learn the set of tablets (adding
unknown tablets to the unassigned list)
Master is able to track all the changes as it initiates all of them (but tablet merge)
Tablet split:
Initiated by the tablet server
Tablet server updates the METADATA table with the new tablet
When the split is committed, the master is notified
If lost, the split is detected at the load stage for the tablet that was split
Tablet serving:
Persistent state of a tablet is stored on the GFS
Updates are committed to a commit log that stores redo records
Recently committed changes are stored in memory, in a sorted buffer (memtable)
Older updates are stored in a sequence of SSTables
Recovering a tablet:
Tablet server reads the list of SSTables comprising a tablet (from METADATA)
Tablet server reads a set of redo points from commit logs
The server then reads the indices of the SSTables into memory and
applies all of the updates that have committed since the redo points
Write operations:
Tablet server checks that the request is well-formed, and authenticates it
Authorisation is performed by reading a list of permitted writers from Chubby
A valid mutation is written to the commit log (Small mutations
are aggregated into a group commit)
After the write is committed, its contents are inserted into the memtable
Read operations:
Well-formed / authenticated
Read is executed on a merged view of the sequence of SSTables and the memtable
Since both are sorted in a lexicographic order, the view can be formed efficiently
Incoming reads and writes can continue while tablets are split and merged
Compactions:
When the memtable size reaches a threshold (writes fill it up):
Memtable is frozen
New memtable is created
Old (frozen) memtable is converted into an SSTable and written to GFS
This shrinks the memory usage of a tablet server
This reduces the amount of data that has to be read from thecommit log during recovery (if server dies)
This does not block incoming reads and writes
The process is called minor compaction
Every minor compaction creates a new SSTable, hence, eventually reads
will have to merge an arbitrary number of those, solution is:
merging compaction
Converts the contents of a few SSTables (and the memtable) into exactly one SSTable
The inputs SSTables and the memtable can be discarded as soon as its done
A merging compaction that rewrites all SSTables into one is called major compaction:
SSTables produced by non-major compaction can have special
deletion entries that suppress data that is still present in
some live SSTables
Major compaction produces a single SSTable, hence, no need for deletion entries
Major compaction is applied regularly
Allows to reclaim resources
Ensures that deleted data disappears from the system in a timely fashion
Refinements (optimisations):
Locality groups:
Clients can group multiple column families into a locality group
A separate SSTable is generated for each locality group in each tablet
Segregating column families that are not read together allows for faster reads
Locality groups allow definition of some parameters (Keep in memory)
Compression:
Clients can control whether the SSTables for locality groups are compressed
Clients can control which compression format is used
Compression happens in blocks
Two-pass compression
Caching for read performance:
Two levels of caching
Scan Cache:
Higher-level
Stores key-value pairs returned by SSTable interface to the tablet server code
Most useful for applications that tend to read the same data repeatedly
Block Cache:
Lower-level
Caches SSTable blocks that were read from the GFS
Benefit applications that exploit spatial locality in data (sequential reads)
Bloom filters:
Read operation has to read from a joint view of all SSTables that make up a tablet
Bloom filters created for SSTables to reduce the number of disk accesses
Probabilistically state whether that SSTable contains the desired key-value pair
Most lookups for non-existent rows/columns do not need to touch disk!
Commit-log implementation:
Having a simple log per tablet would result in many disk reads/writes
Commit log is implemented on a per-tablet-server basis
Writes to different tablets are co-mingled in the same physical log file
This complicates recovery
The recovery is simplified by sorting all entries by keys <table, row, log seq #>
In the sorted output all entries for a tablet are contiguous and in order
The sort is parallelised and split among many tablet servers
The sorting is coordinated by the master and is initiated when the recovery is needed
Commit logging is done on two separate threads to protect against GFS latency spikes
Tablet recovery:
After the master decides to move the tablet to another server
The tablet server does a minor compaction on a tablet
Stops serving
Another minor compaction (to eliminate any uncompacted state) that arrived during 1st
After that the tablet can be loaded into any other server without any recovery
Exploiting immutability:
All SSTables generated are immutable
No need to synchronise reads - easier row concurrency
Memtable is the only mutable structure accessed by both reads and writes
Each memtable row is copy on write, which allows parallel reads and writes
Removing deleted data == garbage collecting old SSTables
Immutability of SSTables means children can share parts of parents SSTable -- speed