By Amazon
  1. Background:
    • Simple key-value interface
    • Highly available with a clearly-defined consistency window
    • State is stored as binary objects (blobs) identified by unique keys
    • No operations span multiple data items and there is no relational schema
    • Dynamo targets applications that need to store relatively small objects (less than 1 MB)
    • They dropped the C from ACID (for higher availability)
    • Scalability is limited
    • Dynamo is assumed to be operating in "safe space" and therefore it lacks security extensions
    • As everything else at Amazon, Dynamo is "eventually consistent"
    • Conflict resolution happens on reads --> "Always writable" service
  1. Architecture details:
  1. Interface:
    • Exposes two operations get and put:
      • GET(key) locates the object replicas associated with the key in the storage system and returns a single object (or a list of objects with conflicting versions) that are stored under that key along with a context
      • PUT(key, context, object) operation determines where the replicas of the object should be placed depending on the key and writes replicas to disk (async)
      • context encodes system metadata about the object that is opaque to the caller and includes informations such as the version of the object. This information is stored along with the object in order to verify context validity during PUT
    • Key and object are both treated as an opaque array of bytes (so... A string?)
    • Key is MD5-hashed to generate an 128 bit identifier
  1. Partitioning:
    • Output space of a hash function is treated as a "ring"
    • Hash the key and walk the ring to find the node that is responsible for a bigger hash. Store the object under its hashed key in that node
    • To avoid heterogeneity problems, each node has multiple values on the ring (Virtual nodes)
    • Whenever a new node is added, it is assigned to multiple positions ("tokens") on the ring
    • ADVANTAGES: of virtual nodes
      • If a node becomes unavailable, the load handled by it is evenly dispersed among other available nodes
      • Whenever a new node becomes available, it receives roughly a similar amount of load from existing nodes
      • The number of virtual nodes ("tokens") for any given node could be adjusted based on its physical capabilities, allowing for hardware heterogeneity
  1. Replication:
    • Data is replicated on multiple hosts
    • Each node is in charge of replication and the number of replicas for different keys can vary, as it is configured on the "per-instance" basis
    • The key (and the object gets replicated at the N-1 successive nodes on the ring
    • The list of nodes that are responsible for storing a particular key is called the "preference list"
    • Preference list is constructed to ensure that there are only distinct nodes in it
  1. Versioning:
    • When the object is updated by the client, but the latest version is not available, the changes are applied to the older version and later on both version are reconciled
    • As an example, items added to cart are never lost, but deleted items can resurface
    • Vector clocks are used to capture causality between different versions of the data:
      • A vector clock is effectively a list of (node, counter) pairs
      • One vector clock is associated with every version of every object
      • If the counters on the first objects clock are <= to all the nodes in the second clock then the first node is an ancestor of the second and can be forgotten
      • Otherwise the two changes are conflicting and need to be merged
    • Whenever an object is updated, the client must specify which version it is updating (via the context included in both - return of GET and parameter for PUT)
    • The context contains the vector clock information
    • When processing the GET request, if there are multiple objects that cannot be syntactically reconciled, Dynamo will return a list of them, with corresponding version information for all of them
    • An update using this context is considered to be reconciling all conflicts
  1. Depths of GET and PUT:
    • Any storage node in Dynamo is eligible to receive requests from clients for any key
    • Node selection by client:
      1. Call via a loadbalancer, which will forward it based on the load
      2. Call via partition-aware client library that routes requests directly to the handling node
    • The approaches mostly differ by the amount of coupling with Dynamo (and latency)
    • Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible
    • To maintain consistency, a quorum protocol is used:
      • R - min number of nodes for a successful read
      • W - min number of nodes for a successful write
      • R + W > N
      • The latency of a read/write is dictated by the slowest replica in R/W
      • Hence, each R and W are usually less than N for latency
    • PUT:
      • Coordinator generates the vector clock and stores the updated version locally
      • Coordinator sends the new version along with the new vector clock to the N highest-ranked nodes
      • If at least W-1 nodes respond -> write is successful
    • GET:
      • Coordinator requests all existing versions of data for that key from the N highest-ranked nodes
      • Coordinator waits for R responses before returning the result to the client
      • If there are conflicting versions, a vector of all versions is returned (with context for each)
  1. Failure handling:
    • "Sloppy quorum" -- all reads and writes are performed on the first N healthy nodes in the preference list
    • Whenever a node receives the update that should have been handled by another node, it also receives a "hint" in its metadata
    • A hint suggests which node was the intended recipient of the replica
    • Nodes have a separate local database for "hinted" replicas
    • If a node detects that some intended recipient has recovered, it will forward the replica to the recovered node and on success will delete it from the local "hinted" store
    • Hinted Handoff ensures that no read/write operation fails because the intended coordinator is down, or because there are some network issues
    • If W==1, writes will be accepted, if at least 1 node is healthy extreme availability
  1. Handling permanent failures:
    • The case where a replica becomes unavailable before it can return the hinted handoff
    • Anti-entropy (replica sync) protocol:
      • Merkle trees reduce the amount of data that needs to be transferred, allow parallel search through branches
      • E.g: if the hash values of the root are of two trees are equal then the values of the leaf nodes in the trees are equal as well and the nodes require no synchronisation
      • If the root hashes are different --> some of the replicas are different
      • -----------------------------------THE PROTOCOL--------------------------------------
      • Each node maintains a separate Merkle tree for each key range (set of keys covered by one of its virtual nodes)
      • Two nodes exchange the root of the Merkle tree corresponding to the key ranges they share
      • They then traverse the trees to determine exactly which replicas are out of sync
      • One disadvantage is that when a new node joins, the ranges change and all trees need to be recalculated
  1. Membership and failure detection:
    • A node outage should not immediately be interpreted as a permanent departure and trigger the partition assignment rebalancing
    • There is no automatic membership change --> Requires human intervention
    • The node that serves the membership change writes it to its local history
    • The change is propagated to all nodes with a gossip-based protocol
    • The view of membership is eventually consistent
    • Each node contacts a peer chosen at random every second and the two nodes reconcile their persistent membership change histories
    • The mappings also contain the partitions of the hash space that the node is covering allowing for faster request routing between nodes (direct route)
    • Avoiding logical partitions:
      • Some nodes play the role of seeds
      • Seeds are nodes that are discovered via an external mechanism and are known to all nodes in a Dynamo ring
      • Eventually all nodes will reconcile with a seed, thus avoiding logical partitions
    • Failure detection:
      • Main purpose is to avoid communication with dead nodes during GET and PUT
      • Local notion of failure is sufficient -> Node B is considered failed by node A if B does not respond to A's messages
      • Nodes will periodically retry communication with the "failed" nodes
      • Failure discovery is driven by clients requests and in the absence of those won't happen
      • No node is automatically marked as departed --> requires human intervention
  1. Adding/Removing Storage Nodes:
    • When X is added to the system it gets assigned a number of tokens, that are scattered on the ring
  1. Implementation:
    • Storage nodes have 3 components:
      1. Request Coordination:
        1. Event-driven messaging substrate (staged message processing pipeline, like in SEDA)
        2. Communications are implemented with Java non-blocking IO channels
        3. The node receiving a request creates a state machine, it contains:
          • Logic for identifying the nodes responsible for a key
          • Requesting, awaiting responses and potentially retrying
          • Processing replies and packaging the response to the client
        4. There is a state machine per each individual request
        5. After the read request completes the state machine waits for a while for "late" responses. If any of the responses are stale, the coordinator will update corresponding nodes. This process is called read repair
        6. Coordinator for the write is chosen to be the node that replied fastest to the preceding read operation (on the probabilistic expectation that R and W are connected)
      2. Membership
      3. Failure detection
    • Storage implementation is interfaced and allows for different engines to be plugged in. These differ by the size of objects they store and the storage models (in-memory buffers, etc)
  1. Lessons Learned:
    • General:
      • Business logic specific reconciliation:
        1. The shopping cart example -> added items are never lost, but we can tolerate resurfacing of the deleted items.
      • Timestamp based reconciliation:
        1. Last write wins policy
        2. Useful for the client session management
      • High performance read engine:
        1. Easy to balance what the service is tailored for by adjusting R and W accordingly
    • Balancing Performance and durability:
      • Keeping an in-memory write buffer helped to bridge that gap
    • Ensuring Uniform Load distribution:
      • Uniform key (hash) distribution ensures that the load is uniformly distributed across the ring which ensures that the load is balanced across nodes