Highly available with a clearly-defined consistency window
State is stored as binary objects (blobs) identified by unique keys
No operations span multiple data items and there is no relational schema
Dynamo targets applications that need to store relatively small objects (less than 1 MB)
They dropped the C from ACID (for higher availability)
Scalability is limited
Dynamo is assumed to be operating in "safe space" and therefore it lacks security extensions
As everything else at Amazon, Dynamo is "eventually consistent"
Conflict resolution happens on reads --> "Always writable" service
Architecture details:
Interface:
Exposes two operations get and put:
GET(key) locates the object replicas associated with the key in the storage system and
returns a single object (or a list of objects with conflicting versions) that are
stored under that key along with a context
PUT(key, context, object) operation determines where the replicas of the object
should be placed depending on the key and writes replicas to disk (async)
context encodes system metadata about the object that is opaque to the caller and
includes informations such as the version of the object. This information is stored
along with the object in order to verify context validity during PUT
Key and object are both treated as an opaque array of bytes (so... A string?)
Key is MD5-hashed to generate an 128 bit identifier
Partitioning:
Output space of a hash function is treated as a "ring"
Hash the key and walk the ring to find the node that is responsible for a bigger hash.
Store the object under its hashed key in that node
To avoid heterogeneity problems, each node has multiple values on the ring (Virtual nodes)
Whenever a new node is added, it is assigned to multiple positions ("tokens") on the ring
ADVANTAGES:of virtual nodes
If a node becomes unavailable, the load handled by it is evenly dispersed among other
available nodes
Whenever a new node becomes available, it receives roughly a similar amount of load
from existing nodes
The number of virtual nodes ("tokens") for any given node could be adjusted based on
its physical capabilities, allowing for hardware heterogeneity
Replication:
Data is replicated on multiple hosts
Each node is in charge of replication and the number of replicas for different keys can
vary, as it is configured on the "per-instance" basis
The key (and the object gets replicated at the N-1 successive nodes on the ring
The list of nodes that are responsible for storing a particular key is called the
"preference list"
Preference list is constructed to ensure that there are only distinct nodes in it
Versioning:
When the object is updated by the client, but the latest version is not available, the
changes are applied to the older version and later on both version are reconciled
As an example, items added to cart are never lost, but deleted items can resurface
Vector clocks are used to capture causality between different versions of the data:
A vector clock is effectively a list of (node, counter) pairs
One vector clock is associated with every version of every object
If the counters on the first objects clock are <= to all the nodes in the second clock
then the first node is an ancestor of the second and can be forgotten
Otherwise the two changes are conflicting and need to be merged
Whenever an object is updated, the client must specify which version it is updating (via
the context included in both - return of GET and parameter for PUT)
The context contains the vector clock information
When processing the GET request, if there are multiple objects that cannot be
syntactically reconciled, Dynamo will return a list of them, with corresponding version
information for all of them
An update using this context is considered to be reconciling all conflicts
Depths of GET and PUT:
Any storage node in Dynamo is eligible to receive requests from clients for any key
Node selection by client:
Call via a loadbalancer, which will forward it based on the load
Call via partition-aware client library that routes requests directly to the handling
node
The approaches mostly differ by the amount of coupling with Dynamo (and latency)
Read and write operations involve the first N healthy nodes in the preference list,
skipping over those that are down or inaccessible
To maintain consistency, a quorum protocol is used:
R - min number of nodes for a successful read
W - min number of nodes for a successful write
R + W > N
The latency of a read/write is dictated by the slowest replica in R/W
Hence, each R and W are usually less than N for latency
PUT:
Coordinator generates the vector clock and stores the updated version locally
Coordinator sends the new version along with the new vector clock to the N
highest-ranked nodes
If at least W-1 nodes respond -> write is successful
GET:
Coordinator requests all existing versions of data for that key from the N
highest-ranked nodes
Coordinator waits for R responses before returning the result to the client
If there are conflicting versions, a vector of all versions is returned (with
context for each)
Failure handling:
"Sloppy quorum" -- all reads and writes are performed on the first N healthy nodes in the
preference list
Whenever a node receives the update that should have been handled by another node, it also
receives a "hint" in its metadata
A hint suggests which node was the intended recipient of the replica
Nodes have a separate local database for "hinted" replicas
If a node detects that some intended recipient has recovered, it will forward the replica
to the recovered node and on success will delete it from the local "hinted" store
Hinted Handoff ensures that no read/write operation fails because the intended coordinator
is down, or because there are some network issues
If W==1, writes will be accepted, if at least 1 node is healthy extreme availability
Handling permanent failures:
The case where a replica becomes unavailable before it can return the hinted handoff
Anti-entropy (replica sync) protocol:
Merkle trees reduce the amount of data that needs to be transferred, allow
parallel search through branches
E.g: if the hash values of the root are of two trees are equal then the values of the
leaf nodes in the trees are equal as well and the nodes require no synchronisation
If the root hashes are different --> some of the replicas are different
Each node maintains a separate Merkle tree for each key range (set of keys covered by
one of its virtual nodes)
Two nodes exchange the root of the Merkle tree corresponding to the key ranges they
share
They then traverse the trees to determine exactly which replicas are out of sync
One disadvantage is that when a new node joins, the ranges change and all trees need
to be recalculated
Membership and failure detection:
A node outage should not immediately be interpreted as a permanent departure and trigger
the partition assignment rebalancing
There is no automatic membership change --> Requires human intervention
The node that serves the membership change writes it to its local history
The change is propagated to all nodes with a gossip-based protocol
The view of membership is eventually consistent
Each node contacts a peer chosen at random every second and the two nodes reconcile their
persistent membership change histories
The mappings also contain the partitions of the hash space that the node is covering
allowing for faster request routing between nodes (direct route)
Avoiding logical partitions:
Some nodes play the role of seeds
Seeds are nodes that are discovered via an external mechanism and are known to all
nodes in a Dynamo ring
Eventually all nodes will reconcile with a seed, thus avoiding logical partitions
Failure detection:
Main purpose is to avoid communication with dead nodes during GET and PUT
Local notion of failure is sufficient -> Node B is considered failed by node A if B
does not respond to A's messages
Nodes will periodically retry communication with the "failed" nodes
Failure discovery is driven by clients requests and in the absence of those won't
happen
No node is automatically marked as departed --> requires human intervention
Adding/Removing Storage Nodes:
When X is added to the system it gets assigned a number of tokens, that are scattered on
the ring
Implementation:
Storage nodes have 3 components:
Request Coordination:
Event-driven messaging substrate (staged message processing pipeline, like in SEDA)
Communications are implemented with Java non-blocking IO channels
The node receiving a request creates a state machine, it contains:
Logic for identifying the nodes responsible for a key
Requesting, awaiting responses and potentially retrying
Processing replies and packaging the response to the client
There is a state machine per each individual request
After the read request completes the state machine waits for a while for "late"
responses. If any of the responses are stale, the coordinator will update
corresponding nodes. This process is called read repair
Coordinator for the write is chosen to be the node that replied fastest to the
preceding read operation (on the probabilistic expectation that R and W are connected)
Membership
Failure detection
Storage implementation is interfaced and allows for different engines to be plugged in. These
differ by the size of objects they store and the storage models (in-memory buffers, etc)
Lessons Learned:
General:
Business logic specific reconciliation:
The shopping cart example -> added items are never lost, but we can tolerate
resurfacing of the deleted items.
Timestamp based reconciliation:
Last write wins policy
Useful for the client session management
High performance read engine:
Easy to balance what the service is tailored for by adjusting R and W accordingly
Balancing Performance and durability:
Keeping an in-memory write buffer helped to bridge that gap
Ensuring Uniform Load distribution:
Uniform key (hash) distribution ensures that the load is uniformly distributed across the
ring which ensures that the load is balanced across nodes