By Yahoo!
  1. Intro:
    • Aims to provide a simple and high performance kernel for building coordination primitives in the client
    • Provides per-client guarantee of FIFO execution and linearizability of requests
    • The entire idea is instead of implementing synchronisation primitives server-side, they expose an API and allow clients to define their own
    • Exposed API manipulates simple wait-free data objects organised hierarchially
    • ZooKeeper resembles any other filesystem (Chubby without lock, open and close methods)
    • Exposed API can implement consensus for any number of processes
    • Pipelined architecture --> FIFO per clients for free
    • Guaranteeing FIFO enables clients to submit requests asynchronously:
      • If a new leader is elected, it has to update metadata. Metadata updates are queued asynchronously and initialisation is of sub-second order (compared to synchronous initialisation)
    • Zab --> Leader-based atomic broadcast protocol (to order things)
    • Read operations are not totally ordered
    • Client-side caching is used for things like leader id (observer pattern in cache):
      • Better than Chubby, since Chubby pauses updates to invalidate all caches (that use changed data)
      • Chubby uses leases to manage slow/faulty clients, but ZooKeeper avoids the problem altogether by allowing clients to manage cache
    • Only writes are linearizable
  2. The ZooKeeper service:
    • Overview:
      • Client -- user of the ZooKeeper service
      • Server -- process providing the ZooKeeper service
      • Znode -- in-memory data node in the ZooKeeper data
      • Referring to a znode is done with standard UNIX notation of A/B/C
    • Znodes:
      • Regular -- created and deleted by client explicitly
      • Ephemeral -- Created explicitly, deleted explicitly or when the creating session terminates
      • Znodes have a sequential flag set upon creation. If set it appends the value of a monothonically increasing counter to the znodes name
      • Unlike files, znodes are not designed for general data storage
      • Have associated metadata, timestamps and version counters (allows tracking and conditional updates)
    • Watches (Observer subscriptions):
      • Update clients in a timely manner and avoid polling
      • Read operations have a watch flag. When set the server promises to notify the client when the information it has just returned (as part of a read) has changed
      • Watches are one-time triggers associated with a session (unregistered once triggered or the session closes)
      • Watches indicate that a change has happened but do not provide the change
    • Data Model:
      • Filesystem (key/value storage with hierarchial keys) with full data reads and writes
      • Hierarchial namespace is useful for distinguishing applications and setting access rights
    • Sessions:
      • Client connection initiates the session
      • Sessions have a timeout
      • Client marked as faulty if there are no updates from it in the timeout window
      • Session is ended when the client closes it, or marked as faulty
      • Within a session the client observes a succession of state changes (execution of its operations)
      • Sessions enable clients to move transparently from one server to another (persist across servers)
    • Client API:
      • Allows creation, deletion and exists check on znodes
      • Get and set data stored in the znode
      • Get children of a znode
      • Waiting for all updates pending at the start of the operation to propagate to the server that the client is connected to (sync call).
      • All methods have both - synchronous and asynchronous versions available
      • ZooKeeper does not use handles to access znodes (every request uses the full path)
      • Each update method takes an "expected version number" --> enable conditional updates
      • Expected version -1 --> no version check on update
    • Guarantees:
      • FIFO and linearizability (a-linearizability -- client is multithreaded)
      • Read requests are processed locally at each replica
      • Only updates are a-linearizable
      • Example: Leader election - delete the ready znode, update config, create ready
      • Liveness - if a majority of servers are active and communicating the service is available
      • Durability - if the service responds successfully to an update - the update persists
    • Examples of primitives: (implementing more powerful primitives)
      • All primitives are implemented in the client
      • Configuration Management:
        1. Configuration is stored in znode A
        2. Processes start up with a full path to A
        3. Starting processes read A and set the watch flag to be true
        4. If A ever changes, they receive the notification and update their config accordingly
        5. After they update, they set the watch flag again
      • Rendezvous:
        1. Client creates a rendezvous node A
        2. Client passes the full path to A as a parameter to its peer
        3. When the peer starts, it fills A with all addresses and ports it is using
        4. When workers start they watch-read A
      • Group Membership:
        1. Use the fact that ephemeral nodes allow us to see the state of the session that created them
        2. Create znode A to represent the group
        3. When a process in A starts, it creates an ephemeral child of A (A/Ai)
        4. To ensure that names are unique (if not already), processes can use the sequential flag