Apache (done by Berkley)
  1. RDD - Resilient Distributed Datasets -> A distributed memory abstraction
  2. Designed to perform in-memory computations on large clusters in a fault-tolerant manner
  3. Reusing data across computations is costly -> you'd need a synchronous GFS that you write to and the IO cost simply becomes too large for huge datasets
  4. Interface based on coarse-grained transformations (map, filter and join) that apply the same operation to many data items -> log the transformations used to build the dataset, not the actual data
  5. Hence, if a partition gets lost we have enough data on how it was derived from other partitions to recompute it entirely
  1. RDD:
    • RDD - read-only partitioned collection of records
    • RDD can be created via deterministic operations on either persistent storage or another RDD. These operations are referred to as transformations
    • Examples of transformations are map, filter and join
    • RDDs do not need to be materialised at all times, instead they have enough information of how it was derived (lineage) to compute the dataset from its source (e.g. persistent storage)
    • Program cannot reference an RDD that it cannot reconstruct after a failure
    • Users control persistence and partitioning of an RDD to optimise for operations they are going to perform on it
  2. Spark interface:
    • Each dataset is represented as an object and transformations are invoked using methods of those objects
    • Start by defining an RDD by using transformations on data in stable storage
    • These new RDDs can be used in actions which are operations that return a value to the application or export data to a storage system (e.g. count, collect, save)
    • Spark computes RDDs lazily (the first time they are used in action)
    • There is a persist method to indicate which RDDs will be used later (and hence must be persisted)
  3. Advantages of an RDD model:
    • Compared with Distributed Shared Memory (DSM)
    • RDD can only be created ("written") through coarse-grained transformations
    • RDD is optimal for uses with bulk reads/writes and is more fault tolerant than DSM
    • RDDs do not need sharding and checkpoints as they could be recomputed using lineage
    • RDDs are immutable ~~> slow nodes can be mitigated in the same way MapReduce does it (backups)
    • In bulk operations on RDDs the tasks can better exploit data locality
    • RDDs degrade gracefully when there is not enough memory to store them
  1. To use spark, one needs to write a driver program that connects to the cluster of workers
  2. The driver defines 1 or more RDDs and invokes actions on them, it also tracks lineage
  1. The main issue is tracking lineage across a wide range of transformations
  2. Spark represents each RDD as:
    • A set of partitions, which are atomic pieces of the dataset
    • A set of dependencies on parent RDDs
    • A function for computing the dataset based on its parents
    • Metadata about its partitioning scheme and data placement
  3. Inter-RDD dependencies:
    • Narrow:
      • Each partition of the parent RDD is used by at most one partition of the child RDD
      • E.g. map
      • Allow pipelined execution on just one cluster node, which can compute all parent partitions
      • Recovery after a node failure is more efficient (since no data needs to be moved around)
    • Wide:
      • Multiple child partitions may depend on the partitions of the parent RDD
      • E.g. join
      • Require data from all parent partitions to be available and to be shuffled across nodes in a MapReduce-like fashion
      • Could require complete re-execution, since a single failed node could cause the loss of a partition from all the ancestors of an RDD
  1. Scheduling:
    • Takes into account which partitions of an RDD are available in memory
    • On action, the scheduler examines the lineage graph and builds a Direct Acyclic Graph (DAG) of stages to execute
    • Each stage consists of as many pipelined transformations with narrow dependencies as possible
    • The boundaries for the stages are:
      • Shuffle operations required for wide dependencies
      • Already computed partitions that can short-circuit the computation of the parent RDD
    • Scheduler then launches the computation of missing partitions until it reaches the target RDD
    • Tasks are assigned based on data locality using delay scheduling
    • For wide dependencies the intermediate records are materialised in the same way as MapReduce persists the results of intermediate Map operations
    • If a task fails it is re-run on another node as long as its stage's parents are still available
    • If some stages are no longer available they are recomputed in parallel
    • All computations in Spark run in response to the driver actions (user supplied)
  2. Interpreter Integration:
    • Run Spark interactively by connecting from the interpreter shell
    • They had to write their own interpreter to modify how code is translated into serialisable objects
  3. Memory Management:
    • Three options for storage of persistent RDDs
    • In-memory:
      • Stored as deserialised Java objects
      • Fastest performance (JVM can access each RDD element natively)
    • In-memory:
      • Stored as serialised data
      • Higher memory efficiency for limited memory
    • On-disk:
      • For RDDs that are too large to be kept in RAM, but costly to recompute at each stage
    • Eviction:
      • LRU at the level of RDDs
    • Each instance of Spark on a cluster has its own memory space (no unified memory)
  4. Support for Checkpointing:
    • Using lineage to recover every fault may be too expensive (long lineage chains)
    • If we checkpoint some RDDs into persistent storage, we can "split" those lineage chains
    • A lot easier to checkpoint than shmem, since RDDs are read-only
    • No consistency considerations means that the checkpoints can be written on the background