RDD - Resilient Distributed Datasets -> A distributed memory abstraction
Designed to perform in-memory computations on large clusters in a fault-tolerant manner
Reusing data across computations is costly -> you'd need a synchronous GFS that you write to and
the IO cost simply becomes too large for huge datasets
Interface based on coarse-grained transformations (map, filter and join) that apply the same
operation to many data items -> log the transformations used to build the dataset, not the
actual data
Hence, if a partition gets lost we have enough data on how it was derived from other partitions
to recompute it entirely
RDD can be created via deterministic operations on either persistent storage or another RDD.
These operations are referred to as transformations
Examples of transformations are map, filter and join
RDDs do not need to be materialised at all times, instead they have enough information of how
it was derived (lineage) to compute the dataset from its source (e.g. persistent storage)
Program cannot reference an RDD that it cannot reconstruct after a failure
Users control persistence and partitioning of an RDD to optimise for operations they are
going to perform on it
Spark interface:
Each dataset is represented as an object and transformations are invoked using methods of
those objects
Start by defining an RDD by using transformations on data in stable storage
These new RDDs can be used in actions which are operations that return a value to the
application or export data to a storage system (e.g. count, collect, save)
Spark computes RDDs lazily (the first time they are used in action)
There is a persist method to indicate which RDDs will be used later (and hence must be
persisted)
Advantages of an RDD model:
Compared with Distributed Shared Memory (DSM)
RDD can only be created ("written") through coarse-grained transformations
RDD is optimal for uses with bulk reads/writes and is more fault tolerant than DSM
RDDs do not need sharding and checkpoints as they could be recomputed using lineage
RDDs are immutable ~~> slow nodes can be mitigated in the same way MapReduce does it (backups)
In bulk operations on RDDs the tasks can better exploit data locality
RDDs degrade gracefully when there is not enough memory to store them