Lessons learned from papers

Exam Prep SSD

Lecturer seems sound and quite good
  1. Overview of general distributed scalable systems
    • Search engines (crawl, index and search)
    • Social Networking (response time, large amount of data)
    • Cloud Computing (availability and access to scalable resources)
    • CDNs (Scalable web hosting, file distribution media streaming)
  2. Design, data centres and cloud computing, scalable storage and querying, compute
  3. These are the papers for storage and querying: – "Bigtable: A Distributed Storage System for Structured Data", Seventh Symposium on Operating System Design and Implementation (OSDI), Seattle, WA, November, 2006 – "Dynamo: Amazon's Highly Available Key-Value Store", ACM Symposium on Operating Systems Principles (SOSP), Stevenson, WA, October 2007 – "Spanner: Google's Globally-Distributed Database", Tenth Symposium on Operating System Design and Implementation (OSDI), Hollywood, CA, October, 2012
  4. Papers for Scalable compute: – "MapReduce: Simplified Data Processing on Large Clusters", Sixth Symposium on Operating System Design and Implementation (OSDI), San Francisco, CA, December, 2004. – "Resilient Distributed Datasets", 9th USENIX conference on Networked Systems Design and Implementation (NSDI), San Jose, CA, April 2012
  5. Method for reading papers:
    1. Skim the paper and get the gist
    2. Come back for a deep read
    3. Look at sample questions and find answers in the paper
  6. Heinis will deal with scalable data
  7. Prepare and work on research papers in lectures and seminars, as one of the courseworks is answering questions on a paper
  8. The exam will also have a paper-based question
  9. Resources: – “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems”, Martin Kleppmann, O'Reilly Media, September 2014:
    1. Focuses more on the data management side
    2. Recommended – “The Art of Scalability: Scalable Web Architecture, Processes and Organizations for the Modern Enterprise”, Martin L. Abbott, Michael T. Fisher, Addison Wesley, 1st Edition, December 2009:
    3. A little more high-level
    4. A little outdated
  10. Blogs: – http://highscalability.com/http://www.allthingsdistributed.com/ (Werner Vogel’s blog) – http://perspectives.mvdirona.com/ (James Hamilton’s blog)
  11. Spanner is the hardest paper covered
Scalable Distributed Systems
  1. Mainframe:
    1. Single point of failure
    2. Does not scale incrementally
    3. Slow if used as a CDN
  2. Data Centres:
    1. Scale out - horizontal
  3. Types of Scalable Systems:
    1. Online and user-facing (latency of < 100 ms)
    2. Batch processing systems (> 1 hr)
      • Hadoop, Spark
      • Offline data processing
    3. Nearline systems (< 1 sec)
      • Dynamic content presented to users
      • CDN-ed content
      • Prediction, recommendations, etc..
  4. Design principles: • Stateless services • Caching • Partition/aggregation pattern • Weaker consistency • Efficient failure recovery
Missed
BigTable discussion

BigTable

Dynamo discussion

Dynamo

Spanner discussion

Spanner

MapReduce discussion

MapReduce

Spark discussion

Spark

Oh the pain. The pain. It always rains. In my soul

Zookeper Notes

This could be a really good exam question (C) Tomas Heinis

How can we make a data structure efficcent for main memory write/read -heavy loads? Answers are on the Cache-Sensitive Search Tree slides

ACA is king

READ the DBMS book Chapter 7: Storage Chapter 8: Indexes Chapter 18: Transactions