Google
  1. Large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures. E.g. memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems, GFS quotas, planned and unplanned hardware maintenance.
  2. Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used. For example, we initially planned to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them.
  3. The most important lesson we learned is the value of simple designs. Given both the size of our system (about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we have found that code and design clarity are of immense help in code maintenance and debugging.
Amazon

1.

Google

1.