Data Intensive App #8 | The trouble with distributed system

8.0 Introduction

  • In the last few chapters => how systems handle things going wrong
  • However, even though we have talked a lot about faults, the last few chapters have still been too optimistic
    • The reality is even darker
    • we will now turn our pessimism to the maximum and assume that anything that can go wrong will go wrong
  • In this chapter
    • we will get a taste of the problems that arise in practice
    • and understanding of the things we can and cannot rely on
    • we must understand what challenges we are up against
  • Next chapter,
    • In spite of everthing going wrong, our task as engineers is to build systems that do their job (i.e, meet the guarantees that users are expecting)
    • We will look at some examples of algorithms that can provide such guarantees in a distributed systems



8.1 Faults and Partial Failures

  • When you are writing a program on a single computer, it normally behaves in a fairly predictable way : either it works or it doesn’t
  • When you are writing software that runs on several computers, connected by a network, the situation is fundamentally different.
  • Partial failure : In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine
    • Nondeterministic : if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail.
  • This non-determinism and possibility of partial failures is what makes distributed systems hard to work with.

8.1.1 Cloud Computing and Supercomputing

  • philosophies on how to build large-scale computing systems:
    1. Cloud Computing : commodity computers connected with and IP network
    2. HPC (high-performance computing) : super computers with thousands of CPUs
  • In a super-computer : If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint. => A supercomputer is more like a single-node computer than a distributed system.
  • we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components.



8.2 Unreliable Networks

  • The distributed systems we focus on in this book are shared-nothing systems: i.e., a bunch of machines connected by a network

  • The internet and most internal networks in datacenters are asynchronous packet networks.

    => one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive or whether it will arrive at all.

    => If you send a request and expect a response, many things could go wrong.

  • The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed.

  • The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive.

8.2.1 Network Faults in Practice

  • We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. => However, it seems that we have not yet succeeded.
  • network problems can be surprisingly common => medium-sized datacenter found about 12 faults/month
  • Even if network faults are rare in your environment, the fact that faults can occur means that your software needs to be able to handle them

8.2.2 Detecting Faults

  • Many systems need to automatically detect faulty nodes
    • ex) In a distributed DB with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader
  • the uncertainty about the network makes it difficult to tell whether a node is working or not

8.2.3 Timeouts and Unbounded Delays

  • If a timeout is the only sure way of detecting a fault, then how long should the time‐out be? : “Failure detection delay” vs “risk of premature timeouts”

    • A long timeout : a long wait until a node is declared dead
    • A short timeout : detects faults faster, but carries a higher risk of incorrectly declaring a node dead
      • When a node is declared dead, its responsibilities need to be transferred to other nodes,
      • => places additional load on other nodes and the network.
      • => If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse.
  • A fictious system with maximum delay $d$, guarantee that a non-failed node always handles a request within some time $r$, => $2d+r$ would be a reasonable timeout to use

  • Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays (that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive),

  • rather than using configured constant timeouts, systems can continually measure response times and their variability (jitter), and automatically adjust time‐outs according to the observed response time distribution. (Phi Accural failure detector used in Akka and Cassandra)




8.3 Unreliable Clocks (287p)

  • Clocks and time are important. Applications depend on clocks in various ways to answer questions like the following:

    1. Has this request timed out yet?

    2. What’s the 99th percentile response time of this service?

    3. How many queries per second did this service handle on average in the last fiveminutes?

    4. How long did the user spend on our site?

  • Each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines.

8.3.1 Time-of-Day Clocks vs Monotonic

  • Time of Day clocks : does what you intuitively expect of a clock: it returns the current date and time according to some calendar

    • clock_gettime(CLOCK_REALTIME) on Linux System.currentTimeMillis() in Java
    • return the number of seconds since the epoch: midnight UTC on January 1, 1970
    • usually synchronized with NTP (Network Time Protocol), which means that a time‐stamp from one machine (ideally) means the same as a timestamp on another machine.
    • if the local clock is too far ahead of the NTP server, it maybe forcibly reset and appear to jump back to a previous point in time => make time-of-day clocks unsuitable for measuring elapsed time
  • Monotonic clocks : are guaranteed to always move forward

    • A monotonic clock is suitable for measuring a duration, such as a timeout or a service’s response time
    • clock_gettime(CLOCK_MONOTONIC) on Linux and System.nanoTime() in Java time.monotonic() in python
    • By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward.

8.3.2 Clock Synchronization and Accuracy

  • Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server
  • Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope
  • It is possible to achieve very good clock accuracy if you care about it sufficiently to invest significant resources. (e.g. financial institutions, trading funds)
  • Such accuracy can be achieved using GPS receivers, and careful deployment and monitoring. However, it requires significant effort and expertise, and there are plenty of ways clock synchronization can go wrong.

8.3.3 Relying on Synchronized Clocks

  • Although clocks work quite well most of the time, robust software needs to be prepared to deal with incorrect clocks
  • However incorrect clockes easily go unnoticed. If its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality.
  • Timestamps for ordering events : (Figure 8-3)
  • Clock readings have a confidence interval
    • Thus, it doesn’t make sense to think of a clock reading as a point in time - it is more like a range of times,
    • Google’s TrueTime API (Spanner) explicitly reports the confidence interval on the local clock. => returns [earliest, latest]

8.3.4 Process Pauses

  • Let’s consider another example of dangerous clock use in a distributed system.
  • What if there is an unexpected pause in the execution of the program?
    • leader paused => another leader take over => previous leader resumed
  • Is it crazy to assume that a thread might be paused for so long? Unfortunately not. There are various reasons why this could happen:
    • Garbage collector (GC) that occasionally needs to stop all running threads.
    • In virtualized environments, a virtual machine can be suspended and resumed
    • If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete
    • A Unix process can be paused by sending it the SIGSTOP signal, for example bypressing Ctrl-Z in a shell
  • When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems
  • A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding.
  • Response time guarantees : requires a large amount of additional work and severely restricts the range of programming languages, libraries, and tools that can be used
  • Limiting the impact of garbage collection : The negative effects of process pauses can be mitigated without resorting to expensive real-time scheduling guarantees.

GC 같은 이유로 process가 멈추면 문제가 될 수 있다. 이를 해결하기 위해, response time 을 guarantee 하도록 개발하는것은 expensive 하고 restrict도 많아 비현실적이고, 대신 memory 사용률을 체크하는 것만으로도 GC로 인한 pause를 예방할 수 있다.




8.4 Knowledge, Truth, and Lies

  • In the rest of this chapter, we will further explore the notions of knowledge and truth in distributed systems, which will help us think about the kinds of assumptions we can make and the guarantees we may want to provide.

8.4.1 The Truth is Defined by the Majority

  • Imagine a network with an asymmetric fault:

    1. a node is able to receive all messages sent to it, but any outgoing messages are dropped or delayed
    2. After some timeout, the other nodes declare it dead,
    3. the semi-disconnected node is dragged to the graveyard, kicking and screaming “I’m not dead!”— but since nobody can hear its screaming, the funeral procession continues with stoic determination.
  • The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. => Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes

  • The leader and the lock : Frequently, a system requires there to be only one of some thing. (one node is allowed to be the leader)

    • Implementing this in a distributed system requires care:
    • even if a node believes that it is “the chosen one” that doesn’t necessarily mean a quorum of nodes agrees!
    • A node may have formerly been the leader, but if the other nodes declared it dead in the meantime it may have been demoted and another leader may have already been elected. (Fig8-4)
  • Fencing tokens : When using a lock or lease to protect access to some resource, such as the file storage in Figure 8-4, we need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system. A fairly simple technique that achieves this goal is called fencing, and is illustrated in Figure 8-5.

    어떠한 이유로 죽은 older token 의 소지자가 storage에 접근할 수 없도록 lease 를 몇번이나 해줬는지 알 수 있도록 기록

8.4.2 Byzantine Faults

  • Fencing tokens can detect and block a node that is inadvertently acting in error.
  • However, if the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.
  • Byzantine fault : nodes may lie
    • if a node may claim to have received a particular message when in fact it didn’t.
  • This concern is relevant in certain specific circumstances. For example:
    • in aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation,
    • multiple participating organizations, some participants may attempt to cheat or defraud others.
  • Weak forms of lying : it can be worth adding mechanisms to guard against weak forms of lying

8.4.3 System Model and Reality

  • Many algorithms have been designed to solve distributed systems problems—in Chapter 9.

  • We somehow formalize the kinds of faults that we expect to happen in a system. We do this by defining a system model, which is an abstraction that describes what things an algorithm may assume.

  • With regard to timing assumptions, three system models are in common use

    • Synchronous model : assumes bounded network delay, bounded process pau‐ses, and bounded clock error
    • Partially synchronous model : behaves like a synchronous system most ofthe time, but it sometimes exceeds the bounds for network delay, process pauses,and clock drift : realistic model
    • Asynchronous model : is not allowed to make any timing assumptions
  • Consider node failures.

    • Crash-stop faults : node may suddenly stop responding at any moment, and thereafter that node is gone forever
    • Crash-recovery faults : nodes may crash at any moment, and perhaps start responding again after some unknown time.
    • Byzantine (arbitrary) faults : Nodes may do absolutely anything, including trying to trick and deceive other nodes,
  • Correctness : An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.

  • If we are generating fencing tokens for a lock we may require the algorithm to have the following properties:

    • Uniqueness: No two requests for a fencing token return the same value.
    • Monotonic sequence : If request x returned token tx, and request y returned token ty, and x completed before y began, then tx < ty.
    • Availability: A node that requests a fencing token and does not crash eventually receives a response.
  • Safety and liveness : uniqueness and monotonic sequence are safety properties, but availability is a liveness property

  • Mapping system models to the real world : when implementing an algorithm in practice, the messy facts of reality come back to bite you again, and it becomes clear that the system model is a simplified abstraction of reality.

이러한 property들을 바탕으로 system model 의 correctness를 판단 => 현실적으로는 어렵지만, 놓치기 쉬운 문제들을 찾아내는데 도움.

Summary

  • In this chapter, we have discussed a wide range of problems that can occur in distributed systems, including:

    • Sended packet over the network, may be lost or delayed.
    • A node’s clock may be significantly out of sync with other node
    • A process may pause for a substantial amount of time at any point in its execution
  • we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

  • To tolerate faults, the first step is to detect them,

    • most distributed algorithms rely on timeouts to determine whether a remote node is still available.
  • Once a fault is detected, making a system tolerate is not easy either: no global variable, no shared memory, no common knowledge

  • This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems.