Link Search Menu Expand Document

Ch08 - Problems with distributed systems

  • Supercomputing
    • deals with partial failure by letting it escalate into total failure
  • Cloud computing
    • Unreliable network
      • telephone network is sync. call establish a circuit, fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. - has bounded delay
      • Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit <- because optimized for bursty traffic. audio or video call needs to transfer a fairly constant number of bps. TCP dynamically adapts the rate of data transfer to the available network capacity.
      • dynamic resource allocation: has downside of queueing, but pro: it maximizes utilization of the wire.
      • static resource partitioning: Latency guarantees. but reduced utilization.
    • Unreliable Clocks
      • possible to achieve good accuracy invest significant resources. e.g. the MiFID II draft EU regulation for financial institutions requires all HFT funds to synchronize to within 100 microseconds of UTC - uses GPS receivers, the Precision Time Protocol (PTP) careful deployment and monitoring
      • typical NTP is fickle - limited by the network round-trip time, in addition to quartz drift, etc.
      • Clock readings have a confidence interval
        • mostly not exposed, but Google’s TrueTime API in Spanner: when you ask it for the current time, you get back two values: [earliest, latest],
      • Synchronized clocks for global snapshots
        • typically require monotonically increasing transaction ID
        • when distributed, Spanner approach: use the TrueTime API, if A[earliest] < A[latest] < B[earliest] < B[latest], then A is before B
        • deliberately waits for the length of the confidence interval before committing a read-write transaction
        • to keep uncertainty small, use GPS receiver or atomic clock in each datacenter
      • problem with leader lease:
        get request
        check lease expiration
        if not expired, handle request
        
        • if rely on time-of-day expiry time, then out of sync
        • what happens if process pause between check lease and handling of request? potential causes
          • GC pause
          • virtual machine suspended due to live migrations
          • I/O blocking
          • context-switches