Ch08 - Problems with distributed systems
- Supercomputing
- deals with partial failure by letting it escalate into total failure
- Cloud computing
- Unreliable network
- telephone network is sync. call establish a circuit, fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. - has bounded delay
- Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit <- because optimized for bursty traffic. audio or video call needs to transfer a fairly constant number of bps. TCP dynamically adapts the rate of data transfer to the available network capacity.
- dynamic resource allocation: has downside of queueing, but pro: it maximizes utilization of the wire.
- static resource partitioning: Latency guarantees. but reduced utilization.
- Unreliable Clocks
- possible to achieve good accuracy invest significant resources. e.g. the MiFID II draft EU regulation for financial institutions requires all HFT funds to synchronize to within 100 microseconds of UTC - uses GPS receivers, the Precision Time Protocol (PTP) careful deployment and monitoring
- typical NTP is fickle - limited by the network round-trip time, in addition to quartz drift, etc.
- Clock readings have a confidence interval
- mostly not exposed, but Google’s TrueTime API in Spanner: when you ask it for the current time, you get back two values: [earliest, latest],
- Synchronized clocks for global snapshots
- typically require monotonically increasing transaction ID
- when distributed, Spanner approach: use the TrueTime API, if A[earliest] < A[latest] < B[earliest] < B[latest], then A is before B
- deliberately waits for the length of the confidence interval before committing a read-write transaction
- to keep uncertainty small, use GPS receiver or atomic clock in each datacenter
- problem with leader lease:
get request check lease expiration if not expired, handle request
- if rely on time-of-day expiry time, then out of sync
- what happens if process pause between check lease and handling of request? potential causes
- GC pause
- virtual machine suspended due to live migrations
- I/O blocking
- context-switches
- Unreliable network