Overview
Applications which rely on reliability and scalability of the distributed membership maintenance protocol:
- Reliable multicast
- Epidemic-style information dissemination (gossip)
- Distributed databases
- Publish-subscribe systems
- Large scale peer-to-peer systems
- Large scale cooperative gaming
- Other collaborative distributed applications
Requirements to protocol:
- Low probability of false positive detection of process fail
- Do not rely on central server
- Low network and computational load
- Low time to detect failure
- Weakly consistent
- Membership update dissemination
- Failure detection
SWIM address non-scalability problems of traditional protocols for membership maintenance by:
- Designing the failure detection and membership update dissemination components separately
- Using a non-heartbeat based strategy for failure detection. It uses random-probing based failure detector protocol.
Properties of SWIM protocol:
- Imposes a constant message load per group member;
- Detects a process failure in an expected constant time at some non-faulty process in the group;
- Provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to detect failure of another process;
- Propagates membership updates, including information about failures, in gossip-style; the dissemination latency in the group grows slowly (logarithmically) with the number of members;
- Provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.
SWIM Protocol Design
SWIM protocol consists of two main components:
- Failure Detector Component
- Dissemination Component
Failure Detector Component
Quality of service properties of failure detection:
- Strong completeness
- Speed of failure detection
- Accuracy
- Network message load
SWIM uses Failure detection protocol described at [2] (see Fig. 1).
Fig. 1 SWIM failure detection: Example protocol period at Mi. This shows all the possible messages that a protocol period may initiate.
Dissemination Component
Basic implementation of Membership Dissemination component consist of:
- On detecting failed member by failure detector process multicast it to the group. Other members delete failed member.
- Joined or voluntarily leaving member multicast it in the same way. But on join need to know at least one contact member of the group.
- If the group is associated with a well known server or IP multicast address, all joins could be directed to the associated address.
- In the absence of such infrastructure, join messages could be broadcasted. Group members will reply with some probability to this broadcast request.
The basic implementation described above shows the general idea of the protocol, but in order to make protocol robust and fault tolerant following improvements may be applied:
- Do not use multicast to disseminate membership updates, but spread membership information by piggybacking it on top of failure detection protocol (ping, ping-req and ack messages). This approach results in an infection-style of membership updates dissemination, with the associated benefits of robustness to packet losses, and of low latency.
- Use suspicion mechanism. Process that unresponsive to ping marked as “suspected”. Suspected member treated as normal for ping selection, but if respond then spread alive information. Only after predefined timeout suspected member declared as faulty. This timeout effectively trades off an increase in failure detection time for a reduction in frequency of false failure detections. Suspicion mechanism is applicable to any system with the distinct Failure Detector and Membership Dissemination components.
- Basic implementation of SWIM protocol guarantee only eventual detection of the failure. Round-Robin probe target selection will guarantee Time Bounded Completeness property; the time interval between the occurrence of a failure and its detection at member Mj is no more than two times the group size in number of protocol periods. This can be solved by the following modification to the protocol. The failure detection protocol at member Mi selecting ping targets not randomly from the list of members, but in a round robin fashion. A newly joining member is inserted in the membership list at a position that is chosen uniformly at random. On completing a traversal of the entire list, Mi rearranges the membership list to a random reordering.
SWIM can be extended to a Wide-area network (WAN) or a virtual private network (VPN), by weighing ping target choices with topological information, thus reducing bandwidth usage on core network elements inside the network.
SWIM Implementations
Consul
Consul is a distributed, highly available, and datacenter-aware service discovery and configuration management system. It uses SWIM as a basis for implementing Gossip Protocol and for managing membership information.
ScaleCube
ScaleCube (disclaimer: I am one of the authors) is an open source microservices framework for a rapid development of a distributed, resilient, reactive applications that scales. It uses a variant of SWIM protocol implementation in Java and uses it as a basis for managing the cluster of connected microservices. It uses suspicion mechanism over the failure detector events and also separate membership updates dissemination component. But it introduces separate Gossip Protocol component instead of piggybacking membership updates on top of failure detector messages. It is done in order to reuse gossip component for other platform events and have more fine grained control over time intervals used for gossiping and failure detection pinging. New members to the cluster joins via the configuration provided seed members addresses (it is an analogous of well known servers described above). And also it extends SWIM protocol with the introduction of periodic SYNC messages in order to improve recovery from network partitioning and message losses.
Reference: