David Beck

Thoughts

Follow on GitHub

Scalesmall W4 Message Contents Finalized

09 Dec 2015 by David Beck on [LinkedIn] / [Feed]

This week I further clarified what I want from the group membership messages and also implemented them. I added unit tests for all functions, so I feel more or less safe to continue with the network communication part next week.

I am seriously considering to add property based tests as well. Only that I don’t have the experience in this, so this will need to wait a bit. (With so much time spent on the theory I am getting eager to see my stuff to do something useful.)

hello from Glasgow

Requirements

I decided that in scalesmall the only one who makes a decision about the group membership and its participation in the shared work is the node itself. Thus only the given node writes the shared information about its participation and everyone else are readers.

This means that I only need to make sure that:

  • whenever the node wants to notify the group about its decision, it eventually reaches all interested parties
  • when the messages eventually arrive, the recipients must be able to correctly combine them which boils down to:
    • these messages be idempotent: A + A = A, so they may be delivered multiple times
    • the messages be commutative: A + B = B + A and associative A + (B + C) = (A + B) + C, so they can be delivered and applied to the shared state in any order

Using a Last-Write-Wins strategy this seems to be super easy, if I just add a client clock to the messages.

Clocks

Digging through the literature, there seem to be a wealth of clocks like version vectors with client and server side update counters and also there are the novel ‘dot clocks’. When I read the papers about them I was tempted to use what is newest and shinier until I realized what suits me best.

On a sidenote: I find very little literature or blog posts give a good overview about when to use which and why. People seem to focus on their results and talk about that specific niche only. If you happen to be wandering in the land of these clocks I suggest to read Scalable and Accurate Causality Tracking for Eventually Consistent Stores by Paulo Sérgio Almeida, Carlos Baquero, Ricardo Gonçalves, Nuno Preguiça, and Victor Fonte.

Reading the paper I realized that my use case could be covered with a very simple solution. The nodes need to increment and pass a counter at each state change and send this over. This is very much like the VVclient approach in the paper.

Local Clock Module

GroupManager.Data.LocalClock represents the counter that is going to be attached to the information the node sends over. Only the given node is supposed to change the clock value. Everyone else are reading it.

LocalClock tests can be found here.

Item Module

If you haven’t read the previous posts in the series: I treat groups as a 32bit unsigned integer range from 0 to 0xffffffffff. The incoming requests are mapped to an integer in this range. The nodes decide what part of the range they want to serve and publish this information. Whenever they decide to serve an additional (sub)range or they want to stop serving a subrange they publish this information to the group. The GorupManager.Data.Item represents this information. It has:

  • member: identifies the node
  • op: is either :add or :rmv
  • start_range
  • end_range
  • priority

The priority is a hint that comes into the picture when multiple nodes are serving the same range. This information is opaque from the group membership stand point. Nodes can use it to decide the order in which they participate in a workload. The idea is that scalesmall should be able to support heterogenous nodes. The priority field could help in distinguishing between more or less reliable machines, different capacities and better or worst network connectivity.

Item tests can be found here.

TimedItem Module

timed_item

The GroupManager.Data.TimedItem module represents and Item at a given LocalClock. This binding allows us to pick the last write of the given nodes. Items are treated as same when these fields match:

  • member
  • start_range
  • end_range

The op and priority fields will be overwritten by the last write.

TimedItem tests can be found here.

TimedSet Module

The GroupManager.Data.TimedSet module is a list of TimedItems and only used for convenience. Its main use is to validate TimedItem members when added to the list.

TimedSet tests can be found here.

WorldClock Module

The GroupManager.Data.WorldClock module is a list of LocalClock items. It both validates the LocalClock members when added and also overwrites the older LocalClock for the given member if present. WorldClock is not used in comparisons. Its purpose is to tell nodes if they lag behind on updates from a given node. When this happens they should take action to gather the missing pieces either from the origin node itself or from the other nodes.

WorldClock tests can be found here.

Message Module

message

The GroupManager.Data.Message is the topmost module of the Data lib hierarchy. It has a WorldClock and a TimedSet. The WorldClock gets updated when new TimedItem elements are added to the Message.

Message tests can be found here.

Episodes

  1. Ideas to experiment with
  2. More ideas and a first protocol that is not in use anymore
  3. Got rid of the original protocol and looking into CRDTs
  4. My first ramblings about function guards
  5. The group membership messages
  6. Design of a mixed broadcast
  7. My ARM based testbed
  8. Experience with defstruct, defrecord and ETS
  9. GroupManager code works, beta
  10. GroupManager more information and improvements