David Beck

Thoughts

Follow on GitHub

Wrapping Up My Elixir Tcp Experiments

19 Oct 2015 by David Beck on [LinkedIn] / [Feed]
submit to reddit

In this post I close my TCP small message experiment series in Elixir. Since my last post I further improved the small message server both in terms of performance and it became more Elixir-ish.

If you haven’t followed the previous experiments then here are the links:

  1. The first post was my most naive experiment to pass small messages from C++ to Elixir with an immediate acknowledgement. This led to 22k messages per second.
  2. The second in the series changed the protocol by allowing delayed acknowledement and combinining ACKs together. This resulted 100k messages per second mainly because my suboptimal Elixir code.
  3. By further improving the Elixir server I got 250k messages per second.
  4. The big speedup came when I stopped passing each message to a separate Elixir process. By calculating and sending the ACKs synchronously I achieved over two million messages.

In this post I recap what caused performance loss in my code and what made possible these improvements. The current version is doing over 3M messages per second.

Github repo

Previously I inlined the codes into my posts which I believe is not very convenient if you want to experiment. The github repo is available here. Feel free to clone or fork.

I slightly renamed the files and the modules:

  1. The first experiment’s files are named as RequestReply
  2. The second is in ThrottleAck
  3. The thrird is in HeadRest
  4. The fourth is in SyncAck
  5. The current experiment is in AsyncAck

Performance

With the latest changes I arrived to the 3M messages per second range:

elapsed usec=595899 avg(usec/call)=0.2979495 avg(call/msec)=3356.2735 avg(call/sec)=3356273.5
elapsed usec=608023 avg(usec/call)=0.3040115 avg(call/msec)=3289.3493 avg(call/sec)=3289349.3
elapsed usec=619510 avg(usec/call)=0.309755 avg(call/msec)=3228.3579 avg(call/sec)=3228357.9
elapsed usec=629900 avg(usec/call)=0.31495 avg(call/msec)=3175.1072 avg(call/sec)=3175107.2
elapsed usec=629148 avg(usec/call)=0.314574 avg(call/msec)=3178.9023 avg(call/sec)=3178902.3
elapsed usec=730798 avg(usec/call)=0.365399 avg(call/msec)=2736.7344 avg(call/sec)=2736734.4
elapsed usec=611692 avg(usec/call)=0.305846 avg(call/msec)=3269.6194 avg(call/sec)=3269619.4
elapsed usec=630911 avg(usec/call)=0.3154555 avg(call/msec)=3170.0192 avg(call/sec)=3170019.2
elapsed usec=613990 avg(usec/call)=0.306995 avg(call/msec)=3257.382 avg(call/sec)=3257382
elapsed usec=614098 avg(usec/call)=0.307049 avg(call/msec)=3256.8092 avg(call/sec)=3256809.2
elapsed usec=603122 avg(usec/call)=0.301561 avg(call/msec)=3316.0787 avg(call/sec)=3316078.7
elapsed usec=629041 avg(usec/call)=0.3145205 avg(call/msec)=3179.443 avg(call/sec)=3179443
elapsed usec=631746 avg(usec/call)=0.315873 avg(call/msec)=3165.8293 avg(call/sec)=3165829.3
elapsed usec=626637 avg(usec/call)=0.3133185 avg(call/msec)=3191.6405 avg(call/sec)=3191640.5
elapsed usec=639153 avg(usec/call)=0.3195765 avg(call/msec)=3129.1412 avg(call/sec)=3129141.2
elapsed usec=632023 avg(usec/call)=0.3160115 avg(call/msec)=3164.4418 avg(call/sec)=3164441.8
elapsed usec=624141 avg(usec/call)=0.3120705 avg(call/msec)=3204.4041 avg(call/sec)=3204404.1
elapsed usec=615717 avg(usec/call)=0.3078585 avg(call/msec)=3248.2455 avg(call/sec)=3248245.5
elapsed usec=626122 avg(usec/call)=0.313061 avg(call/msec)=3194.2657 avg(call/sec)=3194265.7
elapsed usec=629619 avg(usec/call)=0.3148095 avg(call/msec)=3176.5242 avg(call/sec)=3176524.2

Now I am using a separate ACK responder process because I figured out what caused the performance issue with that before in the second and third experiment.

Using separate processes for concurrency

Using separate Elixir processes is a great way for concurrency. At the same time it is important to understand their performance implications. Sending a message to a process is not as lightweight as I thought. This takes time on the sender side which is far bigger than I expected. And the receiver’s message queue can also be overwhelmed.

What I have read so far about Erlang and Elixir concurrency is pretty misleading. Authors seem to be too happy that they can implement a useless Fibonacci example by parallel processes. In practice one needs to be careful and measure the results.

In my third experiment I offloaded the ACK processing to a timer and a Task process. Later I realized that passing large number of messages to a separate process is too expensive and I moved the ACK generation into the same process that reads from the network. This gave me 9x performance improvement.

While I was happy with the improvement I started thinking about how to revert this and delegate the ACK processing to a separate process, because in a new experiment series I want to make the small message server distributed. So I decided not to pass each messages, but rather the original data blocks I read from network and offload the parsing to the separate process too. This is what helped achieving the 3M messages.

:observer.start load graph also looks healthier, because the new server utilizes the two cores in my laptop:

Observing single client performance

Utilize Elixir binary pattern matching and reduce system calls

The message structure in this experiment has three parts:

  • ID
  • Payload size
  • Payload

At my first naive attempt I did two reads to receive messages. First I read the two fixed length fields and then, based on the Payload size I read the payload. This was very slow.

Then I decided to do as big reads as available from the network and try to parse whatever I had. This can be tricky because the data may be split between subsequent reads. Solving this in Elixir turned out to be very easy.

UPDATE: Linux performance

I only tested on my Macbook Air which I thought is OK as long as my goal is to improve my Elixir skills by polishing this experiment. Thanks to Panagiotis PJ Papadomitsos’ comments I checked this on a spare Linux box too. This is around 7 years old box running Linux non-virtualized. I have a few other boxes at work but they all running VMware VMs, so as per Panagiotis’ suggestion may not be the best for these tests.

Here are the results:

  RequestReply Throttle HeadRest SyncAck AsyncAck
Deafult Settings20k30k78k1380k780k
+K true19k30k78k1380k780k
+K false +sbwt none20k30k80k1400k790k
+K false +sbwt none +swt very_high20k30k78k1380k790k
+K false +sbwt none +swt very_low20k30k82k1390k810k

Here is the cpuinfo for this Linux machine:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Genuine Intel(R) CPU           U2300  @ 1.20GHz
stepping	: 10
microcode	: 0xa04
cpu MHz		: 1199.990
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips	: 2399.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Genuine Intel(R) CPU           U2300  @ 1.20GHz
stepping	: 10
microcode	: 0xa04
cpu MHz		: 1199.990
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips	: 2399.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

This CPU is way slower than my Mac’s so I don’t want to compare the absolute numbers. The main takeway for me is that the performance is dominated by the relation of the network performance versus the CPU power. This could have been obvious but the actual numbers are very interesting. My latest AsyncAck code that performs best on Mac becomes second on the slow Linux box. The Erlang VM settings on the other hand made little difference, I guess because the CPU power was too slow for these settings to actually matter.

I start to have the feeling that writing performant Elixir code, one also need to think about the hardware where it is going to run. At least for this kind of networking code. I am saying this because the only difference between the SyncAck and AsyncAck code is that I have put the Ack processing on a separate process for which the CPU was not enough in this box. So to max out this Linux box I’d need to make a software architecture decision too. This is pretty much in contrast to what I expected. My naive feeling was that a well written Elixir code would run equally well on any computer / OS, only that the relative performance of the boxes would differ.

UPDATE2: FreeBSD on EC2

I couldn’t resist to do another experiment on a c4.large EC2 instance, running FreeBSD 10. Here are the numbers:

  RequestReply Throttle HeadRest SyncAck AsyncAck
Deafult Settings 26k 150k 302k 2600k 2400k
+K true 26k 155k 305k 2600k 2200k
+K false 26k 154k 305k 2500k 2180k
+K false +sbwt none 26k 154k 305k 2500k 2400k
+K false +sbwt none +swt very_high 26k 154k 308k 2500k 2200k
+K false +sbwt none +swt very_low 26k 155k 308k 2550k 2300k

Interesting to see how a faster CPU and a different OS impacts the numbers. Just like on Linux, my separate ACK process that worked well in Mac OSX, hurts performance here.

uname -a: FreeBSD ip-172-30-0-199 10.2-RELEASE FreeBSD 10.2-RELEASE #0 r286666: Wed Aug 12 15:26:37 UTC 2015 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC