Monday, February 6, 2012

libcppa vs. Erlang vs. Scala Performance (Mixed Scenario)

Remark


Please note that the results of the original post are heavily outdated. The graph below illustrates some newer benchmark results using Scala 2.10, Erlang 5.10.2, and libcppa 0.9. Rather than running the benchmark on a 12-core machine, we have used a 64-core machine (4 CPUs with 16 cores each). Furthermore, we have used a slightly different set of parameters: 100 rings, 50 actors each, initial token value of 1000, and 5 repetitions. We will publish a more throughout evaluation in the near future.



Original Post


This benchmark simulates a use case with a mixture of operations. The continuous creation and termination of actors is simulated along with a total of more than 50,000,000 messages sent between actors and some expensive calculations are included to account for numerical work load. The test program creates 20 rings of 50 actors each. A token with initial value of 10,000 is passed along the ring and decremented once per iteration. A client receiving a token always forwards it to the next client and finishes execution whenever the value of the token was 0. The following pseudo code illustrates the implemented algorithm.

chain_link(Next):
  receive:
    {token, N} =>
      next ! {token, N}
      if (N > 0) chain_link(Next)

worker(MessageCollector):
  receive:
    {calc, X} =>
      MessageCollector ! {result, prime_factorization(X)}

master(Worker, MessageCollector):
  5 times:
    Next = self
    49 times: Next = spawn(chain_link, Next)
    Next ! {token, 10000}
    Done = false
    while not Done:
      receive:
        {token, X} =>
          if (X > 0): Next ! {token, X-1}
          else: Done = true
  MessageCollector ! {master_done}

Each ring consists of 49 chain_link actors and one master. The master recreates the terminated actors five times. Each master spawns a total of 245 actors and the program spawns 20 master actors. Additionally, there is one message collector and one worker per master. A total of 4921 actors (20+(20∗245)+1) are created but no more than 1021 (20+20+(20∗49)+1) are running concurrently. The message collector waits until it receives 100 (20∗5) prime factorization results and a done message from each master. Prime factors are calculated to simulate some work load. The calculation took about two seconds on the tested hardware in our loop-based C++ implementation. Our tail recursive Scala implementation performed at the same speed, whereas Erlang needed almost seven seconds.




As expected, the thread-based Scala implementation yields the worst performance though the runtime increase for eight and more cores surprises. Akka is significantly faster than both standard library implementations of Scala. Erlang performs very well, given the fact that its prime factorization is more than three times slower. The very efficient scheduling of Erlang, which is the only implementation under test that performs preemptive scheduling, is best at utilizing hardware concurrency. The current overhead of the libcppa scheduler hinders better performance results. The overhead of stack allocation and context switching is about 10-20% in this benchmark for up to six cores where the scheduler is stretched to its limits.

This benchmark shows that libcppa is competitive and performs at comparable speed to well-tested and established actor model implementations. Nevertheless, the current scheduling algorithm is not able to utilize more than six cores efficiently by now.

The benchmarks ran on a virtual machine with Linux using 2 to 12 cores of the host system comprised of two hexa-core Intel® Xeon® processors with 2.27GHz. All values are the average of five runs.

The sources can be found on github (MixedCase.scala, mixed_case.erl and mixed_case.cpp).

2 comments:

  1. The link to benchmarks leads nowhere :(

    ReplyDelete
    Replies
    1. Oops, sorry. The benchmarks have moved to another repository. I have updated the links.

      Delete