Monday, February 13, 2012

Actor Creation Overhead

libcppa provides two actor implementations: a context switching and an event-based implementation.

The context-switching implementation is easier to use from a user's point of view. One has to write less code and receives can be nested. But there is a downside to this approach: each actor allocates its own stack. As an example for a current mainstream system: Mac OS X Lion defines the two constants SIGSTKSZ = 131072 and MINSIGSTKSZ = 32768 in its system headers. SIGSTKSZ is the recommended stack size in bytes and MINSIGSTKSZ is the minimum allowed stack size in bytes. Assuming a system with 500,000 actors, one would require a memory usage of at least 15 GB of RAM for stack space only. This would rise up to 61 with the recommended stack size instead in use. This clearly does not scale well for large systems. The event-based implementation uses fewer system resources, allowing developers to use hundreds of thousands of actors. Creating an event-based actor is cheap and lightweight but you have to provide a class-based implementation. Furthermore, you cannot use receive() since this would block the calling worker thread. However, the behavior-based approach is slightly different to use but fairly easy to understand and use (see the Dining Philosophers example).

The following benchmark measures the overhead of actor creation. It recursively creates 2¹⁹ (524,288) actors, as the following pseudo code illustrates.

spreading_actor(Parent):
  receive:
    {spread, 0} =>
      Parent ! {result, 1}
    {spread, N} =>
      spawn(spreading_actor, self)) ! {spread, N-1}
      spawn(spreading_actor, self)) ! {spread, N-1}
      receive:
        {result, X1} =>
          receive:
            {result, X2} =>
              Parent ! {result, X1+X2}

main():
  spawn(spreading_actor, self)) ! {spread, 19}
  receive:
    {result, Y} =>
      assert(2^19 == Y)

This measurement tests how lightweight actor implementations are. We did not test the thread-mapped actor implementation of Scala, because the JVM cannot handle half a million threads. And neither could a native application.

It is not surprising that Erlang yields the best performance, as its virtual machine was build to efficiently handle actors. Furthermore, we can see the same increase in runtime caused by more hardware concurrency for the event-based libcppa implementation as in our previous benchmark. However, the context-switching (stacked) implementation clearly falls short in this scenario. Please note that this benchmark used the minimal stack size to be able to create half a million actors. Per default, libcppa uses the recommended stack size! Consider using event-based actors whenever possible, especially in systems consisting of a large amount of concurrently running actors.

The benchmarks ran on a virtual machine with Linux using 2 to 12 cores of the host system comprised of two hexa-core Intel® Xeon® processors with 2.27GHz. All values are the average of five runs.

The sources can be found on github (ActorCreation.scala, actor_creation.erl and actor_creation.cpp).

Monday, February 6, 2012

`libcppa` vs. Erlang vs. Scala Performance (Mixed Scenario)

Remark

Please note that the results of the original post are heavily outdated. The graph below illustrates some newer benchmark results using Scala 2.10, Erlang 5.10.2, and libcppa 0.9. Rather than running the benchmark on a 12-core machine, we have used a 64-core machine (4 CPUs with 16 cores each). Furthermore, we have used a slightly different set of parameters: 100 rings, 50 actors each, initial token value of 1000, and 5 repetitions. We will publish a more throughout evaluation in the near future.

Original Post

This benchmark simulates a use case with a mixture of operations. The continuous creation and termination of actors is simulated along with a total of more than 50,000,000 messages sent between actors and some expensive calculations are included to account for numerical work load. The test program creates 20 rings of 50 actors each. A token with initial value of 10,000 is passed along the ring and decremented once per iteration. A client receiving a token always forwards it to the next client and finishes execution whenever the value of the token was 0. The following pseudo code illustrates the implemented algorithm.

chain_link(Next):
  receive:
    {token, N} =>
      next ! {token, N}
      if (N > 0) chain_link(Next)

worker(MessageCollector):
  receive:
    {calc, X} =>
      MessageCollector ! {result, prime_factorization(X)}

master(Worker, MessageCollector):
  5 times:
    Next = self
    49 times: Next = spawn(chain_link, Next)
    Next ! {token, 10000}
    Done = false
    while not Done:
      receive:
        {token, X} =>
          if (X > 0): Next ! {token, X-1}
          else: Done = true
  MessageCollector ! {master_done}

Each ring consists of 49 chain_link actors and one master. The master recreates the terminated actors five times. Each master spawns a total of 245 actors and the program spawns 20 master actors. Additionally, there is one message collector and one worker per master. A total of 4921 actors (20+(20∗245)+1) are created but no more than 1021 (20+20+(20∗49)+1) are running concurrently. The message collector waits until it receives 100 (20∗5) prime factorization results and a done message from each master. Prime factors are calculated to simulate some work load. The calculation took about two seconds on the tested hardware in our loop-based C++ implementation. Our tail recursive Scala implementation performed at the same speed, whereas Erlang needed almost seven seconds.

As expected, the thread-based Scala implementation yields the worst performance though the runtime increase for eight and more cores surprises. Akka is significantly faster than both standard library implementations of Scala. Erlang performs very well, given the fact that its prime factorization is more than three times slower. The very efficient scheduling of Erlang, which is the only implementation under test that performs preemptive scheduling, is best at utilizing hardware concurrency. The current overhead of the libcppa scheduler hinders better performance results. The overhead of stack allocation and context switching is about 10-20% in this benchmark for up to six cores where the scheduler is stretched to its limits.

This benchmark shows that libcppa is competitive and performs at comparable speed to well-tested and established actor model implementations. Nevertheless, the current scheduling algorithm is not able to utilize more than six cores efficiently by now.

The benchmarks ran on a virtual machine with Linux using 2 to 12 cores of the host system comprised of two hexa-core Intel® Xeon® processors with 2.27GHz. All values are the average of five runs.

The sources can be found on github (MixedCase.scala, mixed_case.erl and mixed_case.cpp).