Friday, March 9, 2012

A Few Words About Theron

Theron is currently the only existing actor library for C++ besides libcppa. It implements event-based message processing without mailbox, using registered member functions as callbacks. I wanted to include Theron to my three benchmarks, but I had some trouble to get Theron running on Linux. I've compiled Theron in release mode, ran its PingPong benchmark ... and got a segfault. GDB pointed me to the DefaultAllocator implementation and I could run the benchmarks after replacing the memory management with plain malloc/free. Thus, the shown results might be some percentages better with a fixed memory management. However, I was unable to get results for the Actor Creation Overhead benchmark because Theron crashed for more than 210 actors.

Theron has two ways of sending a message to an actor, because it distincts between references and addresses. The push method can be used only if one has a reference to an actor. This is usually only the case for the creator of an actor. The send method uses addresses and is the general way to send messages.



The results for the Mixed Scenario benchmark are as follows.



Theron yields good results on two and four cores. However, the more concurrency we add, the more time Theron needs in the mixed scenario. I don't know about Theron's internals, but this is a common behavior of mutex-based software in many-core systems and cannot be explained by the "missing" memory management.

I used Theron in version 3.03. As always, the benchmarks ran on a virtual machine with Linux using 2 to 12 cores of the host system comprised of two hexa-core Intel® Xeon® processors with 2.27GHz. All values are the average of five runs.

The sources can be found on github in the benchmark folder (theron_mailbox_performance.cpp and theron_mixed_case.cpp).

10 comments:

  1. Hi, I'm the author of Theron. Sorry to hear about your crash problem! There's a patch 3.03.01 which fixes an occasional crash in 64-bit builds, but it was only released three days after your post. Let me know if it doesn't fix your issue.

    Your performance results surprise me. When I get a chance, I'll try implementing your benchmark myself and see if I can reproduce them.

    Have you tried running the benchmarks included with Theron? There's a page on the Theron website which presents some results:
    http://www.theron-library.com/index.php?t=page&p=performance

    The ThreadRing and ParallelThreadRing benchmark, included with Theron, are somewhat similar to your benchmark - but without the prime factorization calculations or the actor creation overhead. My recorded time for 50 million messages in ThreadRing is around 5 seconds. That's with 16 software threads on a machine with 8 hardware threads (4 hyperthreaded cores). It would be interesting to know what sort of results you get in your 12 core environment using that benchmark.

    ReplyDelete
  2. Hi. I'm glad you're reading my blog! I was about to post the results in your forum, too. But I guess that's not necessary anymore.

    I was surprised by the results myself, since Theron performs well on two and four cores. The first benchmark does not include any overhead besides sending messages. Both benchmark programs do not require libcppa. Feel free to compile & run (and review) the benchmarks yourself. That's why I publish the sources as well.


    Some results for Theron's included benchmarks:

    "./ParallelThreadRing 50000000 8" in a VM using 4 cores: 80 seconds
    "./ParallelThreadRing 50000000 16" in a VM using 8 cores: 275 seconds
    "./ParallelThreadRing 50000000 24" in a VM using 12 cores: 750 seconds

    ThreadRing and CountingActor do not use more than one core ("top" always shows 100% usage on one core). Thus, the results are always equal in all setups. ThreadRing needs ~13, and CountingActor ~45 seconds for 50,000,000 messages.



    Version 3.3.1 does compile & run out of the box and is slightly faster:

    "./ParallelThreadRing 50000000 8" in a VM using 4 cores: 88 seconds
    "./ParallelThreadRing 50000000 16" in a VM using 8 cores: 230 seconds
    "./ParallelThreadRing 50000000 24" in a VM using 12 cores: 620 seconds


    Btw: it would be nice if Theron would provide Automake or CMake support.

    ReplyDelete
  3. Thanks for taking the time to test the new version and run the benchmarks.

    I've written my own version of MixedScenario, but it is very similar to yours and I think it will behave the same. You can find it here (you'll need to define Theron::uint64_t)
    http://www.theron-library.com/MixedScenario.cpp

    I have some results for ParallelThreadRing and MixedScenario, on a 4-core X5550 machine (8 hardware threads) and a 6-core X5660 machine (12 hardware threads) -- unfortunately I don't have a 12 core machine at my disposal ;) Perhaps unsurprisingly, my results tell a different story:

    ParallelThreadRing 50M 4 cores (8 hw threads, 16 sw threads): 12.3s
    ParallelThreadRing 50M 6 cores (12 hw threads, 16 sw threads): 12.8s

    MixedScenario 50M 4 cores (8 hw threads, 16 sw threads): 27.1s
    MixedScenario 50M 6 cores (12 hw threads, 16 sw threads): 20.6s

    As I would expect, more cores makes little difference in ParallelThreadRing. ParallelThreadRing is all about the overheads of passing messages and synchronizing, and does no actual work.

    And also as I would expect, more cores (and a higher clock speed) does make a difference in MixedScenario. In my opinion MixedScenario is probably dominated by the factorization of the primes, and it's all about how effectively you can parallelize that. It makes sense that adding more cores makes MixedScenario faster, rather than slower, so long as the actor overheads don't dominate.

    So the question is why our results are different. There are a number of possible factors obviously, not least the fact that I'm running Visual Studio builds in Windows 7 whereas you're presumably running GCC builds on a Linux virtual machine.

    But to be honest my money would be on Boost -- I'm guessing you're using Boost threads (or something related) whereas I'm using Windows threads. I already know that Boost threads are 2x slower for me than Windows threads, so I wouldn't be surprised if there was an issue there. I wrote about it under Known Issues in the release notes:
    http://www.theron-library.com/index.php?t=notes&p=31

    When I get time, I will try a Boost threads build to see if I can confirm that's what's going on. In the meantime, thanks again!

    ReplyDelete
  4. You're welcome.

    I would recommend you to run your benchmarks on Linux, too. The boost implementation on Windows ultimately uses Windows threads anyways. However, have you considered switching to the new STL threading implementation yet? It's included in the new Visual Studio release as well as in GCC on Linux. It might be worth the effort since you wouldn't have to maintain two separate implementations.

    ReplyDelete
  5. Yes, it would make sense for me to be reporting linux benchmarks -- and if nothing else I should at least be reporting the gcc/boost scores on Windows. I had ignored them because GCC isn't my lead platform and, well, because they were slower :) But by not reporting them I missed the fact that they behaved very differently: slower is one thing, but the kind of dramatic effect you've drawn my attention to is a different thing altogether.

    I could also do with buying a machine with more cores -- even at work the best I can find is a 6-core machine. And that doesn't seem to be enough to satisfactorily repro the behaviour you're seeing on 12 cores.

    In the last couple of days I've uploaded a beta version of a new 3.03.02 patch, which I hope will cure this problem. Basically I've avoided boost::recursive_mutex and used boost::mutex instead, having figured out how.

    In my own tests I've confirmed that 3.03.02 is significantly faster than 3.03.01, in gcc builds using Boost.Thread on Windows. But my results don't show the kind of dramatic slowdown you're seeing, even with 3.03.01. I don't know whether that's because I don't have enough cores to see the problem, or whether it's because Boost.Threads is implemented differently on Windows (and ultimately just uses Windows Threads, as you say). So I haven't been able to confirm that my changes fix the real problem, yet.

    3.03.01 ParallelThreadRing (4 cores): 35s
    3.03.01 ParallelThreadRing (6 cores): 35s

    3.03.02 ParallelThreadRing (4 cores): 28s
    3.03.02 ParallelThreadRing (6 cores): 26s

    (all results reflect 32-bit GCC builds with Boost.Thread on Windows)

    As for std::thread, I have support working locally but need to release it, which is proving complicated because I want to still maintain support for Boost and Windows threads, for now -- not everyone is able to use the latest compiler versions, and Visual Studio support for std::thread in particular is very new. But as you say it will eventually simplify my life, which I look forward to :)

    A final note: the comment form on this blog currently only works for me in Internet Explorer, whereas it used to work fine in Chrome. Not sure why.

    ReplyDelete
  6. News flash: A mate at work has a dual-CPU machine with 8 cores (4 cores per CPU). And on that machine we're seeing the sort of nasty numbers you're seeing, in both 3.03.01 and also 3.03.02. Even though the cores themselves are faster than the other machines we've tested, and have more cache.

    3.03.01 ParallelThreadRing (8 cores, dual CPU): 156s
    3.03.02 ParallelThreadRing (8 cores, dual CPU): 144s

    So that makes me suspect the issue might only show up on dual-CPU machines?

    ReplyDelete
  7. I'm glad you found a machine that's reproducing the observed runtime behavior. Well, synchronization is more complex on many-CPU machines. Thus, it sounds plausible that locking operations are (much) slower in such an environment. I had constantly only one core running during the Theron benchmarks, while all other cores were blocked. Something really fires back in Theron on multi-CPU machines. But it's likely that the same behavior would arise on a single processor with dozens of cores.

    I don't know about your internals, but I'm using spinlocks whenever possible and try to avoid heavyweight mutexes. If you have to use a mutex, you should use a fine-grained locking to minimize collisions and queuing. I wish you good luck for identifying the bottleneck! :)

    Btw, I'm using Chrome for this comment (on Mac).

    ReplyDelete
  8. I doubt the locks alone are the problem.

    The machine here which shows the bug only has 8 hardware threads, which is 4 less than the 6 core (but hyperthreaded) machine I tested earlier, with 12 hardware threads. So the results for ParallelThreadRing look like this:

    8 hardware threads (1 cpu x 4 cores x 2 threads): 28s
    8 hardware threads (2 cpu x 4 cores x 1 threads): 144s
    12 hardware threads (1 cpu x 6 cores x 2 threads): 26s

    The difference in performance seems uncorrellated with the numbers of cores or hardware threads.

    ReplyDelete
  9. Oh, and the reason why only one core's worth of work is done in ThreadRing and ParallelThreadRing is because that's how those benchmarks are designed. They're stress tests of message passing and synchronization overheads and there's intentionally no parallelism to be exploited. So adding more hardware threads has little positive effect in either benchmark, by design.

    ReplyDelete
  10. Yes, sorry for messing that up. You've mentioned false sharing as possible reason for the performance impact in your forum. That seems plausible, too. And it should be easy to check this by adding padding to your queue elements.

    However, please let me know if you've solved the issue, so I can re-run the benchmarks and update the graphs.

    ReplyDelete