Monday, November 23, 2009

Parallel computing on Windows

With the advent of the multicore world the need for parallelism increased a lot. Until now one way to improve the application performance was to increasing the CPU clock speed. Because of some hardware limitation is very hard for hardware vendors to increase the CPU clock speed but instead they can add more and more cores to CPUs. At this moment most of the laptops and desktop computers have CPUs with 2 cores at least.

A trivial e.g. of code that can be executed on more CPUs:

1. c -> a and b
2. if (c is bigger than 0)
3. foreach( i to c)
Compute(module(i))

Let's say that we have a computer with a CPU that has 256 cores. If we can use all the cores and execute Compute(module(i)) on each core at any given moment you will have a major performance improvement. Having a CPU with 256 cores can sound a bit ridiculous but let’s look at the history. In 1995 the fastest computer was an IBM machine with 512 CPUs and weights a few tons also the power consumption was very high. At this moment a high-end GPU has the same computation power like that IBM machine and this happen only in 16 years.

In the new version of .NET Framework Microsoft added new classes to have support for multicore called Task Parallel Library. This TPL exposes parallel constructs For and ForEach loops using the regular methods and delegates. Writing multithread application is quite a challenging task; the most notorious problem is the deadlock when 2 threads wait each other to release a resource.

Prior to Windows Server 2008 and Windows 7 at the OS level only 64 cores are supported and that mainly because of a "hot" lock called Dispatcher Lock. What is this lock and why is "hot"? In order for a thread to be executed by the CPU it needs to acquire this lock. When you have a relative small number of cores the CPU contention is not very high. If the number of core grows then you will have a lot of cores that they will try to acquire this lock.

If the lock is not acquired the core will spin and do nothing. The NT kernel was design by David Cutler and when the kernel was design having more the 16 cores on a CPU were more like Sci-Fi. Arun Kishan, a kernel developer, took this issue as a side project and manages to remove this lock and he replaces it with a much finer synchronization primitives and now Windows Server 2008 and Windows 7 can scale up to 256 core. This is a very good news for everyone that creates threads.

There are some cases when having multiple threads on a machine with just one core can be a good thing. Why is that? Because in some cases you can have a thread doing some IO operation that can take long if your harddrive is fragmented (there is a lot of seeking) and all this time the CPU will spin and do nothing. As a rule of thumb when you have a case like in my small e.g. with the Compute(module(i)) try to use ThreadPool or TPL. ThreadPool is able to reuse threads inside the pool, TPL is more advanced and it will be part of .NET 4.0. It has concepts like work stealing, worker thread local pool, scheduling groups of actions.


The material prepared by Stefan Tabaranu

Bookmark and Share

2 comments:

  1. Although this group is about .NET, it mught be interesting to know that for native C/C++ programming there well established parallelizm mechanism using a library called "openmp", adding macros and structures for "paralleling" your code.
    On an 8 cores machine i could achieve well over5 times speed increase on a computational intensive LARGE code section.

    Doron Weiss
    profile at: http://il.linkedin.com/pub/doron-weiss/1/3b0/6b1

    ReplyDelete
  2. Hi Doron,

    nice to hear that, I'm not a C/C++ developer but I'll have a look at that library. We also manage to increase the performance in cases where we don't have share state between loops and another thing as developers we will have to change the way we think and redesign algorithms in order take adavantages of these new machines.

    Stefan

    ReplyDelete