User Guides Multi Threaded Programming Guide


Multi Threaded Programming Guide

Hey guys,

Now taking a shot at writing my second article.

I got a lot of hands of experience when building SuperPrimeâ„¢. I mean ofcourse I had some preliminary idea from before (read self learning and college) but had never gone all-and-all-out in coding a true multi threaded application capable of scaling efficiently.

I will give a base idea what multi threading is all about. Then I will explain how I proceeded through the different versions and different algorithms and ideas. Hope you like the technical overview :)

Multi Threading : What is it?

With the recent craze of multi core processors, multi threading as a programming technique has really gone to a new level altogether.

Earlier... whole program structures were written in sequential form. One action after the other.

As an example, consider a multi-player game:

  • Contact server to get latest update.
  • Calculate client side physics simulation.
  • Play appropiate sound actions.
  • Render appropiate graphics on screen.
  • Handle user input.

Since these actions are being executed purely sequentially, we are GENERATING bottlenecks by ourshelves.

If you study the sequence, the visual aspect of the game is just one step in the line up. So if any of the other sequences lag due to what-ever reasons, the graphics will suffer. Not in quality (ofcourse) but in the quantity. This will cause what is famously known "STUTTERING" and will result in spawning of people like Akshay Talwar (aka Techboy.)

The Alternate Way

So instead of attacking Techboy, we should attack the root problem, i.e. the sequential nature of the actions.

However uptill now, most DESKTOP users had "SINGLE core" machines which meant that one one thread (stream) of instructions could be executed at any point of time. So, even if the game developers had made their games multi thread capable, the performance would have suffered because the processors at that time were incapable of efficiently executing multiple threads. Hence all the effort of the programmers were concentrated upon running the processes in a sequential manner in as efficient a way as possible.

But now with the easy availability of dual core processors (this, this of the top of my mind) the situation is changing rapidly.

Now, the processor is capable of efficiently running multiple threads (atleast 2 threads now :p) So now instead of being limited to running the actions sequentially, the game programmer options are expanded to:

  • Start rendering the world, excluding the players and related actions.
  • Buffer the sound files in memory so that playing them becomes less of a bottle neck.
  • Asynchronously get updates from the server. So we are not held up by internet lag any more.
  • Calculate physics simulations. Ofcourse this is a pseudo sequential step. Infact all are. But atleast you will find parallel actions. This step has to start after we get updates from the server.
  • Handle user input.
  • Render user related graphics and HUD.
  • Play sounds, etc.

Ofcourse this list is not accurate and is not meant to be either. But it does give you a rough picture of what I meant by multi threaded programming. Now instead of having to deal with the nitty-gritty of multi threaded game programming, I will focus on something on which I can speak with some authority on.

SuperPrimeâ„¢ : Through the Versions

When i first started programming SuperPrime (well it was called PNS then), my intention was to fully engage as many cores as the user selects (well I gave the option in the program for number of threads) in calculating upto a fixed working set size (again selectable by the user.)

However, it was not a very successful attempt in the first try, and I will tell you why.
The two approaches I had

I had two possible solutions in front of me when i started programming SP.

  • Either split the "whether a number is prime of not" process into multiple threads and work your way through the working set, or
  • Split the working set into multi parts (user selected) and then proceed with the calculation.

My first Implementation

The first way is obviously more complicated and requires a deep understanding of thread synchronization and related parts. Hence I shyed away from it and decided to use the second approach. What i did was this:

Code:
Algorithm:

1- Take the input from the user
  a. Take number of threads -> N
  b. Take working set size -> W
  c. Store starting time -> Tstart
2- Split the working set into required # of parts (W / N)
3- Start N number of threads each assigned to work on the selected range.
4- Wait for both the threads to finish executing (by way of Auto Events)
5- Store ending time -> Tend
6- Calculate the total working time = Ttotal = Tend - Tstart
7- Show output to user.
8- End

Obviously there was a severe limitation in this approach.

Due to the polynomial nature of primality testing (of each number in the working set) larger numbers will take longer to be tested for primality than smaller numbers.

Since my working set was linear in size and went from 0 -> W (from algorithm above), it meant the starting ranges would complete the calculation much faster than the later ranges. And that held even more true for large working set sizes.

Because of this behaviour of the code, Funky noticed and duly reported to me that for the same working set size, selecting N (number of threads) > the number of actual cores available was giving him better times. This was because with large W and small N, the initial range was completing much faster and after that one of the cores was sitting unutilized which the single core struggled to complete the upper range. I was stumbled at first and didnt know why this was happening. Then I thought of the reason I just described.

Hence I was left with a job of somehow ensuring that both the core would continue to be utilized at all times thus ensuring proper scaling.

The first Workaround

Since I realized that the upper ranges were taking longer to execute than the equal sized lower ranges (due to the nature of the primality test, bigger the number, larger the time required), I decided upon one easy solution.

Instead of diving the full working ranges into N equal sized parts, i decided to do a 3:2 distrubution.

So for example if W = 2000 and N = 2,

instead of doing: R1 = 1 - 1000, R2 = 1001 - 2000

i would do: R1 = 1 - 1200, R2 = 1201 - 2000

Obviously, this was a simple solution to the problem.... But since I had a megre dual core setup with me :p I could really predict the running situations with quad core systems.

When I gave the new version to funky to try out, he reported slightly improved situation, but this was not a perfect solution. For even greater ranges, the first large range was completing execution earlier than the short second range and therefore the test was regressing to a single threaded test after that.

I discovered the biggest problem when some people with quad core processors starting running the test. I was expecting the scores to scale linearly from dual core to quad core. However it was not happening. And I was left dumb founded. I was wondering why this was and I didnt have first hand access to such machines (until sometime later when I could test the program on a 8 way Intel machine :))

However after some introspection the problem dawned upon me.

With dual core processors, the ranges (for W = 2000 and N =2 ) were getting divided as following (with the 3:2 divide ratio)

R1 = 1 - 1200, R2 = 1201 - 2000

however, when we upped N to 4 on the same machine and W, we got this:

R1 = 1 - 1200, R2 = 1200 - 1680, R3 = 1681 - 1872, R4 = 1873 - 2000

Though it held true that upper ranges took longer to execute than the lower "larger" ranges, but at this conjecture, the upper ranges were becoming too small to take longer than the progressively larger lower ranges.

Hence R4 was finishing execution very early and a little later R3 would finish also. After that, R1 and R2 would struggle to execute with the larger ranges and the test would regress to a two thread test.

Necessity : Mother of Invention

I had to come with a innovative idea to handle this situation. Clearly this static load distribution was not scaling and something had to be done about it. I quit trying to find work arounds and really sat down to come up with something good.

And I finally decided upon this approach:

Dynamic Load Balancing

Code:
Algorithm:

1- Take the input from the user
  a. Take number of threads -> N
  b. Take working set size -> W
  c. Store starting time -> Tstart
2- Spawn one worker thread.
3- The worker threads keep checking every iterations if the number of RUNNING threads (R) is < N.
4- If R < N Then
  a. Check if we are allowed TO spawn (more on this later.) ***
  b. If Yes Then
     a. Divide our REMAINING working set into 2 equal parts and spawn another worker thread to handle it.
     b. Continue working our own working set.
5- Wait for both the threads to finish executing (by way of Auto Events)
6- Store ending time -> Tend
7- Calculate the total working time = Ttotal = Tend - Tstart
8- Show output to user.
9- End

I will explain the above a bit.

See now instead of statically dividing the working set W into N parts and starting off the worker threads, I incorporated the logic into the worker thread itself. So each worker thread sees the situation (like number of current running threads R, max number of threads allowed N) etc and spawns a new worker thread by subdividing its own RANGE.

Now i will explain step 4(a) of the above algorithm a little bit.

The conditions which are checked before spawning a new thread are:

  • If running threads R < allowed max threads N
  • If THIS particular worker thread (which is checking for new spawn) is having the largest working set size. This way, the thread who is handling the largest working set RANGE gets to divide its work into 2.

Example Run

This way for a working set size W = 2000 and N = 2, it works out like this:

  • First worker thread is spawning with range = 2000
  • It sees that the number of running threads is 1 and it can spawn another thread.
  • Now it checks whether it is the worker thread with max RANGE. Which turns out to be true as it is the only worker thread.
  • It divides its range (i.e. 2000) into 2 and spawns another thread. So now that are 2 worker threads with range = 1000 each.

Now assume that both the threads are crunching numbers and because running threads = R = 2 = max threads = N = 2, no more checks happen.

Then the first worker completes its range (because it was dealing with the lower range.) That way the second worker sees that the number of running threads has dropped to 1. It checks for respawn. It sees that it is the only worker thread and thus has the biggest RUNNING range.

Voila, it subdivides whatever is left of its original working set RANGE and spawns another thread. Now the number of running threads goes back to 2. So we see that all the Cores are always kept busy, and are dynamically loaded.

This is the implementation in the current version of SuperPrime (0.4.3 not released yet, will be releasing it soon with a surprise :))

Conclusion

With the recent advances in technology (not just bleeding edge, but for main stream users also) multi threaded programming is really taking off. So if you are at a position to decide between linear sequential execution and multi threaded execution, please do your users a favour and decide upon the later. It will do the WORLD a lot of good.

Hope you guys liked my article. I have tried to be as fluid in explaining on how I implemented and tackled the algorithm that sits behind SuperPrime right now. Hope that it was a good read :)

Regards,

Karan
Team TechEnclave

ps: wrote my first article w/o any pics :) omg.... and thanks to mods for keeping the contest open for a wee bit long. I had a very rough week and couldnt complete this article.

 
stormblast said:
too much free time at work?

just kidding. nice stuff but dint have the patience to read the whole thing.

Lol... didnt have any time at work.. Hence had to complete it in a weekend :)

Thanks for the appreciation guys. But the irony is I dunno if anyone will even read all of it, let alone discuss it (

Lets see....
 
Good article. You are lucky that even the most naive implementation of this program doesn't need much in the name of synchronization.

Dynamic load balancing - good idea.
 
Karan said:
Lol... didnt have any time at work.. Hence had to complete it in a weekend :)

Thanks for the appreciation guys. But the irony is I dunno if anyone will even read all of it, let alone discuss it (

Lets see....

maybe you can come up with a short version with all the main points that people can click and expand if they want to get into the details , anyways nice work
 
KingKrool said:
Good article. You are lucky that even the most naive implementation of this program doesn't need much in the name of synchronization.
Dynamic load balancing - good idea.

Thanks buddy.... The next version (0.5) is gonna be Open Source. And after that I will need help from all you guys to really take this to a new level.

I hope I have your support guys. I particularly need help in streamlining the benchmarking part.

Synchronization, yea a little bit was used. Nothing hardcore :)

The good thing is, when I scaled this program to a 8 core Xeon server, and ran a 1E8 test, I saw that all the 8 cores were 100% utilized for the entirity of the benchrun. The dynamic load balancing is working well :)
 
Too high level for me....didnt bother reading the whole thing.

I guess Kingkrool was the onlt guy who read the full thing :p
 
As promised guys:

SuperPrime 0.5 (download)

Version 0.5.1

30-Oct-2006

- Made SuperPrimeâ„¢ available under GPL license :)

- Renamed the title of the "Checking for Update" window.

- Tweaked the primality test a bit.

- Changed version number to 0.5.1

Version 0.5

29-Oct-2006

- Will release the source with version 0.5.1 with which I will also test the auto update feature.

- Added auto update feature :)

- Reorganized the source code so that it is more readable.

- Slight adjustment to the load balancing algorithim.

- Moved the time resolution code a bit for more accurate measurements.

- Made SuperPrimeâ„¢ open source under GPL license :)

- Changed version number to 0.5

OpenSource for the win :)

 
I assume if you are not using the sieve method, then dynamic load balancing may not be required.... The way you divided the range was the basic fallacy. If you just have threads=no of cores and make the threads work on consecutive integers, it'll not lead to starvation of any of the cores. So as an example, take a 4core system and 1-100 range. At the beginning thread 1 finds whether 1 is a prime and thread 4 finds whether 4 is a prime. Then thread 1 moves to 5, 2 to 6, 3 to 7, 4 to 8 etc. The threads may not finish entirely at the same time but will be more or less sharing an equal workload. Actually, you can even implement the sieve method this way after a bit of thinking ;). Don't divide by all, only divide by what are known to be primes ;). You'd need some shared memory where the results are put up by the individual threads. Will involve a bit of sync'ing but still hardly anything difficult :p.

Edit:Fgt to mention.... keep it up... really good intro to multithreading for n00bs. Pretty lucid and nicely written.
 
I agree that he doesn't really need much load balancing, but the idea was a good one. Most people are afraid of dynamic load balancing. Secondly, the way he did his stuff avoided excessive syncing. Consider the situation where you want each prime to not be reported more than once. What then? You will need to synchronize on each task. In such a case, a somewhat larger task grain would be appropriate.
 
As always I have read the full article (Though understood only in bits, never want to miss reading karan's useful articles eventhough some portion of it went over my head, will post back with my doubts and stuff).

Awesome Article , the amount of thinking that you have done here is magnificent (Although am still trying to make sense of some of the high-funda stuff).

Reps for you :) :)@ crap, will have to sprad some around, reps pending :) .Waiting for more informative articles from you :)).
 
Darthcoder said:
As always I have read the full article (Though understood only in bits, never want to miss reading karan's useful articles eventhough some portion of it went over my head, will post back with my doubts and stuff).

Awesome Article , the amount of thinking that you have done here is magnificent (Although am still trying to make sense of some of the high-funda stuff).

Reps for you :) :)@ crap, will have to sprad some around, reps pending :) .Waiting for more informative articles from you :)).

Words like this give me a lot of encouragement.... to further refine my work.

Thanks buddy :)
 
Nice tutorial Karan :) It was a good read.

If I might make a suggestion, you could create a pool of N threads at the start of the run and resuse them when necessary. This way you could avoid the overhead in creating new threads as and when the need arises.

This would improve the performance if a considerable number of threads are getting spawned in one run of the application.
 
psynaps3 said:
Nice tutorial Karan :) It was a good read.

If I might make a suggestion, you could create a pool of N threads at the start of the run and resuse them when necessary. This way you could avoid the overhead in creating new threads as and when the need arises.
This would improve the performance if a considerable number of threads are getting spawned in one run of the application.

Hmm good idea... I will definitely use it.... :)
 
Back
Top