Guide - Multi Threaded Programming Guide | Page 2 | TechEnclave

Multi Threaded Programming Guide

Hey guys,

Now taking a shot at writing my second article.

I got a lot of hands of experience when building SuperPrimeÃ¢â€žÂ¢. I mean ofcourse I had some preliminary idea from before (read self learning and college) but had never gone all-and-all-out in coding a true multi threaded application capable of scaling efficiently.

I will give a base idea what multi threading is all about. Then I will explain how I proceeded through the different versions and different algorithms and ideas. Hope you like the technical overview

Multi Threading : What is it?

With the recent craze of multi core processors, multi threading as a programming technique has really gone to a new level altogether.

Earlier... whole program structures were written in sequential form. One action after the other.

As an example, consider a multi-player game:

Contact server to get latest update.

Calculate client side physics simulation.

Play appropiate sound actions.

Render appropiate graphics on screen.

Handle user input.

Since these actions are being executed purely sequentially, we are GENERATING bottlenecks by ourshelves.

If you study the sequence, the visual aspect of the game is just one step in the line up. So if any of the other sequences lag due to what-ever reasons, the graphics will suffer. Not in quality (ofcourse) but in the quantity. This will cause what is famously known "STUTTERING" and will result in spawning of people like Akshay Talwar (aka Techboy.)

The Alternate Way

So instead of attacking Techboy, we should attack the root problem, i.e. the sequential nature of the actions.

However uptill now, most DESKTOP users had "SINGLE core" machines which meant that one one thread (stream) of instructions could be executed at any point of time. So, even if the game developers had made their games multi thread capable, the performance would have suffered because the processors at that time were incapable of efficiently executing multiple threads. Hence all the effort of the programmers were concentrated upon running the processes in a sequential manner in as efficient a way as possible.

But now with the easy availability of dual core processors (this, this of the top of my mind) the situation is changing rapidly.

Now, the processor is capable of efficiently running multiple threads (atleast 2 threads now ) So now instead of being limited to running the actions sequentially, the game programmer options are expanded to:

Start rendering the world, excluding the players and related actions.

Buffer the sound files in memory so that playing them becomes less of a bottle neck.

Asynchronously get updates from the server. So we are not held up by internet lag any more.

Calculate physics simulations. Ofcourse this is a pseudo sequential step. Infact all are. But atleast you will find parallel actions. This step has to start after we get updates from the server.

Handle user input.

Render user related graphics and HUD.

Play sounds, etc.

Ofcourse this list is not accurate and is not meant to be either. But it does give you a rough picture of what I meant by multi threaded programming. Now instead of having to deal with the nitty-gritty of multi threaded game programming, I will focus on something on which I can speak with some authority on.

SuperPrimeÃ¢â€žÂ¢ : Through the Versions

When i first started programming SuperPrime (well it was called PNS then), my intention was to fully engage as many cores as the user selects (well I gave the option in the program for number of threads) in calculating upto a fixed working set size (again selectable by the user.)

However, it was not a very successful attempt in the first try, and I will tell you why.
The two approaches I had

I had two possible solutions in front of me when i started programming SP.

Either split the "whether a number is prime of not" process into multiple threads and work your way through the working set, or

Split the working set into multi parts (user selected) and then proceed with the calculation.

My first Implementation

The first way is obviously more complicated and requires a deep understanding of thread synchronization and related parts. Hence I shyed away from it and decided to use the second approach. What i did was this:

Code:

Algorithm: 1- Take the input from the user a. Take number of threads -> N b. Take working set size -> W c. Store starting time -> Tstart 2- Split the working set into required # of parts (W / N) 3- Start N number of threads each assigned to work on the selected range. 4- Wait for both the threads to finish executing (by way of Auto Events) 5- Store ending time -> Tend 6- Calculate the total working time = Ttotal = Tend - Tstart 7- Show output to user. 8- End

Obviously there was a severe limitation in this approach.

Due to the polynomial nature of primality testing (of each number in the working set) larger numbers will take longer to be tested for primality than smaller numbers.

Since my working set was linear in size and went from 0 -> W (from algorithm above), it meant the starting ranges would complete the calculation much faster than the later ranges. And that held even more true for large working set sizes.

Because of this behaviour of the code, Funky noticed and duly reported to me that for the same working set size, selecting N (number of threads) > the number of actual cores available was giving him better times. This was because with large W and small N, the initial range was completing much faster and after that one of the cores was sitting unutilized which the single core struggled to complete the upper range. I was stumbled at first and didnt know why this was happening. Then I thought of the reason I just described.

Hence I was left with a job of somehow ensuring that both the core would continue to be utilized at all times thus ensuring proper scaling.

The first Workaround

Since I realized that the upper ranges were taking longer to execute than the equal sized lower ranges (due to the nature of the primality test, bigger the number, larger the time required), I decided upon one easy solution.

Instead of diving the full working ranges into N equal sized parts, i decided to do a 3:2 distrubution.

So for example if W = 2000 and N = 2,

instead of doing: R1 = 1 - 1000, R2 = 1001 - 2000

i would do: R1 = 1 - 1200, R2 = 1201 - 2000

Obviously, this was a simple solution to the problem.... But since I had a megre dual core setup with me I could really predict the running situations with quad core systems.

When I gave the new version to funky to try out, he reported slightly improved situation, but this was not a perfect solution. For even greater ranges, the first large range was completing execution earlier than the short second range and therefore the test was regressing to a single threaded test after that.

I discovered the biggest problem when some people with quad core processors starting running the test. I was expecting the scores to scale linearly from dual core to quad core. However it was not happening. And I was left dumb founded. I was wondering why this was and I didnt have first hand access to such machines (until sometime later when I could test the program on a 8 way Intel machine )

However after some introspection the problem dawned upon me.

With dual core processors, the ranges (for W = 2000 and N =2 ) were getting divided as following (with the 3:2 divide ratio)

R1 = 1 - 1200, R2 = 1201 - 2000

however, when we upped N to 4 on the same machine and W, we got this:

R1 = 1 - 1200, R2 = 1200 - 1680, R3 = 1681 - 1872, R4 = 1873 - 2000

Though it held true that upper ranges took longer to execute than the lower "larger" ranges, but at this conjecture, the upper ranges were becoming too small to take longer than the progressively larger lower ranges.

Hence R4 was finishing execution very early and a little later R3 would finish also. After that, R1 and R2 would struggle to execute with the larger ranges and the test would regress to a two thread test.

Necessity : Mother of Invention

I had to come with a innovative idea to handle this situation. Clearly this static load distribution was not scaling and something had to be done about it. I quit trying to find work arounds and really sat down to come up with something good.

And I finally decided upon this approach:

Dynamic Load Balancing

Code:

Algorithm: 1- Take the input from the user a. Take number of threads -> N b. Take working set size -> W c. Store starting time -> Tstart 2- Spawn one worker thread. 3- The worker threads keep checking every iterations if the number of RUNNING threads (R) is < N. 4- If R < N Then a. Check if we are allowed TO spawn (more on this later.) *** b. If Yes Then a. Divide our REMAINING working set into 2 equal parts and spawn another worker thread to handle it. b. Continue working our own working set. 5- Wait for both the threads to finish executing (by way of Auto Events) 6- Store ending time -> Tend 7- Calculate the total working time = Ttotal = Tend - Tstart 8- Show output to user. 9- End

I will explain the above a bit.

See now instead of statically dividing the working set W into N parts and starting off the worker threads, I incorporated the logic into the worker thread itself. So each worker thread sees the situation (like number of current running threads R, max number of threads allowed N) etc and spawns a new worker thread by subdividing its own RANGE.

Now i will explain step 4(a) of the above algorithm a little bit.

The conditions which are checked before spawning a new thread are:

If running threads R < allowed max threads N

If THIS particular worker thread (which is checking for new spawn) is having the largest working set size. This way, the thread who is handling the largest working set RANGE gets to divide its work into 2.

Example Run

This way for a working set size W = 2000 and N = 2, it works out like this:

First worker thread is spawning with range = 2000

It sees that the number of running threads is 1 and it can spawn another thread.

Now it checks whether it is the worker thread with max RANGE. Which turns out to be true as it is the only worker thread.

It divides its range (i.e. 2000) into 2 and spawns another thread. So now that are 2 worker threads with range = 1000 each.

Now assume that both the threads are crunching numbers and because running threads = R = 2 = max threads = N = 2, no more checks happen.

Then the first worker completes its range (because it was dealing with the lower range.) That way the second worker sees that the number of running threads has dropped to 1. It checks for respawn. It sees that it is the only worker thread and thus has the biggest RUNNING range.

Voila, it subdivides whatever is left of its original working set RANGE and spawns another thread. Now the number of running threads goes back to 2. So we see that all the Cores are always kept busy, and are dynamically loaded.

This is the implementation in the current version of SuperPrime (0.4.3 not released yet, will be releasing it soon with a surprise )

Conclusion

With the recent advances in technology (not just bleeding edge, but for main stream users also) multi threaded programming is really taking off. So if you are at a position to decide between linear sequential execution and multi threaded execution, please do your users a favour and decide upon the later. It will do the WORLD a lot of good.

Hope you guys liked my article. I have tried to be as fluid in explaining on how I implemented and tackled the algorithm that sits behind SuperPrime right now. Hope that it was a good read

Regards,

Karan
Team TechEnclave

ps: wrote my first article w/o any pics omg.... and thanks to mods for keeping the contest open for a wee bit long. I had a very rough week and couldnt complete this article.