flat assembler
Message board for the users of flat assembler.

Index > Windows > XP scheduler is broken

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 15:48
[rant]

Windows XP thread scheduler is not round robin with multicore CPUs.

I have a dual core machine running three worker threads. They are all the same priority of 1 (the lowest). They all started at the same time but one thread gets an entire core to itself and the other two have to share.

This does not happen on single core machines, the threads there get 1/n time each.

Very frustrating!

[/rant]
Post 20 Feb 2008, 15:48
View user's profile Send private message Visit poster's website Reply with quote
Plue



Joined: 15 Dec 2005
Posts: 151
Plue 20 Feb 2008, 18:35
Well, you have three threads and two cores. What do you want Windows to do? Surely it can't add another core for you. Wink
Of course you can split up one of the threads over two cores, but that will most likely cause cache thrashing and an overall lower throughput. I think they simply prioritized throughput over fairness, which is maybe a "fair" choice.

Edit:
Quote:
They all started at the same time
I thought you said you only had two cores? It's technically impossible to start three threads at the same time on two cores.
Post 20 Feb 2008, 18:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 18:53
Doesn't seem very fair to me, one thread runs for 6 hours solid on one core, while two other threads of the same priority run 3 hours each on another core. Cache thrashing is not an issue, they are all in the same process, the cache is not invalided when switching threads only when switching processes.

Yes, technically, two threads can start together and one has to wait one time slice, okay, but in the human (I'm human btw Wink) time scale, they all started together.
Post 20 Feb 2008, 18:53
View user's profile Send private message Visit poster's website Reply with quote
Plue



Joined: 15 Dec 2005
Posts: 151
Plue 20 Feb 2008, 19:03
Quote:
Cache thrashing is not an issue, they are all in the same process, the cache is not invalided when switching threads only when switching processes.
But the processor cache can only hold so much. Unless all threads are working on the same data they will destroy the cache for each others.
Post 20 Feb 2008, 19:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 19:40
I wish it would do this:

Time slice 0: T1 + T2
repeat
Time slice 1: T1 + T3 ;T2 cache lost
Time slice 2: T2 + T3 ;T1 cache lost
Time slice 3: T2 + T1 ;T3 cache lost
Time slice 4: T3 + T1 ;T2 cache lost
Time slice 5: T3 + T2 ;T1 cache lost
Time slice 6: T1 + T2 ;T3 cache lost
end repeat

But it does this:

Time slice 0: T1 + T2
repeat
Time slice 1: T1 + T3 ;T2 cache lost
Time slice 2: T1 + T2 ;T3 cache lost
Time slice 3: T1 + T3 ;T2 cache lost
Time slice 4: T1 + T2 ;T3 cache lost
Time slice 5: T1 + T3 ;T2 cache lost
Time slice 6: T1 + T2 ;T3 cache lost
end repeat

There is no difference in how much cache thrash there is, but the sharing is more regular with the first schedule.
Post 20 Feb 2008, 19:40
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 23:12
The algo seems to me to be very simple.

Step 0: Choke whoever is eating the most.
Step 1: Feed whoever is starving the most.

Simple. Very simple. Exceedingly simple! Did I mention it was simple, becauses it is, simple that is.

Example.

Time Slice=0:
T1 hasn't eaten yet
T2 hasn't eaten yet
T3 hasn't eaten yet
So schedule CPU0=T1, CPU1=T2

Time Slice=1:
T1 has been feeding for one turn
T2 has been feeding for one turn
T3 hasn't eaten yet
So schedule CPU0=T3, CPU1=T2

Time Slice=2:
T1 hasn't eaten yet
T2 has been feeding for two turns
T3 has been feeding for one turn
So schedule CPU0=T3, CPU1=T1

Time Slice=3:
T1 has been feeding for one turn
T2 hasn't eaten yet
T3 has been feeding for two turns
So schedule CPU0=T2, CPU1=T1

Time Slice=4:
T1 has been feeding for two turns
T2 has been feeding for one turn
T3 hasn't eaten yet
So schedule CPU0=T2, CPU1=T3

... etcetera, etcetera, etcetera,

easy peasy, oh yeah, it is simple also.

It generalises for any number of CPUs and any number of threads. The only extra thing I haven't mentioned is that higher priority threads get first pick if they are not blocked.

@Microsoft: I hope you are reading and learning, it's called round-robin, check your own search engine.
Post 20 Feb 2008, 23:12
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 20 Feb 2008, 23:14
Having the same amount of cache lost doesn't indicate much with regard to execution speed. Sure, ideally T# could be designed to use only ~1/3 of the cache to reduce speed loss due to trashing. With most applications still be single threaded there is a greater likelihood T# is designed to use ~100% of the cache. In which case T1 runs as expected and T2/T3 run at a lower level of performance - the round robin approach would have all T# running sub-optimal (can be much less than 50% of T1).

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 20 Feb 2008, 23:14
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 20 Feb 2008, 23:22
But how would XP possibly "know" that T1 needs all the cache, and T2/3 don't? Of course it can't know, it just simply chose a thread at random to run double the other two. What if actually T2 needs lots of cache, and T1 is doing a much simpler task, then you get worse overall performance.

The cache thing is a red-herring anyway, and is independent of the scheduling algo. I just mentioned because Plue was concerned about it. But it shows the worst case scenario is no worse than the current scenario as far as cache performance.

Indeed for my task at hand, the working set size is <<< 1/3 cache and no thrashing will actually take place.
Post 20 Feb 2008, 23:22
View user's profile Send private message Visit poster's website Reply with quote
kandamun



Joined: 20 Jul 2005
Posts: 25
kandamun 21 Feb 2008, 07:56
revolution,
can you try to call SetThreadAffinityMask , just to make sure that all threads are allowed to run on each processor ?
Post 21 Feb 2008, 07:56
View user's profile Send private message ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 21 Feb 2008, 08:03
kandamun wrote:
can you try to call SetThreadAffinityMask , just to make sure that all threads are allowed to run on each processor ?
Thanks for the suggestion, I had tried all the affinity masks, it had no effect Sad
Post 21 Feb 2008, 08:03
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 21 Feb 2008, 15:13
I examined this further, it seems that only three threads are affected, five threads has a small bias but not a lot.

Results:
Code:
Single core          Dual core
----------------------------------------
Total threads: 1   Total threads: 1
Thread: 0, 100%             Thread: 0, 100%
         
Total threads: 2        Total threads: 2
Thread: 0,  50%             Thread: 0,  49%
Thread: 1,  49%              Thread: 1,  50%
         
Total threads: 3        Total threads: 3
Thread: 0,  33%             Thread: 0,  24%
Thread: 1,  33%              Thread: 1,  49%
Thread: 2,  33%              Thread: 2,  25%
         
Total threads: 4        Total threads: 4
Thread: 0,  25%             Thread: 0,  25%
Thread: 1,  25%              Thread: 1,  24%
Thread: 2,  25%              Thread: 2,  25%
Thread: 3,  24%              Thread: 3,  24%
         
Total threads: 5        Total threads: 5
Thread: 0,  20%             Thread: 0,  19%
Thread: 1,  19%              Thread: 1,  21%
Thread: 2,  19%              Thread: 2,  19%
Thread: 3,  20%              Thread: 3,  19%
Thread: 4,  19%              Thread: 4,  19%
         
Total threads: 6        Total threads: 6
Thread: 0,  16%             Thread: 0,  17%
Thread: 1,  16%              Thread: 1,  16%
Thread: 2,  16%              Thread: 2,  17%
Thread: 3,  16%              Thread: 3,  16%
Thread: 4,  16%              Thread: 4,  16%
Thread: 5,  16%              Thread: 5,  16%
         
Total threads: 7        Total threads: 7
Thread: 0,  14%             Thread: 0,  14%
Thread: 1,  14%              Thread: 1,  14%
Thread: 2,  14%              Thread: 2,  14%
Thread: 3,  13%              Thread: 3,  14%
Thread: 4,  14%              Thread: 4,  14%
Thread: 5,  14%              Thread: 5,  14%
Thread: 6,  13%              Thread: 6,  13%
         
Total threads: 8        Total threads: 8
Thread: 0,  12%             Thread: 0,  12%
Thread: 1,  12%              Thread: 1,  12%
Thread: 2,  12%              Thread: 2,  12%
Thread: 3,  12%              Thread: 3,  12%
Thread: 4,  12%              Thread: 4,  12%
Thread: 5,  12%              Thread: 5,  12%
Thread: 6,  12%              Thread: 6,  12%
Thread: 7,  12%              Thread: 7,  12%    
Another curiosity, it is always the second created thread (thread 1) that gets the extra runtime.


Description: Multi-thread tester
Download
Filename: Scheduler.asm
Filesize: 2.29 KB
Downloaded: 364 Time(s)



Last edited by revolution on 13 Sep 2009, 03:34; edited 1 time in total
Post 21 Feb 2008, 15:13
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 21 Feb 2008, 21:27
Creating more threads than you have virtual CPUs is bad design practice.
Post 21 Feb 2008, 21:27
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 22 Feb 2008, 03:27
r22 wrote:
Creating more threads than you have virtual CPUs is bad design practice.
This has been standard practice since the dawn of the computer. Remember the old university mainframes, they would run hundreds of users programs at the same time using pre-emptive scheduling.

I think it is a little bit more complicated than how many CPUs one has, many other factors need to be considered to determine the optimal number of threads to use. If threads occasionally get blocked waiting to input, then the time can be used fruitfully by other threads to fill in the gap, but this only works if you have more threads running than you have CPUs.
Post 22 Feb 2008, 03:27
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 22 Feb 2008, 15:08
I think we have colliding opinions on the proper use of threads in general and the difference between a process and a thread.

Raw Computation - Parallel calculations (ie: stream processing) with no blocking
For raw computation purposes 1 thread per CPU is the logical choice, because you'd lose throughput to context switching with any more.

Worker Threads - One thread handles input and feeds the 'Worker' threads (ie: web servers)
For worker thread type situations X * the number of CPUs is the logical choice, because I/O blocking creates a void in CPU usage that needs to be filled to get the highest throughput.

Quote:

If threads occasionally get blocked waiting to input, then the time can be used fruitfully by other threads to fill in the gap, but this only works if you have more threads running than you have CPUs.


In your first post it seemed like you were referring to the Raw Computation thread model (hence my reply 1 thread / CPU), which is the only model where scheduling is an issue. But your above quote talks about the Worker thread model where scheduling is based on I/O (a worker thread will only use CPU when its fed) so it's illogical to talk about time slicing in this respect.

Why its illogical: In both thread models, threads will be scheduled to finish their task in the "fastest possible time" not in the "fairest for all amount of time". Scheduling them round robin would cause unneeded context switching which takes away from throughput.

So time slicing and scheduling inconsistencies are a direct result of poor software design (if using the Raw Computation model) or totally irrelevant (if using the Worker model).
Post 22 Feb 2008, 15:08
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 22 Feb 2008, 15:44
I accept your description, thanks. The only thing is where you mention context switching. For threads in the same process the context switch is not an issue, it takes the same amount of time as resuming a thread since no cache or memory tables are affected. CPU switches to Ring0 to decide the next thread and the OS just sets the address to reload the CPU registers. When there is no change in process then thread switching is very smooth.

For may particular case I need all three threads to perform at the same level. It is necessary for the task (a three player game). So when one player (thread 1) was considerably ahead I realised something was very wrong. It was unfairly getting more time to consider it's plays. So all my results were ruined, hence the rant in the first post.

Unfortunately not all tasks can adjust themselves to the hardware they are running on. And in this case I could not work out how to efficiently solve it. So I ended up making another dummy thread (4 in total) and just burn time doing nops.
Post 22 Feb 2008, 15:44
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 22 Feb 2008, 17:40
You most likely already figured this out: The solution in that case is to create a thread for each CPU and dispatch jobs to them; or create multiple instances of a thread which handles all players.

I'm using the later choice in one of my projects. Every thread is responsible for updating the job queue and preforming jobs. No thread is idle until all work is done - this is the way we work at the warehouse. Smile

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 22 Feb 2008, 17:40
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 22 Feb 2008, 18:11
My initial thought was to break the player thread into two parts (thus making 6 threads total) but there was almost no chance I could see to do that. Most of the processing has a linear dependency chain and opportunities for parallelism were insufficient to make it feasible. Result: An improbable solution.

My second thought was your suggestion bitRAKE, to break the task into smaller pieces and distribute. But that creates a synchronisation difficulty, I would have to ensure each small processing slice was of an equal time length (for fairness). But the algorithm does not yield to deterministic timing in a straight forward way. Result: An unlikely solution.

My third thought was to create a dummy thread and waste 1/4 of the system bandwidth. Result: A possible solution.

My fourth thought was to disassemble and reverse engineer the kernel scheduler. Reprogram it to my specification. Email the new kernel to MS. Wait a long time for them to acknowledge the problem and maybe incorporate my patch. Then run my code. Result: An impossible solution.

So, I settled to solution 3.
Post 22 Feb 2008, 18:11
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 22 Feb 2008, 19:20
The equality requires such a fine grain time scale? My program seems very different from yours. I keep running totals of time and skip jobs to keep them all in line. Completely asyncronous jobs makes this easy, though.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 22 Feb 2008, 19:20
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20450
Location: In your JS exploiting you and your system
revolution 22 Feb 2008, 20:08
More time to think means an advantage is being given. Doing a comparative analysis means that I need ensure external effects do not adversely influence the results. A small percentage variance due to non-algorithm related things can be tolerated, but the long term error needs to average out to zero across all threads else there will be a systematic error favouring/hurting one thread compared to others.
Post 22 Feb 2008, 20:08
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 23 Feb 2008, 00:28
I've always thought absolute equality is an illusion. It doesn't take much complexity for determinism to fall out the window. It's an on going struggle to get close. How to take advantage of the system without having inside knowledge of it might be part of game? Laughing

I used to play Eve Online and many groups had impressive techniques that relied on systematic errors. Only way to combat the abuse was to add massive delays to gaurantee syncronization - which created additional avenues for abuse and degraded game play. It's a great game though - easy learning curve that really sucks you in, and scalable MMORPG action with high risks.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 23 Feb 2008, 00:28
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.