flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2, 3 Next |
Author |
|
Plue 20 Feb 2008, 18:35
Well, you have three threads and two cores. What do you want Windows to do? Surely it can't add another core for you.
![]() Of course you can split up one of the threads over two cores, but that will most likely cause cache thrashing and an overall lower throughput. I think they simply prioritized throughput over fairness, which is maybe a "fair" choice. Edit: Quote: They all started at the same time |
|||
![]() |
|
revolution 20 Feb 2008, 18:53
Doesn't seem very fair to me, one thread runs for 6 hours solid on one core, while two other threads of the same priority run 3 hours each on another core. Cache thrashing is not an issue, they are all in the same process, the cache is not invalided when switching threads only when switching processes.
Yes, technically, two threads can start together and one has to wait one time slice, okay, but in the human (I'm human btw ![]() |
|||
![]() |
|
Plue 20 Feb 2008, 19:03
Quote: Cache thrashing is not an issue, they are all in the same process, the cache is not invalided when switching threads only when switching processes. |
|||
![]() |
|
revolution 20 Feb 2008, 19:40
I wish it would do this:
Time slice 0: T1 + T2 repeat Time slice 1: T1 + T3 ;T2 cache lost Time slice 2: T2 + T3 ;T1 cache lost Time slice 3: T2 + T1 ;T3 cache lost Time slice 4: T3 + T1 ;T2 cache lost Time slice 5: T3 + T2 ;T1 cache lost Time slice 6: T1 + T2 ;T3 cache lost end repeat But it does this: Time slice 0: T1 + T2 repeat Time slice 1: T1 + T3 ;T2 cache lost Time slice 2: T1 + T2 ;T3 cache lost Time slice 3: T1 + T3 ;T2 cache lost Time slice 4: T1 + T2 ;T3 cache lost Time slice 5: T1 + T3 ;T2 cache lost Time slice 6: T1 + T2 ;T3 cache lost end repeat There is no difference in how much cache thrash there is, but the sharing is more regular with the first schedule. |
|||
![]() |
|
revolution 20 Feb 2008, 23:12
The algo seems to me to be very simple.
Step 0: Choke whoever is eating the most. Step 1: Feed whoever is starving the most. Simple. Very simple. Exceedingly simple! Did I mention it was simple, becauses it is, simple that is. Example. Time Slice=0: T1 hasn't eaten yet T2 hasn't eaten yet T3 hasn't eaten yet So schedule CPU0=T1, CPU1=T2 Time Slice=1: T1 has been feeding for one turn T2 has been feeding for one turn T3 hasn't eaten yet So schedule CPU0=T3, CPU1=T2 Time Slice=2: T1 hasn't eaten yet T2 has been feeding for two turns T3 has been feeding for one turn So schedule CPU0=T3, CPU1=T1 Time Slice=3: T1 has been feeding for one turn T2 hasn't eaten yet T3 has been feeding for two turns So schedule CPU0=T2, CPU1=T1 Time Slice=4: T1 has been feeding for two turns T2 has been feeding for one turn T3 hasn't eaten yet So schedule CPU0=T2, CPU1=T3 ... etcetera, etcetera, etcetera, easy peasy, oh yeah, it is simple also. It generalises for any number of CPUs and any number of threads. The only extra thing I haven't mentioned is that higher priority threads get first pick if they are not blocked. @Microsoft: I hope you are reading and learning, it's called round-robin, check your own search engine. |
|||
![]() |
|
bitRAKE 20 Feb 2008, 23:14
Having the same amount of cache lost doesn't indicate much with regard to execution speed. Sure, ideally T# could be designed to use only ~1/3 of the cache to reduce speed loss due to trashing. With most applications still be single threaded there is a greater likelihood T# is designed to use ~100% of the cache. In which case T1 runs as expected and T2/T3 run at a lower level of performance - the round robin approach would have all T# running sub-optimal (can be much less than 50% of T1).
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 20 Feb 2008, 23:22
But how would XP possibly "know" that T1 needs all the cache, and T2/3 don't? Of course it can't know, it just simply chose a thread at random to run double the other two. What if actually T2 needs lots of cache, and T1 is doing a much simpler task, then you get worse overall performance.
The cache thing is a red-herring anyway, and is independent of the scheduling algo. I just mentioned because Plue was concerned about it. But it shows the worst case scenario is no worse than the current scenario as far as cache performance. Indeed for my task at hand, the working set size is <<< 1/3 cache and no thrashing will actually take place. |
|||
![]() |
|
kandamun 21 Feb 2008, 07:56
revolution,
can you try to call SetThreadAffinityMask , just to make sure that all threads are allowed to run on each processor ? |
|||
![]() |
|
revolution 21 Feb 2008, 08:03
kandamun wrote: can you try to call SetThreadAffinityMask , just to make sure that all threads are allowed to run on each processor ? ![]() |
|||
![]() |
|
revolution 21 Feb 2008, 15:13
I examined this further, it seems that only three threads are affected, five threads has a small bias but not a lot.
Results: Code: Single core Dual core ---------------------------------------- Total threads: 1 Total threads: 1 Thread: 0, 100% Thread: 0, 100% Total threads: 2 Total threads: 2 Thread: 0, 50% Thread: 0, 49% Thread: 1, 49% Thread: 1, 50% Total threads: 3 Total threads: 3 Thread: 0, 33% Thread: 0, 24% Thread: 1, 33% Thread: 1, 49% Thread: 2, 33% Thread: 2, 25% Total threads: 4 Total threads: 4 Thread: 0, 25% Thread: 0, 25% Thread: 1, 25% Thread: 1, 24% Thread: 2, 25% Thread: 2, 25% Thread: 3, 24% Thread: 3, 24% Total threads: 5 Total threads: 5 Thread: 0, 20% Thread: 0, 19% Thread: 1, 19% Thread: 1, 21% Thread: 2, 19% Thread: 2, 19% Thread: 3, 20% Thread: 3, 19% Thread: 4, 19% Thread: 4, 19% Total threads: 6 Total threads: 6 Thread: 0, 16% Thread: 0, 17% Thread: 1, 16% Thread: 1, 16% Thread: 2, 16% Thread: 2, 17% Thread: 3, 16% Thread: 3, 16% Thread: 4, 16% Thread: 4, 16% Thread: 5, 16% Thread: 5, 16% Total threads: 7 Total threads: 7 Thread: 0, 14% Thread: 0, 14% Thread: 1, 14% Thread: 1, 14% Thread: 2, 14% Thread: 2, 14% Thread: 3, 13% Thread: 3, 14% Thread: 4, 14% Thread: 4, 14% Thread: 5, 14% Thread: 5, 14% Thread: 6, 13% Thread: 6, 13% Total threads: 8 Total threads: 8 Thread: 0, 12% Thread: 0, 12% Thread: 1, 12% Thread: 1, 12% Thread: 2, 12% Thread: 2, 12% Thread: 3, 12% Thread: 3, 12% Thread: 4, 12% Thread: 4, 12% Thread: 5, 12% Thread: 5, 12% Thread: 6, 12% Thread: 6, 12% Thread: 7, 12% Thread: 7, 12%
Last edited by revolution on 13 Sep 2009, 03:34; edited 1 time in total |
|||||||||||
![]() |
|
r22 21 Feb 2008, 21:27
Creating more threads than you have virtual CPUs is bad design practice.
|
|||
![]() |
|
revolution 22 Feb 2008, 03:27
r22 wrote: Creating more threads than you have virtual CPUs is bad design practice. I think it is a little bit more complicated than how many CPUs one has, many other factors need to be considered to determine the optimal number of threads to use. If threads occasionally get blocked waiting to input, then the time can be used fruitfully by other threads to fill in the gap, but this only works if you have more threads running than you have CPUs. |
|||
![]() |
|
r22 22 Feb 2008, 15:08
I think we have colliding opinions on the proper use of threads in general and the difference between a process and a thread.
Raw Computation - Parallel calculations (ie: stream processing) with no blocking For raw computation purposes 1 thread per CPU is the logical choice, because you'd lose throughput to context switching with any more. Worker Threads - One thread handles input and feeds the 'Worker' threads (ie: web servers) For worker thread type situations X * the number of CPUs is the logical choice, because I/O blocking creates a void in CPU usage that needs to be filled to get the highest throughput. Quote:
In your first post it seemed like you were referring to the Raw Computation thread model (hence my reply 1 thread / CPU), which is the only model where scheduling is an issue. But your above quote talks about the Worker thread model where scheduling is based on I/O (a worker thread will only use CPU when its fed) so it's illogical to talk about time slicing in this respect. Why its illogical: In both thread models, threads will be scheduled to finish their task in the "fastest possible time" not in the "fairest for all amount of time". Scheduling them round robin would cause unneeded context switching which takes away from throughput. So time slicing and scheduling inconsistencies are a direct result of poor software design (if using the Raw Computation model) or totally irrelevant (if using the Worker model). |
|||
![]() |
|
revolution 22 Feb 2008, 15:44
I accept your description, thanks. The only thing is where you mention context switching. For threads in the same process the context switch is not an issue, it takes the same amount of time as resuming a thread since no cache or memory tables are affected. CPU switches to Ring0 to decide the next thread and the OS just sets the address to reload the CPU registers. When there is no change in process then thread switching is very smooth.
For may particular case I need all three threads to perform at the same level. It is necessary for the task (a three player game). So when one player (thread 1) was considerably ahead I realised something was very wrong. It was unfairly getting more time to consider it's plays. So all my results were ruined, hence the rant in the first post. Unfortunately not all tasks can adjust themselves to the hardware they are running on. And in this case I could not work out how to efficiently solve it. So I ended up making another dummy thread (4 in total) and just burn time doing nops. |
|||
![]() |
|
bitRAKE 22 Feb 2008, 17:40
You most likely already figured this out: The solution in that case is to create a thread for each CPU and dispatch jobs to them; or create multiple instances of a thread which handles all players.
I'm using the later choice in one of my projects. Every thread is responsible for updating the job queue and preforming jobs. No thread is idle until all work is done - this is the way we work at the warehouse. ![]() _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 22 Feb 2008, 18:11
My initial thought was to break the player thread into two parts (thus making 6 threads total) but there was almost no chance I could see to do that. Most of the processing has a linear dependency chain and opportunities for parallelism were insufficient to make it feasible. Result: An improbable solution.
My second thought was your suggestion bitRAKE, to break the task into smaller pieces and distribute. But that creates a synchronisation difficulty, I would have to ensure each small processing slice was of an equal time length (for fairness). But the algorithm does not yield to deterministic timing in a straight forward way. Result: An unlikely solution. My third thought was to create a dummy thread and waste 1/4 of the system bandwidth. Result: A possible solution. My fourth thought was to disassemble and reverse engineer the kernel scheduler. Reprogram it to my specification. Email the new kernel to MS. Wait a long time for them to acknowledge the problem and maybe incorporate my patch. Then run my code. Result: An impossible solution. So, I settled to solution 3. |
|||
![]() |
|
bitRAKE 22 Feb 2008, 19:20
The equality requires such a fine grain time scale? My program seems very different from yours. I keep running totals of time and skip jobs to keep them all in line. Completely asyncronous jobs makes this easy, though.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
revolution 22 Feb 2008, 20:08
More time to think means an advantage is being given. Doing a comparative analysis means that I need ensure external effects do not adversely influence the results. A small percentage variance due to non-algorithm related things can be tolerated, but the long term error needs to average out to zero across all threads else there will be a systematic error favouring/hurting one thread compared to others.
|
|||
![]() |
|
bitRAKE 23 Feb 2008, 00:28
I've always thought absolute equality is an illusion. It doesn't take much complexity for determinism to fall out the window. It's an on going struggle to get close. How to take advantage of the system without having inside knowledge of it might be part of game?
![]() I used to play Eve Online and many groups had impressive techniques that relied on systematic errors. Only way to combat the abuse was to add massive delays to gaurantee syncronization - which created additional avenues for abuse and degraded game play. It's a great game though - easy learning curve that really sucks you in, and scalable MMORPG action with high risks. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
![]() |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.