Message board for the users of flat assembler.
> Windows > HARDWARE MASTER DRIVER
Goto page Previous 1, 2, 3
Of course there is - I just thought NUMA was about RAM topology, not caches? I do see that the Windows NUMA APIs include cache information... which makes sense
So, nothing is gained from executing code on two cores sharing L2 cache?
- carpe noctem
|13 Jan 2010, 11:06||
uP uP uP uP uP uP uP uP | | | | | | | | L2 L2 L2 L2 (6MB each) | | | | -------- < QPI > -------- (L5410) | | | | Memory (6GB) Memory (6GB)
|13 Jan 2010, 11:09||
|13 Jan 2010, 12:53||
I made a Driver that works also on multi-processors Computers with Windows 7.
It uses the imported kernel api KeIpiGenericCall to execute a broadcast function automatically in every CPU in the same time.
You can choose all CPU or a specific CPU for the execution.
There is a procedure to serialize the CPU executions (exclusive mode)
It has also the undo operation.
It is simple and customizable. Read the source for more details
Also there's the Fasm DDK 2 (new relase) inside the pack
The 64 bit version is in progress....
Last edited by Pirata Derek on 24 Mar 2010, 07:15; edited 1 time in total
|23 Mar 2010, 15:58||
Sorry, i've forgot 4 brakets because i've re-edit the source for compatibility with the FASM Board. (what does it mean? )
The ZIP package is updated and works perfectly.
|24 Mar 2010, 07:14||
first of all, thanks for the clear good code. I have a question (i am like a baby
in this context) , from the comment in the source file
It is to say, although for example the rdtsc is "per se" a EDIT: NOT serializing instruction, i could (or i will be able on x64) execute 2 rdtsc instructions at ring0 and measure the time elapsed in only one of my Yorkfield 4 cores ?
Have you an userland template (load/unload driver) to test it ?
Keep it up so !
|24 Mar 2010, 13:53||
I thought the TSC was never guaranteed to be the same counter on all cores. Although some current CPUs only physically implement one counter that is not guaranteed to be true for all systems. Reading the TSC for timing is fraught with pitfalls unless one takes the proper actions to ensure various conditions are properly met.
|24 Mar 2010, 14:27||
I can easly program that driver in x86 - multi-processors:
The x64 one... you have to wait for me, because i'm exercising in Multi-processors and the x64 platform programming.
i prefered change my old single-core 32bit computer with one 64bit dual-core, instead of using virtual machines.
I've programmed a similar program but it execute 3 RDTSC:
1st RDTSC ; .... instructions needed (optional) .... 2nd RDTSC ; .... execute what you want .... ; .... or make a call and return ; to target function .... 3rd RDTSC ; Correction = 2nd RDTSC - 1st RDTSC ; Delta execution = 3rd RDTSC - 2nd RDTSC ; Result = Delta execution - correction [all]
Correction is useful for slow or crappy processors that elapse time for RDTSC execution and also all operations needed to store value or others.
Teorically "correction" is equal to 0, but who knows?
You can make a driver that disable the Time Stamp Disable flag, so you can execute RDTSC in ring3 directly every time you want until PC reset.
Sometimes INTERRUPTIONS and context switches can invalidate results.
Time Stamp Disable flag is the third bit in the CR4 register.
MOV EAX,CR4 BTR EAX,2 ; RDTSC enabled for ring3 MOV CR4,EAX RET MOV EAX,CR4 BTS EAX,2 ; RDTSC enabled ONLY for ring0 MOV CR4,EAX RET
If you get trouble, i can suggest you another way (my preferred) witout interruptions or problems:
The Driver must hook SYSENTER when loading and relase it when unloading,
also it has to preserve the original pointer to redirect normal syscalls to kernel.
You should use a "Broadcast Worker" similar driver and set the TARGET_CPU value, so the driver will function only in a specific core.
The hooked Sysenter function (in the driver) has only to check if a system call have the EDX (or RDX) register = 0.
If this register is 0 then there's a special syscall (from the interfacing program), because the KiFastSystemCall function always store a value in that register.
When a special syscall is acquired, driver will store the usermode values (read below) and then it is redirected into OUR function. (in this case to read the time stamp counter two times... or all operations what you want!)
You should RAISE THE IRQL to prevent context switch.
I prefer raise it in IPI_LEVEL so my core is isolated from the other/s.
OUR function can:
start counting with the first special syscall and end counting with another syscall.
Also you have consider (subtract) the time spent for system call, raising irql, inverse functions and so on....
Same system is implemented in my new version of "NATIVE API INTERCEPTOR".
After that, the driver creates the stack for an "Artificial Interrupt" and then uses IRET to go back the SYSENTER.
FastReturnFromDriver: PUSHD 23h ; usermode stack segment PUSHD [usermode_stack_pointer] ; provided by the interfacing program PUSHD [usermode_flags] ; provided by the interfacing program PUSHD 1Bh ; usermode code segment PUSHD [usermode_eip] ; provided by the interfacing program IRETD
The RDTSC values will be in general registers like EAX.
Execution in a single core give more precision.
Remember to measure all the optional functions time (as various corrections) and subtract all of them to the main delta execution time.
result: VERY HIGH PRECISELY VALUE
The INTERFACING PROGRAM has only to put its variables into general registers, set EDX = 0 and then make a SYSCALL.
Don't forget to set ITS AFFINITY PROCESS like the hooked CPU core!
; Calling our driver using my personal routine DerekFastDriverCall: PUSHFD POP EAX ; we save the EFLAGS MOV EDI, ESP ; we save the stack XOR EDX,EDX ; the "special syscall" MOV ESI, .Return ; return eip SYSCALL ; go to driver .Return: NOP RET
The driver will provide storage of these values to return in usermode
You can see my old interface driver example in this board
I hope all informations will be usefull for you.
wait for me and i'll post for you the x64 driver
With the advent of multi-core/hyperthreaded CPUs, systems with multiple CPUs, and "hibernating" operating systems, the TSC cannot be relied on to provide accurate results. The issue has two components: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. In such cases, programmers can only get reliable results by locking their code to a single CPU.
What a perfect post?
|25 Mar 2010, 14:44||
|Goto page Previous 1, 2, 3
< Last Thread | Next Thread >
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.