flat assembler
Message board for the users of flat assembler.

Index > Windows > HARDWARE MASTER DRIVER

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
bitRAKE wrote:
So, nothing is gained from executing code on two cores sharing L2 cache?
Of course there is Smile - I just thought NUMA was about RAM topology, not caches? I do see that the Windows NUMA APIs include cache information... which makes sense Smile

_________________
Image - carpe noctem
Post 13 Jan 2010, 11:06
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17271
Location: In your JS exploiting you and your system
revolution
NUMA:
Code:
uP uP  uP uP     uP uP  uP uP
 |  |   |  |      |  |   |  |
  L2     L2        L2     L2   (6MB each)
   |      |         |      |
   -------- < QPI > --------   (L5410)
       |               |
       |               |
  Memory (6GB)     Memory (6GB)    
Post 13 Jan 2010, 11:09
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
So far, it appears only Windows includes cache topology in NUMA. Yet, it is important to understand shared cache from a performance standpoint.
Post 13 Jan 2010, 12:53
View user's profile Send private message Visit poster's website Reply with quote
Pirata Derek



Joined: 31 Oct 2008
Posts: 259
Location: Italy
Pirata Derek
I made a Driver that works also on multi-processors Computers with Windows 7.

It uses the imported kernel api KeIpiGenericCall to execute a broadcast function automatically in every CPU in the same time.
You can choose all CPU or a specific CPU for the execution.
There is a procedure to serialize the CPU executions (exclusive mode)
It has also the undo operation.

It is simple and customizable. Read the source for more details
Also there's the Fasm DDK 2 (new relase) inside the pack

The 64 bit version is in progress....


Description: Broadcast worker driver 32 bit + Fasm DDK 2
Updated

Download
Filename: Broadcast worker 32 bit.zip
Filesize: 207.94 KB
Downloaded: 78 Time(s)



Last edited by Pirata Derek on 24 Mar 2010, 07:15; edited 1 time in total
Post 23 Mar 2010, 15:58
View user's profile Send private message Send e-mail Reply with quote
Pirata Derek



Joined: 31 Oct 2008
Posts: 259
Location: Italy
Pirata Derek
Sorry, i've forgot 4 brakets because i've re-edit the source for compatibility with the FASM Board. (what does it mean? Razz )

The ZIP package is updated and works perfectly.
Post 24 Mar 2010, 07:14
View user's profile Send private message Send e-mail Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
Hallo Pirata,
first of all, thanks for the clear good code. I have a question (i am like a baby
in this context) , from the comment in the source file
Quote:

; For example: CPUs store their values in the same memory
; address (or region), but you don't want they overwrite
; their stored values.
; Until the processor get the token, it will wait for it.
; After execution, CPU will relase the original token
; for another CPU

It is to say, although for example the rdtsc is "per se" a EDIT: NOT Very Happy serializing instruction, i could (or i will be able on x64) execute 2 rdtsc instructions at ring0 and measure the time elapsed in only one of my Yorkfield 4 cores ?
Have you an userland template (load/unload driver) to test it ?

Keep it up so !

Cherrs Very Happy
hopcode

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 24 Mar 2010, 13:53
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17271
Location: In your JS exploiting you and your system
revolution
I thought the TSC was never guaranteed to be the same counter on all cores. Although some current CPUs only physically implement one counter that is not guaranteed to be true for all systems. Reading the TSC for timing is fraught with pitfalls unless one takes the proper actions to ensure various conditions are properly met.
Post 24 Mar 2010, 14:27
View user's profile Send private message Visit poster's website Reply with quote
Pirata Derek



Joined: 31 Oct 2008
Posts: 259
Location: Italy
Pirata Derek
I can easly program that driver in x86 - multi-processors:
The x64 one... you have to wait for me, because i'm exercising in Multi-processors and the x64 platform programming. Wink
i prefered change my old single-core 32bit computer with one 64bit dual-core, instead of using virtual machines.

IMPORTANT!
I've programmed a similar program but it execute 3 RDTSC:
Code:
1st RDTSC
; .... instructions needed (optional) ....
2nd RDTSC
; .... execute what you want ....
; .... or make a call and return 
; to target function ....
3rd RDTSC

; Correction = 2nd RDTSC - 1st RDTSC
; Delta execution = 3rd RDTSC - 2nd RDTSC
; Result = Delta execution - correction [all]    

Correction is useful for slow or crappy processors that elapse time for RDTSC execution and also all operations needed to store value or others.
Teorically "correction" is equal to 0, but who knows?

You can make a driver that disable the Time Stamp Disable flag, so you can execute RDTSC in ring3 directly every time you want until PC reset.
Sometimes INTERRUPTIONS and context switches can invalidate results.
Time Stamp Disable flag is the third bit in the CR4 register.
Code:
 MOV EAX,CR4
 BTR EAX,2       ; RDTSC enabled for ring3
 MOV CR4,EAX
 RET

 MOV EAX,CR4
 BTS EAX,2       ; RDTSC enabled ONLY for ring0
 MOV CR4,EAX
 RET    


If you get trouble, i can suggest you another way (my preferred) witout interruptions or problems:

The Driver must hook SYSENTER when loading and relase it when unloading,
also it has to preserve the original pointer to redirect normal syscalls to kernel.
You should use a "Broadcast Worker" similar driver and set the TARGET_CPU value, so the driver will function only in a specific core.

The hooked Sysenter function (in the driver) has only to check if a system call have the EDX (or RDX) register = 0.
If this register is 0 then there's a special syscall (from the interfacing program), because the KiFastSystemCall function always store a value in that register.

When a special syscall is acquired, driver will store the usermode values (read below) and then it is redirected into OUR function. (in this case to read the time stamp counter two times... or all operations what you want!)
You should RAISE THE IRQL to prevent context switch.
I prefer raise it in IPI_LEVEL so my core is isolated from the other/s.

OUR function can:
start counting with the first special syscall and end counting with another syscall.
Also you have consider (subtract) the time spent for system call, raising irql, inverse functions and so on....
Same system is implemented in my new version of "NATIVE API INTERCEPTOR".

After that, the driver creates the stack for an "Artificial Interrupt" and then uses IRET to go back the SYSENTER.
Code:
 FastReturnFromDriver:
    PUSHD 23h ; usermode stack segment
    PUSHD [usermode_stack_pointer] ; provided by the interfacing program
    PUSHD [usermode_flags] ; provided by the interfacing program
    PUSHD 1Bh ; usermode code segment
    PUSHD [usermode_eip] ; provided by the interfacing program
    IRETD    

The RDTSC values will be in general registers like EAX.
Execution in a single core give more precision.
Remember to measure all the optional functions time (as various corrections) and subtract all of them to the main delta execution time.
result: VERY HIGH PRECISELY VALUE Very Happy

The INTERFACING PROGRAM has only to put its variables into general registers, set EDX = 0 and then make a SYSCALL.
Don't forget to set ITS AFFINITY PROCESS like the hooked CPU core!
Code:
;  Calling our driver using my personal routine
DerekFastDriverCall:
    PUSHFD
    POP EAX ; we save the EFLAGS
    MOV EDI, ESP ; we save the stack
    XOR EDX,EDX ; the "special syscall"
    MOV ESI, .Return ; return eip
    SYSCALL ; go to driver
    .Return: NOP
    RET    

The driver will provide storage of these values to return in usermode
You can see my old interface driver example in this board


I hope all informations will be usefull for you.
wait for me and i'll post for you the x64 driver


From Wikipedia:

With the advent of multi-core/hyperthreaded CPUs, systems with multiple CPUs, and "hibernating" operating systems, the TSC cannot be relied on to provide accurate results. The issue has two components: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. In such cases, programmers can only get reliable results by locking their code to a single CPU.

What a perfect post?
Post 25 Mar 2010, 14:44
View user's profile Send private message Send e-mail Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.