flat assembler - OS Kernel Basics - Pre-Emptive Multitasking

Index > OS Construction > OS Kernel Basics - Pre-Emptive Multitasking

Author

Thread

Greg_M

Joined: 07 Jun 2025
Posts: 42

Greg_M 17 Jun 2025, 11:29

(Using struct as general term from C)

#1 Hardware interrupt driven, e.g. a 10 millisecond tick timer.

#2 Kernel collection (e.g. array, list, map) of Task structs for each thread/task.

#3 Task struct contains the current Stack Pointer (SP) of the task. The task struct may also contain task context, but most task context e.g. CPU registers is generally stored on the stack. The task's SP is used to restore context when switching to the task and the last step to switch to the task is RTI instruction which loads the Program Counter, PC, from the stack. When the task, call it Task A, is switched out in the task switch timer interrupt (see #1), the PC is automatically placed on the stack by the interrupt, and inside the interrupt handler, you store the remaining context (e.g. registers) also on the stack.

The secret sauce is that the PC on the stack is where Task A was running when interrupted. Lo and behold, it's not's your ordinary interrupt handler because other tasks can be switched to, each interrupt cycle, before the kernel logic determines when Task A should be switched back to. And when the kernel does re-activate Task A, voila, the RTI sets the PC for Task A to continue running where it was when it was interrupted by the task switch interrupt. An ironic aspect (in a good sense) is that the kernel task struct doesn't need to store the PC. It only needs to store the SP, and the interrupt and later RTI manages the PC automatically.

For information specific to RTOS, refer to bzt's excellent posts about tickless and scheduling.

(Even non-RT OSes can use tickless for low-power purposes. For example, with regards to the 10 millisecond tick example mentioned in #1, the periodic tick can be varied/disabled during low-power modes. This general technique is used by Windows and Linux.)

Last edited by Greg_M on 19 Jun 2025, 01:04; edited 9 times in total

17 Jun 2025, 11:29

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20689
Location: In your JS exploiting you and your system

revolution 17 Jun 2025, 11:37

There is cooperative -multitasking where each task decides for itself when to yield. In those cases a hardware interrupt can be omitted.

For pre-emptive-multitasking then a hardware interrupt would be the most likely way to initiate a yield.

17 Jun 2025, 11:37

Greg_M

Joined: 07 Jun 2025
Posts: 42

Greg_M 17 Jun 2025, 12:03

Good point. And tasks of pre-emptive kernels can suspend e.g. semaphore_wait(...) or suspend(...)

The suspend() and hardware interrupt could use a common task switch function. One approach might be for the suspend() to set up the stack to mimic the interrupt stack and then use the exact same function that the interrupt uses.

(Note: I also updated the title to include "pre-emptive". Thanks!)

Last edited by Greg_M on 18 Jun 2025, 02:17; edited 5 times in total

17 Jun 2025, 12:03

macomics

Joined: 26 Jan 2021
Posts: 1169
Location: Russia

macomics 17 Jun 2025, 14:52

#3 Here you are focusing on the fact that something is saved in the stack that the current task uses to work. But, what will happen to the recovered data in the stack after restarting the task? For good measure, task switching should not pollute the task stack. The nearest task's working memory (including stack memory) should remain as unaffected as possible when switching tasks. If we are talking about the i386 architecture, then the task state should generally be recorded on a separate physical page and stored separately from all other tasks.

17 Jun 2025, 14:52

Core i7

Joined: 14 Nov 2024
Posts: 111
Location: Socket on motherboard

Core i7 17 Jun 2025, 21:22

What OS are we talking about?
For example, Windows x64 does not use hardware task switching via the TSS segment (it is slow), and stores the entire context in the kernel stack. Moreover, threads are scheduled, not processes/tasks. Each thread has 2 stacks - user and kernel. Pointers to them are stored in the _KTHREAD structure of each thread. Structures with the prefix (K) are used by the Kernel scheduler, and with the prefix (E) - by the Execution OS. This can be seen in the WinDbg debugger, for example, in my SMBIOS.EXE process:

Code:

0: kd> !process 0 5 smbios_v13.exe
PROCESS fffffa8004e2b810
    SessionId: 1    Cid: 0898    Peb: 7efdf000   ParentCid: 0bcc
    DirBase  : 1c26d000  ObjectTable: fffff8a00203e8f0  HandleCount: 94.
    Image    : SMBIOS_v13.EXE

    VadRoot    fffffa80059b6300  Vads 92  Clone 0  Private 358.  Modified 0. Locked 0.
    Token             fffff8a002c4ca30
    VirtualSize       89 Mb
    PageFaultCount    2204
    BasePriority      8
    Job               fffffa80056194a0

    1.  THREAD fffffa8005a81640  Cid 0898.08b8  Teb: 000000007efdb000  Win32Thread: fffff900c23377c0 WAIT
    2.  THREAD fffffa800570ab50  Cid 0898.08d8  Teb: 000000007efd8000  Win32Thread: 0000000000000000 WAIT

0: kd>

Now the interesting fields are "DirBase=1c26d000" (process virtual page directory) and "VadRoot=xx6300" (allocated list). My 2 threads specified in the WinDbg log use one address space of the smbios.exe process, so DirBase describes the memory of both threads 1 and 2. DirBase stores PTE records that describe the relationship of virtual addresses to physical memory frames. This is what the Vad (Virt-Addr-Descriptor) list contains, in which the Start/End addresses are specified without the 12 least significant bits (i.e. add x000 on the right):

Code:

0: kd> !vad fffffa80059b6300
VAD             level      start      end    commit
fffffa800571b550 ( 6)         10       1f         0 Mapped       READWRITE          Pagefile-backed section
fffffa8005809420 ( 5)         20       20         1 Private      READWRITE
fffffa8005ba8590 ( 6)         30       30         1 Private      READWRITE
fffffa8005a06680 ( 4)         40       40         0 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\apisetschema.dll
fffffa80059efb90 ( 5)         50       8f         9 Private      READWRITE
fffffa8005b2a690 ( 6)         90       cf         7 Private      READWRITE
fffffa8005b07810 ( 3)         d0       d3         0 Mapped       READONLY           Pagefile-backed section
fffffa80056f0d10 ( 5)         e0       e0         1 Private      READWRITE
fffffa80056c97a0 ( 4)         f0       f5         0 Mapped       READONLY           Pagefile-backed section
fffffa8004a7e520 ( 6)        100      100         0 Mapped       READWRITE          Pagefile-backed section
fffffa8004dd9b40 ( 3)        200      266         0 Mapped       READONLY           \Windows\System32\locale.nls
fffffa8005943420 ( 5)        270      36f        48 Private      READWRITE
fffffa8005a35570 ( 1)        400      408         6 Mapped  Exe  EXECUTE_WRITECOPY  \SMBIOS_v13.EXE  ;<--------- 0x400000 = Image Base
fffffa8004e44b70 ( 6)        410      48f         1 Private      READWRITE
fffffa8005a8c520 ( 5)        4b0      4bf         3 Private      READWRITE
fffffa80051f3110 ( 6)        4c0      4cf         7 Private      READWRITE
fffffa8005aca230 ( 4)        4d0      657         0 Mapped       READONLY           Pagefile-backed section
fffffa80056b1a30 ( 5)       2100     213f        24 Private      READWRITE
fffffa8005a8ccc0 ( 6)      738d0    738d8         2 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\version.dll
fffffa8005a6c990 ( 6)      74700    7474b         4 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\dxgi.dll
fffffa80059be370 ( 3)      74ed0    7502f         5 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\ole32.dll
fffffa80056348c0 ( 3)      770b0    771bf         3 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\kernel32.dll
fffffa8004c4f3a0 ( 6)      771e0    7726f         3 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\gdi32.dll
fffffa8005a66ef0 ( 4)      77570    7770e        15 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\ntdll.dll
fffffa8004f0fb40 ( 5)      77730    778af        10 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\SysWOW64\ntdll.dll
fffffa8004941570 ( 6)      7efe0    7f0df         0 Mapped       READONLY           Pagefile-backed section
fffffa8004df67d0 ( 3)      7f0e0    7ffdf         0 Private      READONLY
fffffa8005998d20 ( 4)      7ffe0    7ffef        -1 Private      READONLY
fffffa8005ac4130 ( 5)      7fff0 7fffffef        -1 Private      READONLY

Total VADs: 83  Average level: 5  Maximum depth: 7
0: kd>

Now, if my thread(1) switches to thread(1) of another process, and not to thread(2) of its own, then the scheduler must save pointers to DirBase+VadRoot in the kernel stack of my thread(1), otherwise it will be impossible to restore my stopped thread(1). If threads switch within one process, then there is no need to save Dir+Vad, but only the register context. Here are the fields with pointers to the stacks of my thread(1):

Code:

0: kd> dt _kthread fffffa8005a81640
ntdll!_KTHREAD
   +0x000 Header           : _DISPATCHER_HEADER
.........
   +0x028 InitialStack     : 0xfffff880`0243cd70  Void
   +0x030 StackLimit       : 0xfffff880`02434000  Void
   +0x038 KernelStack      : 0xfffff880`0243c5f0  Void
   +0x134 ContextSwitches  : 0x3f5
   +0x1d8 TrapFrame        : 0xfffff880`0243cbe0 _KTRAP_FRAME
   +0x358 StateSaveArea    : 0xfffff880`0243cdc0 _XSAVE_FORMAT
.........
0: kd>

In general, when switching thread(1) to thread(2), the Windows scheduler does the following:

1. The context of the thread(1) registers is saved by the KernelStack pointer of thread(1), and the KernelStack value is reduced by the size of the context (see the _CONTEXT structure). If thread(2) does not belong to the thread(1) process, then the pointers to Dir+Vad of thread(1) are also saved in KernelStack.

2. Based on the KernelStack value of thread(2), its entire context is restored.

3. Now the KernelStack value of thread(1) is written to the KernelStack field of thread(2).

4. Switching of all other threads will occur similarly to obtain a circular chain.

17 Jun 2025, 21:22

bzt

Joined: 09 Nov 2018
Posts: 90

bzt 18 Jun 2025, 10:44

macomics wrote:

But, what will happen to the recovered data in the stack after restarting the task? For good measure, task switching should not pollute the task stack. The nearest task's working memory (including stack memory) should remain as unaffected as possible when switching tasks. If we are talking about the i386 architecture, then the task state should generally be recorded on a separate physical page and stored separately from all other tasks.

Obviously you must remove those from the stack before restarting the task. In general it doesn't matter where you store those, user stack is as a good place as any (from strictly technical point of view), using a separate area is recommended for better security only (eliminates the possibility of buffer overflow attacks, eg. when user stack is deliberately filled and there's not enough space to store the task struct).

Core i7 wrote:

For example, Windows x64 does not use hardware task switching via the TSS segment (it is slow)

That's right, not used, but not because it's slow, rather because it's non-existent. There's no hardware task switch in long mode (amd64 or x86_64 or x64) at all.

(FYI, hardware task switch was only supported in 16-bit and 32-bit protected modes, often referred to as x86 or i386. 16-bit protected mode was only supported on 80286).

Core i7 wrote:

Moreover, threads are scheduled, not processes/tasks.

That's right, and for the records, it's the same for Linux. There the scheduler operates on so called task structs, indexed by a so called "lightweight process" aka. thread id (you can check it with "ps -L", prints this under the column LWP).

18 Jun 2025, 10:44

Core i7

Joined: 14 Nov 2024
Posts: 111
Location: Socket on motherboard

Core i7 18 Jun 2025, 13:10

bzt wrote:

There's no hardware task switch in long mode (amd64 or x86_64 or x64) at all.

Perhaps the documentation says that in x64 the TSS task state segment is completely absent, but in the cpu main structure PCR (Processor Control Region) there is a valid pointer to it. But if I request the structure itself, it is empty, except for one RSP. It is unclear where this RSP points - its frame does not match the one used by the program threads:

Code:

0: kd> dt _kpcr @$pcr
ntdll!_KPCR
   +0x000 NtTib                : _NT_TIB
   +0x000 GdtBase              : 0xfffff800`00b9c000  _KGDTENTRY64
   +0x008 TssBase              : 0xfffff800`00b9b000  _KTSS64        ;<-------------//
   +0x010 UserRsp              : 0x1ce1b8
   +0x018 Self                 : 0xfffff800`02c4e000  _KPCR
   +0x020 CurrentPrcb          : 0xfffff800`02c4e180  _KPRCB
   +0x028 LockArray            : 0xfffff800`02c4e7f0  _KSPIN_LOCK_QUEUE
   +0x030 Used_Self            : 0x000007ff`fff9c000  Void
   +0x038 IdtBase              : 0xfffff800`00b9a000  _KIDTENTRY64
   +0x050 Irql                 : 0 ''
   +0x051 CacheAssociativity   : 0x8 ''
   +0x060 MajorVersion         : 1
   +0x062 MinorVersion         : 1
   +0x064 StallScaleFactor     : 0x9c3
   +0x0bc SecondLevelCacheSize : 0x200000
   +0x0c0 HalReserved          : [16] 0x9502d1f0
   +0x108 KdVersionBlock       : (null)
   +0x118 PcrAlign1            : [24] 0
   +0x180 Prcb                 : _KPRCB

0: kd> dt _ktss64 -v 0xfffff800`00b9b000
ntdll!_KTSS64
struct _KTSS64, 8 elements, 0x68 bytes            ;<----------- Sizeof = 0x68 bytes
   +0x000 Reserved0        : 0
   +0x004 Rsp0             : 0xfffff800`00ba4d70
   +0x00c Rsp1             : 0
   +0x014 Rsp2             : 0
   +0x01c Ist              : [8] 0
   +0x05c Reserved1        : 0
   +0x064 Reserved2        : 0
   +0x066 IoMapBase        : 0x68

0: kd> dt _kthread fffffa8003b9d940      ;<-------------------- User Thread
ntdll!_KTHREAD
   +0x000 Header           : _DISPATCHER_HEADER
   +0x018 CycleTime        : 0x2`73e69274
   +0x020 QuantumTarget    : 0x2`75d652ea
   +0x028 InitialStack     : 0xfffff880`05f15d70  Void
   +0x030 StackLimit       : 0xfffff880`05f0c000  Void
   +0x038 KernelStack      : 0xfffff880`05f15470  Void   ;<-----------------
..........

18 Jun 2025, 13:10

macomics

Joined: 26 Jan 2021
Posts: 1169
Location: Russia

macomics 18 Jun 2025, 15:11

bzt wrote:

Obviously you must remove those from the stack before restarting the task. In general it doesn't matter where you store those, user stack is as a good place as any (from strictly technical point of view), using a separate area is recommended for better security only (eliminates the possibility of buffer overflow attacks, eg. when user stack is deliberately filled and there's not enough space to store the task struct).

In this case, even for the banal definition of tracing, you can use this code:

Code:

mov rax, [rsp-4]
cmp rax, [rsp-4]
jnz tracing

I was just saying that it's not worth changing the address space of a task for system purposes.

Core i7 wrote:

Perhaps the documentation says that in x64 the TSS task state segment is completely absent, but in the cpu main structure PCR (Processor Control Region) there is a valid pointer to it. But if I request the structure itself, it is empty, except for one RSP. It is unclear where this RSP points - its frame does not match the one used by the program threads:

That's what I was talking about. Even in Windows, the hardware TSS is located on its own separate page.
It is not initialized because the user task has a write-copy page in its place in case of a hardware task switch. Only the system init process will have it recorded in full.

18 Jun 2025, 15:11

bzt

Joined: 09 Nov 2018
Posts: 90

bzt 18 Jun 2025, 15:56

Core i7 wrote:

Perhaps the documentation says that in x64 the TSS task state segment is completely absent

No, the doc does not say anything like that. It says that TSS exists, it's just not task state related any more. To be precise, see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3A, page 9-19, section 9.7 TASK MANAGEMENT IN 64-BIT MODE.

macomics wrote:

I was just saying that it's not worth changing the address space of a task for system purposes.

Yes, I agree. That's why most modern OS split the address space in two: user space mappings (per process) and kernel space mappings (same in all address space).

macomics wrote:

the hardware TSS is located on its own separate page

But in x64 the TSS does not store the task state at all. From the Intel Manual page 9-19:
"The TSS holds information important to 64-bit mode and that is not directly related to the task-switch mechanism."

macomics wrote:

It is not initialized because the user task has a write-copy page in its place in case of a hardware task switch.

Again, there's no per task TSS (actually, it's enough to have a single TSS per CPU core, no matter how many tasks are scheduled on that core), and there's no such thing as hardware task switch, from Intel Manual page 9-19:
"However, the task switching mechanism available in protected mode is not supported in 64-bit mode. Task management and switching must be performed by software."

Hope this helps understand the quirks of 64-bit mode. It's tricky because it inherited and repurposed the 32-bit mode structures.

18 Jun 2025, 15:56

macomics

Joined: 26 Jan 2021
Posts: 1169
Location: Russia

macomics 18 Jun 2025, 16:10

bzt wrote:

Again, there's no per task TSS (actually, it's enough to have a single TSS per CPU core, no matter how many tasks are scheduled on that core), and there's no such thing as hardware task switch, from Intel Manual page 9-19:
"However, the task switching mechanism available in protected mode is not supported in 64-bit mode. Task management and switching must be performed by software."

Only you have not only 64-bit tasks in your system, but also 32-bit ones. That's what they can trigger writing to this TSS page. Therefore, it is clean and in write-copy mode.

18 Jun 2025, 16:10

bzt

Joined: 09 Nov 2018
Posts: 90

bzt 18 Jun 2025, 16:35

macomics wrote:

Only you have not only 64-bit tasks in your system, but also 32-bit ones.

Nope, in long mode you only have 64-bit tasks, nothing else. If you want 32-bit tasks as well, then your kernel must implement compatibility mode (not the same as protected mode).

macomics wrote:

That's what they can trigger writing to this TSS page.

Nope, compatibility mode does not support hardware task switch either, see Intel Manual Vol 2A, page 3-141:
"Note that a CALL instruction can not be used to cause a task switch in compatibility mode since task switches are not supported in IA-32e mode."

Like I said, 64-mode is tricky, because they have repurposed the structures but kept the nomenclature. In 32-bit emulation mode only user level is the same, but kernel level is very different than protected mode.

18 Jun 2025, 16:35

Core i7

Joined: 14 Nov 2024
Posts: 111
Location: Socket on motherboard

Core i7 18 Jun 2025, 17:46

macomics wrote:

Therefore, it is clean and in write-copy mode.

The "Copy-On-Write" attribute can be set to user mode pages, and the TSS is in the kernel. Its PTE entry has the "Global" attribute, but unfortunately the CoW bit is clear. So the purpose of this TSS64 is unclear, although the pointer seems to be valid and someone has already accessed it (see bit A)

Code:

0: kd> dt _kpcr @$pcr TssBase
ntdll!_KPCR
   +0x008 TssBase : 0xfffff800`00b9b000  _KTSS64

0: kd> !cmkd.ptelist -v 0xfffff80000b9b000
VA=FFFFF80000B9B000
  PXE Idx=1F0  Va=FFFFF6FB7DBEDF80  Contents=00199063  Hard Pfn=00000199  Attr=---DA--KWEV
  PPE Idx=000  Va=FFFFF6FB7DBF0000  Contents=00198063  Hard Pfn=00000198  Attr=---DA--KWEV
  PDE Idx=005  Va=FFFFF6FB7E000028  Contents=0019A063  Hard Pfn=0000019A  Attr=---DA--KWEV
  PTE Idx=19B  Va=FFFFF6FC00005CD8  Contents=00B9B163  Hard Pfn=00000B9B  Attr=-G-DA--KWEV  <--+
                                                                                               |
                                Global,Dirty,Accessed,Kernel,Writable,Execute,Valid  ----------+

0: kd> dt _mmpte_hardware FFFFF6FC00005CD8
nt!_MMPTE_HARDWARE
   +0x000 Valid            : 0y1
   +0x000 Owner            : 0y0
   +0x000 WriteThrough     : 0y0
   +0x000 CacheDisable     : 0y0
   +0x000 Accessed         : 0y1
   +0x000 Dirty            : 0y1
   +0x000 LargePage        : 0y0
   +0x000 Global           : 0y1
   +0x000 CopyOnWrite      : 0y0   ;<-------- Bit CoW = False
   +0x000 Write            : 0y0
   +0x000 PageFrameNumber  : 0y000000000000000000000000101110011011 (0xb9b)
   +0x000 SoftwareWsIndex  : 0y00000000000 (0)
   +0x000 NoExecute        : 0y0
0: kd>

18 Jun 2025, 17:46

Core i7

Joined: 14 Nov 2024
Posts: 111
Location: Socket on motherboard

Core i7 18 Jun 2025, 17:56

..and a little more about the time allocated to the threads.
Windows uses "preemptive multitasking" - threads with higher priority can take time from regular threads. The scheduler allocates the same amount of time to all threads (whether they have a priority or not), after which the context of the current thread is saved and control is passed to the next thread in the queue - this time is known as a "Quantum". If the system is a server, then 1 quantum is equal to 12 system timer intervals, and for user systems only 2 intervals.

The duration of one timer interval depends on the HAL, not the OS kernel. Having requested it from the HAL, the kernel stores it in its variable "KeMaximumIncrement" in 100 nanosecond blocks. On my system, WinDbg says that it is equal to 15.6 milliseconds, which means that threads are allocated quantum with a duration of ~30 ms:

Code:

0: kd> dd  KeMaximumIncrement L1
fffff800`02d07030    00026161

0: kd> ? 0x26161
Evaluate expression: 156001 = 00000000`00026161

0: kd>

But the point is that the Windows thread scheduler does not monitor timer interrupts to assign a quantum to the next thread. Instead, the quantum duration is calculated relative to the processor clock frequency. That is, the kernel runs the RDTSC instruction and waits for one timer interval = 15.6 milliseconds to end. The resulting counter value is saved in the variable "KiCyclesPerClockQuantum", after which the hardware timer is no longer needed. I got 12,994,883 CPU cycles, and accordingly the quantum for the thread will be 2 times greater than 25,989,766:

Code:

0: kd> dq KiCyclesPerClockQuantum L1
fffff800`02d0714c    00000000`00c64943

0: kd> ? 0xc64943
Evaluate expression: 12994883 = 00000000`00c64943  ;<----------

0: kd>

It turns out that such an algorithm is used to give threads an exact time (in quanta) for execution. The problem is that context switching is a long operation, and it takes time inside a quantum. But the kernel knows how many cpu CYCLES it takes to save the context, and it will be correct if this time does not depend on the timer period. For example, a 3 GHz cpu will switch context three times faster than a 1 GHz processor. As a result, threads on different processors get the same quantum.

18 Jun 2025, 17:56

bzt

Joined: 09 Nov 2018
Posts: 90

bzt 19 Jun 2025, 09:29

Core i7 wrote:

but unfortunately the CoW bit is clear

Why "unfortunately"? CoW makes no sense, as the CPU never writes to the TSS, just reads it.

Core i7 wrote:

So the purpose of this TSS64 is unclear

Take a look at the Intel Manual I've linked, page 9-20. In 64-bit mode the TSS stores stack pointers for different privilege levels (rings).

Core i7 wrote:

someone has already accessed it

In 64-mode, the TSS is read if an interrupt to a higher privilege level triggered (which is almost always the case if you already have user tasks, aka. ring 3 code is interrupted and the handler is usually running at ring 0, so the new RSP value is taken from the TSS, and the interrupt stack frame is stored on that new stack; the TSS itself is not modified at all).

Or, if the IDT is configured for ISTs, then the per interrupt handler stack's pointer is read from the TSS instead. Same thing, the only difference is, from which TSS offset is the stack pointer loaded into RSP, but the TSS page is accessed either way, and nothing is written into the TSS.

19 Jun 2025, 09:29

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20689
Location: In your JS exploiting you and your system

revolution 19 Jun 2025, 13:41

For CPUs with FPUs / vector units (aka a co-processor), it can be useful to disable the co-processor when switching to a new task. Then if the new task starts using the co-processor a fault is generated and the new set of registers can be switched in, then restart the task.

This is an efficiency enhancement that bypasses unnecessary saving/restoring of the co-processor registers if a task doesn't use them. It comes at the cost of a small amount of housekeeping to keep track of which task the co-processor registers currently belong to.

19 Jun 2025, 13:41

Core i7

Joined: 14 Nov 2024
Posts: 111
Location: Socket on motherboard

Core i7 19 Jun 2025, 15:52

revolution, I think restarting the task is not much more effective than a simple dump of the fpu registers. The context structure stores the fpu state at the current moment (Control & Status words), and the thread structure even has a separate frame for the fpu in the kernel stack. But this is Windows, and for other OS your suggestion may be useful.

Code:

0: kd> dt _xsave_format
ntdll!_XSAVE_FORMAT
   +0x000 ControlWord      : Uint2B
   +0x002 StatusWord       : Uint2B
   +0x004 TagWord          : UChar
   +0x005 Reserved1        : UChar
   +0x006 ErrorOpcode      : Uint2B
   +0x008 ErrorOffset      : Uint4B
   +0x00c ErrorSelector    : Uint2B
   +0x00e Reserved2        : Uint2B
   +0x010 DataOffset       : Uint4B
   +0x014 DataSelector     : Uint2B
   +0x016 Reserved3        : Uint2B
   +0x018 MxCsr            : Uint4B
   +0x01c MxCsr_Mask       : Uint4B
   +0x020 FloatRegisters   : [8] _M128A
   +0x0a0 XmmRegisters     : [16] _M128A
   +0x1a0 Reserved4        : [96] UChar

0: kd> dt _kthread fffffa8005c0da90      ;<-------- User thread
ntdll!_KTHREAD
   +0x000 Header           : _DISPATCHER_HEADER
   +0x018 CycleTime        : 0x7`8f7f5d63
   +0x020 QuantumTarget    : 0x7`913ac789
   +0x283 QuantumReset     : 6 
   +0x134 ContextSwitches  : 0x5697

   +0x028 InitialStack     : 0xfffff880`0516fd70  Void
   +0x030 StackLimit       : 0xfffff880`05166000  Void
   +0x038 KernelStack      : 0xfffff880`0516f470  Void
   +0x278 StackBase        : 0xfffff880`05170000  Void
   +0x358 StateSaveArea    : 0xfffff880`0516fdc0  _XSAVE_FORMAT     ;<---------//
............
0: kd>

19 Jun 2025, 15:52

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20689
Location: In your JS exploiting you and your system

revolution 20 Jun 2025, 07:32

Core i7 wrote:

But this is Windows, and for other OS your suggestion may be useful.

When inside the "OS Construction" forum I usually assume the topic won't be about Windows. Since Windows has already been constructed a long time ago.

Even so, I remember reading a long time ago that Windows follows this method of delaying the save/restore of the FPU/XMM state until needed. If a process requests to read the context then Windows is forced to update the FPU//XMM state before passing it on to the requesting process.

Do the latest Windows versions no longer do this? Did they do some benchmarks, or something, and decide to scrap it because "modern" apps make copious use of the FPU / vector units?

Perhaps with the mandatory stack alignment requirement, due to the use of MOVDQA for loading the first four values of a FASTCALL procedure, that the trade-off is no longer worth it (in Windows). Other OSes might, or might not, find it worthwhile. But it is a data point for people to consider when constructing their OSes.

20 Jun 2025, 07:32

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum