flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
revolution 17 Jun 2025, 11:37
There is cooperative -multitasking where each task decides for itself when to yield. In those cases a hardware interrupt can be omitted.
For pre-emptive-multitasking then a hardware interrupt would be the most likely way to initiate a yield. |
|||
![]() |
|
Greg_M 17 Jun 2025, 12:03
Good point. And tasks of pre-emptive kernels can suspend e.g. semaphore_wait(...) or suspend(...)
The suspend() and hardware interrupt could use a common task switch function. One approach might be for the suspend() to set up the stack to mimic the interrupt stack and then use the exact same function that the interrupt uses. (Note: I also updated the title to include "pre-emptive". Thanks!) Last edited by Greg_M on 18 Jun 2025, 02:17; edited 5 times in total |
|||
![]() |
|
macomics 17 Jun 2025, 14:52
#3 Here you are focusing on the fact that something is saved in the stack that the current task uses to work. But, what will happen to the recovered data in the stack after restarting the task? For good measure, task switching should not pollute the task stack. The nearest task's working memory (including stack memory) should remain as unaffected as possible when switching tasks. If we are talking about the i386 architecture, then the task state should generally be recorded on a separate physical page and stored separately from all other tasks.
|
|||
![]() |
|
Core i7 17 Jun 2025, 21:22
What OS are we talking about?
For example, Windows x64 does not use hardware task switching via the TSS segment (it is slow), and stores the entire context in the kernel stack. Moreover, threads are scheduled, not processes/tasks. Each thread has 2 stacks - user and kernel. Pointers to them are stored in the _KTHREAD structure of each thread. Structures with the prefix (K) are used by the Kernel scheduler, and with the prefix (E) - by the Execution OS. This can be seen in the WinDbg debugger, for example, in my SMBIOS.EXE process: Code: 0: kd> !process 0 5 smbios_v13.exe PROCESS fffffa8004e2b810 SessionId: 1 Cid: 0898 Peb: 7efdf000 ParentCid: 0bcc DirBase : 1c26d000 ObjectTable: fffff8a00203e8f0 HandleCount: 94. Image : SMBIOS_v13.EXE VadRoot fffffa80059b6300 Vads 92 Clone 0 Private 358. Modified 0. Locked 0. Token fffff8a002c4ca30 VirtualSize 89 Mb PageFaultCount 2204 BasePriority 8 Job fffffa80056194a0 1. THREAD fffffa8005a81640 Cid 0898.08b8 Teb: 000000007efdb000 Win32Thread: fffff900c23377c0 WAIT 2. THREAD fffffa800570ab50 Cid 0898.08d8 Teb: 000000007efd8000 Win32Thread: 0000000000000000 WAIT 0: kd> Now the interesting fields are "DirBase=1c26d000" (process virtual page directory) and "VadRoot=xx6300" (allocated list). My 2 threads specified in the WinDbg log use one address space of the smbios.exe process, so DirBase describes the memory of both threads 1 and 2. DirBase stores PTE records that describe the relationship of virtual addresses to physical memory frames. This is what the Vad (Virt-Addr-Descriptor) list contains, in which the Start/End addresses are specified without the 12 least significant bits (i.e. add x000 on the right): Code: 0: kd> !vad fffffa80059b6300 VAD level start end commit fffffa800571b550 ( 6) 10 1f 0 Mapped READWRITE Pagefile-backed section fffffa8005809420 ( 5) 20 20 1 Private READWRITE fffffa8005ba8590 ( 6) 30 30 1 Private READWRITE fffffa8005a06680 ( 4) 40 40 0 Mapped Exe EXECUTE_WRITECOPY \Windows\System32\apisetschema.dll fffffa80059efb90 ( 5) 50 8f 9 Private READWRITE fffffa8005b2a690 ( 6) 90 cf 7 Private READWRITE fffffa8005b07810 ( 3) d0 d3 0 Mapped READONLY Pagefile-backed section fffffa80056f0d10 ( 5) e0 e0 1 Private READWRITE fffffa80056c97a0 ( 4) f0 f5 0 Mapped READONLY Pagefile-backed section fffffa8004a7e520 ( 6) 100 100 0 Mapped READWRITE Pagefile-backed section fffffa8004dd9b40 ( 3) 200 266 0 Mapped READONLY \Windows\System32\locale.nls fffffa8005943420 ( 5) 270 36f 48 Private READWRITE fffffa8005a35570 ( 1) 400 408 6 Mapped Exe EXECUTE_WRITECOPY \SMBIOS_v13.EXE ;<--------- 0x400000 = Image Base fffffa8004e44b70 ( 6) 410 48f 1 Private READWRITE fffffa8005a8c520 ( 5) 4b0 4bf 3 Private READWRITE fffffa80051f3110 ( 6) 4c0 4cf 7 Private READWRITE fffffa8005aca230 ( 4) 4d0 657 0 Mapped READONLY Pagefile-backed section fffffa80056b1a30 ( 5) 2100 213f 24 Private READWRITE fffffa8005a8ccc0 ( 6) 738d0 738d8 2 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\version.dll fffffa8005a6c990 ( 6) 74700 7474b 4 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\dxgi.dll fffffa80059be370 ( 3) 74ed0 7502f 5 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\ole32.dll fffffa80056348c0 ( 3) 770b0 771bf 3 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\kernel32.dll fffffa8004c4f3a0 ( 6) 771e0 7726f 3 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\gdi32.dll fffffa8005a66ef0 ( 4) 77570 7770e 15 Mapped Exe EXECUTE_WRITECOPY \Windows\System32\ntdll.dll fffffa8004f0fb40 ( 5) 77730 778af 10 Mapped Exe EXECUTE_WRITECOPY \Windows\SysWOW64\ntdll.dll fffffa8004941570 ( 6) 7efe0 7f0df 0 Mapped READONLY Pagefile-backed section fffffa8004df67d0 ( 3) 7f0e0 7ffdf 0 Private READONLY fffffa8005998d20 ( 4) 7ffe0 7ffef -1 Private READONLY fffffa8005ac4130 ( 5) 7fff0 7fffffef -1 Private READONLY Total VADs: 83 Average level: 5 Maximum depth: 7 0: kd> Now, if my thread(1) switches to thread(1) of another process, and not to thread(2) of its own, then the scheduler must save pointers to DirBase+VadRoot in the kernel stack of my thread(1), otherwise it will be impossible to restore my stopped thread(1). If threads switch within one process, then there is no need to save Dir+Vad, but only the register context. Here are the fields with pointers to the stacks of my thread(1): Code: 0: kd> dt _kthread fffffa8005a81640 ntdll!_KTHREAD +0x000 Header : _DISPATCHER_HEADER ......... +0x028 InitialStack : 0xfffff880`0243cd70 Void +0x030 StackLimit : 0xfffff880`02434000 Void +0x038 KernelStack : 0xfffff880`0243c5f0 Void +0x134 ContextSwitches : 0x3f5 +0x1d8 TrapFrame : 0xfffff880`0243cbe0 _KTRAP_FRAME +0x358 StateSaveArea : 0xfffff880`0243cdc0 _XSAVE_FORMAT ......... 0: kd> In general, when switching thread(1) to thread(2), the Windows scheduler does the following: 1. The context of the thread(1) registers is saved by the KernelStack pointer of thread(1), and the KernelStack value is reduced by the size of the context (see the _CONTEXT structure). If thread(2) does not belong to the thread(1) process, then the pointers to Dir+Vad of thread(1) are also saved in KernelStack. 2. Based on the KernelStack value of thread(2), its entire context is restored. 3. Now the KernelStack value of thread(1) is written to the KernelStack field of thread(2). 4. Switching of all other threads will occur similarly to obtain a circular chain. |
|||
![]() |
|
bzt 18 Jun 2025, 10:44
macomics wrote: But, what will happen to the recovered data in the stack after restarting the task? For good measure, task switching should not pollute the task stack. The nearest task's working memory (including stack memory) should remain as unaffected as possible when switching tasks. If we are talking about the i386 architecture, then the task state should generally be recorded on a separate physical page and stored separately from all other tasks. Core i7 wrote: For example, Windows x64 does not use hardware task switching via the TSS segment (it is slow) (FYI, hardware task switch was only supported in 16-bit and 32-bit protected modes, often referred to as x86 or i386. 16-bit protected mode was only supported on 80286). Core i7 wrote: Moreover, threads are scheduled, not processes/tasks. |
|||
![]() |
|
Core i7 18 Jun 2025, 13:10
bzt wrote: There's no hardware task switch in long mode (amd64 or x86_64 or x64) at all. Perhaps the documentation says that in x64 the TSS task state segment is completely absent, but in the cpu main structure PCR (Processor Control Region) there is a valid pointer to it. But if I request the structure itself, it is empty, except for one RSP. It is unclear where this RSP points - its frame does not match the one used by the program threads: Code: 0: kd> dt _kpcr @$pcr ntdll!_KPCR +0x000 NtTib : _NT_TIB +0x000 GdtBase : 0xfffff800`00b9c000 _KGDTENTRY64 +0x008 TssBase : 0xfffff800`00b9b000 _KTSS64 ;<-------------// +0x010 UserRsp : 0x1ce1b8 +0x018 Self : 0xfffff800`02c4e000 _KPCR +0x020 CurrentPrcb : 0xfffff800`02c4e180 _KPRCB +0x028 LockArray : 0xfffff800`02c4e7f0 _KSPIN_LOCK_QUEUE +0x030 Used_Self : 0x000007ff`fff9c000 Void +0x038 IdtBase : 0xfffff800`00b9a000 _KIDTENTRY64 +0x050 Irql : 0 '' +0x051 CacheAssociativity : 0x8 '' +0x060 MajorVersion : 1 +0x062 MinorVersion : 1 +0x064 StallScaleFactor : 0x9c3 +0x0bc SecondLevelCacheSize : 0x200000 +0x0c0 HalReserved : [16] 0x9502d1f0 +0x108 KdVersionBlock : (null) +0x118 PcrAlign1 : [24] 0 +0x180 Prcb : _KPRCB 0: kd> dt _ktss64 -v 0xfffff800`00b9b000 ntdll!_KTSS64 struct _KTSS64, 8 elements, 0x68 bytes ;<----------- Sizeof = 0x68 bytes +0x000 Reserved0 : 0 +0x004 Rsp0 : 0xfffff800`00ba4d70 +0x00c Rsp1 : 0 +0x014 Rsp2 : 0 +0x01c Ist : [8] 0 +0x05c Reserved1 : 0 +0x064 Reserved2 : 0 +0x066 IoMapBase : 0x68 0: kd> dt _kthread fffffa8003b9d940 ;<-------------------- User Thread ntdll!_KTHREAD +0x000 Header : _DISPATCHER_HEADER +0x018 CycleTime : 0x2`73e69274 +0x020 QuantumTarget : 0x2`75d652ea +0x028 InitialStack : 0xfffff880`05f15d70 Void +0x030 StackLimit : 0xfffff880`05f0c000 Void +0x038 KernelStack : 0xfffff880`05f15470 Void ;<----------------- .......... |
|||
![]() |
|
macomics 18 Jun 2025, 15:11
bzt wrote: Obviously you must remove those from the stack before restarting the task. In general it doesn't matter where you store those, user stack is as a good place as any (from strictly technical point of view), using a separate area is recommended for better security only (eliminates the possibility of buffer overflow attacks, eg. when user stack is deliberately filled and there's not enough space to store the task struct). In this case, even for the banal definition of tracing, you can use this code: Code: mov rax, [rsp-4] cmp rax, [rsp-4] jnz tracing I was just saying that it's not worth changing the address space of a task for system purposes. Core i7 wrote: Perhaps the documentation says that in x64 the TSS task state segment is completely absent, but in the cpu main structure PCR (Processor Control Region) there is a valid pointer to it. But if I request the structure itself, it is empty, except for one RSP. It is unclear where this RSP points - its frame does not match the one used by the program threads: That's what I was talking about. Even in Windows, the hardware TSS is located on its own separate page. It is not initialized because the user task has a write-copy page in its place in case of a hardware task switch. Only the system init process will have it recorded in full. |
|||
![]() |
|
bzt 18 Jun 2025, 15:56
Core i7 wrote: Perhaps the documentation says that in x64 the TSS task state segment is completely absent macomics wrote: I was just saying that it's not worth changing the address space of a task for system purposes. macomics wrote: the hardware TSS is located on its own separate page "The TSS holds information important to 64-bit mode and that is not directly related to the task-switch mechanism." macomics wrote: It is not initialized because the user task has a write-copy page in its place in case of a hardware task switch. "However, the task switching mechanism available in protected mode is not supported in 64-bit mode. Task management and switching must be performed by software." Hope this helps understand the quirks of 64-bit mode. It's tricky because it inherited and repurposed the 32-bit mode structures. |
|||
![]() |
|
macomics 18 Jun 2025, 16:10
bzt wrote: Again, there's no per task TSS (actually, it's enough to have a single TSS per CPU core, no matter how many tasks are scheduled on that core), and there's no such thing as hardware task switch, from Intel Manual page 9-19: |
|||
![]() |
|
bzt 18 Jun 2025, 16:35
macomics wrote: Only you have not only 64-bit tasks in your system, but also 32-bit ones. macomics wrote: That's what they can trigger writing to this TSS page. "Note that a CALL instruction can not be used to cause a task switch in compatibility mode since task switches are not supported in IA-32e mode." Like I said, 64-mode is tricky, because they have repurposed the structures but kept the nomenclature. In 32-bit emulation mode only user level is the same, but kernel level is very different than protected mode. |
|||
![]() |
|
Core i7 18 Jun 2025, 17:46
macomics wrote: Therefore, it is clean and in write-copy mode. The "Copy-On-Write" attribute can be set to user mode pages, and the TSS is in the kernel. Its PTE entry has the "Global" attribute, but unfortunately the CoW bit is clear. So the purpose of this TSS64 is unclear, although the pointer seems to be valid and someone has already accessed it (see bit A) Code: 0: kd> dt _kpcr @$pcr TssBase ntdll!_KPCR +0x008 TssBase : 0xfffff800`00b9b000 _KTSS64 0: kd> !cmkd.ptelist -v 0xfffff80000b9b000 VA=FFFFF80000B9B000 PXE Idx=1F0 Va=FFFFF6FB7DBEDF80 Contents=00199063 Hard Pfn=00000199 Attr=---DA--KWEV PPE Idx=000 Va=FFFFF6FB7DBF0000 Contents=00198063 Hard Pfn=00000198 Attr=---DA--KWEV PDE Idx=005 Va=FFFFF6FB7E000028 Contents=0019A063 Hard Pfn=0000019A Attr=---DA--KWEV PTE Idx=19B Va=FFFFF6FC00005CD8 Contents=00B9B163 Hard Pfn=00000B9B Attr=-G-DA--KWEV <--+ | Global,Dirty,Accessed,Kernel,Writable,Execute,Valid ----------+ 0: kd> dt _mmpte_hardware FFFFF6FC00005CD8 nt!_MMPTE_HARDWARE +0x000 Valid : 0y1 +0x000 Owner : 0y0 +0x000 WriteThrough : 0y0 +0x000 CacheDisable : 0y0 +0x000 Accessed : 0y1 +0x000 Dirty : 0y1 +0x000 LargePage : 0y0 +0x000 Global : 0y1 +0x000 CopyOnWrite : 0y0 ;<-------- Bit CoW = False +0x000 Write : 0y0 +0x000 PageFrameNumber : 0y000000000000000000000000101110011011 (0xb9b) +0x000 SoftwareWsIndex : 0y00000000000 (0) +0x000 NoExecute : 0y0 0: kd> |
|||
![]() |
|
Core i7 18 Jun 2025, 17:56
..and a little more about the time allocated to the threads.
Windows uses "preemptive multitasking" - threads with higher priority can take time from regular threads. The scheduler allocates the same amount of time to all threads (whether they have a priority or not), after which the context of the current thread is saved and control is passed to the next thread in the queue - this time is known as a "Quantum". If the system is a server, then 1 quantum is equal to 12 system timer intervals, and for user systems only 2 intervals. The duration of one timer interval depends on the HAL, not the OS kernel. Having requested it from the HAL, the kernel stores it in its variable "KeMaximumIncrement" in 100 nanosecond blocks. On my system, WinDbg says that it is equal to 15.6 milliseconds, which means that threads are allocated quantum with a duration of ~30 ms: Code: 0: kd> dd KeMaximumIncrement L1 fffff800`02d07030 00026161 0: kd> ? 0x26161 Evaluate expression: 156001 = 00000000`00026161 0: kd> But the point is that the Windows thread scheduler does not monitor timer interrupts to assign a quantum to the next thread. Instead, the quantum duration is calculated relative to the processor clock frequency. That is, the kernel runs the RDTSC instruction and waits for one timer interval = 15.6 milliseconds to end. The resulting counter value is saved in the variable "KiCyclesPerClockQuantum", after which the hardware timer is no longer needed. I got 12,994,883 CPU cycles, and accordingly the quantum for the thread will be 2 times greater than 25,989,766: Code: 0: kd> dq KiCyclesPerClockQuantum L1 fffff800`02d0714c 00000000`00c64943 0: kd> ? 0xc64943 Evaluate expression: 12994883 = 00000000`00c64943 ;<---------- 0: kd> It turns out that such an algorithm is used to give threads an exact time (in quanta) for execution. The problem is that context switching is a long operation, and it takes time inside a quantum. But the kernel knows how many cpu CYCLES it takes to save the context, and it will be correct if this time does not depend on the timer period. For example, a 3 GHz cpu will switch context three times faster than a 1 GHz processor. As a result, threads on different processors get the same quantum. |
|||
![]() |
|
bzt 19 Jun 2025, 09:29
Core i7 wrote: but unfortunately the CoW bit is clear Core i7 wrote: So the purpose of this TSS64 is unclear Core i7 wrote: someone has already accessed it Or, if the IDT is configured for ISTs, then the per interrupt handler stack's pointer is read from the TSS instead. Same thing, the only difference is, from which TSS offset is the stack pointer loaded into RSP, but the TSS page is accessed either way, and nothing is written into the TSS. |
|||
![]() |
|
revolution 19 Jun 2025, 13:41
For CPUs with FPUs / vector units (aka a co-processor), it can be useful to disable the co-processor when switching to a new task. Then if the new task starts using the co-processor a fault is generated and the new set of registers can be switched in, then restart the task.
This is an efficiency enhancement that bypasses unnecessary saving/restoring of the co-processor registers if a task doesn't use them. It comes at the cost of a small amount of housekeeping to keep track of which task the co-processor registers currently belong to. |
|||
![]() |
|
Core i7 19 Jun 2025, 15:52
revolution, I think restarting the task is not much more effective than a simple dump of the fpu registers. The context structure stores the fpu state at the current moment (Control & Status words), and the thread structure even has a separate frame for the fpu in the kernel stack. But this is Windows, and for other OS your suggestion may be useful.
Code: 0: kd> dt _xsave_format ntdll!_XSAVE_FORMAT +0x000 ControlWord : Uint2B +0x002 StatusWord : Uint2B +0x004 TagWord : UChar +0x005 Reserved1 : UChar +0x006 ErrorOpcode : Uint2B +0x008 ErrorOffset : Uint4B +0x00c ErrorSelector : Uint2B +0x00e Reserved2 : Uint2B +0x010 DataOffset : Uint4B +0x014 DataSelector : Uint2B +0x016 Reserved3 : Uint2B +0x018 MxCsr : Uint4B +0x01c MxCsr_Mask : Uint4B +0x020 FloatRegisters : [8] _M128A +0x0a0 XmmRegisters : [16] _M128A +0x1a0 Reserved4 : [96] UChar 0: kd> dt _kthread fffffa8005c0da90 ;<-------- User thread ntdll!_KTHREAD +0x000 Header : _DISPATCHER_HEADER +0x018 CycleTime : 0x7`8f7f5d63 +0x020 QuantumTarget : 0x7`913ac789 +0x283 QuantumReset : 6 +0x134 ContextSwitches : 0x5697 +0x028 InitialStack : 0xfffff880`0516fd70 Void +0x030 StackLimit : 0xfffff880`05166000 Void +0x038 KernelStack : 0xfffff880`0516f470 Void +0x278 StackBase : 0xfffff880`05170000 Void +0x358 StateSaveArea : 0xfffff880`0516fdc0 _XSAVE_FORMAT ;<---------// ............ 0: kd> |
|||
![]() |
|
revolution 20 Jun 2025, 07:32
Core i7 wrote: But this is Windows, and for other OS your suggestion may be useful. Even so, I remember reading a long time ago that Windows follows this method of delaying the save/restore of the FPU/XMM state until needed. If a process requests to read the context then Windows is forced to update the FPU//XMM state before passing it on to the requesting process. Do the latest Windows versions no longer do this? Did they do some benchmarks, or something, and decide to scrap it because "modern" apps make copious use of the FPU / vector units? Perhaps with the mandatory stack alignment requirement, due to the use of MOVDQA for loading the first four values of a FASTCALL procedure, that the trade-off is no longer worth it (in Windows). Other OSes might, or might not, find it worthwhile. But it is a data point for people to consider when constructing their OSes. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.