flat assembler
Message board for the users of flat assembler.

Index > Main > Kernel performance question

Author
Thread Post new topic Reply to topic
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf 19 Jul 2006, 13:07
Hello,

I hope this is the suitable message board for this...

I have been reading trough the whitepaper on the Sythesis OS. It is a non-x86 OS that was built in Assembly, experimentaly, in the spirit of high performance. A fascinating read.

The paper mentions, if I understood it right, that most OS kernels work in a branchful manner: Every time a program calls a Kernel call, the Kernel procedure first has to decode the request, before executing it, thus making a lot of branched choices.

In other 'words', this would be a Kernel call:
Code:
mov bl,[blah]
mov ah,6Ch
int 21h    

...and this would then be a part of how the Kernel decides which instruction this is:
Code:
cmp ah,6Bh
je (somewhere)
cmp ah,6Ch
je (somewhere else)    


These conditional jumps do not execute very quickly on a Pipelined CPU, I think. I also think that even the most advanced branch prediction would be powerless to help in a situation such as this. Right?

How many Kernel calls do modern programs make? At the time when that Sythesis whitepaper was written (if the paper is accurate), programs relied on Kernel calls regularly troughout execution.

Does the design of all modern Kernels impact on the efficiency of the Pipeline?
Post 19 Jul 2006, 13:07
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 19 Jul 2006, 15:13
Quote:
...and this would then be a part of how the Kernel decides which instruction this is:

wrong, you can use jump table.

also why did you choose "project and ideas" section?!? It's meant is "projects and project ideas", not any ideas Smile

moderator: please, move it
Post 19 Jul 2006, 15:13
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf 19 Jul 2006, 19:42
vid wrote:
Quote:
...and this would then be a part of how the Kernel decides which instruction this is:

wrong, you can use jump table.


Which doesn't change anything for the Pipeline, or does it?

I am also wondering if the modern kernels actually use a jump table or not.

vid wrote:
also why did you choose "project and ideas" section?!? It's meant is "projects and project ideas", not any ideas Smile

moderator: please, move it


It is about the Sythesis project. I am coinsidering the idea of creating a new simple x86 kernel and am wondering wether the things pointed out in the Sythesis whitepaper are still valid on modern platforms.

So it is a project idea and I'm not sure I will ever let it evolve into anything more than an idea (since I don't really have time to write a kernel). The facts are still very interesting to me tho.
Post 19 Jul 2006, 19:42
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 19 Jul 2006, 21:58
hmmm, for me 2-3 instructions and branch is not that much overhead for each kernel call, do you think otherwise? just performing call itself can take more time than this, most times.

PS: i think best place would be OS-Dev section Wink
Post 19 Jul 2006, 21:58
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf 20 Jul 2006, 00:06
vid wrote:
hmmm, for me 2-3 instructions and branch is not that much overhead for each kernel call, do you think otherwise? just performing call itself can take more time than this, most times.


Well the Intel Pipeline (according to my information) is 20 stages long. If you take a Kernel call that uses an unpredictable branch, you break that Pipeline and lenghten the execution time for another 20 stages.

If programs use such Kernel calls from within every loop, the design of the Kernel is slowing them down up to 20-fold.

Now of course, there are different kernel calls, some have huge procedures on them, others are much shorter, but the effect of the branch is present on all of them. A Kernel call that inputs a key from the keyboard or transfers the position of the mouse happens very commonly and it's branch will brake the Pipeline every time, you're looking at a relatively intense slowdown.
Post 20 Jul 2006, 00:06
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 20 Jul 2006, 09:09
do you think that input happens so often. i think max rate for keyboard is about 30 per second... not worth of mentioning.

yes, you are right there could be some imporvement, but i doubt it will be noticeable. and how would YOU suggest to solve it without unpredicatable branch? relocating all system calls in code?
Post 20 Jul 2006, 09:09
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
DustWolf



Joined: 26 Jan 2006
Posts: 373
Location: Ljubljana, Slovenia
DustWolf 20 Jul 2006, 15:49
vid wrote:
yes, you are right there could be some imporvement, but i doubt it will be noticeable. and how would YOU suggest to solve it without unpredicatable branch? relocating all system calls in code?


Synthesis (I think) used templates for calls instead of branched selection.

Instead of calling a Kernel call, the program would have the entire code pre-assembled durring load (and possibly by the OS durring runtime as well), integrated with system calls and runtime optimized (hence the name: "Synthesis", it sythesizes it's own code).

Supposably, it performed very well, but it was written for a dual core 50 MHz non-x86. It was very good for all kinds of signal processing, due to it's high efficiency and thus real-timeness. Making something similar for a modern platform should be fascinating. Smile
Post 20 Jul 2006, 15:49
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jul 2006, 16:18
Not sure if it's the same but remember the VxD times on Win9x where you call some int like http://win32assembly.online.fr/vxd-tut1.html

The first time it's a call to int 20h and next all calls are directly to the function.
Post 20 Jul 2006, 16:18
View user's profile Send private message Reply with quote
bubach



Joined: 17 Sep 2004
Posts: 341
Location: Trollhättan, Sweden
bubach 20 Jul 2006, 20:42
I use a table with function addresses and then do soemthing like this:

call [(function_number*4)+function_list]
Post 20 Jul 2006, 20:42
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 21 Jul 2006, 05:07
bubach: that's unpredictable branch too (i think)
Post 21 Jul 2006, 05:07
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 21 Jul 2006, 08:50
Indirect branching (ie., jump table) is unpredictable indeed, but it's a single unpredictable branch vs. several hundred branch mispredicts in the CMP,JE case...

Quote:

If programs use such Kernel calls from within every loop, the design of the Kernel is slowing them down up to 20-fold.

You don't call kernel code in time-critical loops, just like you don't base graphics algorithms around PutPixel() calls. For user input and file I/O, you're limited by slow external factors, so the overhead of calling the kernel is more or less insignificant.
Post 21 Jul 2006, 08:50
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Jul 2006, 13:18
Actually on PPro and newer processors indirect branch is predicted (when the branch address is the same address of the previous execution of course).
Post 21 Jul 2006, 13:18
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 22 Jul 2006, 21:17
DustWolf wrote:
Instead of calling a Kernel call, the program would have the entire code pre-assembled durring load (and possibly by the OS durring runtime as well), integrated with system calls and runtime optimized (hence the name: "Synthesis", it sythesizes it's own code).

Nice idea Wink, personally if you write your own OS, it shouldn't be that hard. The "magic" concept here is a custom .exe format. Something like this (simplified view):

Code:
+------------------------+
|        Header          |  < fixed number of bytes
+------------------------+
|  System call entries   |  < table with pointers to the system-call instructions in the code (something like RVA)
+------------------------+
|The code (not CPU valid)|  < the code, without the system-call instructions which will be managed later by the .exe loader
+------------------------+    
And then, when it loads the .exe, it automatically inserts the system-call instructions at the location specified by the entries in the System call entries, with the "target address" being the OS's specific function. This function number accompanies the RVA addresses. Of course, the function numbers never get placed in the code -- they are just identifier for the OS to replace them with the corresponding function offset. This way you will have direct jumps to the functions in the OS, and even if the OS's version changes, so will this "loader" change it's lookup table for the function numbers (i.e they will receive a different offset in another version), but the .exe will be compatible.

Hope it's clear. The idea is that you should always use your own formats because they suit your OS's requirements the best (in fact, they can even use specific low-level stuff). Personally, in my projects I use internally only my formats -- I implement and support the other formats for compatibility with other applications (for example, import a .tga file and convert it in .custom file), but all the internal stuff, that's custom man, I'm not lazy to let others do my work. Wink

Sorry if I mislead you with my bad explanations, but I hope you get the point.

ps: I hate text formats that are compiled run-time. It's simply such a waste every time... but that's off-topic, though it explains that in text files you can insert stuff in the middle, manipulate, etc.. and since text files are binaries too (just think of ASCII as hex), then it's obvious binary can be manipulated the same way, but some people are simply to blind to see this until you show them a proof.
Post 22 Jul 2006, 21:17
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 22 Jul 2006, 21:26
For a simplified view of my previous explanation, imagine that your exe loader "finds" all the system-calls, and replaces the function number with the offset, thus yielding a direct call.
Post 22 Jul 2006, 21:26
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 22 Jul 2006, 21:36
So, similar to VxDs where Windows replaces int 20h accurrencies with calls?
Post 22 Jul 2006, 21:36
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 22 Jul 2006, 21:46
I had no experience with those VxDs, but I think it's the same idea, though mine is a bit faster on "startup" because it doesn't need to "search" for the system-calls -- since these are already in the System call table. And also note that my idea was to simply "remove" the ints and keep an offset to them in the file (not in memory). Then, the loader would simply "add" the respective call there before inserting the following bytes in memory. And will do so until there are no more entries in the System call table. From there on, it should simply add the rest of the code to memory, and start executing Smile

ps: when you compile the program with an assembler/compiler, you should make it write the same number of bytes in the system-calls as will be when the loader "manages" them (i.e the direct calls). This is important because of the offsets -- they need to be calculated correctly. Of course, after calculating the offsets you only need to store the offset on disk for each system call in a table and then strip them off the code (you also need to store the function number near the offset in the same table so the loader knows what system-call target address to insert).

The thing above can be done with FASM's macros, or even better, with a specialized compiler, since you might need an OS-specific compiler anyway to adapt itself to that format Smile
Post 22 Jul 2006, 21:46
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.