flat assembler
Message board for the users of flat assembler.
Index
> Main > Some questions: [edi+ecx*3] | PUSH | RAM limit | IDT PIT Goto page 1, 2, 3, 4 Next |
Author |
|
Teehee 28 Feb 2011, 11:20
Hi. me again, sorry.
1. why do i can't to do mov dword[edi+ecx*3] ? 2. how many push/pop can i use before i consider to change to pusha/popa? 3. My PC has 512Mb RAM, what happen if i try to read/write a memory above that? (bc my GDT is defined up to 4Gb). 4. in PMode, do i really need to define a IDT and a PIT - System Clock? (check link please) Thanks you. EDIT by DOS386 : enhanced subject _________________ Sorry if bad english. |
|||
28 Feb 2011, 11:20 |
|
edfed 28 Feb 2011, 13:59
access to not present RAM theorically returns nothing.
to test: create a 4MB page that points first to the 1st 4MB, @@: write to offset 1000h, read back, if equal the page contains real memory set a flag in some data structure to indicate, memory present increment page, to point to next 4MB, until it reashes 4GB loop @b and you will have the map of present RAM on your system. but be carefull, some RAM from VIdeo cards, USB controler, etc can be implemented in upper memory adresses (After your 512MB RAM), but is not RAM useable for your programs, just hardware buffer for the peripheral. |
|||
28 Feb 2011, 13:59 |
|
revolution 28 Feb 2011, 14:50
Teehee wrote: 2. how many push/pop can i use before i consider to change to pusha/popa? Teehee wrote: 4. in PMode, do i really need to define a IDT and a PIT - System Clock? (check link please) |
|||
28 Feb 2011, 14:50 |
|
Teehee 28 Feb 2011, 15:38
Thank you guys.
revolution wrote: Completely up to you. but, rev: Code: push eax push ebx push ecx ; 5 pushs push edx push esi pop esi pop edx pop ecx ; 5 pops pop ebx pop eax Code: pusha popa _________________ Sorry if bad english. |
|||
28 Feb 2011, 15:38 |
|
revolution 28 Feb 2011, 16:21
Teehee wrote: ... isn't slower than just ...? Short answer: It depends Long answer: We don't know. What CPU? What RAM timings? What mobo? What video card? What OS? What is in cache? How many times do you call it? etc. etc. etc. Helpful answer: If you can't notice any change in your program's runtime then it doesn't matter which one you use. Déjà vu |
|||
28 Feb 2011, 16:21 |
|
Teehee 28 Feb 2011, 16:35
that makes no sense to me.. it seems to me that the CPU need many clocks to process each push/pop (and bytes # increase), while a single pusha/popa it does not need so much.
|
|||
28 Feb 2011, 16:35 |
|
b1528932 28 Feb 2011, 18:52
1. because sib scale is only 2 bytes. you can multiply by 1, 2, 4 or 8.
instead pf pusha/popa use sub/mov, add/mov. pusha/popa exist only on few cpus, since 80386. removed in long mode. |
|||
28 Feb 2011, 18:52 |
|
edfed 28 Feb 2011, 18:58
b1528932 wrote:
not cool, it means i should say this in the book, one hour of job more just to document the 64 bits particularities, that i cannot test because i don't have any 64 bit compatible CPU... |
|||
28 Feb 2011, 18:58 |
|
Teehee 28 Feb 2011, 19:07
b1528932 wrote: pusha/popa exist only on few cpus, since 80386. removed in long mode. is pushf ok? _________________ Sorry if bad english. |
|||
28 Feb 2011, 19:07 |
|
Madis731 28 Feb 2011, 20:09
For simplicity I'll take a 45nm Intel CPU (Wolfdale) for this example:
Code: ;Source: Agner Fog's instruction_tables.pdf fused uops p015 ... p3 p4 lat 1/throughput PUSH reg 1 1 1 3 1 PUSHF(D/Q) 17 15 1 1 7 PUSHA(D) 18 9 1 8 8 If you add up pushing of eax, edx, ecx, ebx, esi, edi, ebp then you will get 7 uops in a fused domain, but adding the additional 3 clocks of latency will take about 10 clocks of your precious CPU time. When you use pusha, you will use 17 clocks. On a Pentium MMX it used to take 5-9 clocks, on a Pentium III: 8 (or 10?) clocks. It has been around 17 clocks starting from Pentium M. Sometimes breaking up instructions into smaller pieces will be faster. |
|||
28 Feb 2011, 20:09 |
|
Teehee 28 Feb 2011, 20:11
Hmmm.. very nice.
Thanks. |
|||
28 Feb 2011, 20:11 |
|
revolution 01 Mar 2011, 03:05
Madis731 wrote: For simplicity I'll take a 45nm Intel CPU (Wolfdale) for this example: The timings that Agner Fog gives are for a very specific situation where everything has been setup to try and eliminate all these other factors and purely focus on just the CPU execution capability. But the CPU execution capability is rarely a bottleneck for anything in a modern computer. Unless you are very very precise with your coding you will likely never be able to achieve the theoretical maximum throughput for code sequences in any normal program. Notice how Madis731 has to specifically state which CPU the timings are for. Other CPUs will have different timings. Any code you try to optimise for one CPU in one system would have to be re-optimised for another CPU, or even the same CPU, in another system. |
|||
01 Mar 2011, 03:05 |
|
Madis731 01 Mar 2011, 07:32
I think that what is important here is to tell the difference between multiple caches. There's instruction cache and data cache. If you look at the pusha dump and the push eax,edx,...,ebp dump, there's a huge difference in code size. Actually esp is moved the same amount, there are as many memory reads/writes in the background, but instruction caching will make a huge difference regarding code speed.
If you're really an optimizing guru, you should try and see if this pusha does any speed difference: Code: macro push_all { ; I hope this is correct, I followed Intel's Instruction Set Reference 2B mov [esp-32],edi ;if esp-32 happens to fall at the start of cache line mov [esp-28],esi ;then this instruction will surely cache following instructions mov [esp-24],ebp mov [esp-20],esp mov [esp-16],ebx mov [esp-12],edx mov [esp-08],ecx mov [esp-04],eax sub esp,32 ; esp is only modified once } This is only theoretical. Actually writing on the cache line boundary will make things worse. On newer i7 (Sandy Bridge) CPUs this sequence might take as low as 3 clocks. |
|||
01 Mar 2011, 07:32 |
|
b1528932 01 Mar 2011, 07:50
this code is wrong. what is you get interrut in the middle of moving data?
first you have to ensure you own the stack, later do things to it. and its better to use bp relative addressing. wiht sp you have to use sib byte. push bp mov bp, sp sub sp, 16 mov [bp-16], di ... mov sp, bp pop bp |
|||
01 Mar 2011, 07:50 |
|
revolution 01 Mar 2011, 07:56
b1528932 wrote: what is you get interrut in the middle of moving data? |
|||
01 Mar 2011, 07:56 |
|
edfed 01 Mar 2011, 10:48
sub esp before, then, no problem with interrupts
Code: macro push_all { ; I hope this is correct, I followed Intel's Instruction Set Reference 2B sub esp,32 ; esp is only modified once mov [esp+0],edi ;if esp-32 happens to fall at the start of cache line mov [esp+4],esi ;then this instruction will surely cache following instructions mov [esp+8],ebp mov [esp+12],esp mov [esp+16],ebx mov [esp+20],edx mov [esp+24],ecx mov [esp+28],eax } |
|||
01 Mar 2011, 10:48 |
|
revolution 01 Mar 2011, 10:53
edfed wrote: sub esp before, then, no problem with interrupts |
|||
01 Mar 2011, 10:53 |
|
edfed 01 Mar 2011, 11:45
ho yeah, it's true...
then, should do: Code: add dword[esp+12],32 |
|||
01 Mar 2011, 11:45 |
|
Madis731 01 Mar 2011, 14:57
Interrupts always restore the context and nothing will happen to your stack.
Why aren't you worried about this sequence: Code: push eax ; what if interrupt happens here and totally trashes ESP? pop eax You MUST be able to rely on your context, otherwise everything memory-related (ever register-related) would be impossible on an x86 architecture. http://en.wikipedia.org/wiki/Context_switch This code is valid, only bad programming can interfere I was more worried about emulating fully the pusha macroinstruction rather than interrupts. |
|||
01 Mar 2011, 14:57 |
|
Goto page 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.