flat assembler
Message board for the users of flat assembler.

Index > Main > Some questions: [edi+ecx*3] | PUSH | RAM limit | IDT PIT

Goto page 1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
Teehee



Joined: 05 Aug 2009
Posts: 570
Location: Brazil
Teehee 28 Feb 2011, 11:20
Hi. me again, sorry.

1. why do i can't to do mov dword[edi+ecx*3] ?
2. how many push/pop can i use before i consider to change to pusha/popa?
3. My PC has 512Mb RAM, what happen if i try to read/write a memory above that? (bc my GDT is defined up to 4Gb).
4. in PMode, do i really need to define a IDT and a PIT - System Clock? (check link please)

Thanks you.

EDIT by DOS386 : enhanced subject

_________________
Sorry if bad english.
Post 28 Feb 2011, 11:20
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 28 Feb 2011, 12:11
1. because there is no such addressing mode. Intel architecture instructions allows multiplication of 2, 4 and 8 and only for one of the registers. This addressing is intended for accessing arrays of word, dword and qword elements.
You can use a trick for multiply by 3: [ecx+2*ecx] == [3*ecx] : FASM makes such conversions automatically. In this case you cans use second register in the addressing, because you actually already used it.
Post 28 Feb 2011, 12:11
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 28 Feb 2011, 13:59
access to not present RAM theorically returns nothing.
to test:
create a 4MB page that points first to the 1st 4MB,
@@:
write to offset 1000h, read back, if equal the page contains real memory
set a flag in some data structure to indicate, memory present
increment page, to point to next 4MB, until it reashes 4GB
loop @b

and you will have the map of present RAM on your system. but be carefull, some RAM from VIdeo cards, USB controler, etc can be implemented in upper memory adresses (After your 512MB RAM), but is not RAM useable for your programs, just hardware buffer for the peripheral.
Post 28 Feb 2011, 13:59
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 28 Feb 2011, 14:50
Teehee wrote:
2. how many push/pop can i use before i consider to change to pusha/popa?
Completely up to you.
Teehee wrote:
4. in PMode, do i really need to define a IDT and a PIT - System Clock? (check link please)
You are not forced to set up any hardware in pmode. But if you leave the interrupt system uninitialised then you will have a hard time doing some things.
Post 28 Feb 2011, 14:50
View user's profile Send private message Visit poster's website Reply with quote
Teehee



Joined: 05 Aug 2009
Posts: 570
Location: Brazil
Teehee 28 Feb 2011, 15:38
Thank you guys.

revolution wrote:
Completely up to you.


but, rev:
Code:
push eax
push ebx
push ecx ; 5 pushs
push edx
push esi
pop esi
pop edx
pop ecx ; 5 pops
pop ebx
pop eax    
isn't slower than just
Code:
pusha
popa    
?

_________________
Sorry if bad english.
Post 28 Feb 2011, 15:38
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 28 Feb 2011, 16:21
Teehee wrote:
... isn't slower than just ...?
Don't know.

Short answer: It depends

Long answer: We don't know. What CPU? What RAM timings? What mobo? What video card? What OS? What is in cache? How many times do you call it? etc. etc. etc.

Helpful answer: If you can't notice any change in your program's runtime then it doesn't matter which one you use.

Déjà vu Razz
Post 28 Feb 2011, 16:21
View user's profile Send private message Visit poster's website Reply with quote
Teehee



Joined: 05 Aug 2009
Posts: 570
Location: Brazil
Teehee 28 Feb 2011, 16:35
that makes no sense to me.. it seems to me that the CPU need many clocks to process each push/pop (and bytes # increase), while a single pusha/popa it does not need so much.
Post 28 Feb 2011, 16:35
View user's profile Send private message Reply with quote
b1528932



Joined: 21 May 2010
Posts: 287
b1528932 28 Feb 2011, 18:52
1. because sib scale is only 2 bytes. you can multiply by 1, 2, 4 or 8.


instead pf pusha/popa use sub/mov, add/mov.
pusha/popa exist only on few cpus, since 80386. removed in long mode.
Post 28 Feb 2011, 18:52
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 28 Feb 2011, 18:58
b1528932 wrote:

pusha/popa exist only on few cpus, since 80386. removed in long mode.

not cool, it means i should say this in the book, one hour of job more just to document the 64 bits particularities, that i cannot test because i don't have any 64 bit compatible CPU...
Post 28 Feb 2011, 18:58
View user's profile Send private message Visit poster's website Reply with quote
Teehee



Joined: 05 Aug 2009
Posts: 570
Location: Brazil
Teehee 28 Feb 2011, 19:07
b1528932 wrote:
pusha/popa exist only on few cpus, since 80386. removed in long mode.
Shocked

Idea is pushf ok?

_________________
Sorry if bad english.
Post 28 Feb 2011, 19:07
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 28 Feb 2011, 20:09
For simplicity I'll take a 45nm Intel CPU (Wolfdale) for this example:
Code:
;Source: Agner Fog's instruction_tables.pdf
         fused uops  p015 ... p3     p4      lat 1/throughput
PUSH reg      1              1       1       3       1
PUSHF(D/Q) 17      15      1       1               7
PUSHA(D)   18       9      1       8               8
    

If you add up pushing of eax, edx, ecx, ebx, esi, edi, ebp then you will
get 7 uops in a fused domain, but adding the additional 3 clocks of
latency will take about 10 clocks of your precious CPU time.
When you use pusha, you will use 17 clocks.

On a Pentium MMX it used to take 5-9 clocks, on a Pentium III: 8 (or 10?) clocks. It has been around 17 clocks starting from Pentium M.

Sometimes breaking up instructions into smaller pieces will be faster.
Post 28 Feb 2011, 20:09
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Teehee



Joined: 05 Aug 2009
Posts: 570
Location: Brazil
Teehee 28 Feb 2011, 20:11
Hmmm.. very nice. Smile
Thanks.
Post 28 Feb 2011, 20:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 01 Mar 2011, 03:05
Madis731 wrote:
For simplicity I'll take a 45nm Intel CPU (Wolfdale) for this example:
Code:
;Source: Agner Fog's instruction_tables.pdf
    fused uops  p015 ... p3     p4      lat 1/throughput
PUSH reg      1              1       1       3       1
PUSHF(D/Q) 17      15      1       1               7
PUSHA(D)   18       9      1       8               8
    

If you add up pushing of eax, edx, ecx, ebx, esi, edi, ebp then you will
get 7 uops in a fused domain, but adding the additional 3 clocks of
latency will take about 10 clocks of your precious CPU time.
When you use pusha, you will use 17 clocks.

On a Pentium MMX it used to take 5-9 clocks, on a Pentium III: 8 (or 10?) clocks. It has been around 17 clocks starting from Pentium M.

Sometimes breaking up instructions into smaller pieces will be faster.
I think it is important to also state that the above is only a very small part of what you need to take into account to get the actual timing of a code sequence. There is so much more that has not been considered. Is the code in the instruction cache yet? Is the data in the data cache yet? Is the stack aligned to any particular boundary? Would the extra pushes evict some other stale cache data to main memory? Are the execution queues already full with other previous or following uops? Are multiple cores accessing the same page of memory? Is the instruction going to force the instruction decoder to cut short the previous parallel decode? Will the extra bytes of code cause instruction cache thrashing? Will the extra dwords of store for pusha cause data cache thrashing? Are the write buffers already full with previous memory writes?

The timings that Agner Fog gives are for a very specific situation where everything has been setup to try and eliminate all these other factors and purely focus on just the CPU execution capability. But the CPU execution capability is rarely a bottleneck for anything in a modern computer. Unless you are very very precise with your coding you will likely never be able to achieve the theoretical maximum throughput for code sequences in any normal program. Notice how Madis731 has to specifically state which CPU the timings are for. Other CPUs will have different timings. Any code you try to optimise for one CPU in one system would have to be re-optimised for another CPU, or even the same CPU, in another system.
Post 01 Mar 2011, 03:05
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 01 Mar 2011, 07:32
I think that what is important here is to tell the difference between multiple caches. There's instruction cache and data cache. If you look at the pusha dump and the push eax,edx,...,ebp dump, there's a huge difference in code size. Actually esp is moved the same amount, there are as many memory reads/writes in the background, but instruction caching will make a huge difference regarding code speed.

If you're really an optimizing guru, you should try and see if this pusha does any speed difference:
Code:
macro push_all
{
; I hope this is correct, I followed Intel's Instruction Set Reference 2B
     mov     [esp-32],edi ;if esp-32 happens to fall at the start of cache line
  mov     [esp-28],esi ;then this instruction will surely cache following instructions
        mov     [esp-24],ebp
        mov     [esp-20],esp
        mov     [esp-16],ebx
        mov     [esp-12],edx
        mov     [esp-08],ecx
        mov     [esp-04],eax
        sub     esp,32 ; esp is only modified once
}
    

This is only theoretical. Actually writing on the cache line boundary will make things worse. On newer i7 (Sandy Bridge) CPUs this sequence might take as low as 3 clocks.
Post 01 Mar 2011, 07:32
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
b1528932



Joined: 21 May 2010
Posts: 287
b1528932 01 Mar 2011, 07:50
this code is wrong. what is you get interrut in the middle of moving data?
first you have to ensure you own the stack, later do things to it.
and its better to use bp relative addressing. wiht sp you have to use sib byte.

push bp
mov bp, sp
sub sp, 16
mov [bp-16], di
...
mov sp, bp
pop bp
Post 01 Mar 2011, 07:50
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 01 Mar 2011, 07:56
b1528932 wrote:
what is you get interrut in the middle of moving data?
It won't matter for ring3 code. Each ring has its own stack pointer so an interrupt into the ring0 kernel won't affect anything.
Post 01 Mar 2011, 07:56
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 01 Mar 2011, 10:48
sub esp before, then, no problem with interrupts
Code:
macro push_all
{
; I hope this is correct, I followed Intel's Instruction Set Reference 2B
        sub     esp,32 ; esp is only modified once
        mov     [esp+0],edi ;if esp-32 happens to fall at the start of cache line
        mov     [esp+4],esi ;then this instruction will surely cache following instructions
        mov     [esp+8],ebp
        mov     [esp+12],esp
        mov     [esp+16],ebx
        mov     [esp+20],edx
        mov     [esp+24],ecx
        mov     [esp+28],eax
}     
Post 01 Mar 2011, 10:48
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 01 Mar 2011, 10:53
edfed wrote:
sub esp before, then, no problem with interrupts
No, then the stored value of ESP is wrong.
Post 01 Mar 2011, 10:53
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 01 Mar 2011, 11:45
ho yeah, it's true...
then, should do:
Code:
add dword[esp+12],32
    
Post 01 Mar 2011, 11:45
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 01 Mar 2011, 14:57
Interrupts always restore the context and nothing will happen to your stack.
Why aren't you worried about this sequence:
Code:
push eax
; what if interrupt happens here and totally trashes ESP?
pop eax
    


You MUST be able to rely on your context, otherwise everything memory-related (ever register-related) would be impossible on an x86 architecture.
http://en.wikipedia.org/wiki/Context_switch

This code is valid, only bad programming can interfere Smile I was more worried about emulating fully the pusha macroinstruction rather than interrupts.
Post 01 Mar 2011, 14:57
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.