flat assembler
Message board for the users of flat assembler.

Index > Main > New Intel instructions

Author
Thread Post new topic Reply to topic
CandyMan



Joined: 04 Sep 2009
Posts: 319
Location: film "CandyMan" directed through Bernard Rose OR Candy Shop
CandyMan
Code:
0F C7 /7                RDSEED r16
0F C7 /7                RDSEED r32
REX.W+0F C7 /7          RDSEED r64
0F 01 CA                CLAC
0F 01 CB                STAC
66 0F 38 F6 /r          ADCX r32,r/m32
REX.W+66 0F 38 F6 /r    ADCX r64,r/m64
F3 0F 38 F6 /r          ADOX r32,r/m32
REX.W+F3 0F 38 F6 /r    ADOX r64,r/m64
    

_________________
smaller is better
Post 22 Sep 2012, 16:52
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
Post 23 Sep 2012, 02:27
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7797
Location: Kraków, Poland
Tomasz Grysztar
I think he listed all the new ones. PREFETCHW is an adopted AMD's opcode.
Post 23 Sep 2012, 16:44
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Perhaps should be posted in the useless instructions thread but, any ideas of what ADCX and ADOX are for? If I understand them correctly, the only difference with ADC is that those affect only CF and OF flags respectively leaving the rest untouched (unlike ADC).
Post 23 Sep 2012, 17:09
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7797
Location: Kraków, Poland
Tomasz Grysztar
It is explained here: New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors.

Intel wrote:
The primary advantage of these instructions over adc is that they support two independent carry chains.
Post 23 Sep 2012, 17:25
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
I wish we had some kind of contest here to find a fast way of doing multiplication on integers of a fixed but large size. The possibilities with mulx, adcx, and adox really do explode...

Also, which processors support adcx and adox?
Post 24 Sep 2012, 05:29
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
tthsqe wrote:
I wish we had some kind of contest here to find a fast way of doing multiplication on integers of a fixed size. The possibilities with mulx, adcx, and adox really do explode...
You would have to define the size beforehand.

Small numbers: Use standard schoolgirl multiplication
Medium sized numbers: Use Karatsuba/Toom-Cook
Large numbers: Use FFT

Also, different architectures/systems will give different results so any speed measurements would be mostly meaningless.
Post 24 Sep 2012, 05:40
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
all right - to revolution everything is relative - just fix some parameters, dude Smile
in general people have difference needs, but I have only a few:
1. do a 1024 bit x 1024 bit unsigned SQUARE and produce the the upper 1024 bits with possibly an error in the least significant bit.
2. fix any capable processor and/or system
note: it would be hard to "beat" a more or less "optimization" of the "standard schoolgirl multiplication" at this input range
The problem is with how the partial products should be added MOST efficiently.
Post 24 Sep 2012, 06:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
Which CPU brand(s)?
64-bit or 32-bit (or 16-bit)?
SSE allowed or not?
AVX allowed or not?
Which OS?
Input data is already in cache or not?
Code is already in cache or not?
Post 24 Sep 2012, 06:26
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
any fixed intel processor with the aformentioned instructions
64-bit
SSE allowed
AVX allowed
windows 7
Code and data already in cache
square is iterated many times over same memory locations (think mandelbrot)

The square could also be performed with others in parallel. Throughput is the issue.
Post 24 Sep 2012, 06:38
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
There are literally 100s is Intel CPUs that will match your specification. Different number of cores and different sizes/level of cache. Hyperthreading?

You will also need to specify which mobo since all mobos/BIOSs have different effects upon code. On-board video can steal cycles. Other external events can steal cycles etc.

There are so many variables here,. If you want the speed measurements to be meaningful then you have to be very restrictive about exactly how you measure and on what system you measure.
Post 24 Sep 2012, 06:48
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
Since practical speed measurements have such drawbacks, lets make it purely a combinatorial problem.
Only allow instructions operating on the general purpose registers
the main ones will be mul/mulx and add/adc/adcx/adox and load/store
assume that there is enough ILP that the timing of the code should be measured by (# mul/mulx instrunctions)*(thoughput of mul)+ (# add instrunctions)*(thoughput of add)+ (# adc/adcx instrunctions)*(thoughput of adc/adcx)
This assumes that load stores are free. If such an assumption is not practical, add in some nominal latency for those
also, the number to be squared is in the cache, and its square should be written back to the next location in the cache
Post 24 Sep 2012, 07:08
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
why so hard ?
Quote:

- 1st-C CPU:
time a bunch of instructions, say 1000, without load-store reference.

- 2nd-C CORE:
time the same as above 4x threading it

now the funny part

- 3rd-C CACHE
- time LOAD only 1x no stride, sequential
- time LOAD only 1x using strides
- time LOAD 4x (threading it) using strides

- time STORE only 1x no stride, sequential
- time STORE only 1x using strides
- time STORE 4x (threading) using strides
this, or something like this, is my basic unavoidable recipe for all processors i know (max Quadcore 45nm)
now begins the design of your algo for that CPU;
not before. and consider this too, generally:
according what i have heard about bulldozer.
if your is a good algo, and it is true what the evangelists say about
bulldozer, AMD+good MOBO should win on intel for few lenghts, or
they time the same. because this is a benchmark-situation

if Intel wins on bulldozer on 1x and on threading, somehing
doesent follow the 3C rule for AMD, or the AMD evangelists lie.
(for what i have heard, the last is statistically true)

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 24 Sep 2012, 11:33
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.