flat assembler
Message board for the users of flat assembler.
Index
> Main > New Intel instructions |
Author |
|
revolution 23 Sep 2012, 02:27
|
|||
23 Sep 2012, 02:27 |
|
Tomasz Grysztar 23 Sep 2012, 16:44
I think he listed all the new ones. PREFETCHW is an adopted AMD's opcode.
|
|||
23 Sep 2012, 16:44 |
|
LocoDelAssembly 23 Sep 2012, 17:09
Perhaps should be posted in the useless instructions thread but, any ideas of what ADCX and ADOX are for? If I understand them correctly, the only difference with ADC is that those affect only CF and OF flags respectively leaving the rest untouched (unlike ADC).
|
|||
23 Sep 2012, 17:09 |
|
Tomasz Grysztar 23 Sep 2012, 17:25
It is explained here: New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors.
Intel wrote: The primary advantage of these instructions over adc is that they support two independent carry chains. |
|||
23 Sep 2012, 17:25 |
|
tthsqe 24 Sep 2012, 05:29
I wish we had some kind of contest here to find a fast way of doing multiplication on integers of a fixed but large size. The possibilities with mulx, adcx, and adox really do explode...
Also, which processors support adcx and adox? |
|||
24 Sep 2012, 05:29 |
|
revolution 24 Sep 2012, 05:40
tthsqe wrote: I wish we had some kind of contest here to find a fast way of doing multiplication on integers of a fixed size. The possibilities with mulx, adcx, and adox really do explode... Small numbers: Use standard schoolgirl multiplication Medium sized numbers: Use Karatsuba/Toom-Cook Large numbers: Use FFT Also, different architectures/systems will give different results so any speed measurements would be mostly meaningless. |
|||
24 Sep 2012, 05:40 |
|
tthsqe 24 Sep 2012, 06:22
all right - to revolution everything is relative - just fix some parameters, dude
in general people have difference needs, but I have only a few: 1. do a 1024 bit x 1024 bit unsigned SQUARE and produce the the upper 1024 bits with possibly an error in the least significant bit. 2. fix any capable processor and/or system note: it would be hard to "beat" a more or less "optimization" of the "standard schoolgirl multiplication" at this input range The problem is with how the partial products should be added MOST efficiently. |
|||
24 Sep 2012, 06:22 |
|
revolution 24 Sep 2012, 06:26
Which CPU brand(s)?
64-bit or 32-bit (or 16-bit)? SSE allowed or not? AVX allowed or not? Which OS? Input data is already in cache or not? Code is already in cache or not? |
|||
24 Sep 2012, 06:26 |
|
tthsqe 24 Sep 2012, 06:38
any fixed intel processor with the aformentioned instructions
64-bit SSE allowed AVX allowed windows 7 Code and data already in cache square is iterated many times over same memory locations (think mandelbrot) The square could also be performed with others in parallel. Throughput is the issue. |
|||
24 Sep 2012, 06:38 |
|
revolution 24 Sep 2012, 06:48
There are literally 100s is Intel CPUs that will match your specification. Different number of cores and different sizes/level of cache. Hyperthreading?
You will also need to specify which mobo since all mobos/BIOSs have different effects upon code. On-board video can steal cycles. Other external events can steal cycles etc. There are so many variables here,. If you want the speed measurements to be meaningful then you have to be very restrictive about exactly how you measure and on what system you measure. |
|||
24 Sep 2012, 06:48 |
|
tthsqe 24 Sep 2012, 07:08
Since practical speed measurements have such drawbacks, lets make it purely a combinatorial problem.
Only allow instructions operating on the general purpose registers the main ones will be mul/mulx and add/adc/adcx/adox and load/store assume that there is enough ILP that the timing of the code should be measured by (# mul/mulx instrunctions)*(thoughput of mul)+ (# add instrunctions)*(thoughput of add)+ (# adc/adcx instrunctions)*(thoughput of adc/adcx) This assumes that load stores are free. If such an assumption is not practical, add in some nominal latency for those also, the number to be squared is in the cache, and its square should be written back to the next location in the cache |
|||
24 Sep 2012, 07:08 |
|
hopcode 24 Sep 2012, 11:33
why so hard ?
Quote:
now begins the design of your algo for that CPU; not before. and consider this too, generally: according what i have heard about bulldozer. if your is a good algo, and it is true what the evangelists say about bulldozer, AMD+good MOBO should win on intel for few lenghts, or they time the same. because this is a benchmark-situation if Intel wins on bulldozer on 1x and on threading, somehing doesent follow the 3C rule for AMD, or the AMD evangelists lie. (for what i have heard, the last is statistically true) Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
24 Sep 2012, 11:33 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.