flat assembler
Message board for the users of flat assembler.

Index > Main > Is FASM taking advantage of Intel's Knights Corner (or MIC)?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 16 Jul 2012, 11:16
I do not know if this message belongs in the "Main" section, if not, please move it to a more appropriate forum

Intel's Knights Corner, a variation of its cancelled Larabee project, is going to be available in the market as "MIC" http://en.wikipedia.org/wiki/Intel_MIC

Image

Reportedly it will be like 50 cores packed in one package, with 8GB of GDDR5 memory

Is FASM going to take advantage of Intel's new offering?


Last edited by kalambong on 16 Jul 2012, 23:38; edited 1 time in total
Post 16 Jul 2012, 11:16
View user's profile Send private message Reply with quote
Enko



Joined: 03 Apr 2007
Posts: 676
Location: Mar del Plata
Enko 16 Jul 2012, 13:53
What is the diference between the normal cpu and this one?
I´m talking in the scence of programing.

As I see it, it doesnt need any special treatment by the program executable. The OS is the one that should handle this thing.
Post 16 Jul 2012, 13:53
View user's profile Send private message Reply with quote
LostCoder



Joined: 07 Mar 2012
Posts: 22
LostCoder 16 Jul 2012, 17:19
kalambong, well, wikipedia link says it still prototype, so I think it is too early to talk about it.
Post 16 Jul 2012, 17:19
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 16 Jul 2012, 21:47
LostCoder wrote:
kalambong, well, wikipedia link says it still prototype, so I think it is too early to talk about it.


"Prototype products, codenamed Knights Ferry were announced and released to developers in 2010. A commercial release, codenamed Knights Corner to be built on a 22nm process is scheduled to go into production in late 2012."

Knights Corner is not a prototype. It will be a commercial product.

I think Knights Corner can be very interesting machine.
Post 16 Jul 2012, 21:47
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 16 Jul 2012, 23:33
Enko wrote:
What is the diference between the normal cpu and this one?
I´m talking in the scence of programing.

As I see it, it doesnt need any special treatment by the program executable. The OS is the one that should handle this thing.


Apparently there may be some new instruction sets for the Knight's Corner products, although technically they do fall into the X86/X64 category.

As I've been following this thing closely, perhaps I should share some links -

Intel has released the software stacks for the MIC products to the Linux community, open-source, and is actively adding to it.

http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-open-source-software-stack/

http://www.phoronix.com/scan.php?page=news_item&px=MTExOTE

Allow me to quote :

Quote:
GCC for Knights Corner is really only for building the kernel and related tools; it is not for building applications. Using GCC to build an application for Knights Corner will most often result in low performance code due its current inability to vectorize for the new Knights Corner vector instructions. Future changes to give full usage of Knights Corner vector instructions would require work on the GCC vectorizer to utilize those instructions’ masking capabilities.



What the above is hinting is that there will be a lot more new stuffs, maybe in the form of new instruction sets, that can take full advantage of the vectorize ability of Intel's MIC products.

Perhaps in this regards FASM can shine - if FASM can support the same vectorize features that are in the MIC family, FASM may be a good alternative to GCC for users who want to tap into the full power of MIC
Post 16 Jul 2012, 23:33
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8349
Location: Kraków, Poland
Tomasz Grysztar 17 Jul 2012, 09:33
Its instruction set is a new variant of what the Larrabee instruction set was (with 512-bit vector registers). fasm never supported the latter, since it was just an exotic prototype. As for the MIC instructions - we will see, I may implement them if there is need for it, but certainly not soon.
Post 17 Jul 2012, 09:33
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 17 Jul 2012, 23:57
Tomasz Grysztar wrote:
Its instruction set is a new variant of what the Larrabee instruction set was (with 512-bit vector registers). fasm never supported the latter, since it was just an exotic prototype. As for the MIC instructions - we will see, I may implement them if there is need for it, but certainly not soon.


Thank you, Tomasz, for your reply

I do have nothing but respect for you, Sir. So, whichever path that you think is right for FASM in regarding to Intel's MIC, of course, I'll also respect

It's just that, IMHO, in this age where many chip manufacturers, - such as Nvidia and AMD's ATi, - have decided that they won't allow assembly language programmers to program directly to the inner hardware registers of their cpu/gpu, Intel's MIC remains a tantalizing possibility

GCC may one day implement the vectorized registers of Intel's MIC - but as we already know, it may take years, or even decades, for the gigantic GCC machinery to reach that stage

And in the meantime, if people can get to the same vectorize registers through FASM (that is, if Intel releases sufficient info, and if you decide to implement them in FASM) - programmers who want to tap into the full potential of Intel's MIC could utilize FASM to realize their dream

But of course, it's all vapor-ware talk, for now Smile
Post 17 Jul 2012, 23:57
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 18 Jul 2012, 01:06
My only question is: how good is its 64x64->128 bit multiply and 64 bit adc?
i.e. could it beat a 2600K or better in deep Mandelbrot zooms?
32 bit floats only have limited use and its hard to string them together for multiprecision...
Post 18 Jul 2012, 01:06
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 19 Jul 2012, 11:53
tthsqe wrote:
My only question is: how good is its 64x64->128 bit multiply and 64 bit adc?
i.e. could it beat a 2600K or better in deep Mandelbrot zooms?
32 bit floats only have limited use and its hard to string them together for multiprecision...

dont expect too much from it. there are not instructional
improvements "per se". we will see instructional improvements,
of course. although in the same way, i imagine, we observed latencies
and differences beetween 45nm and 65nm.

but those cores are Pentiums glued together, simply und stop.
and my investigation, started on this board more than
one year ago, had driven me to see far right one year ago
in a public discussion telling Larrabee as the alter-ego of Transmeta.
and Larrabee representing Intel trying to atone
for what has been its engineering after Pentium.
(because better than aliasing MMX on FPU they couldnt
really improve the FPU - what i consider a
masterpiece of engeneering!)

now,

1) the presence of FPU instro back-compatibility on MIC
should be read as an hint of what would not change.

2) there is no reason for 300W power nowadays. unless they
think it for server-side applications.
they say then they want to simplify access to cores
(to win the competition against CUDA etc).
and this is not bad. but coders use GCC, not assembly
nor their own tools; math library included too
because not open source.

3) the shared cache is difficoult to program, and i doubt
seriously that coders understand/take advantage of things like in this paper
http://rolfed.com/nehalem/nehalemPaper.pdf
and whenever they understand it, have they such a huge GCC-latency-time
to apply it ? in that sense GCC is not trustable.
because by "shared cache" and "multicore" computing, design results
to be a MUST, not a toolchain option, as they say

4) considering a prefetching strategy doesent give more than 25-30% performances (my personal experience up to SSE 4.1)

5) i didnt hear something from Microsoft about MIC.
and the SYSENTER/SYSEXIT instro will be discarded from that set!
ergo:
i would not hack into that set, nor i will buy one of those machines
to test that instro-set. i would rather read reports from users/developers.
and this last fact (imho a possible marketing-error of Intel)
will mark a destiny for their MIC... again. didnt we learn the
Itanium's lesson too ? Idea
ok,my opinion.

konkret: dont worry about it too much. but consider that
your FMA Mandelbrot experiments (whenever i like generally what you write)
do not run as i expect on my Quad Core. they run not so smooth as expected.
there must be another more efficient strategy for them now, i mean
for my quadcore, under SSE 4.1 no VEX, or SSE 4.2

on the contrary, back-buffering +-2,3 zoom factors would be perfectly doable on newer MIC machine, also your occasion/temptation Wink

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 19 Jul 2012, 11:53
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 22 Jul 2012, 02:04
So what you are essentially saying is that Intel is not to be trusted, and that anyone who wants to program the new MIC chip will have to go through GCC, which will help Intel to obfuscate the juicy intricacy bits from the masses, that no one get to code bare metal MIC with assembly languages such as FASM?
Post 22 Jul 2012, 02:04
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 22 Jul 2012, 08:49
kalambong wrote:
...you saying Intel is not to be trusted
yes, and those untrustworthy steps started from the shared cache,
not from the MIC itself, that remain more or less than a concept, an acronym, because there's no commercial release at the moment, as reported from wikipedia.

knowing some of the bottlenecks ot the prefetch instructions
i was/i am still the first enthusiast of the shared cache, even if
on a theorethical basis because i never had a machine to test
or implement it. but i do assembly, in fact the question should be
addressed to those, the majority, using C/C++ toolchains.

could someone factually provide example of designs (open/closed source)
of a good advantage of the shared cache using those toolchains, GCC etc ?
i would discuss about them.
kalambong wrote:
...to program the new MIC chip will have to go through GCC
yes, because most developers who push technology forward do it on a public platform, i.e. by sharing solutions and deploying open-source. ergo proprietary software intel suggests would give intel no market for their technology !
kalambong wrote:
... which will help Intel to obfuscate the juicy intricacy bits from the masses
Question Exclamation
kalambong wrote:
no one get to code bare metal MIC with assembly languages such as FASM?
no, i didnt say that. i mean one needs design+assembly to get the best advantages from that technology. i imagine GCC & C. toolchains cannot achieve it easily.
also, i agree with Tomasz about that instro-set:
only
Quote:
if there is need for it, but certainly not soon.
Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 22 Jul 2012, 08:49
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 23 Jul 2012, 05:46
hopcode,
are you suggesting improvements to the mandelbrot explorer I posted at
http://board.flatassembler.net/topic.php?t=12722? The only thing I regret was the overly complicated way of reloading the SSE vectors, but I think it does reduce the drawing time.
Anyways, I was talking about computing really deep zoom (like 200+ decimal digits). You can see some of the ones I have rendered on youtube:
http://www.youtube.com/watch?v=v-9siTf8K6c&feature=plcp
This was done with fasm to compute and color everything and ffmpeg
to compress the bitmaps into a video.
Also, in my experience multiprecision with CUDA ptx has about the same price/performance ratio as with a 2600K, so I stuck with the CPU for now.
Post 23 Jul 2012, 05:46
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 23 Jul 2012, 07:31
please consider that i am not an expert of gaming/graphics.
what follows is generally my opinion on how the cpu acts. also i may be wrong in this case. ok
tthsqe wrote:
http://www.youtube.com/watch?v=v-9siTf8K6c&feature=plcp
This was done with fasm to compute and color everything and ffmpeg to compress the bitmaps into a video.
great ! fascinating ! bravo !
now,
tthsqe wrote:
the overly complicated way of reloading the SSE vectors
ok, i launched it again SSE. it breaks at whatever depth as the black circle gets a diameter of 1050 pixel (x 1680), i.e inscribed into the rectangle of my monitor. that cannot be. Smile there is a lot of room for improvements there:
- switching ciclically from data->mem->stack is the main bottleneck.
- no prefetch strategy
- FPU should be totally eliminated.
- too much mem-moves in the CUBIC_SSE_Reload
- all memory should be homogeneous; also, all stack or all virtual alloc in some way.
so at a first glance, i estimate that mem-moves in the CUBIC_SSE_Reload block takes at least ~40% of cycles. comment the rest of the code to time that block of mov/movapd. if it is so, then there is the bottleneck. required is to interleave calculations (just as you did in the .loop block).
then do prefetch on no more than 1/2 of required virtual alloced memory.

this should be enough.
Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 23 Jul 2012, 07:31
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 23 Jul 2012, 11:37
tthsqe, this video is beautiful. Great job.
Post 23 Jul 2012, 11:37
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 28 Jul 2012, 17:29
Interesting project http://ispc.github.com/
Generally, Intel C compiler with auto vectorization capability. Supports SSE, AVX, AVX2 and Xeon Phi (Knights Corner) instruction set.
Post 28 Jul 2012, 17:29
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 29 Jul 2012, 00:22
randall wrote:
Interesting project http://ispc.github.com/
hi randall
thanks for the link. the good one (but for Intel) is that it is BSD. ok.

it doesent buy me though. after considering the output code here
http://ispc.github.com/mandelbrot.txt
there is no effective productive gain, imho. one can read MIC specs once, in
one day,and continue using his own toolchain without ispc, producing even better code. then, it makes the code more opaque than by using HLL macros.
and those obscure thingy barrier() and syncronizationS too...

complex too much, too much verbose descriptions. take this from
http://ispc.github.com/ispc.html#uniform-control-flow
Quote:
Uniform Control Flow
One advantage of declaring variables that are shared across the gang as uniform, when appropriate, is the reduction in storage space required.Exclamation Question
i would rather avoid to imagine how could be an implementation of it; or simply i am so low with englisch that Laughing sorry, cannot resist
Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 29 Jul 2012, 00:22
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 29 Jul 2012, 10:47
Hand written code always will be better. But for most of the programmers who don't want to mess with assembly I think that it is nice tool.

Of course I won't be using this I prefer fasm.
Post 29 Jul 2012, 10:47
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 31 Jul 2012, 11:55
hi everybody,
randall wrote:
...programmers who don't want to mess with assembly I think that it is nice tool.
and just because they are the 99.9% out there i would like to refer myself to the 0.1% in this thread suggesting/recalling, as obvious, the fact that without benchmarks/numbers/facts it is difficoult to have a tool got adopted, imho.
well, but
received notification just yesterday per email and started digging again.
i must admit compilers are very open to this not-yet-active-on-the-market
MIC initiative. here a list of them http://openmp.org/wp/openmp-compilers/

now, because ScaleMP, ergo OpenMP, will both ease the transition to MIC,
read please from the 1st paras after the image here
http://goparallel.sourceforge.net/virtualized-symmetric-multiprocessing-eases-mic-transition/

dont neglect please the fact that new multicore layouts would raise, 100% guaranteed,
some considerable-2-huge problems by the management of shared cache, and access to it.

but hey!, there is already much in the "guess" of Linus Torvalds in 2009 here
http://multicorenz.wordpress.com/2009/03/26/linus-torvalds-patterson-and-different-views-or-different-worlds/

also, if we define Wink for the first time here a "micset" as
a middle layout wrapping in a toolchain in order to develop for MIC,
relating it to fasm, i think it is more convenient to have it got "as"
instructional-opcoding-macros rather then implementing it as new instro-set.

and just as the one reported from randall in the link, i think we will see a flourishing bunch of those micsets in a near future,
it is but not accidental the fact that my opinion,
(i built it myself from crude/raw asm programming the cache)
corresponds basically to that of LinusT. above.
in all cases, and before integrating one of those micset in the
toolchain, consider please this kind of
programming as a very special/dedicated one. where
performance will be 100% dependent from the micset.

final quote from the 2nd above document, ScaleMP on Xeon Phi,
Quote:
Emulation, of course, will slow the algorithm down somewhat,
but our system shows the programmer which lines of code are affected so that they can rewrite them at any time to regain the lost speed.”
he told you clearly " ..rewrite.."
you were warned.

Cheers,

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 31 Jul 2012, 11:55
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 03 Sep 2012, 17:34
Tomasz Grysztar wrote:
Its instruction set is a new variant


Umm, that link seems invalid, anyone have a new link?

Thanks.
Post 03 Sep 2012, 17:34
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 03 Sep 2012, 18:22
Alphonso wrote:
Tomasz Grysztar wrote:
Its instruction set is a new variant

Umm, that link seems invalid, anyone have a new link?
Thanks.

...nothing can elude hopcode's control Laughing
saved RAR on my website here
https://sites.google.com/site/x64lab/327364001EN.rar
because it doesent fit the quota on board,
Cheers

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 03 Sep 2012, 18:22
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.