So I decided to try to improve my multiprecision multiplication function, and it comes down to this:
suppose you have a bunch of 128 bit integers xn and you want to add them to get the 192 bit sum.
Of course you can do
...
add SumLow,xnLow
adc SumHi,xnHi
adc SumCarry,0
...
but the latency of adc is twice its throughput. So naturally one would be led to add of half of the terms separately (interleaving the instructions, of course) and combine the two partial sums at the end.
As I am currently unable to test this, I was wondering if anyone knows if this will lead to improvement.