flat assembler
Message board for the users of flat assembler.

Index > Main > Null termination?

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill
Quote:

If you did it twice in a row (little or nothing in between).. but still slower on the first pass, right?


No, not slower on the first pass, that would be normal speed if the cache is cold, but faster with following passes, since you have read (accessed) the entire range of memory that holds the string.

Quote:

And if the L1 cache is to valuable to store a length in it isn't it to valuable to store a string in it, anyways?


I'm not quite sure what you mean by that, as I see it, as much pertinent data you can have in your L1, the better. What do you mean by "too valuable to store <whatever>"?
Post 30 Mar 2009, 17:36
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
buzzkill wrote:
Quote:

If you did it twice in a row (little or nothing in between).. but still slower on the first pass, right?


No, not slower on the first pass, that would be normal speed if the cache is cold, but faster with following passes, since you have read (accessed) the entire range of memory that holds the string.
I didn't mean slower then not using the L1 cache, I meant slower then having the length prepended.

buzzkill wrote:
Quote:

And if the L1 cache is to valuable to store a length in it isn't it to valuable to store a string in it, anyways?


I'm not quite sure what you mean by that, as I see it, as much pertinent data you can have in your L1, the better. What do you mean by "too valuable to store <whatever>"?
Someone said something about it being bad that you have to use a register (which is part of the L1 cache right?) in length prepended strings as opposed to null terminated strings.
Post 30 Mar 2009, 17:42
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Well since you are discussing L1 cache then you should also realise the importance of L1 cache line size and the read-allocate policy. If your usual string length is <= the L1 line size and you align the strings on cache line boundaries the you get the whole string in the cache, whether you want it or not, just by reading one byte of it.
Post 30 Mar 2009, 17:44
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
revolution wrote:
Well since you are discussing L1 cache then you should also realise the importance of L1 cache line size and the read-allocate policy. If your usual string length is <= the L1 line size and you align the strings on cache line boundaries the you get the whole string in the cache, whether you want it or not, just by reading one byte of it.
Thanks. It takes more time to scan through the whole string then the first byte of it though, even in the L1 cache, doesn't it? Confused


Last edited by Azu on 30 Mar 2009, 17:47; edited 2 times in total
Post 30 Mar 2009, 17:46
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Azu wrote:
... use a register (which is part of the L1 cache right?
No. Definitely not. Registers are a completely separate entity from caches.
Post 30 Mar 2009, 17:46
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill
Quote:
I don't know about Pascal strings.

By this I mean strings with a length byte prepended. These were the strings in the Pascal language, so I call them Pascal(-style) strings.

Quote:

I think it would make comparisons and scanning and copying faster. I just want to know if I'm missing anything (besides it taking a register).


Well, there is for instance the scas(b) instruction if you want to search for the null byte to determine string length with a C-string. As for copying, I guess with a Pascal-string you could read the length byte and then movs(b) the string itself, but whether that would be a great performance increase I don't know. For comparison, the advantage of having a length byte is only if your two strings have different length and even then you need to consider what you call equal (ie "hello" vs "hello "). Also, you will want eg the position in the strings where they differ, so you'll have to traverse them anyway.
Post 30 Mar 2009, 17:47
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
buzzkill wrote:
Quote:
I don't know about Pascal strings.

By this I mean strings with a length byte prepended. These were the strings in the Pascal language, so I call them Pascal(-style) strings.
Wouldn't that just be writing it to the start then? Why is that twice as much?

buzzkill wrote:
Quote:

I think it would make comparisons and scanning and copying faster. I just want to know if I'm missing anything (besides it taking a register).


Well, there is for instance the scas(b) instruction if you want to search for the null byte to determine string length with a C-string.
But isn't a single byte faster to read then the whole string?

buzzkill wrote:
As for copying, I guess with a Pascal-string you could read the length byte and then movs(b) the string itself, but whether that would be a great performance increase I don't know. For comparison, the advantage of having a length byte is only if your two strings have different length and even then you need to consider what you call equal (ie "hello" vs "hello "). Also, you will want eg the position in the strings where they differ, so you'll have to traverse them anyway.
So in the worst case scenario it is the same speed as null terminated?


revolution wrote:
Azu wrote:
... use a register (which is part of the L1 cache right?
No. Definitely not. Registers are a completely separate entity from caches.
Whoops! Looks like I've still got a lot to learn lol. Sorry. Embarassed
Post 30 Mar 2009, 17:50
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Azu wrote:
Thanks. It takes more time to scan through the whole string then the first byte of it though, even in the L1 cache, doesn't it? Confused
Not necessarily. It depends upon your usage model. You may be able to hide scan execution with prefetches of other data happening in parallel.

If you are really determined to get peak performance out of string operations then you have to know about caches and memory bandwidths etc.. The CPU is only a small part of optimising code for best performance, you also get to get data into and out of the CPU to make it useful, and the path in/out is the memory interface.
Post 30 Mar 2009, 17:54
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill
revolution wrote:
Well since you are discussing L1 cache then you should also realise the importance of L1 cache line size and the read-allocate policy. If your usual string length is <= the L1 line size and you align the strings on cache line boundaries the you get the whole string in the cache, whether you want it or not, just by reading one byte of it.


That's very interesting, revolution, I didn't know that. Lining up strings to cache-line sizes is not something I have seen done before (at least not that I've noticed, I wonder if any HLL compilers take this into account). For my CPU, cache line size is 64 bytes, so I personally wouldn't align to that I think (either large holes in your data, or you'd have to rearrange other data). And whether or not typical string usage in a program would be strlen() <= 64, I think would be hard so say beforehand. Nevertheless, you've given me something to think about Smile
Post 30 Mar 2009, 17:56
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
revolution wrote:
Azu wrote:
Thanks. It takes more time to scan through the whole string then the first byte of it though, even in the L1 cache, doesn't it? Confused
Not necessarily. It depends upon your usage model. You may be able to hide scan execution with prefetches of other data happening in parallel.

If you are really determined to get peak performance out of string operations then you have to know about caches and memory bandwidths etc.. The CPU is only a small part of optimising code for best performance, you also get to get data into and out of the CPU to make it useful, and the path in/out is the memory interface.
Thank you. I still don't understand though. If the whole string can be scanned with no performance hit, can't the first byte of it? Or am I misunderstanding something?
And for strings that don't fit in L1 wouldn't the scanning still be a lot slower?
Post 30 Mar 2009, 17:57
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill
Quote:

Wouldn't that just be writing it to the start then? Why is that twice as much?

Not twice as much: C-string: append new data to end of string. Pascal-string: append new data to end of string and update string lenght.

Quote:

But isn't a single byte faster to read then the whole string?

That depends, if we take into account what revolution said about cache lines.

Quote:

So in the worst case scenario it is the same speed as null terminated?

I didn't say that, if you really want to know for sure, first study some string-processing algorithms, then take some benchmarks...

*Edit* Eg, with length byte strings, you may have to keep the string length in a register (as mentioned above) and that can affect the efficiency of your algorithm, x86 is not know as register-starved for nothing Smile
Post 30 Mar 2009, 18:00
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
buzzkill wrote:
Quote:

Wouldn't that just be writing it to the start then? Why is that twice as much?

Not twice as much: C-string: append new data to end of string. Pascal-string: append new data to end of string and update string lenght.
Okay well what I meant was just having the length in the front.
Not putting the length in the front AND putting a null on the end.

buzzkill wrote:
Quote:

But isn't a single byte faster to read then the whole string?

That depends, if we take into account what revolution said about cache lines.
It might not be faster then.. but it wouldn't be slower either would it?

buzzkill wrote:
Quote:

So in the worst case scenario it is the same speed as null terminated?

I didn't say that, if you really want to know for sure, first study some string-processing algorithms, then take some benchmarks...
Sorry. Was just hoping to get some easy answers.
Post 30 Mar 2009, 18:03
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Azu: Do you have the app ready, debugged and running now? If not then I would respectfully suggest that you are trying to optimise too early.

The basic rule is always: Get it working, then get it fast.

If you have already passed the working stage then you can start to characterise the usage model and start to give real world figures and ratios etc. Only then can you start to answer the sample Q's that I proposed some posts back there somewhere.

Also, if your app is already working then you can just code up a few variations and test to see which is fastest in your runtime situation. Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all! I cannot tell you what will be faster for you, because your situation will be very specific and I cannot duplicate it here to tell what is best for performance.
Post 30 Mar 2009, 18:09
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
revolution wrote:
Azu: Do you have the app ready, debugged and running now? If not then I would respectfully suggest that you are trying to optimise too early.

The basic rule is always: Get it working, then get it fast.

If you have already passed the working stage then you can start to characterise the usage model and start to give real world figures and ratios etc. Only then can you start to answer the sample Q's that I proposed some posts back there somewhere.

Also, if your app is already working then you can just code up a few variations and test to see which is fastest in your runtime situation. Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all! I cannot tell you what will be faster for you, because your situation will be very specific and I cannot duplicate it here to tell what is best for performance.
I want to code it right to begin with instead of finishing it only to find out that I have to rewrite the entire thing because I chose a sub-optimal fundament.
Post 30 Mar 2009, 18:12
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill
Quote:

Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all!


OT: you could consider this an argument against very low-level optimizations in assembly of course Smile
Post 30 Mar 2009, 18:17
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
buzzkill wrote:
Quote:

Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all!


OT: you could consider this an argument against very low-level optimizations in assembly of course Smile
Which is why I am asking if anyone knows of cases where null terminated strings would be faster (faster, not equal).
Post 30 Mar 2009, 18:19
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
Azu wrote:
I want to code it right to begin with instead of finishing it only to find out that I have to rewrite the entire thing because I chose a sub-optimal fundament.
There is no right or wrong way. I appreciate that you don't want to duplicate work, but in most cases the string handling functions (maybe 5 or 6 at most) is not going to be a great hardship to write in three different ways. And if the performance is so important then it should be important enough to make it worthwhile to write it in a few different ways else the payback/effort ratio is too small and you shouldn't worry about it. If it is worth doing then it should be worth doing in a few ways to see which works best in your situation.


Last edited by revolution on 30 Mar 2009, 18:24; edited 1 time in total
Post 30 Mar 2009, 18:20
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
revolution wrote:
Azu wrote:
I want to code it right to begin with instead of finishing it only to find out that I have to rewrite the entire thing because I chose a sub-optimal fundament.
There is no eight or wrong way. I appreciate that you don't want to duplicate work, but in most cases the string handling functions (maybe 5 or 6 at most) is not going to be a great hardship to write in three different ways. And if the performance is so important then it should be important enough to make it worthwhile to write it in a few different ways else the payback/effort ratio is too small and you shouldn't worry about it. If it is worth doing then it should be worth doing in a few ways to see which works best in your situation.
I'd really rather not use functions unless necessary. Each use would be two needless branches. I could use macros, but then I wouldn't be able to optimize at a very fine grain.
Post 30 Mar 2009, 18:22
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
buzzkill wrote:
Quote:

Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all!


OT: you could consider this an argument against very low-level optimizations in assembly of course Smile
Absolutely true. For the basic commercial PC's, hyper fine optimisations are generally wasted effort when compared to the payback. But like I have already mentioned about 20 times here it depends upon your usage model. Now I hope that is clear folks, define your usage model first.
Post 30 Mar 2009, 18:23
View user's profile Send private message Visit poster's website Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
revolution wrote:
buzzkill wrote:
Quote:

Remember that other people's computers may give different results. With differing cache sizes and differing CPUs etc. the results are almost certain to be different. One size does not fit all!


OT: you could consider this an argument against very low-level optimizations in assembly of course Smile
Absolutely true. For the basic commercial PC's, hyper fine optimisations are generally wasted effort when compared to the payback. But like I have already mentioned about 20 times here it depends upon your usage model. Now I hope that is clear folks, define your usage model first.
I did. Twice.
Post 30 Mar 2009, 18:24
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.