flat assembler
Message board for the users of flat assembler.

Index > Main > Parsing strings

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 11 Apr 2011, 19:50
I'm in the process of writing the lexer for a new programming language that I'm writing, just to put this in context...

The problem I'm having with strings in the language is that I don't know the length of a string while I'm lexing it, and it seems stupid to move the memory constantly, but at the same time I can't read ahead to find the length as I'm reading from the file in chunks of 4096 characters and it could go over a boundary and get messy.

How expensive is memory movement and re-allocation? And how would you guys do it? My guess is an arbitrary allocation of a reasonably size, with the possibility to resize when needed, in which case what size would you recommend?

Thanks as always Smile
Mat
Post 11 Apr 2011, 19:50
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 11 Apr 2011, 19:51
HINT: check for nulls
Post 11 Apr 2011, 19:51
View user's profile Send private message Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 11 Apr 2011, 20:00
Is that a hint as to how to do it, or a hint in general? There shouldn't be any nulls in there.

As an example, this would be the legendary statement in a simple programming language:
Code:
print "Hello, World!"    


ouput of the lexer:
Code:
Token: Identifier ('print')
Token: String ("Hello, World!")    


In that case the string is 13 characters long. The lexer sees a " and sets the state to string, which then deals with all sorts of escape codes and characters before ending the string. But how to allocate the space which is not known in advance and could be any length?
Post 11 Apr 2011, 20:00
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 11 Apr 2011, 20:06
Well, unless you use some library which already does this for you, you need to do it the hard way: Estimate the size, read to the buffer, if buffer is full then grow (reallocate) it and go on... reallocation can be relatively expensive, especially on a large block, but I wouldn't worry about this unless you plan parsing over megabytes of strings.
Post 11 Apr 2011, 20:06
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 11 Apr 2011, 20:07
No need to allocate. Just set a pointer to a blank sapce...on top of other variable ofcourse incase you overwrite them


Just make sure you add it after you have added other variables because you may not know how long it is. It could cause buffer overflow. So

SUB ESP,4 ; dword etc
SUB ESP,8 ; another dword etc
SUB ESP,12 ; pointer to string.
Post 11 Apr 2011, 20:07
View user's profile Send private message Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 11 Apr 2011, 20:12
I doubt it will be megabytes... But I suppose thats up to the user Razz

What is a good size? I suppose I need to find a point where allocation is still trivial but is big enough to minimise the amount of allocations overall... I'll run a few tests and see what happens.
Post 11 Apr 2011, 20:12
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 11 Apr 2011, 20:14
Assume the all standard size 1 KB (1024 ) LOL


Hey, Even winsock uses that as initial buffer size..
Post 11 Apr 2011, 20:14
View user's profile Send private message Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 11 Apr 2011, 20:25
Having come from a high level background... Allocating a fixed number always looks ugly to me.

Looking at what I need, I'm going to go with 256 (1024 is a bit ott for what I need), but I'm going to keep it as a constant at the top of the lexer so changing it later is trivial.

I'll also have a look at using the stack pointer like you said typedef.
Post 11 Apr 2011, 20:25
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 11 Apr 2011, 22:21
If I understood you correctly, you want to reach the end of the string and just then to copy it somewhere else - right?
In this case you need to support arbitrary sizes of the buffer - because the string can have arbitrary size.
If you need really optimal solution, I would suggest you to use arbitrary count of buffers instead. When the one buffer ends, just allocate new one, read the chunk of the file and continue. In the same time you have to use double pointer to point to the begin and to the end of the string: [buffer number] and [buffer offset].
When you locate the end of the string, you can allocate needed memory, copy the string and then free all buffers except the last one.
Then the algorithm loops from begin.
You can keep the pointer to the buffers in dynamic array - it will be small enough to not cost you much for re-allocations of even to hold all pointers without reallocation at all. 4096bytes array can address 1024 buffers that is (if you allocate 4k for each) 4Megabytes before you need to resize it.
Post 11 Apr 2011, 22:21
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 12 Apr 2011, 08:43
I have it all working, but that's a good idea thanks JohnFound... If the buffer size is fixed then I could almost have a singly linked list of buffers with just a pointer then the buffer, so no need for the dynamic array, just a pointer to the first buffer. I'll certainly keep that in mind when I need to deal with bigger memory blocks.

There is a slight misunderstanding in that I won't be copying the string (if I can help it). Once it's in memory as a string I'll just be using a pointer to it, which will be stored in the token and given to the parser... So by doing it without the list I save a single allocation at the end, so doing it the other way would be better for small strings (e.g. less than my arbitrary buffer size).
Post 12 Apr 2011, 08:43
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 12 Apr 2011, 10:23
JohnFounds idea is better performance-wise, but the string in such representation is of course harder to work with. You must decide what's better for you.

As for growing buffer, usual way is to start with size "good enough for most", and then on each growing double the size (eg. 256 -> 512 -> 1024 -> 2048, etc.).
Post 12 Apr 2011, 10:23
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 12 Apr 2011, 12:16
At the moment I've gone for 256 as the initial size (good enough for 99.9% of uses) and then it keeps going up by 256... Doubling would probably be faster and easier than adding (just left shift by 1).
Post 12 Apr 2011, 12:16
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 12 Apr 2011, 12:53
I made some research about memory reallocation strategies. Additive grow is really very, very slow. You have to do it by multiplication.
IMHO, x1.5 is probably the best multiplier for most of my needs.
You can do it for example with:
Code:
  lea  ecx, [ecx+2*ecx+256]
  shr  ecx, 1
    


It will make the series: 0, 128, 320, 608, 1040, 1688, etc.
The additive constant serves to make the series grow faster on small sizes.
You can vary this behavior changing the constant in the brackets. It must be > 2 in order to exclude hanging if incidentally ecx=0 on the input.
Post 12 Apr 2011, 12:53
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 12 Apr 2011, 14:11
JohnFound wrote:
Additive grow is really very, very slow. You have to do it by multiplication.
Perhaps for a general purpose OS this is true. But if you know your data sets sizes and distributions ahead of time then there are plenty of other strategies that can perform much better. It all depends upon the needs of your task(s).
Post 12 Apr 2011, 14:11
View user's profile Send private message Visit poster's website Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 12 Apr 2011, 18:29
hmmm.... i smell buffer overflows lol.
cant you use physical files? or mapped files?
write the buffer to a file and do get file size the allocate that amount? kinda works but slow.....good security wise too
Post 12 Apr 2011, 18:29
View user's profile Send private message Reply with quote
MattDiesel



Joined: 31 Oct 2010
Posts: 34
Location: England
MattDiesel 12 Apr 2011, 21:24
Theres no chance of buffer overflows Razz Don't worry about that.

I'm going stick with what I'm doing now, which is:
1) Allocate an initial buffer that depends on type of content (256*sizeof.TCHAR for strings), and set an initial counter to -1
2) Add the characters to the buffer until the counter = size, then do a simple shl on the size and reallocate to that size.

Thanks for all the help though. Testing all the alternatives has taught me a lot.

Mat
Post 12 Apr 2011, 21:24
View user's profile Send private message Send e-mail Visit poster's website MSN Messenger Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 12 Apr 2011, 23:54
Quote:
hmmm.... i smell buffer overflows lol.

Depends on who does the coding Razz
Post 12 Apr 2011, 23:54
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 13 Apr 2011, 00:57
vid wrote:
Quote:
hmmm.... i smell buffer overflows lol.

Depends on who does the coding Razz


Very Happy Smile Sad Surprised Laughing Razz Shocked Confused Cool Crying or Very sad Mad Rolling Eyes Evil or Very Mad Twisted Evil Twisted Evil Twisted Evil Twisted Evil Twisted Evil Twisted Evil Twisted Evil Evil or Very Mad Evil or Very Mad Evil or Very Mad Evil or Very Mad Evil or Very Mad Evil or Very Mad Evil or Very Mad Evil or Very Mad

I'll kill you Twisted Evil
Post 13 Apr 2011, 00:57
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 13 Apr 2011, 01:05
Quote:
I'll kill you

depends on who does the killing Razz
Post 13 Apr 2011, 01:05
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 13 Apr 2011, 01:27
vid wrote:
Quote:
I'll kill you

depends on who does the killing Razz


I'll rape you Razz Very Happy
Post 13 Apr 2011, 01:27
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.