How C stores symbol table? (type information)

Index > Programming Language Design > How C stores symbol table? (type information)

Goto page Previous 1, 2

Author

Thread

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 14 Apr 2017, 14:05

I wonder why you shared fasm at all. You could've keep it to yourself. Why let others use your work for free. (Yes I know there is a donation button and all, but nobody is forsed to donate.)

This thread is whatever now, but I don't want to create another one for every little question.

14 Apr 2017, 14:05

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 14 Apr 2017, 15:14

vivik wrote:

I wonder why you shared fasm at all. You could've keep it to yourself. Why let others use your work for free. (Yes I know there is a donation button and all, but nobody is forsed to donate.)

IMO fasm would have never become as versatile as it is if not for the others and their input. When I first released it I just wanted to hear some comments about my work an see if I could make something valuable out of it. And there were people interested in commenting, and their opinions and suggestions kept me working on this project for years and inventing many interesting new ideas to deal with various challenges. This is perhaps the main thing I got in exchange for sharing my code freely.

14 Apr 2017, 15:14

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 22 Apr 2017, 08:07

Noticed a typo, reporting here in case you want to fix it.

https://flatassembler.net/docs.php?article=manual

>but it can be forced to us CX with the loopw mnemonic

us instead of use

Also, how about mentioning in docs that test instruction always clears Carry flag? (and probably a few others.) Seems like an important detail, probably. I asked about it before here https://board.flatassembler.net/topic.php?t=19803

22 Apr 2017, 08:07

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 11 May 2017, 13:46

I probably wouldn't be able to do much of what I want to do. I guess I'll just share my ideas then, just in case.

My goal is to make a video game. But I think that recompiling everything makes the feedback cicle unnecessarily slow, so I think that the ability to quickly change some parts of code and see changes almost instantiously is necessary. Luckly, I stumbled at the description of how the microsoft did their hot patching (aka monkey patching) in microsoft dlls, by preceeding every function with 6 nop, and replacing them with absolute jmp when necessary (something like push address, ret, takes exactly 6 bytes).

The problem, this will require me to write a compiler, and quite probably a debugger and an ide. Always wanted to make a new language anyway, so why not, may as well make a new language since I'm remaking the whole infrastructure.

People usually just use something like lua for that, but I want the resulting exe to not contain any unnecessary libraries, and I think I will be too lazy to rewrite my lua code to the c or c++ one at the end.

I think this feature is available in visual studio under the name edit-and-continue. Never really used VS though, there are some very stupid reasons why I can't work with it, like not beliving that any good program can take 4 GiB to download, it forsing me to register, refusing to install the lastest version the last time I tried (registration form bugged out), having very confusing project properties interface, weird error messages, etc. I just used mingw all this time, it's terrible too, but at least I can do something with it.

What I did so far:

1) tried to understand TCC. I wasted much more time than I could or should on this, it took me much too long to finally look in the git history. The first version in the git is much simplier than the current version, it's just a single file, and the generated code's quality hasn't improved since then (the priority for TCC is speed of compilation and simplicity of compiler instead of overall resulting quality). It at least gave me some confidence in compiler building, even if I probably wouldn't really use much out of TCC's code.

2) made an empty exe file (with code data idata sections of size 10000h, to fit everything), and started rewriting the c functions in assembly using ollydbg. Opened a compiled version, and tried to rewrite it myself. I learned a lot of things about assembly during that, about fpu commands and about pe file format. So right now it can open a winapi opengl window, and display text. Can't edit text yet. It's very easy to run out of space when you do things this way, and very hard to move things around. Next step is to compile a single function, I don't want to code in assembly anymore.

11 May 2017, 13:46

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 11 May 2017, 13:47

About my language.

I had a bit of time on my hands, so I tried out languages like asm, C, python, C++, D, golang, raptor, lisp. Didn't found one for me yet. Asm is a pain to use, other compiling languages put something weird in my exe (except of mingw and C, and only after some black magic with command line parameters), interpreted languages ain't the fastest thing in the world and they either require too many dependencies or just a pain to distribute, not the best thing for gamedev.

One of the lastest languages I learned was raptor. It's a forth like language, which means you write in it like "argument1 argument2 function". You put variables on stack, and functions do something with them. The problem with raptor though, there are many functions that rearrange stack, like exchange first and second element, copy first element and push it on stack, exhange first and third element. The creator of raptor first recommended to minimize the usage of those, and then implemented an "in1 in2 in3 function :> out1" syntax. This kind of makes a huge part of raptor outdated. I'd like to continue where raptor stopped, but remove some trash from it, simplify it.

I find it weird how in C the dataflow feels backwards, it goes from right to left. You got function arguments to the right, you got the function in the middle, you got "=" to the left of a function, and you get the variable that stores the result to the left. It feels backwards that dataflow is from right to left, while it feels more natural to read from left to right. Also, with the expressions like x = 5 + 7 / 3 - 2 * 4 >> 3 it's even more confusing, you have to keep the paranthesis everywhere or memorise the priority, I will get rid of that too. Those are small things, but I want to fix them.

So I'm thinking of something like this

calling a function:

Code:

in1 in2 .function -- out1 out2

deleting a variable:

Code:

var //

call a function and also delete a variable:

Code:

last-usage// function. -- out

discarding one of the outputs:

Code:

in1 in2 .function -- _ out2

basic math:

Code:

2 -- x
2 -- y
x y * -- res
res 1 + -- res

function definition:

Code:

func (in1 dword, in2 dword) (out bool) {
    in1 in2 > -- out
}

branching:

Code:

if x 0 < {
    "this happens" .print
}

Everything else I'll just copy from python, I guess.

Why deleting variables is important
Because golang really annoyed me with it. I often need to create a variable for an error, and I only need it once and never again. And in golang there are two different operands, = for just assigning an existing variable, and := for also creating a variable. I kind of have to look if it's the first use of the err variable, and then choose := or = depending on that.

It will make writing long functions easier. I think long functions are actually easier to read than lots of short ones, you don't have to jump all over the text to understand what's going on. The Code Complete book actually sort of agrees with me on that, it states somewhere that long functions ain't as bad as they have a reputation of being, there are less errors in there or something.

I could create an extra code block for temporal variables, but it's not flexible enough. With code block, I will have to specify a variable that needs to survive before the code block starts, and it feels unnatural for a variable to appear before it's first used. It's weird, just deleting variables manually feels like a more intuitive solution.

In C, at first, it was impossible to create variables midfunction. You just list all variables you will ever need at the beginning of function. I guess it was made this way because it made compilers and debuggers simplier, you have only a single symbol table set in stone for the rest of the function, and you have to increase esp only once. The ability to delete variables will make life a bit more tough for compilers and much more hard for debuggers, symbol table may change midfunction, and debugger have to keep track of which register and which stack value stands for what variable.

Why I use some-variable instead of some_variable or SomeVariable.
Too lazy to press shift one extra time for some_variable, and the SomeVariable just feels hard to read to me. I stole this idea from elisp, they name vars like that all the time. This means I can't use "-" as an actual arithmetic operation, but I can't do this anymore anyway, so why not.

the "x<y" versus "x y <" versus "x y .less"
This is the ugliest part so far. I guess I'll choose "x y <", I only need to compare once per if.

The only time when the C way looks better is when you fill the structure fields with something.

Code:

house.chair = 4
house.window = "broken"
house.color = "pink"

The interesting feature of it is that I can chain functions, quite like it happens in unix pipes. Something like this

Code:

my-array 5 .push 7 .push .sort

Why I make functions start from dot
Just to make them look different from variables. I don't want them to look exactly the same.

Will also help with autocompletion, if I'll ever go far enough to make an ide.

Not sure how will I take the address of a function. In C, I can call a function with "some_function()", and get address of a function with "some_function", and here I can't quite do that. I can do the ".some_function" is a call, and "some_function" is an address, but that's not feels very good. Maybe I should go for polymorphism or interfaces instead, like people usually do. I guess I'll steal the golang way, interfaces.

Another concern is how to organise python or golang like modules. You see, in golang, you use "some_function()" only if it's in the same module as your current function is, quite often you instead do "some_module.some_function()". I guess that's not a problem at all, this language will just use "some_module.some_function" in the same way as it would use ".some_function".

Yet another problem is what to do with structure members. They look very simularly, they use dot to get a subfield. In many languages you are allowed to put space between the name and the dot, "some_struct . some_field", this would no longer work. I guess it's not a problem unless you have very long chain of those structs, "some_struct.some_field.another_field.yet_another_field.and_again.and_again". Not sure how common is that, and what to do with it.

Accessing array, getting substring and iterating array will probably be just like in python, or maybe golang or javascript.

Last edited by vivik on 11 May 2017, 14:00; edited 1 time in total

11 May 2017, 13:47

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 11 May 2017, 13:57

About how to organize memory. I'll probably go for garbage collection, it feels like a very helpful thing to have, at least as a debug feature.

I plan on having a debug and release versions, the debug one I will use to develop faster, and the release one is what users will see. Visual Studio style. Debug will have features like hotpatching and garbage collection, release is what users will get. One for development, one for everyday usage.

So, I plan to use garbage collection mostly as a debug tool, to find if and where I made a mistake. Like, run full mark and sweep once in a while, and see if there is some garbage left anywhere. I probably can store a caller of malloc function next to each allocated data chunk, to see where it was allocated.

To make a simple garbage collection, I only need to know the size of each chunk, know which part of this chunk is a pointer to something else, and have a way to traverse the whole heap. To do this, I need to store: size of each chunk, and type of each chunk. Those are two dwords.

I want to have some of my data easibly resizable, this means I should allocate more data than I currently need in advance. This means instead of just storing one dword for size, I'll need two dwords for used and allocated size. Now there are three dwords in total.

To make data travercable, I will need to put a distance to the next chunk of data, before and after the current chunk. Zero will mean it's the last one, 4 will mean the next chunk is right after this one, anything bigger will mean there is some unused data between chunks. Now there are four dwords per chunk, not including the data itself.

I figured out how to store any type in a single dwords. To do that, I sacrificed pointers (there is still a way to store a pointer, but it will be just that, pointer. Not a pointer to integer, not a pointer to string, just a pointer to something) and array to arrays. Never liked multidimensional array anyway. I may optionally make arrays and structures resizable in place, without copying everything, pretty useful.

So here is a possible way to encode types, it will probably be changed a bit.

Code:

possible types:
    reserved   ffffffff
    ubyte      0
    byte       1
    uword      2
    word       3
    udword     4
    dword      5


    float      8


    addr       10h (16)
    func       11
    ;weak       12 ;probably...

    ascii-string     20
    widechar-string  21
    utf8-string      22


    000  1/8 default types
    001  1/8 structs
    01?  1/4 array   (array of struct or array of default type)
    100  1/8 blob    ;means "no concrete structure",
    101  reserved
    11   reserved

So there is some place for signed and unsigned byte word and dword, float etc. Every type that starts on 000(in binary) is a default type.

Everything that starts on 001(in binary) is a user defined structure.

Arrays are encoded somewhat interestingly, you take any default type or struct, and then change one bit, second from the left. This is why you can't have a struct of structs, you just can't encode it.

The 100 are blobs. These are just the things that need some custom logic for a garbage collector, or maybe don't need garbage collection at all. Just to not limit myself with a standard way of storing data. Not quite sure what I will do with those.

By storing data this way, one can resize strings, structs and arrays in place, without copying everything. You can't resize quite easily structures if they are in arrays though, you'll have to copy them.

My biggest concern is if I'll be able to easily change between debug and release modes, because programming with garbage collection and without garbage collection are two pretty different ways of programming and even thinking, I'm not sure if I'll manage to do a smooth automatic transition between them. With garbage collection the size is always stored somewhere, without garbage collection program must store it somewhere manually. With garbage collection you can follow the reference and see what type it has, without garbage collection you should have strict rules about where goes what. With garbage collection polymorphism is trivial to implement, without garbage collection you need to do some compile time checks.

I never really programmed with a manual memory management before, I guess that's the problem right now. I either had garbage collection, or no memory management at all. I imagine that one time I'll need to replace and move some chunk, and it'll turn out there was not a single pointer to this chunk, but two or more, and they all are now invalid because I forgot to update them.

Things like circular references are a problem too, I don't know how to organize memory to avoid them. They are probably not a problem at all, I'm not sure.

Also, probably not important or related, but I don't know how to implement things like DOM, like in HTML. There may be the situation when one node is deleted, and there is somewhere a piece of code holding a reference to an already deleted node. This memory is already freed, but that code may not know it yet. This feels like a job for a weak reference. Not sure how to better make it though, and if garbage collector needs to store some additional info for that. I found one example of how to do it, by keeping an extra dword next to each chunk of memory, which will contain an address to a weak reference object. Not sure if I want or should do that, there should be a way to work with DOM without it, and without any garbage collection at all.

About structures, and changing them at runtime.
I want structures to behave like the lua tables, or better like javascript objects. To be able to just add a field to a structure, and for a structure to automatically extend. I'm not really sure how to do that, and if it's really a good idea. Well, the structures are resizable, and if I'll ever want to do that, it's possible, you can safely do that to a structure if you are sure that there is only one structure of that type in the whole program. Well, it feels like it's important to be able to add or remove a field from an object, so I guess it should be done then.

11 May 2017, 13:57

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 11 May 2017, 13:58

What I will do now:

First, I will finish watching that letsplay.

Next, I need to compile a single function. This means:
1.2) Store text in an easily editable and parsable way (my window can display text strings, but it's far from a text editor)
1.3) Parse text into tokens
1.4) Build Abstract Syntax Tree from those tokens, whatever that means
1.5) Do a code generation from it. Problems here are:
1.5.1) register allocation. There is an interesting article on wikipedia, but except learning such a pretty words like graph colorability and liveness analysis i haven't moved much from there. It's not that useful anyway, because a pure mathematical solution will ignore things like encoding length and what commands are possible at all. I guess the best result will be by teaching machine which combinations of commands are possible, and how to choose from them. Too complex though, I shouldn't do this for now.
1.5.2) command encoding. Going to play the assembler soon too. I would need only the basic stuff, mov cmp je call ret.

After that, I will have to learn how to rearrange my functions. They are in the wrong order right now, and there are gaps between them. To do that, I need:
2.1) Store information about every function, address size name. Currently all my names are stored in the ollydbg, I hope to get rid of it and move to selfpatching entirely, olly is just so unconfortable to work with.
2.2) Choose the desired order of functions. For that:
2.2.1) See which functions call which, and which are called from where. Building call graph, or cross reference. No, it's far from a proper cross reference yet, no variables.
2.2.2) Write a gui for that. Dunno, maybe I can dump it as text and edit that, and then parse it back up into something useful.
2.3) Do the actual rearranging, with patching all calls and jumps.
2.3.1) The problem here, there is a WndProc address in one of structures. This is required by winapi. This kind of hints, that I will need a full cross reference for it. Damn it, I'm not sure. This kind of reminds me about the "how will I take the address of a function", the same problem again. I guess I'll just do a full heap traversion just to see if there is some function pointer that needs to be updated. Also, not a whole heap, but only the constants.
2.3.2) I will need to allocate a continuous block of memory for rearranged functions. Not that hard, since I will knew the total size of all necessary functions by then.

After that, I will have a text editor united with a compiler. Kind of like emacs, except fully running in native code, and easy to strip of all unnecessary features. It would be good to have a debugger, but just hotpatching a function should be enough, at least at the start. When I program in emacs or in firefox, I never really use a debugger, I just change the function as a whole and see what changed.

After that, it will be the time to finally play with winapi and opengl, make a nice offline documentation, just organize everything properly.

After that, I'd like to make a simple painting tool. I never found a tool whose source code I could understand, I guess I'll have to make one then. Often find things I miss or can't understand in existing ones.

After that, music. Just some beeps for start, then maybe learn to do samples. Always liked tracker music, maybe I'll make a simple one myself.

After that, it's finally time for a game. Maybe something 2d, simple physics. Level editor, cutscene editor, ai editor. Making ai without code hotpatching must be a pain, kind of explains why in most games every ai is so braindead. I'll need to learn how to draw too. Maybe drawing isn't that hard actually. Maybe I'll find a team, I have nothing to pay them with though.

After that I'll try to sell it, I guess. With tools like that I wouldn't need much else. Hope to get something in return for all this trouble though.

11 May 2017, 13:58

alexfru

Joined: 23 Mar 2014
Posts: 80

alexfru 11 May 2018, 06:34

vivik wrote:

How C compiler stores symbol table? This link https://www.tutorialspoint.com/compiler_design/compiler_design_symbol_table.htm
explained most of it, but I'm still not sure how they store type data. They can be quite complex, some types are actually combinations of types (array of struct, pointer to pointer to integer), some types reference user defined types (creating a struct creates a new type, which can be referenced in another type (array of structs again)). How compilers and debuggers store that?

I'm probably very late to this discussion, but here's what I did in my C compiler.
All declarations are stored nearly as-is, only linearized. Examples:

int a; // -> a int
int** a; // -> a * * int
int a[2][3]; // -> a [2] [3] int
int* (* a[5])[10]; // -> a [5] * [10] * int
int f(int a, int* b); // -> f (a int , b * int) int
int (*f)(int a, int* b); // -> f * (a int , b * int) int
typedef a b; // -> typedef b a
struct s { int a, *b }; // -> struct s { a int , b * int }

Whenever there's a reference to a structure/union, it's replaced with a "pointer" to its declaration (initially it may be a "NULL" pointer if the structure/union isn't fully declared yet; these dormant pointers will be updated when the declaration occurs).

References to typedef's are replaced with copies of the declarations from those typedefs (although, there may be a kind of pointer used as with structures/unions).

I keep all those parens, braces, brackets, commas and asterisks. Each has a dedicated token. These delimiters let me dive into structure members or function parameters or skip them. There are no explicit trees with edges (pointers) and nodes here.

This is a simple representation, which is sufficiently powerful for all kinds of simple and composite/aggregate types I need (full disclosure: const/void ignored, bit fields not implemented). Declarations are naturally variably sized.

At the moment each token in a declaration takes 5 bytes (1 for token type, 4 for accompanying data (e.g. array dimension or structure "pointer")). This is quite compact (the compiler can run in 96KB of RAM for all of the code, data and stack), but can be made even more dense since many tokens don't carry any data or don't need all 4 bytes for the data.

In most cases I can compare two declarations token by token without paying much attention to the tokens, I only need to know when to stop.

I can reuse these declarations when figuring out the type of an expression.
Say, I have:
int** a[2][3]; // -> a [2] [3] * * int
a[0]; // type: [3] * * int
a[0][0]; // type: * * int
a[0][0][0]; // type: * int
a[0][0][0][0]; // type: int
See, as I dereference things off a, I move further to the right in the original declaration of a.
There are some quirks:
a + 0; // type: * [3] * * int
There's no third * in the declaration of a. I use a negative index (into the declaration of a) to imply this extra *.

Similarly with negative indices into declarations:
int q; // -> q int
&q; // type: * int
*&q; // type: int
&*&q; // type: * int
*&*&q; // type: int
It's a bit ugly in the implementation, but works in the end.

Identifiers are obviously stored in a separate data structure and appear as a special token in a declaration. This token carries an index into that other data structure (really, array of chars). A hash table can be used to look things up by identifier, but this isn't implemented at the moment (an inefficient linear search is used instead).

Numbers representing array dimensions and enums are stored in the declarations, in-place.

Whenever a new scope begins (function or block scope), a special token is placed into the declaration data structure (effectively, it's a list/stack of variably-sized declarations). This lets me know what scope things are in, so I can check redeclarations and complete incomplete types (e.g. struct s; struct s* p; struct s { int a; }; extern int a[]; extern int a[4]; ). Again, scope management can be made more efficient if an identifier can be looked up quickly in the current scope and scopes enclosing it.

When the current scope ends, all of its declarations are removed. In my compiler code is generated as soon as possible, so, no need to keep around stuff that's not going to be used again (declarations, expressions, identifiers).

11 May 2018, 06:34

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 14 May 2018, 08:31

You're not late, I still haven't started coding the compiler. I haven't finished the text editor yet. Thank you.

14 May 2018, 08:31

Goto page Previous 1, 2

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum