What is your MyCard?The New World OrderWho wants to be on the Advertising Team?


Computer Help & Chat

Get help with computer-related problems, or just chat about hardware, software, emulation, programming, networking, smartphones/tablets or pretty much any other computer topic.
How compilers work
Posted: Posted October 16th by Kohlrak

Since my last post demonstrated the average coder has trouble with understanding how you go from code in source language X, to compiled binary for machine Z. The first thing to understand is, in reality, we are often just doing multiple layers of the same thing: converting X to Y, then converting Y to Z with the same process. Now, rather than speaking of the actual terms taught in compiler classes, I'm going to speak in terms low level and direct enough that you should be able to accomplish this task on your own, as opposed to solving the problem by downloading libraries A, B, and C to do this for you. Why? Well, why try to understand how compilers work if you're not going to bother learning how A, B, and C work, too? Yet, this is the mistake I constantly see with the "build your own compiler" tutorials, because they use terms like "lexer," but neither explain how they work or why you even need one (so we're pretending that we're learning how compilers work, but only by breaking them down into smaller pieces which we still don't understand). Saddest part is, these are college courses being like this.

What I'm going to propose is the system I used (well, am making, since the project isn't completed yet due to laziness), which has been successful so far, and seems to model the actual ones being used. what i've learned from experience, though, is that alot of what i come up with, myself, and think is absolutely brilliant, is what everyone else is already doing, so i learned to stop being so hard on myself when i think a problem could be solved a better way, but can't think of how. I'm not going to get too in depth, so as to practically write the code here (especially going into how to check for syntax errors, as that become self-evident as you're writing), but I'm going to talk deep when i need to. Odds are, you're going to need to read this more than once to understand it, since the big picture is understood by looking at the smaller interlocking pictures. Also, for clarity, I use assembler and compiler almost interchangeably, here, but there is a distinction. When making or using them, they're mostly the same, but the difference comes in that compilers are meant to convert one coding language to another coding langauge (usually to assembly), while assemblers are meant to make a "binary output."

The biggest challenge with this, or OS dev, or any other huge project, is that it's not really a huge project, but a collection of smaller projects, that when you break down the components, it becomes easier. As a result, the task is scary (alot of people when developing the tokenizer are worried that it's a waste of time because they don't know how to get things in the correct order of operations, for example, when that's miles down the road and has an easy solution they don't know, so it is more overwelming than it is difficult). Another thing to consider is, the first time you do it will be the hardest time, so you should focus more on getting it to work, then accepting that it was more of an educational project, then rebuild new ones from scratch, using what you learned, but not recycling code. It seems scary, but once you've done it, really is incredibly easy.

First step is you want to define a language. I defined mine loosely early on, and constantly refined it while i built it. This tactic is not the most reliable method, as you may especially end up reinventing an existing programming lanaguage. The language itself will heavily influence how efficient your compiler really is, both with producing good binaries, but also at getting the code compiled in a reasonable time frame. You also want to know what the output should look like, as well. My first compiler is actually an assembler, which doesn't even have a target machine, which seems useless until you see it in action, and also realize that it thus has the potential to be used for *any* machine, since the target is not in my source, but the source passed to it at run time. This is probably the easiest method for a first timer to go with, since you don't have to learn a target machine (my purpose, really, was to have an assembler that would later target a VM i plan on making). A nice example syntax is the following (this is actually what's working in my assembler, but you don't have to be able to understand what each bit does to understand the big picture):

org 0x7c00 ;Adjusts file so all labels representing locations are based on offset from last org.
start: d1 'meow\n', 0 ;Unescaped string with escape character
babysfirstlabel: ;Points to the next byte, which happens to be the 7th.
d1 "meow\n", 0 ;Same string, only escaped for real

align 0x10, 0xcc ;Alignment makes it easier to read
日本語のlabel: d4 start, 日本語のlabel, babysfirstlabel, 0x01234566+1
dx 2, 8, 0xFF, 4, 0xa00b ;For making instructions. syntax: byte, limit bit, data, lsh, OR data
align 0x10, 0xcc
align 0x10, 0 ;This should never print

file "binfile_for_import.bin", 0, 0
align 0x10, 0xcc

a= 1
IF a % 2
d1 a
ELSE
d2 a
ENDIF


And the output (courtesy of xxd running on Termux):

00000000: 6d65 6f77 5c6e 006d 656f 770a 00cc cccc meow\n.meow.....
00000010: 007c 0000 107c 0000 077c 0000 6745 2301 .|...|...|..gE#.
00000020: fbaf cccc cccc cccc cccc cccc cccc cccc ................
00000030: 6865 6c6c 6f20 6672 6f6d 2066 696c 6521 hello from file!
00000040: 00cc cccc cccc cccc cccc cccc cccc cccc ................
00000050: 01


settingsOptions
There are 2 Replies

Hopefully the formatting doesn't get too messed up, as I don't want to leave a permanent post on pastebin for this when it's not even complete. But, without going into details on all the syntax, you can see that it could be useful when completed, even if not used for programming. The "file" directive alone, can be used to combine files as-is in certain orders which could be used for making files in specific binary formats for mods, VMs, etc. I'll cover more on the binary output, at the end, since this is where the theory present differs most strongly from the application of it.

Secondly, you'll need to have some kind of setup for doing multiple "passes." It is next to impossible to get everything going well by going one step per line, so you actually want to run steps over the entire source code separately. This was the thing that, when i finally learned and realized this, that I found that making a compiler is actually incredibly easy. This makes debugging alot easier, too, and will allow you to break everything down into manageable parts and even have some use for the end product even before it's completed. I, personally, read the file into C++ strings, then closed the file. Then after each pass, erase the "source string" then setting the pointers for "souce string" to output strings, then making the "output string" pointers point to new. Most of this was done automatically for me by the C++ STL, however this isn't very efficient. The reason I did it this way, was because my goal was so it could be run on any machine that my C++ compiler could target, including machines that my C++ compiler itself could not be compiled for. Thus, potentially, allowing my assembler to run on arduino. In the long run, as the C++ STL gets optimized on systems, my compiler will also benefit from those optimizations. However, for a more serious project, I'd recommend using a smarter memory management system, since, right now, it's RAM hungry and will choke up on a large file.

Thirdly, and some people combine this with one of the next steps (like the tokenizer is a good choice, as well), I read the code line by line and skip over the comments. It's easier to force single line comments only, but double comments can be handled as well with fairly little complexity. Odds are, if you have a pass dedicated to comments, it'll be easier to handle the multi-line comments later on. Having a separate pass for the comments is probably going to slow down the whole process more than mixing it with a future component, but the complexity costs could scare you away. I ended up putting this step into the same step where I read the file into RAM.

Next, if there's any sort of special pre-processing that you must do (outside of pre-processor directives like "#include" [notice, I haven't even implemented an include yet]), You may want to do that here. Strings, for example, can make the next step (tokenization) incredibly difficult. I made a dedicated pass to pre-processing strings so that the tokenizer wouldn't choke up on the whitespace within the strings. The problem here, is that if you have a special escape symbol for spaces, newlines, etc, you need an escape symbol for the escape symbol itself, incase it was included in the strings. This way, the person using your product doesn't have to be aware of the internal limitations of your compiler.

Next, You'll want to "tokenize." How you do this depends alot on your defined language, but the idea is that you need to be able to break individual lines (parsing) into "tokens," which is alot easier to deal with than trying to write all your logic around parsing the lines every step of the way. I actually did not write a dedicated pass to tokenizing, but, rather, the tokenizing happens with every line on every pass, which is slower, but, once again, you can worry about making it an actual pass after you manage to make something that takes your example source and produces the end result that you desire. Yes, the waterfall model is accurate, but you're not going to be making your ideal compiler first time you try, so embrace first that this is going to be your worst one. Honestly, I'm even having a hard time trying to figure out how to make this it's own step, anyway, since I think it would be more RAM intensive to do so (even moreso than using STL, since you're pretty much forced to use linked lists).

Now we get into the meat. There's often alot of preparation necessary, that we don't think about, what makes everything easier. That's what we did above. Now, we need to build from the ground up if we want to keep things manageable (note: I didn't even implement include yet, as "file" imports raw binary data). First thing I did was I made data directives, which are both usable in the source, but other things in the source can, through other directives (such as "align" in my example), also make them. These will be how you take your raw output and turn it into a file. This ended up being the last step of my compiler, but the first step I made, so that it can be practically tested as I'm building it. It's possible to build the other way, but the task seems way more overwhelming that way, so I recommend doing it like this, so you can see the thing spitting out actual output. For me, used "d#" directives, where # represented the size in bytes, while d meant "raw output." If I had "d4 1" it would output 0x00000001 (thanks to little-endian, 01 00 00 00). You'll probably be building an assembler (what we call a compiler for assembly) before a compiler for another language, so you'll end up doing something similar to me, but, for a higher level language, you'll end up making this something like "asm" and then the assembly of the target machine, which will then run through your assembler, which will then probably make either an obj or an exe (more on this later).

After that's done, you could probably build any step that you want that doesn't rely on a step that's not yet created. As such the order of the following may vary, but I will go in the order that I went. So, next I made a calculator, which then used the tokenizer (this is where you define what is and isn't valid in a variable name [designed actual variables after completing this]). I then used the shunting yard algorithm to convert infix notation (what we're used to) to reverse polish notation (which basically makes things easier to calculate). From there, getting the result is easy using a similar algorithm whose name escapes me at the moment. This allowed me to do put expressions, instead of only numbers, on my output expressions ("d4 1+3").

Next i wanted to be able to use "labels" to hold numbers so that they could be calculated on the fly. These also include "location labels," such which end with colons. I merely had a "vector" of a struc that held both the value and the name of the label, which the expression calculator would then have access to and use when it came accross a variable.

As I'm sure you noticed, this creates a problem if you want to forward reference things (something that's defined in the future). Thus, i expanded my final pass so that it could do "dry runs" (a single pass). During dry runs, it would keep track of bytes hypothetically written (does a simulated write), so the "location labels" would have appropriate values. You'd think there would be chicken-and-egg issues, here, but there isn't, because this state only fake writes bytes, and after this stage you should do the wet run with no other processing, so the actual bytes written count would stay accurate. The wet run (another pass) would modify the "bytes written" again, although, at this point, it is no longer read, since only the dry run could define the value of location labels. From here, you probably have 5 passes (read source into RAM, removing comments, parsing strings, dry run, wet run).

From there, "align," "org," and other such directives are final pass, as well, so they could be defined and tested now. "align" was a special exception to the "bytes written is ignored" rule, because the directive actually writes bytes, but needs to know how many are written before now. You could write a special variable for it during the dry run, or you could just reset the bytes written counter just before the wet run.

All the above ended up making the meat of it all. It's even harder with a high level language, because IF-statements and such need to get handled there. On the flip side, it's easier to handle them there, because runtime things are converted to assembly which, depending on the target, has it's own logical way of handling conditionals, which usually means an incredibly simple conversion process that does not require you actually know the value, since it's run time defined. However, for conditional assembly (the if statements you see in my example) and conditional preprocessor (things like "#ifndef" in C++), you need a run that just before the 2 output passes. For conditionals, i recycled the expression parser and tokenizers, and then did recusion. I actually executed the conditionals, so if a conditional came out false, i'd then look for "endif" or "elseif," but continuing to read through the lines incase i came across a nested if, because i'd need to recursively jump down, to prevent seeing a subordinate endif as the endif for the current level (and, yes, this was a major pain to debug). On a side note, i skipped implementing a "while," since I've never really seen them used in conditional assembly or other pre-processor outside of examples showing that it is actually a feature in a given assembler or compiler. They're available in most other assemblers, but you should be able to use align to fill an area with bytes or something.


Posted October 16th by Kohlrak
Kohlrak


I decided to do a separate pass, for simplicity, to process "macro" statments, which I haven't implemented yet (because it'll be more tedious than challenging). After coding both "macro" and "include," my binary file maker becomes an assembler: instructions are merely macros, where the parameters of the macro will have math and conditional assembly applied to turn it all into those statements that begin with "d4" or something to that effect. My planned syntax looks somewhat like the following (this is simplified for the sake of example here, as it'll be slightly more complex):


oprandtype register, r0=0, r1=1, r2=2, r3=3, r4=4

macro add, x, y
d1 1
if (not register x)
fatal "Cannot use a constant as a destination oprand."
endif
if (not register y)
d1 1 ;y is a constant
else
d1 0
endif
d1 x, y
endmacro

add r2, r1


which will turn into

d1 1
d1 0
d1 2
d1 1


Which presumably works on the target machine (which, in this case, doesn't actually exist, but is described for the purpose of example). The challenge of macro statements, is that you may or may not want to handle nesting, or redefinition, or a nested macro with the same name as a superior macro. My recommendation is to not allow nesting of macros, then implement nesting if you later feel it's actually necessary. Moreover, that "register" thing has to be defined for assemblers, since constants and registers are not the same thing, you'll need some way of identifying if a parameter is actually a register, a number, a memory oprand (if they exist), etc. The good news is, you shouldn't actually need to use the expression parser if the macro pass is separate from the conditional pass, so that "not register" doesn't need to be defined as an operator, but the whole thing in parenthesis can be parsed and disected on this pass and then be turned into "true" or "false" before we go down to the conditional pass. To make it easier, you might want to do some sort of prefix before the parenthesis. The sky is the limit, since you're writing the rules. Just make sure that whatever you do to escape it has it's own escape in case of intentional character usage.

That should give you enough information to get started. Now, as for the binary output seen in the real world, usually if it's for a microarchitecture, the assembler will spit out either a binary or "hex code" (a format read by a flasher to stick an actual binary bit into the ROM of a SOC [system on chip]). If it's more complex (like most assemblers), it'll instead write an object file which will then be used by a separate program called a "linker" to "resolve symbols" (there's many ways of handling this depending on how the ABI [standards for how functions of a given programming language are supposed to call external functions on a given target system] works) when pre-compiled/pre-assembled output is expected to be added to the binary output. My solution is to not build the target into the assembler itself, but to use macros in an include file be added to the source to define how to handle the output. That way, my assembler is portable enough to be used with targets that don't even exist yet.

Usually, the way it works is, you have a compiler that converts source code into assembly, which (unless you give it a special command line option to have it write assembly output) passes the output with "pipes" directly to a second program which acts as an assembler. The second program (assembler) usually writes a ".o" or ".obj" which are known as object files. This is usually done per file, and often adds so much padding to the finaly binary that it becomes an absolute monster. So each object file then gets passed to a "linker" which links the object files together then rewrites variables or instructions so that the final binary output created has everything redirected as necessary that a jmp or call from one file corresponds to the proper code in another file (which is the same file in the resulting binary).

After typing all this, i have a feeling i've failed my mission of demystifying how compilers work, but it's worth posting to see if people get it.

Also, something went wrong when posting this, which is why some of this is within the replies of the post.

Posted October 16th by Kohlrak
Kohlrak
Reply to: How compilers work

Enter your message here


Site Rules | Complaints Process | Register Complaint Facebook Page
GTX0 © 2009-2017 Xhin GameTalk © 1999-2008 lives on