Since my last post demonstrated the average coder has trouble with understanding how you go from code in source language X, to compiled binary for machine Z. The first thing to understand is, in reality, we are often just doing multiple layers of the same thing: converting X to Y, then converting Y to Z with the same process. Now, rather than speaking of the actual terms taught in compiler classes, I'm going to speak in terms low level and direct enough that you should be able to accomplish this task on your own, as opposed to solving the problem by downloading libraries A, B, and C to do this for you. Why? Well, why try to understand how compilers work if you're not going to bother learning how A, B, and C work, too? Yet, this is the mistake I constantly see with the "build your own compiler" tutorials, because they use terms like "lexer," but neither explain how they work or why you even need one (so we're pretending that we're learning how compilers work, but only by breaking them down into smaller pieces which we still don't understand). Saddest part is, these are college courses being like this.
What I'm going to propose is the system I used (well, am making, since the project isn't completed yet due to laziness), which has been successful so far, and seems to model the actual ones being used. what i've learned from experience, though, is that alot of what i come up with, myself, and think is absolutely brilliant, is what everyone else is already doing, so i learned to stop being so hard on myself when i think a problem could be solved a better way, but can't think of how. I'm not going to get too in depth, so as to practically write the code here (especially going into how to check for syntax errors, as that become self-evident as you're writing), but I'm going to talk deep when i need to. Odds are, you're going to need to read this more than once to understand it, since the big picture is understood by looking at the smaller interlocking pictures. Also, for clarity, I use assembler and compiler almost interchangeably, here, but there is a distinction. When making or using them, they're mostly the same, but the difference comes in that compilers are meant to convert one coding language to another coding langauge (usually to assembly), while assemblers are meant to make a "binary output."
The biggest challenge with this, or OS dev, or any other huge project, is that it's not really a huge project, but a collection of smaller projects, that when you break down the components, it becomes easier. As a result, the task is scary (alot of people when developing the tokenizer are worried that it's a waste of time because they don't know how to get things in the correct order of operations, for example, when that's miles down the road and has an easy solution they don't know, so it is more overwelming than it is difficult). Another thing to consider is, the first time you do it will be the hardest time, so you should focus more on getting it to work, then accepting that it was more of an educational project, then rebuild new ones from scratch, using what you learned, but not recycling code. It seems scary, but once you've done it, really is incredibly easy.
First step is you want to define a language. I defined mine loosely early on, and constantly refined it while i built it. This tactic is not the most reliable method, as you may especially end up reinventing an existing programming lanaguage. The language itself will heavily influence how efficient your compiler really is, both with producing good binaries, but also at getting the code compiled in a reasonable time frame. You also want to know what the output should look like, as well. My first compiler is actually an assembler, which doesn't even have a target machine, which seems useless until you see it in action, and also realize that it thus has the potential to be used for *any* machine, since the target is not in my source, but the source passed to it at run time. This is probably the easiest method for a first timer to go with, since you don't have to learn a target machine (my purpose, really, was to have an assembler that would later target a VM i plan on making). A nice example syntax is the following (this is actually what's working in my assembler, but you don't have to be able to understand what each bit does to understand the big picture):
org 0x7c00 ;Adjusts file so all labels representing locations are based on offset from last org.
start: d1 'meow\n', 0 ;Unescaped string with escape character
babysfirstlabel: ;Points to the next byte, which happens to be the 7th.
d1 "meow\n", 0 ;Same string, only escaped for real
align 0x10, 0xcc ;Alignment makes it easier to read
日本語のlabel: d4 start, 日本語のlabel, babysfirstlabel, 0x01234566+1
dx 2, 8, 0xFF, 4, 0xa00b ;For making instructions. syntax: byte, limit bit, data, lsh, OR data
align 0x10, 0xcc
align 0x10, 0 ;This should never print
file "binfile_for_import.bin", 0, 0
align 0x10, 0xcc
IF a % 2
And the output (courtesy of xxd running on Termux):
00000000: 6d65 6f77 5c6e 006d 656f 770a 00cc cccc meow\n.meow.....
00000010: 007c 0000 107c 0000 077c 0000 6745 2301 .|...|...|..gE#.
00000020: fbaf cccc cccc cccc cccc cccc cccc cccc ................
00000030: 6865 6c6c 6f20 6672 6f6d 2066 696c 6521 hello from file!
00000040: 00cc cccc cccc cccc cccc cccc cccc cccc ................