Sunday 13 October 2024

Compiling Knowledge

 


Any software engineer who works with a compiled language will know the almost religious concept of the build. Whether you've broken the build, whether you've declared that it builds on my machine, or whether you've ever felt like you are in a life or death struggle with the compiler. The build is a process that must happen to turn your toil into something useful for users to interact with.

But what is actually happening when your code is being complied? In this post we are certainly not going to take a deep dive into compiler theory, it takes a special breed of engineer to work in that realm, but an understanding of the various processes involved can be helpful on the road to becoming a well rounded developer.

From One to Another

To start at the beginning, why is a compiler needed? Software by the time it runs on a CPU is a set of pretty simple operations knows as the instruction set. These instructions involve simple mathematical and logical operations alongside moving data between registers and areas of memory.

Whilst it is possible to program at this level using assembly language, it would be an impossibly difficult task to write software at scale. As engineers we want to be able to code at a higher level.

Compilers give us the ability to do that by acting as translators, they take the software we've written in a high level language such as C++, C#, Java etc and turn this into a program the CPU can run.

That simple description of what a compiler does belies the complexity of what it takes to achieve that outcome, so it shouldn't be too much of a surprise that implementing it takes several different process and phases.     

Phases

The first phase of compilation is Lexical Analysis, this involves reading the input source code and breaking it up into its constituent parts, usually referred to as tokens. 

The next phase is Syntax Analysis, also know as parsing. This is where the compiler ensures that the tokens representing the input source code conform to the grammer of the programming language. The output of this stage is something called the Abstract Syntax Tree (AST), this represents the structure of the code as described by a series of interconnected nodes in a tree structure representing paths through the code.

Next comes Semantic Analysis, it is at this stage that the compiler checks that the code actually makes sense and obeys the rules of the programming language including its type system. The compiler is checking that variables are declared correctly, that functions are called correctly and any other semantic errors that may exist in the source code. 

Once these analysis phases are complete the compiler can move onto Intermediate Code Generation. At this stage the compiler generates an intermediate representation of what will become the final program that is easier to translate into the machine code the CPU can run.

The compiler will then run an Optimisation stage to apply certain optimisations to the intermediate code to improve overall performance.

Finally the compiler moves onto Code Generation in order to produce the final binary, at this stage the high level language of the input source code has been converted into an executable that can be run on the target CPU.    

Front End, Middle End and Backend

The phases described above are often segregated into front end, middle end and backend. This enables a layered approach to be taken to the architecture of the compiler and allows for a certain degree of independence. This means different teams can work on these areas of the compiler as well as making it possible for different parts of compilers to be re-used and shared.

Front end usually refers to the initial analysis phases and is specific to a particular programming language. Should any of the code fail this analysis errors and warnings will be generated to indicate to the developer which lines of source code are incorrect. In this sense the front end is the most visible part of the compiler that developers will interact with.

The middle end is generally responsible for optimisation. Many compilers will have settings for how aggressive this optimisation is and depending on the target environments may also distinguish between optimising for speed or memory footprint.

The backend represents the final stage where code that can actually run on the CPU is generated.

This layering allows for example for front ends related to different programming languages to be combined with different backends that produce code for particular families of CPUs, with the intermediary representation acting as the glue to bind them together.

As we said at the start of this post understanding exactly how compilers work is a large undertaking. But having any appreciation of the basic architecture and phases will help you deal with those battles you may have when trying to build your software. Compiler messages may sometimes seem unclear or frustrating so this knowledge may save valuable time in figuring out what you need to do to keep the compiler happy.