What do C and Assembler actually compile to?
Let's take a C program.
When you run gcc
, clang
, or 'cl' on the c program, it will go through these stages:
- Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
- Lexical analysis (producing tokens and lexical errors).
- Syntactical analysis (producing a parse tree and syntactical errors).
- Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
- Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
- Outputing into assembly source (or another intermediate format like .NET IL bytecode)
- Assembling of the assembly into some binary object format.
- Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
- Output of final executable in elf, PE/coff, MachO64, or whatever other format
In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...
)
Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com
executable
You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.
As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.
I've wiki'd this answer to let people tweak any errors/add information.
C typically compiles to assembler, just because that makes life easy for the poor compiler writer.
Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:
Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.
For example, if you compile and assemble (but don't link) this program
int main () { printf("Hello, world\n"); }
you are likely to wind up with a relocatable object file with
A
text
section containing the machine code formain
A label definition for
main
which points to the beginning of the text sectionA
rodata
(read-only data) section containing the bytes of the string literal"Hello, world\n"
A relocation entry that depends on
printf
and that points to a "hole" in a call instruction in the middle of a text section.
If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o
, and you can explore the label definitions and uses with a simple tool called nm
, and you can get more detailed information from a somewhat more complicated tool called objdump
.
I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.