Zhivago | zloy: you are an idiot.
Zhivago | zloy: Assembly cannot be used to explain how C works.
© from #c channel on irc.freenode.net
I decided to learn C language by the following approach — disassemble the most interesting listings (from K&R of course :]) and investigate what is going on under the hood. I use emacs as editor, compiler, shell, debugger and... coffee machine :), GNU C compiler (gcc) and gdb+gud.el (emacs gdb ui) for debugging. BTW, I will appreciate it if you leave any remarks and criticisms in the comments.
Ok, let's begin.
Today I want examine the guts of a simple program from page 12 chapter 34 of K&R book which counts tabs, spaces, new lines and other symbols in the input and shows the statistics to user when interrupted by Ctrl-D. From my point of view it is one of the most interesting listing from chapter 1 from the book, not too hard and shows various things like defining variables, array and loops which want to analyze. I think it is good enough for my purpose. So, here is the source code:
I compile ths source with -ggdb3 flag to include debugging symbols into resulting binary (gcc -ggdb3 -o count count.c).
Let's see objdump output (I'm going to use intel syntax):
objdump -S -M intel -d count
Before I start investigating I have to say a few words about the stack, stack frames and related things.
The stack is a place in a memory where we can tuck away any number of double (32-bit) or quad (64-bit) words for the time being and access them by their addresses later. In fact it is a way to manage data in the memory. The amd64 Call Stack is upside down. It grows from top (the highest address) to bottom (the lowest address).
The stack frame is a chunk of allocated memory from the stack. Every time any function is called it cuts off enough memory for its arguments and locals from the stack. You see the function PUSHES its locals (Call Stack Frame) below the TOS. This region of memory is called a stack frame. I thought that function obtains its own frame in the memory in the past. But actually it doesn't obtain the memory. The stack's 8mbs are already allocated and belong to the program since the program was started. The stack frame is framed by the rsp (Stack Pointer) register at the bottom (which additionally is a TOS) and rbp which is a pointer to the highest address where stack frame ends. In other words before the function will be called we should to store the address in rsp to rbp (in assembly it looks like mov rbp, rsp) to return to the previous 'state' of the stack when function will end work. Therefore, if a function pushes more values onto the stack, it is effectively growing its frame.
The main() function from 'hello world' is called by __libc_start_main() which is called by __start() routine.
Now, take a look at the output of objdump program:
'push rbp' stores current address of rbp to the stack and than 'mov rbp,rsp' sets the address from rsp to rbp, it is the start of memory which the function can use. The third instruction 'sub rsp,0x40' (the stack grows down, remember?) offsets the stack pointer 64 bytes down. Therefore, these three instructions set up a stack frame for main() function.
In our C 'hello world' program the first thing we did is declared variables and assigned the nwhite and nother variables to 0 value. Here is the output from the objdump:
Some interesting juggling goes on here. The 'mov DWORD PTR [rbp-0xc],0x0' instruction sets the 0 value to 12 bytes offset address, we defined 'c', 'nwhite' and 'nother' as integer type — 4 bytes, so, 4 * 3 = 12. For variable 'i' compiler also allocates 4 bytes but a little bit later, look at for loop below.
Then 'nother's' variable value is set to 0 via eax (32-bit) register. In fact the record 'a = b = 0;' is compiled as 'b = 0; a = b;' that's why the value of nwhite (rbp-0xc) is stored in 32-bit register eax and then assigned to memory address rbp-0x8 of the stack frame.
To be sure that the every element of ndigit doesn't contain garbage from the memory we have to set each element to 0 value using 'for' loop. How compiler translated it to assembly you can see below:
The 'mov DWORD PTR [rbp-0x4],0x0' instruction at 0x400569 address initializes 'i' (rbp-0x4) = 0 in our loop. 0x400570 jumps the cpu to 0x400583 — it is 'cmp DWORD PTR [rbp-0x4],0x9' which verifies that value in rbp-0x4 doesn't exceed 9. jle at address 400587 means Jump if Less or Equal - in circumstance of the first itteration (because [rbp-0x4] = 0) it jumps to 'mov eax,DWORD PTR [rbp-0x4]' at 0x400572 where the value of 'i' is stored in eax register.
The 'cdqe' simple converts double word int eax register into quad word rax register with sign (bit 31).
The next instruction 'mov DWORD PTR [rbp+rax*4-0x40],0x0' contains a flavor of address displacement and I recommend to read it as:
where rbp-0x40 - the 'ndigit' array itself in the stack frame and (rax*4) is displacement inside 'ndigit' and every element of array has integer type (4 bytes). For example, if we want to set 5 to the third element of our array the displacement would be 3 * 4 and our 5 will be placed between the 12th and 16th bits. In the 'add DWORD PTR [rbp-0x4],0x1' we just increment 'i' counter by 'add DWORD PTR [rbp-0x4],0x1' and after it we compare the value of i with 9 again. If the value in i <= 9 we jump to the top of our 'for' loop. It repeats while i won't be > 9. If jle doesn't jump then the next unstruction is executed and from now on we are at the doorstep of the 'while' loop:
At the start of the 'while' loop (offset 0x4005d9) we immediately jump to the address 0x40061b offset 'jmp 40061b' where getchar() function is called. The 'man 3 getchar' describes "fgetc() reads the next character from stream and returns it as an unsigned char cast to an int, or EOF on end of file or error. getc() is quivalent to fgetc() except that it may be implemented as a macro which evaluates stream more than once. getchar() is equivalent to getc(stdin)." it returns the character read as an unsigned char cast to an int or EOF on end of file or error. The returned value is placed in eax register and then assigned to 'c' variable by 'mov DWORD PTR [rbp-0x10],eax' at 0x400620. Then this value is compared with -1 (Ctrl-D):
'cmp DWORD PTR [rbp-0x10],0xffffffff' and if 'с' isn't equal -1 we jump to 0x4005db address 'jne 4005db'. At this address we get into if condition:
where '0x4005db: cmp DWORD PTR [rbp-0x10],0x2f' with 'jle' look if the value in rbp-0x10 is less or equal 0x2f ('/' symbol before '0' in ASCII table) it jumps to the 0x4005ff offset then. The next pair 'cmp DWORD PTR [rbp-0x10],0x39' and 'jg 4005ff' verifies that the value in rbp-0x10 is not grater than 0x39 ('9' symbol in ASCII table) and if so, we jump to the address — 0x4005ff. The reason of these verifications is to determine if 'c' contains number. If so, the byte value in 'c' is stored in eax register (0x4005e7 offset). The point is that the 'c' variable contains the ASCII hex value of the digit (e.g. decimal 6 is 0x36 in hex) that's why the eax value is subtracted by 0x30 (0x4005ea offset) to obtain the 'real' value of the digit in 'c' for further manipulations. Then at address 0x4005ed offset we have to adjust the 32-bit value in eax to 64-bit value because we cannot use 32-bit register in address calculations. Take a look at the offset 0x4005f0, we fetch the value from the memory address (rbp-0x40) + (rdx*4) in eax, where rbp-0x40 - is the start of our 'ndigit' array and rdx*4 - is rdx's element offset which has 4 byte length and just increment it by '0x4005f4: add edx,0x1' instruction.
You should recall what does 'cdqe' do from the above discussion of 'for' loop, if you don't please refer to that section of the post. Eventually, the two instructions which I have to discuss in this chunk of code remain: the '4005f9: mov DWORD PTR [rbp+rax*4-0x40],edx' just stores the value from the edx register to the address displacement of the memory. As usual rbp-0x40 — location of our 'ndigit' array and rax*4 — the element of 'ndigit'. Finally, at the last 'jmp 40061b' instruction CPU jumps to the address displacement 0x40061b where the next byte is fetched by getchar() function from the sequence of input characters.
Next, at the 0x4005ff offset another if condition is located — 'else if (c == ' ' || c == '\n' || c == '\t')', where we increment 'nwhite' or 'nother' counter according to the occurrence of one or another character and of course it has a sequence of conditional jumps inside:
At this moment you should be familiar with cmp and various of conditional jumps, I hope :) The first instruction compares the value of the 'c' variable (rbp-0x10) with 0x20 (ASCII white space) and then jumps (je = Jump if Equal) to the 0x400611 address if it the received character is the white space. Further, we check if 'c' (rbp-0x10) contains 0xa (ASCII new line - \n) and jump to the same 0x400611 address in the case of new line there. The last compre/jump pair verifies the value in 'c' by rbp-0x10 address and if it isn't equal (jne - Jump if Not Equal) 0x9 (ASCII tabulation) we shift to the 0x400617 (there we get the next character from the sequence by calling getchar() function) address where the 'nother' variable is incremented: 'add DWORD PTR [rbp-0xc],0x1'. But if the tab code is present in 'c' variable the CPU jumps to the 0x400611 address where our 'nwhite' variable is incremented by 'add DWORD PTR [rbp-0x8],0x1' instruction. That's all, at this point the 'while' loop ends up.
The other last part of the code prints accumulated data which has been collected in the above code.
First of all we call printf() function to display "digits =" string to the output. The code translated into assembly looks interesting from my point of view and needs some explanation:
In the first instruction we store the address of "digits =" string to eax register. As you might know eax is 1/2 of rax register and we store the whole value in rax into rdi register. The 'mov eax,0x0' is a mark that we have no floating numbers before calling printf(). The last instruction speaks for itself and I hope there are no reasons to explain it in detail :)
Now I'd like to show another example with printf() function. Presume we have a simple C code that prints out a decimal number 23.
By the AMD64 SystemV Calling Convention the first two arguments are put in the rdi and rsi registers (so the instruction at offset address 0x4004f0 loads a pointer to the "%d\n" to rdi, and the 0x4004ee — the value of i to rsi), if you look at the memory address 0x4005b4 you can see that the 4 bytes 0x000a6425 are stored there and in my example we have to read this in little endianess notation, so according to ASCII table 0x25 - "%", 0x64 - "d", 0x0a - new line "\n" and 0x00 - is the null-char.
OK, let's return to our example from K&R book.
To print out each element of the 'digit' array we are using 'for' loop. The C code expands to assembly like:
I've already explained the 'for' loop and calling printf() function above, therefore, I won't describe every instruction in detail.
Briefly, we set 0x0 value to 'i' variable then jump to 0x400665 address where 'i' is compared with 0x9 value. Then we jump back to the 0x400644 address, store the value of 'i' in eax register, convert eax to rax (because only 64-bit register can be used in address displacement calculation in 64-bit system) with sign, store the value of the first element of 'digit' array to edx and set up registers before call printf() function, call printf(), increment 'i' and repeat everything again and again until `i' exceeds 0x8.
If you recall the rbp-0xc and rbp-0x8 is the 'nwhite' and 'nother' variables accordingly. In the last fragment of assembly we simply set up more registers (then previous explanation of the printf example code) before calling the printf() function because it has more parameters and of course call it.
The last two instructions which I haven't mentioned yet are 'leave' and 'ret'. At the beginning of this post I explained what is the stack frame and how the compiler sets up registers to allocate a region from the memory for our main() function. Linux kernel treats program as function. The leave instruction is short for 'mov rsp, rbp pop rbp'. In other words we return the rbp/rsb to what they were before we entered main(). The ret instruction pops the return address off of the stack and jumps to it. That's all.
I want to thank cottidianus
for his help and patience while I wrote this post :)
The sources which I used to create this post:Intel x86 Function-call Conventions - Assembly ViewAll About EBPIntel® 64 and IA-32 Architectures Software Developer's Manual
Combined Volumes 3A, 3B, and 3C: System Programming Guide, Parts 1 and 2