Understanding buffer overflows (I)
In this series of articles I will try to explain the basics behind buffer overflow exploits. I will try to make this as easy and accessible as possible, but I cannot start explaining what a computer is or how it works. So the required knowledge to understand this series are as follows:
Here is a fast reference about Intel instructions for the curious.
So as you see, this does not much except wait for a string then print it out. Nothing fancy. So let's try compiling this. I will first remove buffer overflow protections from both the OS and the C compiler. So first turning off ASLR (don't forget to turn it on again after you've finished):
And then compiling with stack execution activated.
Let's execute this:
Looks like it's working correctly. What's the problem with this program?
Hmmm looks interesting. Let's see what GDB can show us about this.
As you can see, the program crashed because of some strange value 0x61616161. To see what's happening let's put a breakpoint after the printf(), so line 9.
Now we stopped at the return. Let's see the assembly for the main function.
As you can see, we're about to execute mov $0x0,%eax instruction (EAX is the register used to return values from functions). Let's step one instruction at a time to see where the segmentation fault happens.
We stepped one instruction so we're now at LEAVE instruction. Let's step once again.
Ok it's either the LEAVE or the RET instruction that is causing the segment violation trying to access address 0x61616161. If it is trying to access memory address 0x61616161 then some register has to contain this value. Let's check:
Bingo!
ESP is the stack pointer, that points to current top of stack. EBP thus indicates the bottom of the stack for this function, that is, where the start of stack for this function is at. Generally EBP >= ESP since stack grows downward in memory. This is called the function entry protocol.
The LEAVE instruction will do the opposite of the above 2 instructions: it will first put ESP = EBP and restore EBP from the stack, leaving ESP and EBP with the same values that were there before entering this function. This is called the function exit protocol.
The RET instruction returns from a function called previously by the CALL instruction. CALL instruction pushes the return address -the address to where the code must continue after finishing executing the function- to the top of the stack. RET just pops this return address from the top of the stack and sets it to EIP, which is the instruction pointer -the register that holds the address of next instruction to be executed-. By the way, what value EIP has?
And bingo number 2. This is what is causing the segmentation fault. This means we somehow managed to overwrite the return address of main function. But how did this happen?
Local variables are reserved (or allocated) into the stack. So if you define a variable inside a function, the space to hold this variable will be allocated from the stack. This means the name variable in our vulnerable program is allocated into the stack. We can guess the address of name by taking a look at main code.
This is the call to scanf, as you can guess from the CALL instruction. The two previous MOV instructions are for passing the function parameters. In Intel architecture, all parameters are passed into the stack in reverse order. For scanf, we have 2 parameters. So first MOV is the second parameter -the pointer to our buffer-. This is what we need. We will run the program again, put a breakpoint there, and check what value is ESP + 4.
So first I set the breakpoint as said, then run the program. Once it stops at the scanf call, I calculate ESP + 4 = 0xbffff0f4. This is the addres of the name variable as we said previously. Since it's a pointer, it contains the pointer to the memory zone where our input is going to be stored. We find out what this pointer is, which turns out to be 0xbffff106. We can check to see what's on the buffer before the scanf call. Here's the dump of 20 bytes of the buffer:
Just garbage, meaningless numbers. Now we can execute the scanf and check this buffer again.
Now it contains our input:
Now as we said before, the return address is also stored in the stack. To find out where, we just have to stop just before the RET instruction is executed and check the top of the stack.
So the return address is stored at 0xbffff11c and it is in fact 0xb7e334d3. We can check to which function does this address belong to:
Bingo, looks like this is actually the return address.
In the name we've inputted exactly 26 characters, which are 26 bytes, as we specified before. The four last characters are a, which binary representation is 0x61, thus 4 a's are 0x61616161. As you can see in the memory dump, the address at 0xbffff11c is effectively 0x61616161. Just continuing the program will make it crash with the same segment violation we got earlier.
We can try with nother value instead of aaaa so you can see this better. Let's try with ABCD (0x41424344).
(0x44434241 is little-endian representation for ABCD).
As you see, we now can freely manipulate main's return address, we can set it to any value we want just changing the input given to the program. This is a very serious security problem as we will see.
- Basic computer (Von Neumann) architecture - knowing how a computer works, what is a CPU, what is memory, what is the stack, what is an address.
- Basic assembly - knowing what machine code means and how this translate to assembly.
- Basic C (with pointers) - comfortable with C and pointers.
- Comfortable user level on Linux/UNIX - exprience working on UNIX-like environment.
- Your favorite 32-bit Linux distro (I will be using Ubuntu 12.04).
- GCC C compiler (I will be using 4.6.3) (sudo apt-get install build-essential). If you never used GCC don't worry, I'll cover its basic usage in this series.
- GDB debugger (I will be using 7.4-2012.04) (sudo apt-get install gdb). I will also cover GDB usage.
- NASM assembler (I will be using 2.07) (sudo apt-get install nasm). Small, fast and support for various syntaxes and output formats. I can't live without it.
- Execstack (I will be using 1.0) (sudo apt-get install execstack). Only needed for compiling, you don't need to worry about this one.
Here is a fast reference about Intel instructions for the curious.
The vulnerable program
First we're going to build a very simple exploitable program to use it to understand how buffer overflows workbuffoverflow1.c
#include <stdio.h>
int main()
{
char name[10];
printf("Please input your name: ");
scanf("%s", name);
printf("Your name is %s\n", name);
return 0;
}
So as you see, this does not much except wait for a string then print it out. Nothing fancy. So let's try compiling this. I will first remove buffer overflow protections from both the OS and the C compiler. So first turning off ASLR (don't forget to turn it on again after you've finished):
echo 0 > /proc/sys/kernel/randomize_va_space
And then compiling with stack execution activated.
cc -fno-stack-protector -z execstack -g -o buffoverflow1 buffoverflow1.c
Let's execute this:
./buffoverflow1 Please input your name: David Your name is David
Looks like it's working correctly. What's the problem with this program?
- The buffer for the input string (name) is too small..
- The buffer for the input string is statically allocated.
- Uses scanf() and thus does no check for buffer overflows
Please input your name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Your name is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Segmentation fault (core dumped)
Hmmm looks interesting. Let's see what GDB can show us about this.
gdb -q buffoverflow1 Reading symbols from /some/path/buffoverflow1...done. (gdb) r Starting program: /some/path/buffoverflow1 Please input your name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Your name is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Program received signal SIGSEGV, Segmentation fault. 0x61616161 in ?? ()
As you can see, the program crashed because of some strange value 0x61616161. To see what's happening let's put a breakpoint after the printf(), so line 9.
(gdb) b 9 Breakpoint 1 at 0x8048474: file buffoverflow1.c, line 9. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /some/path/buffoverflow1 Please input your name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Your name is aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Breakpoint 1, main () at buffoverflow1.c:9 9 return 0;
Now we stopped at the return. Let's see the assembly for the main function.
(gdb) disas main Dump of assembler code for function main: 0x08048434 <+0>: push %ebp 0x08048435 <+1>: mov %esp,%ebp 0x08048437 <+3>: and $0xfffffff0,%esp 0x0804843a <+6>: sub $0x20,%esp 0x0804843d <+9>: mov $0x8048550,%eax 0x08048442 <+14>: mov %eax,(%esp) 0x08048445 <+17>: call 0x80483400x0804844a <+22>: mov $0x8048569,%eax 0x0804844f <+27>: lea 0x16(%esp),%edx 0x08048453 <+31>: mov %edx,0x4(%esp) 0x08048457 <+35>: mov %eax,(%esp) 0x0804845a <+38>: call 0x8048370 <__isoc99_scanf plt="plt"> 0x0804845f <+43>: mov $0x804856c,%eax 0x08048464 <+48>: lea 0x16(%esp),%edx 0x08048468 <+52>: mov %edx,0x4(%esp) 0x0804846c <+56>: mov %eax,(%esp) 0x0804846f <+59>: call 0x8048340 => 0x08048474 <+64>: mov $0x0,%eax 0x08048479 <+69>: leave 0x0804847a <+70>: ret End of assembler dump.
As you can see, we're about to execute mov $0x0,%eax instruction (EAX is the register used to return values from functions). Let's step one instruction at a time to see where the segmentation fault happens.
(gdb) s 10 } (gdb) disas main Dump of assembler code for function main: 0x08048434 <+0>: push %ebp 0x08048435 <+1>: mov %esp,%ebp 0x08048437 <+3>: and $0xfffffff0,%esp 0x0804843a <+6>: sub $0x20,%esp 0x0804843d <+9>: mov $0x8048550,%eax 0x08048442 <+14>: mov %eax,(%esp) 0x08048445 <+17>: call 0x80483400x0804844a <+22>: mov $0x8048569,%eax 0x0804844f <+27>: lea 0x16(%esp),%edx 0x08048453 <+31>: mov %edx,0x4(%esp) 0x08048457 <+35>: mov %eax,(%esp) 0x0804845a <+38>: call 0x8048370 <__isoc99_scanf plt="plt"> 0x0804845f <+43>: mov $0x804856c,%eax 0x08048464 <+48>: lea 0x16(%esp),%edx 0x08048468 <+52>: mov %edx,0x4(%esp) 0x0804846c <+56>: mov %eax,(%esp) 0x0804846f <+59>: call 0x8048340 0x08048474 <+64>: mov $0x0,%eax => 0x08048479 <+69>: leave 0x0804847a <+70>: ret End of assembler dump.
We stepped one instruction so we're now at LEAVE instruction. Let's step once again.
(gdb) s Warning: Cannot insert breakpoint 0. Error accessing memory address 0x61616161: Input/output error. 0x61616161 in ?? ()
Ok it's either the LEAVE or the RET instruction that is causing the segment violation trying to access address 0x61616161. If it is trying to access memory address 0x61616161 then some register has to contain this value. Let's check:
(gdb) i r esp esp 0xbffff120 0xbffff120 (gdb) i r ebp ebp 0x61616161 0x61616161
Bingo!
Stack is fun!
As we saw above, EBP holds the 0x61616161 value. But first, what is EBP register? EBP is used to point to the bottom of the local stack frame. If you check main disassembly you can see how EBP is saved into the stack with PUSH, and then set to ESP value when entering the function.0x08048434 <+0>: push %ebp 0x08048435 <+1>: mov %esp,%ebp
ESP is the stack pointer, that points to current top of stack. EBP thus indicates the bottom of the stack for this function, that is, where the start of stack for this function is at. Generally EBP >= ESP since stack grows downward in memory. This is called the function entry protocol.
The LEAVE instruction will do the opposite of the above 2 instructions: it will first put ESP = EBP and restore EBP from the stack, leaving ESP and EBP with the same values that were there before entering this function. This is called the function exit protocol.
The RET instruction returns from a function called previously by the CALL instruction. CALL instruction pushes the return address -the address to where the code must continue after finishing executing the function- to the top of the stack. RET just pops this return address from the top of the stack and sets it to EIP, which is the instruction pointer -the register that holds the address of next instruction to be executed-. By the way, what value EIP has?
(gdb) i r eip eip 0x61616161 0x61616161
And bingo number 2. This is what is causing the segmentation fault. This means we somehow managed to overwrite the return address of main function. But how did this happen?
Local variables are reserved (or allocated) into the stack. So if you define a variable inside a function, the space to hold this variable will be allocated from the stack. This means the name variable in our vulnerable program is allocated into the stack. We can guess the address of name by taking a look at main code.
0x08048453 <+31>: mov %edx,0x4(%esp) 0x08048457 <+35>: mov %eax,(%esp) 0x0804845a <+38>: call 0x8048370 <__isoc99_scanf@plt>
This is the call to scanf, as you can guess from the CALL instruction. The two previous MOV instructions are for passing the function parameters. In Intel architecture, all parameters are passed into the stack in reverse order. For scanf, we have 2 parameters. So first MOV is the second parameter -the pointer to our buffer-. This is what we need. We will run the program again, put a breakpoint there, and check what value is ESP + 4.
(gdb) b *0x0804845a Breakpoint 1 at 0x804845a: file buffoverflow1.c, line 7. (gdb) r Starting program: /some/path/buffoverflow1 Breakpoint 1, 0x0804845a in main () at buffoverflow1.c:7 7 scanf("%s", name); (gdb) p /x $esp + 4 $1 = 0xbffff0f4 (gdb) x /xw 0xbffff0f4 0xbffff0f4: 0xbffff106
So first I set the breakpoint as said, then run the program. Once it stops at the scanf call, I calculate ESP + 4 = 0xbffff0f4. This is the addres of the name variable as we said previously. Since it's a pointer, it contains the pointer to the memory zone where our input is going to be stored. We find out what this pointer is, which turns out to be 0xbffff106. We can check to see what's on the buffer before the scanf call. Here's the dump of 20 bytes of the buffer:
(gdb) x /20xb 0xbffff106 0xbffff106: 0x00 0x00 0x89 0x84 0x04 0x08 0xf4 0xff 0xbffff10e: 0xfb 0xb7 0x80 0x84 0x04 0x08 0x00 0x00 0xbffff116: 0x00 0x00 0x00 0x00
Just garbage, meaningless numbers. Now we can execute the scanf and check this buffer again.
(gdb) s Please input your name: Jason 8 printf("Your name is %s\n", name); (gdb) x /20xb 0xbffff106 0xbffff106: 0x4a 0x61 0x73 0x6f 0x6e 0x00 0xf4 0xff 0xbffff10e: 0xfb 0xb7 0x80 0x84 0x04 0x08 0x00 0x00 0xbffff116: 0x00 0x00 0x00 0x00
Now it contains our input:
0x4a 0x61 0x73 0x6f 0x6e 0x00 J a s o n \0
Now as we said before, the return address is also stored in the stack. To find out where, we just have to stop just before the RET instruction is executed and check the top of the stack.
(gdb) c Continuing. Your name is Jason Breakpoint 2, 0x0804847a in main () at buffoverflow1.c:10 10 } (gdb) disas main Dump of assembler code for function main: 0x08048434 <+0>: push %ebp 0x08048435 <+1>: mov %esp,%ebp 0x08048437 <+3>: and $0xfffffff0,%esp 0x0804843a <+6>: sub $0x20,%esp 0x0804843d <+9>: mov $0x8048550,%eax 0x08048442 <+14>: mov %eax,(%esp) 0x08048445 <+17>: call 0x80483400x0804844a <+22>: mov $0x8048569,%eax 0x0804844f <+27>: lea 0x16(%esp),%edx 0x08048453 <+31>: mov %edx,0x4(%esp) 0x08048457 <+35>: mov %eax,(%esp) 0x0804845a <+38>: call 0x8048370 <__isoc99_scanf plt="plt"> 0x0804845f <+43>: mov $0x804856c,%eax 0x08048464 <+48>: lea 0x16(%esp),%edx 0x08048468 <+52>: mov %edx,0x4(%esp) 0x0804846c <+56>: mov %eax,(%esp) 0x0804846f <+59>: call 0x8048340 0x08048474 <+64>: mov $0x0,%eax 0x08048479 <+69>: leave => 0x0804847a <+70>: ret End of assembler dump. (gdb) i r esp esp 0xbffff11c 0xbffff11c (gdb) x /wx 0xbffff11c 0xbffff11c: 0xb7e334d3
So the return address is stored at 0xbffff11c and it is in fact 0xb7e334d3. We can check to which function does this address belong to:
(gdb) disas 0xb7e334d3 Dump of assembler code for function __libc_start_main: 0xb7e333e0 <+0>: push %ebp 0xb7e333e1 <+1>: push %edi 0xb7e333e2 <+2>: push %esi 0xb7e333e3 <+3>: push %ebx 0xb7e333e4 <+4>: call 0xb7f44ee3 0xb7e333e9 <+9>: add $0x18cc0b,%ebx 0xb7e333ef <+15>: sub $0x5c,%esp [Removed for brievity] ---Typeto continue, or q to quit---q Quit
Bingo, looks like this is actually the return address.
Mixing apples with pears
As we saw before, our buffer is stored at 0xbffff106 and the return address is at 0xbffff11c. This addresses are pretty close one to another, only 22 bytes. And given the fact that scanf does no boundary/overflow check, we might be able to overwrite this return address if we can write enough data into the stack at 0xbffff106, precisely we need 22 bytes + 4 that will overwrite the return address. Let's check this.(gdb) b *0x0804847a Breakpoint 1 at 0x804847a: file buffoverflow1.c, line 10. (gdb) r Starting program: /some/path/buffoverflow1 Please input your name: 1234567890123456789012aaaa Your name is 1234567890123456789012aaaa Breakpoint 1, 0x0804847a in main () at buffoverflow1.c:10 10 } (gdb) x /32bx 0xbffff106 0xbffff106: 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0xbffff10e: 0x39 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0xbffff116: 0x37 0x38 0x39 0x30 0x31 0x32 0x61 0x61 0xbffff11e: 0x61 0x61 0x00 0x00 0x00 0x00 0xb4 0xf1
In the name we've inputted exactly 26 characters, which are 26 bytes, as we specified before. The four last characters are a, which binary representation is 0x61, thus 4 a's are 0x61616161. As you can see in the memory dump, the address at 0xbffff11c is effectively 0x61616161. Just continuing the program will make it crash with the same segment violation we got earlier.
(gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x61616161 in ?? ()
We can try with nother value instead of aaaa so you can see this better. Let's try with ABCD (0x41424344).
(gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /some/path/buffoverflow1 Please input your name: 1234567890123456789012ABCD Your name is 1234567890123456789012ABCD Breakpoint 1, 0x0804847a in main () at buffoverflow1.c:10 10 } (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x44434241 in ?? ()
(0x44434241 is little-endian representation for ABCD).
As you see, we now can freely manipulate main's return address, we can set it to any value we want just changing the input given to the program. This is a very serious security problem as we will see.
Nice and easy to understand explanation
ReplyDeleteWaiting for 2nd part :)
Thanks for your feedback Guille :)
ReplyDeleteReally nice article!
ReplyDeleteThank you.