Simple C/Unix Memory Problem Debugging

They don't teach you how to debug programs in computer science courses. It's too practical, and you're expected to "pick it up on the job." I've shown several people at work some basic debugging of common memory problems in C, and I decided to put some examples on the web for the benefit of those of you who don't work for me.

Let's start with a very simple case, a null-pointer dereference. This straightforward problem will serve to introduce the debugger and the information available, and to show a "normal" case for comparison with the more complex cases later. Take the following program:

/* 1 */      #include <stdio.h>
/* 2 */      #include <stdlib.h>
/* 3 */      
/* 4 */      void f2(void)
/* 5 */      {
/* 6 */          int *p=0;
/* 7 */      
/* 8 */          *p=0xd1e;
/* 9 */      }
/* 10 */     
/* 11 */     void f1(void)
/* 12 */     {
/* 13 */         f2();
/* 14 */     }
/* 15 */     
/* 16 */     int main (void)
/* 17 */     {
/* 18 */         f1();
/* 19 */         printf("done\n");
/* 20 */     
/* 21 */         return 0;
/* 22 */     }

The line numbers are in comments so you can cut and paste it. Suppose it's called foo.c. I compile and run it as so:

$ gcc -g -Wall -o foo foo.c
$ ./foo
Segmentation fault

I'm told that Windows/MSVC will tell you that you dereferenced a null pointer, but all you get on Linux is a segfault. So, let's use the debugger to see what's going on:

$ gdb foo
(gdb) run
Starting program: /tmp/foo 

Program received signal SIGSEGV, Segmentation fault.
0x080483f0 in f2 () at foo.c:8
8           *p=0xd1e;
(gdb) print p
$1 = (int *) 0x0
(gdb) backtrace 
#0  0x080483f0 in f2 () at foo.c:8
#1  0x08048403 in f1 () at foo.c:13
#2  0x08048413 in main () at foo.c:18

When we run the program, we see that it dies with a segmentation fault on line 8. Seeing that this line assigns a value to *p, we can look at the value of p, and see that it is null.

An extremely useful feature of the debugger is to show the stack trace. This shows us, here, that main() called f1() which called f2(), with the line numbers of each call. By using the up and down commands in gdb (or clicking on the window in graphical debuggers like ddd) you can move to each function and inspect variables there. This is particularly useful in C++ programming, where actual segmentation faults tend to occur three levels deep in the STL and you need to move up quite a few calls in order to look at your own code and figure out what the problem really is.

Now let's look at a more complex case, a stack overwrite. This case is easy to understand if you've seen it before, but is extremely perplexing the first time you see it. I'm explaining it here so that when you encounter it in your own code, you'll have seen it before.

Here's our new program:

/* 1 */      #include <stdio.h>
/* 2 */      #include <stdlib.h>
/* 3 */      
/* 4 */      void f2(void)
/* 5 */      {
/* 6 */          int a[8];
/* 7 */          int i;
/* 8 */      
/* 9 */          for (i=0; i<16; i++)
/* 10 */             a[i]=0;
/* 11 */     }
/* 12 */     
/* 13 */     void f1(void)
/* 14 */     {
/* 15 */         f2();
/* 16 */     }
/* 17 */     
/* 18 */     int main (void)
/* 19 */     {
/* 20 */         f1();
/* 21 */         printf("done\n");
/* 22 */     
/* 23 */         return 0;
/* 24 */     }

When we run this one, however, the results are somewhat more perplexing:

(gdb) run
Starting program: /tmp/foo 

Program received signal SIGSEGV, Segmentation fault.
0x00000000 in ?? ()
(gdb) backtrace
#0  0x00000000 in ?? ()

There aren't any line numbers! This is because the memory which was overwritten when we overflowed the end of the a[] array is exactly the memory which had contained this information.

On many processor architectures, including i386, the stack grows "backwards" - when things are pushed onto the stack, the stack pointer is decreased. When we call f2() from f1(), the return address is pushed onto the stack and the stack pointer is decreased. At the beginning of f2(), the local variables are allocated space on the stack, and the stack pointer is decreased further. This means that the local variables come just before the return address on the stack, and when we write past the end of the local variables, we clobber the addresses on the stack.

It's important to note that the actual segmentation fault usually will NOT occur when we write off the end of the array; after all, we're just writing to other memory that's part of our process' stack. The segmentation fault occurs when we try to return to the address which now points to memory which does not exist. (It's possible, with care, to overwrite the address on the stack with a controlled address instead of garbage; this technique is used deliberately to exploit security holes.)

For a slight variation, reverse the order of the declarations of i and a[] in the example and try again. You should find that this time, instead of generating a segmentation fault, the program will run forever. This happens because with the variables in the other order, the loop variable i comes directly after the a[] array. Thus, a[9] is i. Every time i gets to 9, it overwrites itself with 0, starting back at the beginning of the array again.

Our third case involves dynamically allocated memory:

/* 1 */      #include <stdio.h>
/* 2 */      #include <stdlib.h>
/* 3 */
/* 4 */      void f2(void)
/* 5 */      {
/* 6 */          int i;
/* 7 */          int *a=malloc(sizeof(int)*8);
/* 8 */      
/* 9 */          for (i=0; i<16; i++)
/* 10 */             a[i]=0;
/* 11 */     }
/* 12 */     
/* 13 */     void f1(void)
/* 14 */     {
/* 15 */         f2();
/* 16 */     }
/* 17 */     
/* 18 */     int main (void)
/* 19 */     {
/* 20 */         f1();
/* 21 */         printf("done\n");
/* 22 */     
/* 23 */         return 0;
/* 24 */     }

Once again, we're overrunning the end of the array. If you compile and run this program, however, you will most likely find that it appears to run successfully.

Because the memory for our array is allocated with malloc(), it comes from a different area of memory (called the heap) and does not interfere with the addresses on the stack. Furthermore, there is no segmentation fault when overwriting the end of the array because this memory is still part of the process. Programs cannot get memory from the operating system in bytes - they must request memroy in units of pages. A common page size for Linux on i386 is 4096 bytes. Thus, a small memory overrun is unlikely to cause a segmentation fault - just incorrect behavior. If you increase the the end value in the for() statement sufficiently (try increasing it from 16 to 2000), you will find that you do eventually get a segmentation fault.

How can you debug this? In order to find this sort of problem, it's generally necessary to use a helper library. This helper library will alter the behavior of malloc() calls, forcing each allocated block to be placed right at the end of a page, with an unallocated page after it, so that you will get a segmentation fault when you overrun it. One of the simplest and most common of these libraries is Bruce Perens' Electric Fence. Assuming you have it installed, all you need to do is recompile your program like so:

$ gcc -g -Wall -o foo foo.c -lefence

When you then run your program, you'll find that you get a segfault right at the time of memory overrun, and you can observe the stack trace and other information in your debugger as usual. You can read the libefence(3) manpage for more information about the library, such as how to make it check for overrunning the beginning of a memory block instead of the end.

That's it for now! Those three simple cases cover the majority of the memory problems you're likely to have in C or C++, and more complicated problems will often be variations on these. Now that you know what to look for, you should have a much easier time finding memory problems in your own programs.