Presented by David Shea on September 10, 2003
Table of Contents
- 1. Overview
- 2. Editing
- 2.1. The text editor
- 2.2. Navigational tools
- 3. Compiling
- 3.1. The compiler
- 3.2. make
- 4. Debugging
- 4.1. Static Analysis
- 4.2. Interactive Debuggers
- 4.2.1. gdb
- 4.2.2. ddd
- 4.2.3. Memory Debugging Tools
- 4.2.4. Electric Fence
- 4.2.5. dmalloc
- 5. Resources
1. Overview
Unix and Linux are often seen as operating systems well suited for
programmers. There are a variety of powerful tools available, and common
standards allow for programs to be moved across systems and architectures
without any major changes to the source code. Rather than using a
particular piece of software to provide an integrated development
environment in which to program, all of Unix acts as a development
environment. IDE software is rarely seen, with programmers preferring
to use a text editor and a command line.
This method of programming, where each component for editing, compiling,
and debugging the program is a separate entity, may seem confusing at
first, but there are numerous tools available to bring everything together
and make programming as pleasant a task as possible.
2. Editing
2.1. The text editor
The choice of text editor is personal decision that can cause
pointless, heated arguments, so this article will make no attempt to
recommend any editor above any other. Just choose an editor that makes
you happy. Features that you may want to look for in more powerful editors
are syntax highlighting, easy navigation within and among files, and
automatic indentation.
2.2. Navigational tools
Navigating a project that spans several files can quickly become a
headache, so tools were created to quickly find a particular function or
definition. In general, the text editor provides an interface to these
tools, so they can be used from within the editor to move to different
files and functions.
Ctags allows for rapid navigation by scanning all of the source files
in a project and creating a tags file that contains the locations of
functions, global variables, and struct definitions. This file is then
used by the editor to jump to a particular location. Many editors
support ctags, and emacs provides a similar tool with etags.
Cscope provides for similar navigation of a C project, but aims more
at being a navigation tool that runs an editor instead of a tool
used by an editor. It supports searches based on function names, global
definitions, or arbitrary strings of text.
3. Compiling
3.1. The compiler
Even if you’ve never used the compiler on a Unix system, you’ve probably
seen huge, incomprehensible blocks of output from make as a program
is compiled. Invoking the C compiler isn’t actually that complicated.
The C compiler on most systems is called cc. On Linux, cc will likely
be a symlink to gcc, the GNU C Compiler. cc usually doesn’t do anything
on its own, but rather provides a single interface to the preprocessor,
compiler, assembler, and linker. To compile a program, simply call cc
and give it the name of your source file.
cc test.c
This will create a file a.out that can then be run just like any other
program.
$ ./a.out
Hello, world!
$
There are a few things to note here. First, this is one of the few
times in Unix where the extension of the file matters. If the filename
does not end in ".c", the compiler will assume that it is
a pre-compiled object file, and fail with some sort of cryptic error
such as "Bad magic number". Second, the default output
filename is a.out, which stands for assembler output. This can be
changed by adding "-o
line. Lastly, unlike some other operating systems, such as DOS, the
current directory may not always be in the path searched for executable
programs. To run a program in the current directory, you either need
to add ‘.’ to the PATH environment variable, or run it as ./
as shown above. Adding ‘.’ to your PATH can be dangerous, though, since
there could be malicious programs in your current directory with names
like "ls" or "cp", so just using ./ is your safest bet.
Multiple files can be compiled into a single program simply by adding
them to the command line. However, it can become tiresome to need to
recompile every file each time a change is made in only one source
file. To get around this, you can keep compiled object files for each
source file and then link all of the object files together into one
program. No one knows how to use the linker by itself, so just let the
compiler figure it out.
cc -c one.c
cc -c two.c
cc -c three.c
cc -o numbers one.o two.o three.o
Preprocessor definitions can be manipulated from the command line. For
example, you may have something like the following in your source file:
#ifdef DEBUG
#endif
This print statement will only be compiled into the program if it occurs
after a #define DEBUG statement. However, rather than needing
to modify the source files every time a symbol needs to be changed, the
compiler can define symbols using the -D flag. cc -DDEBUG
would compile the program as if DEBUG were defined at the top. -U does
the opposite of -D, so cc -UDEBUG would explicitly undefine
the DEBUG symbol.
Doing anything useful in C usually requires code from a library. For
example, printf is commonly used, but most people never write their own
printf. The printf from
links the program, it adds a reference to the printf code in libc, the
standard C library. Other libraries exist, but they must be explicitly
requested. A commonly encountered problem is to be unable to compile
a program that uses floating-point math functions, even though the
functions are defined in ANSI C and all the right headers are being
used. This is because the functions in
implemented in libm, the math library, instead of libc. To tell the C
compiler to link to libm, add ‘-lm’ to the command line.
cc -o math_stuff -lm math_stuff.c
The other standard options are -E, which only runs the preprocessor;
-s, which outputs assembly instead of binaries; -g, which compiles the
program with debugging symbols; -I
directory to the path to search for include files; -L
which adds a directory to be searched for library files; and
-O
The meaning of any particular number is dependent on the compiler. gcc
has levels 1, 2, and 3, with higher numbers theoretically outputting
faster code.
gcc also has many options beyond those defined by POSIX. Perhaps the
most useful of these is -Wall, which turns on all compiler warnings.
When the compiler gives you a warning about something in your code, there’s
usually a good reason for it, and it would be worthwhile to
investigate. The NetBSD project goes as far to require that all of
NetBSD compile with gcc -Wall -Werror, which will cause any
compiler warnings to be treated as errors, and thus cause compilation with
warnings to fail.
Another common option is "-ansi", which enables more strict C89
compliance. It turns off some gcc extensions that could possibly conflict
with identifier names. For most people, this option is undesirable, since
it also turns on the undesirable trigraph feature, which can take three
otherwise normal characters and treat them as something new and unexpected.
Trigraphs were created for people using certain international character
sets where the code for various accented characters may conflict with
those for less commonly used punctuation marks, such as square brackets or
backslashes. For most people, trigraphs are only useful as an explanation
for why "??!" was printed as "|".
3.2. make
Keeping track of which source files have been changed since the last
time they were compiled can become nearly impossible even with only a
handful of files. To solve this problem, there is make. make reads a
set of targets and dependencies, specified in a Makefile, and if the
dependencies have a newer modified time than their target, it executes a
rule to rebuild it.
Makefiles can themselves become very complicated, and compatibility
among different variants of make can be an issue. GNU make is usually
seen as having the most useful set of extensions, and is available on
many systems as ‘gmake’.
Rather than trying to cover all of make, here’s an example Makefile
to compile each source file in a project to an object file, and link all
the object files into one program.
CC = gcc
CFLAGS = -O2 -g
LDFLAGS = -lm
SRC = foo.c bar.c baz.c
OBJ = $(SRC:.c=.o)
foobarbaz: $(OBJ)
.c.o:
When ‘make’ is typed from the command line, make will look for the
first target specified and attempt to build it. In this case the first
target is foobarbaz, which depends on $(OBJ). $(OBJ) contains three
targets: foo.o, bar.o, and baz.o. The .c.o target is a special built-in
rule to make object files of source files, since specifying each of them
by hand would be rather annoying. Three builtin variables are being used
here: $@, which expands to the name of the target, $
to the first item in the dependency list, and $^, which expands to the
entire dependency list.
It is important to note that the indentation of the rules has to
be a tab character. If 8 spaces are used instead, bad things happen,
and none of them involve actually compiling your program.
Another quirk about Makefiles that can cause trouble is that every line
of a group of rules is considered a separate rule and is run in a
separate subshell. So if a Makefile has something like the following
stuff:
the "cd hejaz" line would be run in a different shell from the
"dostuff" line, so dostuff wouldn’t happen where you want it to happen.
This can be solved by putting everything on one line, or using ‘\’ to
continue a line across linebreaks.
stuff:
4. Debugging
So you’ve managed to get the compiler to spit out a binary, but this
program will probably not work perfectly, and may not even run at all.
What now?
4.1. Static Analysis
Static analyzers try to find bugs in programs by examining the code
itself rather than actually compiling the program and running it. This
is similar to what compiler warnings do, and, in fact, using gcc -Wall
catches many of the errors that can be found through static examination
of the code. However, occasionally an external debugger will have
something useful to say.
By far the most well known static analyzer for C code is lint.
There is no single version of lint, since each commercial Unix vendor
will provide their own version of lint, much as they each provide their
own version of cc. Also, many changes have been made to lint throughout
its history to support ANSI C and new POSIX extensions, but for the most
part modern lint implementations will output useful warnings and will have
a few things to say that the compiler didn’t find important, such as
declarations local to a file not being marked as static. However, the
major drawback of lint is that there is no free implementation, so it is
not likely to be found running on open source operating systems.
Splint appears to be the most popular open-source alternative to
lint, and it takes a somewhat different approach. Rather than simply
knowing better than you what makes good code, splint has options that
allow you configure every detail of what should and should not produce a
warning. Unfortunately, there doesn’t seem to be any universal, useful
set of splint flags, so you’ll have to experiment and customize them
to your project and the things you want to be able to check.
4.2. Interactive Debuggers
It’s often useful to be able to examine the state of a program while
it’s running. Debuggers allows this by running a program within a
controlled environment where it can be examined, stopped, and modified.
Another useful feature of debuggers is their ability to examine
core files. If you’ve programmed in C for more than five minutes,
you’ve probably seen a message along the lines of "Segmentation fault
(core dumped)". That message is more than just an annoying reminder
that you’re not done yet; it’s a snapshot of the program’s memory at
the time it died, and it can help you find the problem.
If you’re getting segfaults but no cores, you probably have the maximum
core file size in your user limits set to 0. Try
ulimit -c unlimited to make your programs dump core all over
the place.
Remember that -g compiler flag above? Using this compiles debugging
symbols into the program, which is useful in order to see what’s going on
while running it in a debugger. Also, you may want to turn off -O
optimizations, since they can create confusing changes in how the
code is run.
4.2.1. gdb
The most commonly used open source debugger is the GNU Debugger, gdb.
To run a program in gdb, just run gdb followed by the program name. gdb
can be given a core file on the command line after the program name. If
the process you want to debug is already running, give the PID after the
program name, instead. Running gdb will stop the program, if it was
running, and bring you to an interactive shell where you can issue
debugging commands. Like the command-line switches in the compiler, gdb
has a huge pile of complicated and confusing commands, but only a
handful that most people will ever use.
To get the program started, just use "run". Arguments to the program
that would have been given on the command-line can be given as
arguments to run. This will run the program to completion or until
a signal is delivered. Since this is more or less what would happen
from the command-line, this probably isn’t what you want, so you
should perhaps set some breakpoints first. Breakpoints are set with
the "break" command, either by giving it the name of a function, or an
address of
program can be continued with "continue", or the next line executed
with "step" or "next". The difference between step and next is that
if the following line contains a function call, step will enter the
function and continue debugging with it, while next will resume debugging
at the following line in the same function.
Now that you can watch the execution of a program, "print" can be used
to display data. The print command takes any C expression as its
argument, so this can be used to modify values, as well.
print i will print the contents of variable i,
while print (i=20) will set the value of i to 20 and
print 20.
Another useful command, especially when debugging from a core file,
is "backtrace", which prints out the current stack. "up" and "down"
can be used to navigate through stack frames. While in a particular
stack frame, "info locals" can be used to print out the value of all
local variables.
4.2.2. ddd
ddd, the Data Display Debugger, is actually a common frontend for
several debuggers, including gdb. It provides a graphical console to
gdb for debugging. All of the same gdb commands can be used, except
that there’s probably also a menu or a button you can click to do the
same thing. Since there are things to click, it tries to be easier
to use in setting breakpoints and selecting code. Breakpoints can be
moved by dragging the stop sign to another point in the file, or
ignored entirely by simply selecting a chunk of code and running it.
ddd claims to be about displaying data, and it does this by creating
plots and graphs of values. The values of sets of variables can be
displayed using its interface to gnuplot.
4.2.3. Memory Debugging Tools
Most bugs in C programs have to do with memory allocation, so tools
have been created to help track these errors down.
4.2.4. Electric Fence
Electric Fence is a small library written by Bruce Perens that, whenever
you screw up, causes your program to crash. It does this by overriding
the malloc function with one that ensures every memory allocation ends
just before an inaccessible page, and accessing this page will cause a
segmentation fault. This will provide a nice core file or debugger
breakpoint to pinpoint the errant instruction. Using Electric Fence
does significantly increase the amount of memory used by the program,
though, since every call to malloc will require at least two pages
of memory.
Using Electric Fence is as simple as adding the library file to the
list of files to link.
cc -o buggyprogram thingone.o thingtwo.o /usr/lib/libefence.a
buggyprogram can then be run just as it normally would have been,
or from within a debugger.
4.2.5. dmalloc
Dmalloc is another popular memory debugger, and it takes a somewhat
different approach. Instead of crashing your program when something
bad happens, it focuses more on letting the program run to completion
and printing out a log afterward. dmalloc tracks calls to malloc and
free and uses this information to detect heap corruption (invalid
address to free() or realloc()) and memory leaks. It also attempts to
detect "off-by-one" errors by writing special characters at the boundaries
of an allocated block and checking that they are still there when the
memory is freed. This can only find memory writes using an address off
by exactly one and cannot detect attempts to read invalid memory, so
dmalloc is not as powerful as Electric Fence if you only want to find
usage of bad addresses.
Using dmalloc is similar to using Electric Fence, but since there are
more configurable options, there are a few extra steps involved. First,
you probably want to enable dmalloc’s line number tracking, which
requires that every source file in your project include dmalloc.h.
Adding something like the following to the top of your C files or
default header file works quite well:
#ifdef DMALLOC
#include
#endif
This way the dmalloc header will only be used if you compile with
-DDMALLOC. Next, you’ll want to set up your environment. This is done
using the dmalloc program, which prints out commands to be run by the
shell. You may want to wrap dmalloc in a shell function so that the
commands are executed automatically. In a POSIX compliant shell:
dmalloc() { eval $(command dmalloc -b $*) ; }
This can be added to your .profile or .bashrc, or whatever your shell
executes when it starts. Some sensible defaults for dmalloc can be
enabled using dmalloc -l logfile -i 100 low. This will
output malloc statistics to logfile, have the library output heap
summaries every 100 iterations, and use the "low" set of debug
features. Other levels of checking are "runtime", for a minimal set
of features, and "medium" or "high" for more extensive checking.
Now that you have the environment setup, just link your program to
the dmalloc library.
cc -o leakyprogram hop.o pop.o /usr/lib/libdmalloc.a
dmalloc also comes with a library that can be used with threaded
programs, and another set of libraries for C++ programs.
Now that your program is linked to dmalloc, run it, let it finish,
and take a look at the output in logfile. It should contain warnings
about potential allocation errors, memory usage statistics, and a
list of all pointers that were allocated but not freed. These can
now hopefully be used to track down errors in your program’s handling
of memory.