Software development under Linux

Posted by David on Sep 10th, 2003

Overview

Unix and Linux are often seen as operating systems well suited for programmers. There are a variety of powerful tools available, and common standards allow for programs to be moved across systems and architectures without any major changes to the source code. Rather than using a particular piece of software to provide an integrated development environment in which to program, all of unix acts as a development environment. IDE software is rarely seen, with programmers preferring to use a text editor and a command line.

This method of programming, where each component for editing, compiling, and debugging the program is a separate entity, may seem confusing at first, but there are numerous tools available to bring everything together and make programming as pleasant a task as possible.

Editing

The text editor

The choice of text editor is personal decision that can cause pointless, heated arguments, so this article will make no attempt to recommend any editor above any other. Just choose an editor that makes you happy. Features that you may want to look for in more powerful editors are syntax highlighting, easy navigation within and among files, and automatic indentation.

Navigation tools

Navigating a project that spans several files can quickly become a headache, so tools were created to quickly find a particular function or definition. In general, the text editor provides an interface to these tools, so they can be used from within the editor to move to different files and functions.

Ctags allows for rapid navigation by scanning all of the source files in a project and creating a tags file that contains the locations of functions, global variables, and struct definitions. This file is then used by the editor to jump to a particular location. Many editors support ctags, and emacs provides a similar tool with etags.

Cscope provides for similar navigation of a C project, but aims more at being a navigation tool that runs an editor instead of a tool used by an editor. It supports searches based on function names, global definitions, or arbitrary strings of text.

Compiling

The compiler

Even if you’ve never used the compiler on a unix system, you’ve probably seen huge, incomprehensible blocks of output from make as a program is compiled. Invoking the C compiler isn’t actually that complicated.

The C compiler on most systems is called cc. On Linux, cc will likely be a symlink to gcc, the GNU C Compiler. cc usually doesn’t do anything on its own, but rather provides a single interface to the preprocessor, compiler, assembler, and linker. To compile a program, simply call cc and give it the name of your source file.

cc test.c

This will create a file a.out that can then be run just like any other program.

$ ./a.out
Hello, world!
$

There are a few things to note here. First, this is one of the few times in unix where the extension of the file matters. If the filename does not end in “.c”, the compiler will assume that it is a pre-compiled object file, and fail with some sort of cryptic error such as “Bad magic number”. Second, the default output filename is a.out, which stands for assembler output. This can be changed by adding -o filename to the command line. Lastly, unlike some other operating systems, such as DOS, the current directory may not always be in the path searched for executable programs. To run a program in the current directory, you either need to add “.” to the PATH environment variable, or run it as ./program as shown above. Adding “.” to your PATH can be dangerous, though, since there could be malicious programs in your current directory with names like “ls” or “cp”, so just using ./ is your safest bet.

Multiple files can be compiled into a single program simply by adding them to the command line. However, it can become tiresome to need to recompile every file each time a change is made in only one source file. To get around this, you can keep compiled object files for each source file and then link all of the object files together into one program. No one knows how to use the linker by itself, so just let the compiler figure it out.

cc -c one.c
cc -c two.c
cc -c three.c
cc -o numbers one.o two.o three.o

Preprocessor definitions can be manipulated from the command line. For example, you may have something like the following in your source file:

#ifdef DEBUG
        printf("%d: Something's about to happen!\n",__LINE__);
#endif

This print statement will only be compiled into the program if it occurs after a #define DEBUG statement. However, rather than needing to modify the source files every time a symbol needs to be changed, the compiler can define symbols using the -D flag. cc -DDEBUG would compile the program as if DEBUG were defined at the top. -U does the opposite of -D, so cc -UDEBUG would explicitly undefine the DEBUG symbol.

Doing anything useful in C usually requires code from a library. For example, printf is commonly used, but most people never write their own printf. The printf from stdio.h is used, and when the compiler links the program, it adds a reference to the printf code in libc, the standard C library. Other libraries exist, but they must be explicitly requested. A commonly encountered problem is to be unable to compile a program that uses floating-point math functions, even though the functions are defined in ANSI C and all the right headers are being used. This is because the functions in math.h are usually implemented in libm, the math library, instead of libc. To tell the C compiler to link to libm, add -lm to the command line.

cc -o math_stuff -lm math_stuff.c

The other standard options are -E, which only runs the preprocessor; -s, which outputs assembly instead of binaries; -g, which compiles the program with debugging symbols; -I directory, which adds a directory to the path to search for include files; -L directory, which adds a directory to be searched for library files; and -Onumber, which tells the compiler to optimize the code. The meaning of any particular number is dependent on the compiler. gcc has levels 1, 2, and 3, with higher numbers theoretically outputting faster code.

gcc also has many options beyond those defined by POSIX. Perhaps the most useful of these is -Wall, which turns on all compiler warnings. When the compiler gives you a warning about something in your code, there’s usually a good reason for it, and it would be worthwhile to investigate. The NetBSD project goes as far to require that all of NetBSD compile with gcc -Wall -Werror, which will cause any compiler warnings to be treated as errors, and thus cause compilation with warnings to fail.

Another common option is -ansi, which enables more strict C89 compliance. It turns off some gcc extensions that could possibly conflict with identifier names. For most people, this option is undesirable, since it also turns on the undesirable trigraph feature, which can take three otherwise normal characters and treat them as something new and unexpected. Trigraphs were created for people using certain international character sets where the code for various accented characters may conflict with those for less commonly used punctuation marks, such as square brackets or backslashes. For most people, trigraphs are only useful as an explanation for why “??!” was printed as “|”.

make

Keeping track of which source files have been changed since the last time they were compiled can become nearly impossible even with only a handful of files. To solve this problem, there is make. make reads a set of targets and dependencies, specified in a Makefile, and if the dependencies have a newer modified time than their target, it executes a rule to rebuild it.

Makefiles can themselves become very complicated, and compatibility among different variants of make can be an issue. GNU make is usually seen as having the most useful set of extensions, and is available on many systems as gmake.

Rather than trying to cover all of make, here’s an example Makefile to compile each source file in a project to an object file, and link all the object files into one program.

CC      = gcc
CFLAGS  = -O2 -g
LDFLAGS = -lm
SRC     = foo.c bar.c baz.c
OBJ     = $(SRC:.c=.o)

foobarbaz: $(OBJ)
	$(CC) $(LDFLAGS) -o $@ $^

.c.o:
	$(CC) $(CFLAGS) -c $<

When make is typed from the command line, make will look for the first target specified and attempt to build it. In this case the first target is foobarbaz, which depends on $(OBJ). $(OBJ) contains three targets: foo.o, bar.o, and baz.o. The .c.o target is a special built-in rule to make object files of source files, since specifying each of them by hand would be rather annoying. Three builtin variables are being used here: $@, which expands to the name of the target, $<, which expands to the first item in the dependency list, and $^, which expands to the entire dependency list.

It is important to note that the indentation of the rules has to be a tab character. If 8 spaces are used instead, bad things happen, and none of them involve actually compiling your program.

Another quirk about Makefiles that can cause trouble is that every line of a group of rules is considered a separate rule and is run in a separate subshell. So if a Makefile has something like the following

stuff:
        cd hejaz
        dostuff

the cd hejaz line would be run in a different shell from the dostuff line, so dostuff wouldn’t happen where you want it to happen. This can be solved by putting everything on one line, or using ‘\’ to continue a line across linebreaks.

stuff:
        cd hejaz ; \
        dostuff

Debugging

So you’ve managed to get the compiler to spit out a binary, but this program will probably not work perfectly, and may not even run at all. What now?

Static Analysis

Static analyzers try to find bugs in programs by examining the code itself rather than actually compiling the program and running it. This is similar to what compiler warnings do, and, in fact, using gcc -Wall catches many of the errors that can be found through static examination of the code. However, occasionally an external debugger will have something useful to say.

By far the most well known static analyzer for C code is lint. There is no single version of lint, since each commercial unix vendor will provide their own version of lint, much as they each provide their own version of cc. Also, many changes have been made to lint throughout its history to support ANSI C and new POSIX extensions, but for the most part modern lint implementations will output useful warnings and will have a few things to say that the compiler didn’t find important, such as declarations local to a file not being marked as static. However, the major drawback of lint is that there is no free implementation, so it is not likely to be found running on open source operating systems.

Splint appears to be the most popular open-source alternative to lint, and it takes a somewhat different approach. Rather than simply knowing better than you what makes good code, splint has options that allow you configure every detail of what should and should not produce a warning. Unfortunately, there doesn’t seem to be any universal, useful set of splint flags, so you’ll have to experiment and customize them to your project and the things you want to be able to check.

Interactive Debuggers

It’s often useful to be able to examine the state of a program while it’s running. Debuggers allow this by running a program within a controlled environment where it can be examined, stopped, and modified.

Another useful feature of debuggers is their ability to examine core files. If you’ve programmed in C for more than five minutes, you’ve probably seen a message along the lines of “Segmentation fault (core dumped)”. That message is more than just an annoying reminder that you’re not done yet; it’s a snapshot of the program’s memory at the time it died, and it can help you find the problem.

If you’re getting segfaults but no cores, you probably have the maximum core file size in your user limits set to 0. Try ulimit -c unlimited to make your programs dump core all over the place.

Remember that -g compiler flag above? Using this compiles debugging symbols into the program, which is useful in order to see what’s going on while running it in a debugger. Also, you may want to turn off -O optimizations, since they can create confusing changes in how the code is run.

gdb

The most commonly used open source debugger is the GNU Debugger, gdb. To run a program in gdb, just run gdb followed by the program name. gdb can be given a core file on the command line after the program name. If the process you want to debug is already running, give the PID after the program name, instead. Running gdb will stop the program, if it was running, and bring you to an interactive shell where you can issue debugging commands. Like the command-line switches in the compiler, gdb has a huge pile of complicated and confusing commands, but only a handful that most people will ever use.

To get the program started, just use run. Arguments to the program that would have been given on the command-line can be given as arguments to run. This will run the program to completion or until a signal is delivered. Since this is more or less what would happen from the command-line, this probably isn’t what you want, so you should perhaps set some breakpoints first. Breakpoints are set with the break command, either by giving it the name of a function, or an address of filename:line number. Once stopped, the program can be continued with “continue”, or the next line executed with step or next. The difference between step and next is that if the following line contains a function call, step will enter the function and continue debugging with it, while next will resume debugging at the following line in the same function.

Now that you can watch the execution of a program, print can be used to display data. The print command takes any C expression as its argument, so this can be used to modify values, as well. print i will print the contents of variable i, while print (i=20) will set the value of i to 20 and print 20.

Another useful command, especially when debugging from a core file, is backtrace, which prints out the current stack. up and down can be used to navigate through stack frames. While in a particular stack frame, info locals can be used to print out the value of all local variables.

ddd

ddd, the Data Display Debugger, is actually a common frontend for several debuggers, including gdb. It provides a graphical console to gdb for debugging. All of the same gdb commands can be used, except that there’s probably also a menu or a button you can click to do the same thing. Since there are things to click, it tries to be easier to use in setting breakpoints and selecting code. Breakpoints can be moved by dragging the stop sign to another point in the file, or ignored entirely by simply selecting a chunk of code and running it.

ddd claims to be about displaying data, and it does this by creating plots and graphs of values. The values of sets of variables can be displayed using its interface to gnuplot.

Memory debugging tools

Most bugs in C programs have to do with memory allocation, so tools have been created to help track these errors down.

Electric Fence

Electric Fence is a small library written by Bruce Perens that, whenever you screw up, causes your program to crash. It does this by overriding the malloc function with one that ensures every memory allocation ends just before an inaccessible page, and accessing this page will cause a segmentation fault. This will provide a nice core file or debugger breakpoint to pinpoint the errant instruction. Using Electric Fence does significantly increase the amount of memory used by the program, though, since every call to malloc will require at least two pages of memory.

Using Electric Fence is as simple as adding the library file to the list of files to link.

cc -o buggyprogram thingone.o thingtwo.o /usr/lib/libefence.a

buggyprogram can then be run just as it normally would have been, or from within a debugger.

dmalloc

Dmalloc is another popular memory debugger, and it takes a somewhat different approach. Instead of crashing your program when something bad happens, it focuses more on letting the program run to completion and printing out a log afterward. dmalloc tracks calls to malloc and free and uses this information to detect heap corruption (invalid address to free() or realloc()) and memory leaks. It also attempts to detect off-by-one errors by writing special characters at the boundaries of an allocated block and checking that they are still there when the memory is freed. This can only find memory writes using an address off by exactly one and cannot detect attempts to read invalid memory, so dmalloc is not as powerful as Electric Fence if you only want to find usage of bad addresses.

Using dmalloc is similar to using Electric Fence, but since there are more configurable options, there are a few extra steps involved. First, you probably want to enable dmalloc’s line number tracking, which requires that every source file in your project include dmalloc.h. Adding something like the following to the top of your C files or default header file works quite well:

#ifdef DMALLOC
#include <dmalloc.h>
#endif

This way the dmalloc header will only be used if you compile with -DDMALLOC. Next, you’ll want to set up your environment. This is done using the dmalloc program, which prints out commands to be run by the shell. You may want to wrap dmalloc in a shell function so that the commands are executed automatically. In a POSIX compliant shell:

dmalloc() { eval $(command dmalloc -b $*) ; }

This can be added to your .profile or .bashrc, or whatever your shell executes when it starts. Some sensible defaults for dmalloc can be enabled using dmalloc -l logfile -i 100 low. This will output malloc statistics to logfile, have the library output heap summaries every 100 iterations, and use the “low” set of debug features. Other levels of checking are “runtime”, for a minimal set of features, and “medium” or “high” for more extensive checking.

Now that you have the environment setup, just link your program to the dmalloc library.

cc -o leakyprogram hop.o pop.o /usr/lib/libdmalloc.a

dmalloc also comes with a library that can be used with threaded programs, and another set of libraries for C++ programs.

Now that your program is linked to dmalloc, run it, let it finish, and take a look at the output in logfile. It should contain warnings about potential allocation errors, memory usage statistics, and a list of all pointers that were allocated but not freed. These can now hopefully be used to track down errors in your program’s handling of memory.

References and Further Reading