Table of Contents

1. Basics

1.1. What is BitTorrent?

Simply, BitTorrent is a filesharing protocol, much like P2P (Peer-to-Peer),
designed by Bram Cohen circa 2001 (I think).

1.2. How is it different than Kazaa, et al?

Older P2P clients only allow downloading from one host – the host offering the
file (and thus has the entire file). The host’s client would make available a
list of everything it has to offer, and peers can download from it. Newer
clients offer the ability to download from multiple hosts if multiple hosts
have that same exact file (some go by a hash, or something, instead of just the
filename). However, BitTorrent is much more distributed by design. Another big
difference is BitTorrent is an open protocol, not a company, so it cannot be
sued. Finally, BitTorrent offers no searching mechanism, like that of Kazaa or
Gnutella and family.

1.3. What are some clients for Linux?

Azureus is a nice java client, and thus can be used on windows and *nix. This
is what I use and offers many plugins for stats and more. There’s also Bram
Cohen’s own implementation from his site, as well as BitTornado by TheShad0w,
ABC, and others.

1.4. Where can I find torrents?

There are many sites on the net devoted to releasing torrents. Google is your
friend. Also – many companies, especially linux distributions, will offer
torrents instead of the actual ISOs, since this will reduce their bandwidth
costs. So try their sites.

2. How Does It Work?

2.1. Trackers

A tracker is a service that tells peers where to find other peers. This is
really all it is responsible for. Peers will ask the tracker for some peers,
and it will give them a list. The selection of peers can be and usually is
random, especially for a newly connected peer. It can be more complicated, but
it isn’t required to be, since Cohen mentions the randomness algorithm is
nicely robust. Peers can also check in on trackers, mostly for stats or

2.2. Peers

Peers are those who are downloading and uploading pieces. Peers’ clients are
responsible for choosing which pieces to download and from where. When
starting, a piece is wanted quickly so it can start uploading ASAP, so it
chooses randomly until a complete piece is assembled. After that, it chooses
pieces based on rarety (Rarest First). More common pieces are left for later,
since the likelihood of that piece being lost is much lower than that of a rare
piece. The protocol tries to use a tit-for-tat policy, so that upload and
download speeds are usually the same. Of course, this can be bypassed by
choking peers (restricting uploading). However in my experience, you can still
get great download speeds (>= 200 KBps) even if your upload speed isn’t so
great. At the end of the download, when the peer needs the last piece or so, it
will send requests to all peers for the sub-pieces (piece is usually .5 MB,
while a subpiece is ~16K). As the sub-pieces are completed, the client sends
cancels for those completed sub-pieces to limit bandwidth (so it won’t download
them again). This is to prevent the common problem of the endgame download

2.3. Seeds

Seeds are those who have already completed downloading the file (or are the
original publisher), and only upload. If a seed has considerably less bandwidth
than that of its peers, the peers will get different pieces of the file from
the seed to reduce redundancy and thus reduce the seed’s overhead. This makes
sure the file gets into the swarm as quickly as possible, so peers can start
downloading it as soon and as quickly as possible.

3. How Can I Make a Torrent?

3.1. Get a tracker

There are many trackers already available, so try using one of those. If you
want to run your own tracker, make sure your server and connection is reliable.
Although you won’t need much bandwidth, if a peer can’t connect, then it cannot
retrieve a list of related peers, and all is futile. ByteMonsoon and mod_bt are
some examples of trackers, as well as the one that ships with Cohen’s tarball
(written in Python). A lot of times, if you’re going to publish to a popular
torrent site, they will offer their own tracker.

3.2. Make the torrent (a metafile)

Azureus provides its own torrent creation mechanism, so just use that. If you
are using another client that doesn’t offer one, try Krypt’s maketorrent.

3.3. Publish the torrent

Put your torrent on the web somewhere – your site, or a popular torrent site.

3.4. Seed

Open up your client with the torrent you created, and start seeding. This will
put the file you wish to share into the swarm of peers (once there are peers).
You could stop when you see there are multiple seeds, but please dont.

4. Final Thoughts

BitTorrent etiquette is to leave open your download as long as possible, even
after its done, to keep the swarm alive. Don’t be a bastard and close it as
soon as its done. Some trackers will even keep track of this and ban you.

5. Useful Links

Software Development Under Linux

Table of Contents

1. Overview

Unix and Linux are often seen as operating systems well suited for
programmers. There are a variety of powerful tools available, and common
standards allow for programs to be moved across systems and architectures
without any major changes to the source code. Rather than using a
particular piece of software to provide an integrated development
environment in which to program, all of Unix acts as a development
environment. IDE software is rarely seen, with programmers preferring
to use a text editor and a command line.

This method of programming, where each component for editing, compiling,
and debugging the program is a separate entity, may seem confusing at
first, but there are numerous tools available to bring everything together
and make programming as pleasant a task as possible.

2. Editing

2.1. The text editor

The choice of text editor is personal decision that can cause
pointless, heated arguments, so this article will make no attempt to
recommend any editor above any other. Just choose an editor that makes
you happy. Features that you may want to look for in more powerful editors
are syntax highlighting, easy navigation within and among files, and
automatic indentation.

2.2. Navigational tools

Navigating a project that spans several files can quickly become a
headache, so tools were created to quickly find a particular function or
definition. In general, the text editor provides an interface to these
tools, so they can be used from within the editor to move to different
files and functions.

Ctags allows for rapid navigation by scanning all of the source files
in a project and creating a tags file that contains the locations of
functions, global variables, and struct definitions. This file is then
used by the editor to jump to a particular location. Many editors
support ctags, and emacs provides a similar tool with etags.

Cscope provides for similar navigation of a C project, but aims more
at being a navigation tool that runs an editor instead of a tool
used by an editor. It supports searches based on function names, global
definitions, or arbitrary strings of text.

3. Compiling

3.1. The compiler

Even if you’ve never used the compiler on a Unix system, you’ve probably
seen huge, incomprehensible blocks of output from make as a program
is compiled. Invoking the C compiler isn’t actually that complicated.

The C compiler on most systems is called cc. On Linux, cc will likely
be a symlink to gcc, the GNU C Compiler. cc usually doesn’t do anything
on its own, but rather provides a single interface to the preprocessor,
compiler, assembler, and linker. To compile a program, simply call cc
and give it the name of your source file.

cc test.c

This will create a file a.out that can then be run just like any other

$ ./a.out
Hello, world!

There are a few things to note here. First, this is one of the few
times in Unix where the extension of the file matters. If the filename
does not end in ".c", the compiler will assume that it is
a pre-compiled object file, and fail with some sort of cryptic error
such as "Bad magic number". Second, the default output
filename is a.out, which stands for assembler output. This can be
changed by adding "-o filename>" to the command
line. Lastly, unlike some other operating systems, such as DOS, the
current directory may not always be in the path searched for executable
programs. To run a program in the current directory, you either need
to add ‘.’ to the PATH environment variable, or run it as ./program>

as shown above. Adding ‘.’ to your PATH can be dangerous, though, since
there could be malicious programs in your current directory with names
like "ls" or "cp", so just using ./ is your safest bet.

Multiple files can be compiled into a single program simply by adding
them to the command line. However, it can become tiresome to need to
recompile every file each time a change is made in only one source
file. To get around this, you can keep compiled object files for each
source file and then link all of the object files together into one
program. No one knows how to use the linker by itself, so just let the
compiler figure it out.

cc -c one.c
cc -c two.c
cc -c three.c
cc -o numbers one.o two.o three.o

Preprocessor definitions can be manipulated from the command line. For
example, you may have something like the following in your source file:

#ifdef DEBUG
printf("%d: Something's about to happen!\n",__LINE__);


This print statement will only be compiled into the program if it occurs
after a #define DEBUG statement. However, rather than needing
to modify the source files every time a symbol needs to be changed, the
compiler can define symbols using the -D flag. cc -DDEBUG

would compile the program as if DEBUG were defined at the top. -U does
the opposite of -D, so cc -UDEBUG would explicitly undefine
the DEBUG symbol.

Doing anything useful in C usually requires code from a library. For
example, printf is commonly used, but most people never write their own
printf. The printf from stdio.h> is used, and when the compiler
links the program, it adds a reference to the printf code in libc, the
standard C library. Other libraries exist, but they must be explicitly
requested. A commonly encountered problem is to be unable to compile
a program that uses floating-point math functions, even though the
functions are defined in ANSI C and all the right headers are being
used. This is because the functions in math.h> are usually
implemented in libm, the math library, instead of libc. To tell the C
compiler to link to libm, add ‘-lm’ to the command line.

cc -o math_stuff -lm math_stuff.c

The other standard options are -E, which only runs the preprocessor;
-s, which outputs assembly instead of binaries; -g, which compiles the
program with debugging symbols; -I directory>, which adds a
directory to the path to search for include files; -L directory>,
which adds a directory to be searched for library files; and
-Onumber>, which tells the compiler to optimize the code.
The meaning of any particular number is dependent on the compiler. gcc
has levels 1, 2, and 3, with higher numbers theoretically outputting
faster code.

gcc also has many options beyond those defined by POSIX. Perhaps the
most useful of these is -Wall, which turns on all compiler warnings.
When the compiler gives you a warning about something in your code, there’s
usually a good reason for it, and it would be worthwhile to
investigate. The NetBSD project goes as far to require that all of
NetBSD compile with gcc -Wall -Werror, which will cause any
compiler warnings to be treated as errors, and thus cause compilation with
warnings to fail.

Another common option is "-ansi", which enables more strict C89
compliance. It turns off some gcc extensions that could possibly conflict
with identifier names. For most people, this option is undesirable, since
it also turns on the undesirable trigraph feature, which can take three
otherwise normal characters and treat them as something new and unexpected.
Trigraphs were created for people using certain international character
sets where the code for various accented characters may conflict with
those for less commonly used punctuation marks, such as square brackets or
backslashes. For most people, trigraphs are only useful as an explanation
for why "??!" was printed as "|".

3.2. make

Keeping track of which source files have been changed since the last
time they were compiled can become nearly impossible even with only a
handful of files. To solve this problem, there is make. make reads a
set of targets and dependencies, specified in a Makefile, and if the
dependencies have a newer modified time than their target, it executes a
rule to rebuild it.

Makefiles can themselves become very complicated, and compatibility
among different variants of make can be an issue. GNU make is usually
seen as having the most useful set of extensions, and is available on
many systems as ‘gmake’.

Rather than trying to cover all of make, here’s an example Makefile
to compile each source file in a project to an object file, and link all
the object files into one program.

CC = gcc

CFLAGS = -O2 -g
SRC = foo.c bar.c baz.c
OBJ = $(SRC:.c=.o)

foobarbaz: $(OBJ)
$(CC) $(LDFLAGS) -o $@ $^

$(CC) $(CFLAGS) -c $

When ‘make’ is typed from the command line, make will look for the
first target specified and attempt to build it. In this case the first
target is foobarbaz, which depends on $(OBJ). $(OBJ) contains three
targets: foo.o, bar.o, and baz.o. The .c.o target is a special built-in
rule to make object files of source files, since specifying each of them
by hand would be rather annoying. Three builtin variables are being used
here: $@, which expands to the name of the target, $, which expands
to the first item in the dependency list, and $^, which expands to the
entire dependency list.

It is important to note that the indentation of the rules has to
be a tab character. If 8 spaces are used instead, bad things happen,
and none of them involve actually compiling your program.

Another quirk about Makefiles that can cause trouble is that every line
of a group of rules is considered a separate rule and is run in a
separate subshell. So if a Makefile has something like the following

cd hejaz


the "cd hejaz" line would be run in a different shell from the
"dostuff" line, so dostuff wouldn’t happen where you want it to happen.
This can be solved by putting everything on one line, or using ‘\’ to
continue a line across linebreaks.

cd hejaz ; \


4. Debugging

So you’ve managed to get the compiler to spit out a binary, but this
program will probably not work perfectly, and may not even run at all.
What now?

4.1. Static Analysis

Static analyzers try to find bugs in programs by examining the code
itself rather than actually compiling the program and running it. This
is similar to what compiler warnings do, and, in fact, using gcc -Wall
catches many of the errors that can be found through static examination
of the code. However, occasionally an external debugger will have
something useful to say.

By far the most well known static analyzer for C code is lint.
There is no single version of lint, since each commercial Unix vendor
will provide their own version of lint, much as they each provide their
own version of cc. Also, many changes have been made to lint throughout
its history to support ANSI C and new POSIX extensions, but for the most
part modern lint implementations will output useful warnings and will have
a few things to say that the compiler didn’t find important, such as
declarations local to a file not being marked as static. However, the
major drawback of lint is that there is no free implementation, so it is
not likely to be found running on open source operating systems.

Splint appears to be the most popular open-source alternative to
lint, and it takes a somewhat different approach. Rather than simply
knowing better than you what makes good code, splint has options that
allow you configure every detail of what should and should not produce a
warning. Unfortunately, there doesn’t seem to be any universal, useful
set of splint flags, so you’ll have to experiment and customize them
to your project and the things you want to be able to check.

4.2. Interactive Debuggers

It’s often useful to be able to examine the state of a program while
it’s running. Debuggers allows this by running a program within a
controlled environment where it can be examined, stopped, and modified.

Another useful feature of debuggers is their ability to examine
core files. If you’ve programmed in C for more than five minutes,
you’ve probably seen a message along the lines of "Segmentation fault
(core dumped)". That message is more than just an annoying reminder
that you’re not done yet; it’s a snapshot of the program’s memory at
the time it died, and it can help you find the problem.

If you’re getting segfaults but no cores, you probably have the maximum
core file size in your user limits set to 0. Try
ulimit -c unlimited to make your programs dump core all over
the place.

Remember that -g compiler flag above? Using this compiles debugging
symbols into the program, which is useful in order to see what’s going on
while running it in a debugger. Also, you may want to turn off -O
optimizations, since they can create confusing changes in how the
code is run.

4.2.1. gdb

The most commonly used open source debugger is the GNU Debugger, gdb.
To run a program in gdb, just run gdb followed by the program name. gdb
can be given a core file on the command line after the program name. If
the process you want to debug is already running, give the PID after the
program name, instead. Running gdb will stop the program, if it was
running, and bring you to an interactive shell where you can issue
debugging commands. Like the command-line switches in the compiler, gdb
has a huge pile of complicated and confusing commands, but only a
handful that most people will ever use.

To get the program started, just use "run". Arguments to the program
that would have been given on the command-line can be given as
arguments to run. This will run the program to completion or until
a signal is delivered. Since this is more or less what would happen
from the command-line, this probably isn’t what you want, so you
should perhaps set some breakpoints first. Breakpoints are set with
the "break" command, either by giving it the name of a function, or an
address of filename>:line number>. Once stopped, the
program can be continued with "continue", or the next line executed
with "step" or "next". The difference between step and next is that
if the following line contains a function call, step will enter the
function and continue debugging with it, while next will resume debugging
at the following line in the same function.

Now that you can watch the execution of a program, "print" can be used
to display data. The print command takes any C expression as its
argument, so this can be used to modify values, as well.
print i will print the contents of variable i,
while print (i=20) will set the value of i to 20 and
print 20.

Another useful command, especially when debugging from a core file,
is "backtrace", which prints out the current stack. "up" and "down"

can be used to navigate through stack frames. While in a particular
stack frame, "info locals" can be used to print out the value of all
local variables.

4.2.2. ddd

ddd, the Data Display Debugger, is actually a common frontend for
several debuggers, including gdb. It provides a graphical console to
gdb for debugging. All of the same gdb commands can be used, except
that there’s probably also a menu or a button you can click to do the
same thing. Since there are things to click, it tries to be easier
to use in setting breakpoints and selecting code. Breakpoints can be
moved by dragging the stop sign to another point in the file, or
ignored entirely by simply selecting a chunk of code and running it.

ddd claims to be about displaying data, and it does this by creating
plots and graphs of values. The values of sets of variables can be
displayed using its interface to gnuplot.

4.2.3. Memory Debugging Tools

Most bugs in C programs have to do with memory allocation, so tools
have been created to help track these errors down.

4.2.4. Electric Fence

Electric Fence is a small library written by Bruce Perens that, whenever
you screw up, causes your program to crash. It does this by overriding
the malloc function with one that ensures every memory allocation ends
just before an inaccessible page, and accessing this page will cause a
segmentation fault. This will provide a nice core file or debugger
breakpoint to pinpoint the errant instruction. Using Electric Fence
does significantly increase the amount of memory used by the program,
though, since every call to malloc will require at least two pages
of memory.

Using Electric Fence is as simple as adding the library file to the
list of files to link.

cc -o buggyprogram thingone.o thingtwo.o /usr/lib/libefence.a

buggyprogram can then be run just as it normally would have been,
or from within a debugger.

4.2.5. dmalloc

Dmalloc is another popular memory debugger, and it takes a somewhat
different approach. Instead of crashing your program when something
bad happens, it focuses more on letting the program run to completion
and printing out a log afterward. dmalloc tracks calls to malloc and
free and uses this information to detect heap corruption (invalid
address to free() or realloc()) and memory leaks. It also attempts to
detect "off-by-one" errors by writing special characters at the boundaries
of an allocated block and checking that they are still there when the
memory is freed. This can only find memory writes using an address off
by exactly one and cannot detect attempts to read invalid memory, so
dmalloc is not as powerful as Electric Fence if you only want to find
usage of bad addresses.

Using dmalloc is similar to using Electric Fence, but since there are
more configurable options, there are a few extra steps involved. First,
you probably want to enable dmalloc’s line number tracking, which
requires that every source file in your project include dmalloc.h.
Adding something like the following to the top of your C files or
default header file works quite well:

#ifdef DMALLOC
#include dmalloc.h

This way the dmalloc header will only be used if you compile with
-DDMALLOC. Next, you’ll want to set up your environment. This is done
using the dmalloc program, which prints out commands to be run by the
shell. You may want to wrap dmalloc in a shell function so that the
commands are executed automatically. In a POSIX compliant shell:

dmalloc() { eval $(command dmalloc -b $*) ; }

This can be added to your .profile or .bashrc, or whatever your shell
executes when it starts. Some sensible defaults for dmalloc can be
enabled using dmalloc -l logfile -i 100 low. This will
output malloc statistics to logfile, have the library output heap
summaries every 100 iterations, and use the "low" set of debug
features. Other levels of checking are "runtime", for a minimal set
of features, and "medium" or "high" for more extensive checking.

Now that you have the environment setup, just link your program to
the dmalloc library.

cc -o leakyprogram hop.o pop.o /usr/lib/libdmalloc.a

dmalloc also comes with a library that can be used with threaded
programs, and another set of libraries for C++ programs.

Now that your program is linked to dmalloc, run it, let it finish,
and take a look at the output in logfile. It should contain warnings
about potential allocation errors, memory usage statistics, and a
list of all pointers that were allocated but not freed. These can
now hopefully be used to track down errors in your program’s handling
of memory.

5. Resources

Compiler Frontends: ccache

Table of Contents

1. Introduction

Ccache is a frontend to your favorite C/C++ compiler that caches object
files, so that if a build is run multiple times, needless recompilations are
not necessary. It uses the C preprocessor output and compiler flags as part
of its hash function, so any object files pulled out of the cache are
identical to what the compiler would have produced. Compiler messages are
also stored and retrieved along with the object files. To the user or the
linker, the only noticeable effect of ccache is in speed.

The uses of this are less obvious than those of distcc, but for many,
recompilation of identical code is a common task. For example, if you have
a project that was compiled using the -O2 flag passed to gcc, and you want
to switch that to -g to debug, a recompilation is necessary. However,
when using a cache, a second recompilation for unchanged code is not
needed when changing back to -O2. Also, package build systems like RPM or
dpkg often run "make clean" as the first step before compiling
a package. Ccache prevents this make clean from throwing away useful

Erik Thiele’s compilercache is an earlier caching compiler frontend, and
uses a collection of shell scripts and the md5sum utility to store object
files. Ccache is a reimplementation of compilercache in C, with some
added improvements for cache management.

Ccache was written by Andrew Tridgell, who primarily uses it to make
Samba build faster.

2. Installation

The cache will be created the first time that ccache is run, so for
most people there are no necessary steps to take before installation.
Just download, compile, and install.

gzip -cd ccache-2.2.tar.gz | tar -xvf -
cd ccache-2.2
./configure --prefix=/usr/local
make install

3. Configuration

The default settings for the cache are to limit its size to 1GB, and allow
an unlimited number of files. These settings can be configured
using ccache -F numfiles> and
ccache -M size>. Size is in GB unless M or K
postfixes the number.

The cache will be located in ${HOME}/.ccache unless otherwise
specified. This can be changed by setting the CCACHE_DIR environment
variable. The cache can be shared among several users, but care must
be taken to keep the permissions of the cache consistent. And, of course,
you have to trust all of the users with access to the cache. Everyone
using the cache must be able to write to it, so a umask of 002 would
probably be desirable. Also, on systems that use SysV style directory
permissions (like Linux), the setgid bit needs to be set on the
cache directory to ensure that all created subdirectories are owned by the
cache group.

4. Compiling

Just as distcc performs C/C++ preprocessing and tries to speed up
compiling and assembling, ccache performs C/C++ preprocessing and tries to
speed up the compiling and assembling steps by skipping them, if possible.
If for some reason all of your work is done in the preprocessor, ccache will
not offer much benefit.

For Makefiles and configure scripts that honor the CC variable, ccache
can be used by setting CC to "ccache gcc". For example:

CC="ccache gcc" ./configure

Similarly, CXX can be set to "ccache g++" for C++ code.

Another way to ensure that ccache is always called is to install a
symlink from ccache to the name of your compiler, but put the symlink
before the real compiler in PATH. For example, if gcc is installed as
/usr/bin/gcc, and /usr/local/bin is before /usr/bin the path, you could
create a symlink /usr/local/bin/gcc that points to ccache. When ccache
is called using /usr/local/bin/gcc, it will search the path for the first
executable program named gcc that is not a symlink to ccache.

Since ccache needs to know which arguments are options and which are
files, it assumes that any argument that is also a valid filename is
referring to that file. If you have filenames that look a lot like your
compiler flags, you may need to use the --ccache-skip flag,
which tells ccache that the argument following it is not really a file.

5. Real-world example

Kernel recompilation is a common task, so let’s see how this can be
improved with ccache. The environment:

  • Computer: AMD Athlon 900MHz, 768MB RAM
  • Kernel Version: 2.4.21
  • gcc version: 3.2.3

The same build script was used as for the distcc example, only modified
slightly for Intel architectures:

make mrproper
cp /boot/config .config
make oldconfig
make dep clean
make MAKE="make -j 5" CC="$1" bzImage
make MAKE="make -j 5" CC="$1" modules

Three runs were made: the first passing gcc as the argument to
the build script, and the second two passing ccache gcc. Here are
the results:

Trial real user sys
gcc only 12m5.012s 11m10.530s 0m38.640s
ccache, cache cleared 12m31.874s 11m23.330s 0m43.760s
ccache, second run 2m12.538s 1m41.900s 0m23.810s

6. Using ccache with distcc

Since distcc and ccache do independent things, they can work together with
ccache calling distcc, as in CC="ccache distcc gcc". You can
also set the environment variable CCACHE_PREFIX to distcc, and
then use "ccache gcc" for CC, and ccache will prefix distcc to
the compiler command. Using this combination, only uncached code will be
compiled, and code that is compiled will be sent over the network as

7. Resources

Compiler Frontends: distcc

Table of Contents

1. Introduction

Distcc is a client/server program that allows you to distribute C, C++,
Objective-C, or Objective-C++ code compiles across a network. It does not
require a central disk mounted on each host, nor does it require the same
operating systems, library versions, or sets of headers.

Since distcc is a drop-in tool for current build systems, you do not need
to do any special reconfiguration of your network to get it working.
You simply invoke distcc as your C compiler rather than
cc or gcc and it takes over from there.

Distcc is written by Martin Pool.

2. Installation

Before you can use distcc, you must install and configure it on each machine
that you want acting as a "volunteer" on your network. At the time of
writing, the current version is 2.7.1. Download the source code, compile
it, and install it:

bzip2 -dc distcc-2.7.1.tar.bz2 | tar -xvf -
cd distcc-2.7.1
./configure --prefix=/usr/local
make install

Most distributions offer precompiled distcc packages, so you may want
to go that route.

3. Configuration

3.1. Run the Server

On each host that you plan to use as a volunteer, you need to run
distccd to accept incoming distcc connections. As root, run
this command (or one similar to it):

distccd -a --daemon

Note: You cannot run the server as root, so make a user for it. By default
it wants to run as ‘distcc’.

3.2. Set the Host List

On each host that you plan to use as a volunteer, you need to configure
the list of available distcc hosts. Start with localhost and then list the
remote hosts. The syntax is ip[:port][/maxjobs]. Here is an example
listing (from my laptop):


This tells distcc, when run from my laptop, spawn no more than two jobs on
localhost and no more than 3 jobs on warp.

You have two choices for the location of the host list. One is in the
DISTCC_HOSTS environment variable, like this:

export DISTCC_HOSTS="localhost/2"

The other place is to create a /usr/local/etc/distcc/hosts

file with one host entry per line. I use the file method, but both methods
work just fine. The file I mention here is the system-wide one. You can
have a user configuration file in ~/.distcc/hosts. The format
is the same. It should also be noted that the DISTCC_HOSTS environment
variable will override whatever you have in the hosts file.

4. Compiling

When you compile C or C++ code, there are several main tasks that are
completed. These are:

  • Generating preprocessed files from source and headers.
  • Compiling to assembly instructions.
  • Assembling to object files.
  • Linking object files and libraries.

The only tasks that are sent to the volunteers are compiling and
assembling. All preprocessing and linking is done on the main node, that
is, the node where you invoked the job.

The easiest way to invoke distcc is as an override to the CC variable that
most Makefiles and configure scripts honor. For example:

CC="distcc" ./configure --prefix=/somewhere
make -j 47

If you are compiling C++ code, you can do CXX="distcc g++".
Distcc defaults to using the regular C compiler, which is why you do not
need to specify it for the CC variable.

Another way to invoke distcc is manually. Instead of:

gcc -O9 -fsuper-fast -march=mysystem -o proggie proggie.c

You can do:

distcc gcc -O9 -fsuper-fast -march=mysystem -o proggie proggie.c

That’s about it for invoking distcc. Not much to it.

5. Real-World Example: the Linux Kernel

Something we’re all familiar with is compiling the Linux kernel. We’re
also all familiar with the fact that it can take some time, even on speedy
machines. I did a comparison of kernel compiles, one using distcc and
one not using distcc, just to get some numbers. The controls:

  • Computer: iBook2 500MHz G3, 384MB RAM
  • Kernel Version: 2.4.21
  • gcc version: 3.2.3
  • Volunteer: 2x 1GHz G4, 512MB RAM

I did a timed run of the build without using distcc first, then I did
it with distcc. Here is the script I timed on each run:

make mrproper
cp /boot/config .config
make oldconfig
make dep clean
make MAKE="make -j 5" CC="$1" vmlinux
make MAKE="make -j 5" CC="$1" modules

For the first run, I passed gcc as the argument. For the second
run, I passed distcc as the argument. Here are the results:

Trial real user sys
Without distcc 50m29.959s 47m19.550s 2m50.780s
With distcc 14m24.400s 11m13.560s 2m6.370s

So you see, results. With just two systems, I was able to speed up my
compile time by that much.

6. Resources

Linux Overview

Table of Contents

1. Basics

1.1. What is Linux?

Although most people refer to Linux as an operating system, it really
isn’t. Specifically, Linux is just the core of the operating system,
called the ‘kernel’, which runs the applications and interacts with the
hardware. The Linux kernel was originally written by Linus Torvalds in
1991, but now includes code from thousands of programmers worldwide.
This kernel is then combined with many applications and utilities and
collectively form the OS commonly known as Linux (or GNU/Linux).

1.2. What are ‘distributions’?

There are many applications (aka, packages) out there. Some do the same
things as others, and most are available in different versions.
Companies such as Red Hat, Mandrake, Gentoo, etc take the kernel and
package it with their own choice of applications. The big companies also
usually modify the kernel and write custom applications (mostly
configuration utilities). These companies also spend a lot of time
writing installers, so that installing Linux isn’t as hard and scary as
it used to be.

These compilations of applications, installers, and kernels are called
distributions. There are dozens of distributions available and most are
free to download. Some can be purchased, although you usually are paying
for the packaging, printed documentation, and tech support.

Different distributions are geared for different kinds of users.
Mandrake is great for previous Windows users who are new to Linux. Red
Hat is also good for these types of users, although it suits the more
advanced users a little bit more than Mandrake. Red Hat has many
different versions, just as Microsoft has many different versions of
Windows, all to suit different users. The "enterprise" editions are
especially designed for businesses installing it on servers and such,
whereas the normal "desktop" editions are for normal (web surfing, word
processing, etc) users. Some versions are especially designed for
international users. Some distributions are meant to fit on (and totally
run from) a floppy or a CD.

The most well-known distributions are (in no particular order):
Red Hat, Mandrake, SuSE, Gentoo, Slackware, Debian, Knoppix

1.3. GPL

Linux and many of the applications that come with the ‘Linux’ operating
systems are released under the Gnu Public License. This means you can do
whatever you want with them – use them, copy them for a friend, change
them, etc. However, do not falsely claim that you wrote the entire app
or kernel, and any changes that you’ve made need to be shared with the
community. This makes it possible for everyone to contribute.

1.4. Who operates, owns, controls Linux?

Unlike Windows, which is controlled by Microsoft, Linux is not owned or
controlled by a large corporation. In fact, there is no single entity in
charge of Linux. Instead, Linux is by and for hobbyists, hackers, and
professionals worldwide.

However, Linus does own the name, "Linux," and Larry Ewing the official

2. Why Linux?

  • Have you ever used Unix and liked its simplicity and power?
  • Do you enjoy [the idea of] reading source code of programs to see how
    they function?
  • Do you have a class need for a good C, C++, gcc, make environment?
  • Do you want to run a great website with just a few simple (yet
    powerful) tools (vi, make, perl, apache, etc)?
  • Do you not want to use monopolistic over-bloated buggy and expensive
    proprietary OS’s?
  • Do you like to have many different ways to do anything?
  • Do you want to be cool?
  • Do you believe you can evolve into a superior technical lifeform?
  • Do you want any kind of technological job 10+ years from now?

If ANY of these applied to you, Linux is for you.

3. Linux Basics

3.1. Kernel

As said before, the kernel is the middle-man between hardware and
software. It is the kernel that is responsible for handling memory,
processes, hard disk I/O, ethernet activity, etc. If you use a wireless
card, your kernel must be able to support the chipset of that card
(prism2, orinoco, etc).

When somebody says they are recompiling their kernel, it is usually
because they bought a new gadget, patched the kernel to protect themself
from an exploit, or just need a new feature (like iptables support).
Typical Linux kernels are usually 1-4mb in size. Some people try to
enable support for only those things they already have, and nothing
else, to try to reduce the filesize(so they can fit it on a floppy).
Other people enable support for anything they think they might want in
the future to avoid having to recompile their kernel.

However, there is a way to enable support for a new feature or gadget
without recompiling your kernel (and thus forcing you to reboot into
that new kernel). You can either build support into the kernel, or as a
module. Modules are separate from the kernel, and can be loaded and
unloaded at will. All you have to do is tell the kernel config utility
that you want that driver compiled as a module (outside of the kernel),
run "make modules modules_install" and voila. Load that module("modprobe
ewmodule") and magically you have support for that new device. No reboot
was required. This saves a lot of time and trouble.

3.2. Filesystem

The Linux filesystem is a lot different than that of Windows. Windows
uses drive letters for drives (hard drives, CD roms, floppies, etc) and
partitions for those drives. C: is the root for the primary (and maybe
only) hard drive or partition. D: is the root for the secondary hard
drive. So on and so forth. Linux, on the other hand, has only 1 root.
The root directory is "/". Everything resides in this root.

/ (root)
/boot/bin/home/mnt/lib/var/usr/dev/etc/proc/root /tmp

There are more dirs than those listed above, but that’s most.

/home is where users’ home directories are located. /home/daevux will be
where user daevux will put all their files, pictures, etc.

/mnt is where you would "mount" a drive, partition, or any other storage
device. For instance, /mnt/cdrom is the most typical place for where you
would "mount" your cdrom drive. First, what is "mount" ? When you mount
a drive, you are telling Linux to get that drive ready for I/O. Linux
then determines what filesystem the drive is formatted with(iso9660 for
CDs, vfat for Windows drives, smbfs for samba shares, etc), then figures
out how it should interact with it accordingly.

/etc is where most system-wide configuration files are kept. For
instance, /etc/apache/conf/apache.conf is the configuration file for

/boot is where the kernel is kept, as well as any bootloader (grub,

/lib contains the libraries needed for certain programs to be run. This
is also where kernel modules are kept.

/var is usually for log files, sometimes tmp files, email spools, etc.

/usr is where applications and user utilities are located, in addition
to A lot more stuff.

/root is the home directory for the root user.

/tmp is for temp files.

/bin (and /sbin) contains basic system utilities (bash, cd, dd, halt,

/dev and /proc are not real directories. /proc includes system
information (a lot of statistics, etc). /dev includes "pointers" per se to
hardware devices. /dev/mouse is a pointer to your mouse. /dev/hda1 is a
pointer to the first partition on your primary hard drive. There’s also
stuff like /dev/zero, /dev/null, and /dev/random.

3.3. Permissions/Users

Linux (like most modern OSs) is multiuser. This means that many users
can use the system, even at once.

First and foremost, there is only 1 root. This is the ultimate superuser
of the machine. root can do anything and everything. Keep root’s
password safe!

Then there are normal users. Users can belong to certain groups, which
determine that user’s capabilities. For instance, a user in the wheel
group is allowed to "su" (switch to root mode) on a Gentoo system (I
don’t know if this is true for all distros). Most users should be
included in the group, "users".

Every file and directory in Linux has attributes that determine who can
read, write, and execute it. For instance, file xyz.txt might have have
permissions that look like this (using ls -o): -rwxr-xr-x . the first
character determines what type of object it is (file, dir, sym link,
etc). – means it is a file. d will mean it is a directory. The next 3
characters are for user, the next 3 for group, the next 3 for all users.
This particular file can be read and executed by the user who owns it,
the users in that user’s group, and everyone. However, only the user who
owns it can write/modify it. The "chmod" command changes those
permissions (but must be run by either the user who owns it, or root).
The "chown" command changes the ownership of the file.

3.4. CLI/Shells

What good is the operating system if you can’t interact with it? This is
where the command line interface (CLI) comes in. This is a generic term
used to describe a text-based way of interacting with programs, as
opposed to using GUIs. It is the simplest way to interact with Linux.

Shells are various programs that provide this functionality. The most
popular shell is bash, but some others are csh, ksh, tcsh, and zsh.
They basically take your commands, process them, execute the program
you called, and display that program’s output. They do much more than
this, but you get the gist of it.

For instance, you might see something like this:

daevux@feynman daevux $

This is a bash prompt. The first word before @ is the username, the word
after the @ is the computer’s [host]name, and the 3rd word is the
current directory. Bash is popular partly because it does (by default)
tab completion and keeps a command history.


This is another shell (ksh). Slightly different than bash. Of course,
there is much more difference between the shells than just how the
prompt is formatted.

3.5. X-Windows

X-Windows is for wussiess. Just kidding. X basically allows you to use
Linux in a much more friendly, graphical environment. XFree86 is what
most Linux users use; it is the free implementation of X. There are
other implementations, but XFree86 is the most mature and popular. X is
also a server/client app, meaning you can run X programs remotely.

4. Applications

4.1. Window Managers

X is no good without a window manager running on top of it. Window
managers operate between X and the graphical programs you want to run.
They do stuff like draw the borders around your programs, manage their
placement, draw titlebars, etc. Some WMs are very simple, some are
very complex.

4.1.1. KDE

More than just a WM, KDE is a desktop environment. It resembles Windows
a lot. There is a panel at the bottom, with buttons that execute programs
and expand menus. Also in this panel is a task area, which keeps track
of which programs are open/minimized. Sound familiar? This is the best
WM to use for those new to Linux. Many distros come with this by default.

KDE uses the Qt engine.

4.1.2. Gnome

Gnome is another WM/DE (desktop env). The same comments for KDE also apply
here. Refer to the screenshots to see the differences.

Gnome uses the GTK[2]+ engine.

Note: Gnome (GTK) and KDE (Qt) applications can be run in any WM, not just
the ones they were written for.

4.1.3. Fluxbox/Blackbox

This WM is for more experienced users – those that are annoyed by DEs.
Fluxbox and the WM it was based off of, Blackbox, are minimalistic to some
degree. This means that it only does what is needed. Nothing more. There is
the toolbar, which only includes iconified (minimized) programs, the time,
and the workspace name. There is also the slit, where you can put dockapps –
small graphical programs that monitor time, CPU, Memory, I/O, Network, etc.
Most people that use these WMs rely on keybindings to open programs – in
addition to a small menu. A major selling point to these window managers is
that they’re very fast and very stable. Fluxbox, unlike Blackbox, also
heavily uses tabs – so that you can group windows together.

4.1.4. Enlightenment

Enlightenment is a popular window manager, with an almost cult-like following.
It is the inspiration for many current window managers. The window manager
however hasn’t seen a new release in nearly two years, and is dying out. It is
kind of like Blackbox, but more graphics heavy (prettier) with cool effects.

4.1.5. Window Maker

Window Maker offers a lightweight alternative to GNOME and KDE. It provides
a look-and-feel that mimics the NEXTSTEP user interface, but can be themed any
way you like. Features such as the Dock and Clip make Window Maker a popular
choice among system administrators.

4.2. Office/Productivity

  • OpenOffice
  • KOffice
  • Abiword
  • GnuCash
  • Siag

4.3. Editors

  • Vi/ViM
  • Emacs
  • Pico/Nano
  • SciTE

4.4. Web Browsing

  • Mozilla
  • Konqueror
  • Lynx

4.5. Email/News

  • Evolution
  • Sylpheed
  • KMail
  • Pine
  • Mutt
  • Tin/SLRN/Pan
  • KNews

4.6. Other Internet Stuff

  • Gaim
  • Everybuddy
  • XChat
  • BitchX
  • Kismet
  • P2P stuff
  • gFTP

4.7. Multimedia

  • GQView
  • xv
  • gPhoto2
  • XMMS
  • Timidity++
  • XawTV
  • TVTime
  • MPlayer
  • Xine
  • VLC
  • gCombust
  • gRip

4.8. Games

Sorry, no games for Linux. Just kidding 🙂

  • GnuChess
  • Quake1-3
  • Halflife
  • TuxRacer
  • BZFlag
  • Dopewars
  • Foobillard
  • Unreal
  • UT2003
  • RTCW
  • Freecraft
  • Freeciv
  • Penguin-command
  • GLTron

4.9. Servers/Databases

  • MySQL
  • PostgreSQL
  • Apache
  • Postfix
  • Sendmail
  • Qmail
  • Samba
  • Cups
  • ProFTPd
  • BIND

5. Useful Links

6. Screenshots

Screenshots can be seen at

Introduction to Gentoo

Table of Contents

1. Gentoo In My Words

Gentoo is a source-based distro. The unique thing about gentoo is that
the default package system uses source code and not pre-compiled binaries.
Gentoo lets you built everything and I mean everything on your system with
whatever options you would like. If you install Gentoo from stage1, then
you actually bootstrap your own system buy compiling your own compiler.
Although a fresh install from stage1 will take a good while depending on
your system, I feel like it is a good trade off for getting a system just
the way I like it. The portage system also make it even easier to use your
own packages, and also add ebuilds for packages that are not in the system.
Gentoo allows to have so much control over your system in a nice manageable

2. Even More About Portage

The two main portage config files are /etc/make.globals and
Included below are examples of both files.

2.1. make.globals

Click here to download the example make.globals.

# Copyright 2002 Daniel Robbins, Gentoo Technologies, Inc.
# System-wide defaults for the Portage system

# ***************************************************
# **** CHANGES TO make.conf *OVERRIDE* THIS FILE ****
# ***************************************************
# ** Incremental Variables Accumulate Across Files **
# **USE, CONFIG_*, and FEATURES are incremental**
# ***************************************************

# Host-type

CONFIG_PROTECT="/etc /var/qmail/control /usr/share/config /usr/kde/2/share/config /usr/kde/3/share/config"

# Options passed to make during the build process

# Fetching command (5 tries, passive ftp for firewall compatibility)
FETCHCOMMAND="/usr/bin/wget -t 5 --passive-ftp URI -P DISTDIR"
RESUMECOMMAND="/usr/bin/wget -c -t 5 --passive-ftp URI -P DISTDIR"

CFLAGS="-O2 -mcpu=i686 -pipe"

# Debug build -- if defined, binaries won't be stripped

# Default maintainer options
#FEATURES="digest sandbox noclean noauto buildpkg"
# Default user options
FEATURES="sandbox ccache"

# By default output colored text where possible, set to
# "true" to output only black&white text

# By default wait 5 secs before cleaning a package
# Set to yes automatically run "emerge clean" after each merge
# Important, as without this you may experience missing symlinks when
# downgrading libraries during a batch (world/system) update.

# Number of times 'emerge rsync' will run before giving up.

# ***************************************************
# **** CHANGES TO make.conf *OVERRIDE* THIS FILE ****
# ***************************************************
# ** Incremental Variables Accumulate Across Files **
# **USE, CONFIG_*, and FEATURES are incremental**
# ***************************************************

2.2. make.conf

Click here to download the example make.conf.

# Copyright 2000-2002 Daniel Robbins, Gentoo Technologies, Inc.
# Contains local system settings for Portage system
# Build-time functionality
# ========================
# The USE variable is used to enable optional build-time functionality. For
# example, quite a few packages have optional X, gtk or GNOME functionality
# that can only be enabled or disabled at compile-time. Gentoo Linux has a
# very extensive set of USE variables described in our USE variable HOWTO at
# The available list of use flags with descriptions is in your portage tree.
# Use 'less' to view them:-- less /usr/portage/profiles/use.desc --
# Example:
USE="mmx sse apm oggvorbis pcmcia pnp pda encode pam ssl cups X xv avi imap
fbcon opengl mpeg kde qt arts quicktime oss gnome gtk gtk2 evo dvd gtkhtml xmms sdl
gif jpeg png tiff mozilla spell truetype pdflib tetex java berkdb samba esd dvd

tcltk crypt ldap wavelan gphoto2 cdr directfb"

# Host Setting
# ============
# If you are using a Pentium Pro or greater processor, leave this line as-is;
# otherwise, change to i586, i486 or i386 as appropriate. All modern systems
# (even Athlons) should use "i686-pc-linux-gnu"

# Host and optimization settings
# ==============================
# For optimal performance, enable a CFLAGS setting appropriate for your CPU
# -mcpu=cpu-type means optimize code for the particular type of CPU without
# breaking compatibility with other CPUs.
# -march=cpu-type means to take full advantage of the ABI and instructions
# for the particular CPU; this will break compatibility with older CPUs (for
# example, -march=athlon-xp code will not run on a regular Athlon, and
# -march=i686 code will not run on a Pentium Classic.
# CPU types supported in gcc-3.2 and higher: athlon-xp, athlon-mp, athlon-4,
# athlon-tbird, athlon, duron, k6, k6-2, k6-3, i386, i486, i586 (Pentium), i686
# (Pentium Pro), pentium, pentium-mmx, pentiumpro, pentium2 (Celeron), pentium3,
# and pentium4. Note that Gentoo Linux 1.4 and higher include at least gcc-3.2.
# CPU types supported in gcc-2.95*: k6, i386, i486, i586 (Pentium), i686
# (Pentium Pro), pentium, pentiumpro Gentoo Linux 1.2 and below use gcc-2.95*
# Decent examples:
#CFLAGS="-mcpu=athlon-xp -O3 -pipe"
CFLAGS="-march=pentium3 -O3 -pipe"

# If you set a CFLAGS above, then this line will set your default C++ flags to
# the same settings. If you don't set CFLAGS above, then comment this line out.

# Advanced Masking
# ================
# Gentoo is using a new masking system to allow for easier stability testing
# on packages. KEYWORDS are used in ebuilds to mask and unmask packages based
# on the platform they are set for. A special form has been added that
# indicates packages and revisions that are expected to work, but have not yet
# been approved for the stable set. '~arch' is a superset of 'arch' which
# includes the unstable, in testing, packages. Users of the 'x86' architecture
# would add '~x86' to ACCEPT_KEYWORDS to enable unstable/testing packages.
# '~ppc', '~sparc', '~sparc64' are the unstable KEYWORDS for their respective

# Portage Directories
# ===================
# Each of these settings controls an aspect of portage's storage and file
# system usage. If you change any of these, be sure it is available when
# you try to use portage. *** DO NOT INCLUDE A TRAILING "/" ***
# PORTAGE_TMPDIR is the location portage will use for compilations and
#temporary storage of data. This can get VERY large depending upon
#the application being installed.
# PORTDIR is the location of the portage tree. This is the repository
#for all profile information as well as all ebuilds. This directory
#itself can reach 200M. WE DO NOT RECOMMEND that you change this.
# DISTDIR is where all of the source code tarballs will be placed for
#emerges. The source code is maintained here unless you delete
#it. The entire repository of tarballs for gentoo is 9G. This is
#considerably more than any user will ever download. 2-3G is
#a large DISTDIR.
# PKGDIR is the location of binary packages that you can have created
#with '--buildpkg' or '-b' while emerging a package. This can get
#upto several hundred megs, or even a few gigs.
# PORTDIR_OVERLAY is a directory where local ebuilds may be stored without
#concern that they will be deleted by rsync updates. Default is not

# Fetching files
# ==============
# If you need to set a proxy for wget or lukemftp, add the appropriate "export
# ftp_proxy=proxy" and "export http_proxy=proxy" lines to /etc/profile if
# all users on your system should use them.
# Portage uses wget by default. Here are some settings for some alternate
# downloaders -- note that you need to merge these programs first before they
# will be available.
# Lukemftp (BSD ftp):
#FETCHCOMMAND="/usr/bin/lukemftp -s -a -o DISTDIR/FILE URI"
#RESUMECOMMAND="/usr/bin/lukemftp -s -a -R -o DISTDIR/FILE URI"
# Prozilla (turbo downloader)
FETCHCOMMAND='/usr/bin/proz --no-getch $URI -P $DISTDIR'

# Advanced Features
# =================
# MAKEOPTS provides extra options that may be passed to 'make' when a
#program is compiled. Presently the only use is for specifying
#the number of parallel makes (-j) to perform. The suggested number
#for parallel makes is CPUs+1.
# AUTOCLEAN enables portage to automatically clean out older or overlapping
#packages from the system after every successful merge. This is the
#same as running 'emerge -c' after every merge. Set with: "yes" or "no".
# FEATURES are settings that affect the functionality of portage. Most of
#these settings are for developer use, but some are available to non-
#developers as well. 'buildpkg' is an always-on setting for the emerge
#flag of the same name. It causes binary packages to be created of all
#packages that are merged.
#FEATURES="sandbox ccache buildpkg"
# RSYNC_RETRIES sets the number of times portage will attempt to retrieve
#a current portage tree before it exits with an error. This allows
#for a more successful retrieval without user intervention most times.

2.3. Portage Tree Structure

The portage tree divides different packages into categories. For
example wine is in the app-emulation category. This helps to find similar
packages and keeps the structure nice and manageable. This is very similar
to the ports system on BSD machines.

2.4. Ebuild Files

Below is the ebuild file for the libogg-1.0 package. The structure of
ebuild packages are just text files so this makes the portage tree
relatively small. When you rsync your tree with the server, it only has
to download and/or deletes text files in your local portage tree to sync

2.5. Building Your Own Ebuilds

You can easily build your own ebuilds. This is where the local portage
tree comes in real handy. You could put your ebuild files in the main
portage tree but they will be deleted when your rsync the tree. You need
an ebuild file, a Changelog and a "files" directory that will hold the
digest file for the files needed to be downloaded to install your

2.6. Using Emerge

Let’s say I am in the mood to start managing my finances and I know
that gnucash is a good program to use. I need to see if I have it
installed. I can do a search by using the built in search of emerge.

$ emerge search gnucash
[ Results for search key : gnucash ]
[ Applications found : 1 ]

Latest version available: 1.6.8
Latest version installed: 1.6.8
Size of downloaded files: 16,264 kB

Description: A personal finance manager

As you can see above, it says I do have gnucash installed and it is
the latest version available. It also gives me a description of what
the package is. The search method of emerge only searches on the name
of the package. For example if I search for wireless, then kismet will
not show up. But I can browse the portage tree to see packages that are
in the wireless section.

2.7. Installing Packages

Now, I am interested in doing some project management and one of my
friends told me about mrproject. So what do I need to get it installed.
We again use emerge in the following way:

$ emerge --pretend --update mrproject

These are the packages that I would merge, in order:

Calculating dependencies ...done!
[ebuildU ] gnome-base/libgnomecanvas- []
[ebuildN] dev-libs/libmrproject-0.6
[ebuildN] app-office/mrproject-0.6

So it seems to if I do install mrproject, it will first update
libgnomecanvas, then install libmrproject and mrproject in that order.
It is very important to say that this is a short-circuited operation.
If the ebuild of libgnomecanvas breaks, then it will not change the
system. Emerge builds everything in a sandbox and only merges it with
the system if everything checks out ok. This pretends half installed

Introduction to XML

Table of Contents

1. Introduction

XML is the acronym given to the eXtensible Markup Language. Along with XML
comes a slew of other acronyms; to name a few XSL, XSLT, XSL-FO, XPath, DTD,
CSS, DOM, SOAP, etc… (yes I believe there is an O’Reilly book for each of
them). Thus when reading articles talking about XML this and that, the XML
handicapped reader quickly becomes lost in acronyms. This guide is written to
give the user a simple introduction to what XML is and how it can be used to
publish stuff online. So forget what you think you know about the acronyms
listed above, they were created to confuse you. The only way to learn
something is do it, starting of with a simple example.

2. XML vs. HTML

Conceptually you can think of XML as HTML. They look similar, and they way you
end up writing XML documents is much like writing HTML, everything is enclosed
in tags. The reason to use XML is that your are not burdened by the look of
the document, that is content is separated from style, this is the main
strength of XML.

There are however a few differences.

  • XML is case sensitive
  • XML tags must be properly nested and closed
  • attributes must be wrapped in quotes ("")
  • all XML must have a root tag (more on this later)

3. Example 1

Lets say I read alot of papers and need to publish summaries of them on the
web, I just want to write summaries and have them look nice. I want the option
of being able to change the look of everything later on. This is where XML can
be used in conjunction with XSLT.

?xml version="1.0" encoding="utf-8"?
?xml-stylesheet type="text/xsl" href="summaries.xsl"?


authorD. Cantrell/author
summaryThis book was no good./summary


The first line says that this is an XML document, the encoding says that we
will be using 8 bit encoding, this will enable us to use the full 8 bit ascii
character set in our XML file. Another common encoding is ISO-8859-1.
You can offcourse add multiple book entries in the file.

The following line specifies the stylesheet that is to be used to give meaning
to the XML tags.

The next line define the root tags, called catalog, this name is
arbitrary. The summary tag is where we keep the actualy information
(text) everything else is XML cruft.

Here is the style sheet; it tells an XML aware reader how to render the XML as HTML.

?xml version="1.0" encoding="utf-8"?
xsl:stylesheet version="1.0"
xsl:template match="/"

h2My Summaries/h2
xsl:for-each select="catalog/book"

tdxsl:value-of select="author"//td
tdxsl:value-of select="summary"//td


The astute reader will notice that the XSL file is also and XML file.

So to read this you could fire up your XML aware browser (mozilla or a
derivate of) and open up the first piece of code. Make sure the
summaries.xsl file was located in the same directory.

Reading through the stylesheet it simply matches the closing XML tags. Then
spits out the HTML tags and then for each book in the catalog prints the value
of the author and summary tags. In our case D. Cantrell and This
book was no good

4. Converting XML to HTML

Not all browsers can read XML files and know how to suck in the associate XSL
file. So to keep those browsers happy we can generate the HTML file from our
source XML. There are many tools to do this; written in Java, C++, C and so on
that work on a variety of platforms. One such tool is xsltproc which
is part of libxslt. It is very easy to use.

xsltproc -o summaries.html summaries.xml summaries.xsl

The command will generate summaries.html from the corresponding xml and
xsl files.

5. That was very poor XSL

The previous example was a simple XML document with an oversimplified
stylesheet. The only good thing about the stylesheet is that it is easy to
read. Infact you should never write stylesheets like that. To explain why I
must digress into some of the messy acronyms (not really just what they do).

5.1. The XSLT Parser

To render or convert XML documents into another format they must be parsed
(duh!). To do this the parser must generate a tree starting with the root
element and trickling down to all the elements that make up the document. XSL
then matches these elements inorder to figure what they mean and in our case
give meaning to the tags by printing our corresponding html tags.

This has all the smelly characteristics of recursion.

To be able to nest tags within one another such as:


The XSL parser must also be able to recurse and match tags within tags and so
on. The XSL in Example 1 does not allow for this. Instead it performs a
for loop printing out the author and summary. What if you wanted to
add an bold tag to make a word boldface? This is not possible without
completely rewriting the XSL.

xsl:for-each select="catalog/book"
tdxsl:value-of select="author"//td

tdxsl:value-of select="summary"//td

As you can see for each catalog print select the author and summary and
wrap them inside table data html tags. Thus there is no recursion and we have
chastised the XSL parser, preventing us from using its full potential.

6. Good XSL

A better way to write XSL stylesheets is to use template matching. This simply
matches one tag and the tells the parser to continue (and match more tags).

First a longer XML example.

?xml version="1.0" encoding="utf-8"?
?xml-stylesheet type="text/xsl" href="summaries.xsl"?

!-- I am Joe Comment --


titleRoute Oscillations in I-BGP with Route Reflections/title
publishedACM SIGCOMM 2002, Pittsburgh, PA, August 19-23, 2002./published
nameAnindya Basu/name

nameChih-Hao Luke Ong/name

pThis paper also analyzes the behavior of route

oscillations due to anomalies in I-BGP behavior. Again the
"correctness of IBGP" is in question.The authors define route
oscillations as "persistent route oscillation" and "transient
route oscillations". The former is when routers exchange

UPDATEs without ever settling on a stable path. The latter case
is when routers undergo route oscillations due to timing
situations. /p

p Bla Bla .../p


!-- Another summary should go in here --


As you can see we have expaned on the previous example providing more more
tags which are hopefully self explanatory. Which is another goal of XML, to
have tags that make sense.

The XSL for this looks like this.

?xml version="1.0" encoding="utf-8"?
xsl:stylesheet version="1.0" xmlns:xsl=""

xsl:template match="/"!--root rule--

htmlheadtitlePaper Summaries/title/head
body bgcolor="white"
xsl:apply-templates select="/catalog"/


xsl:template match="catalog"
xsl:apply-templates select="summary"/

Last Modified: Sun Jan 19 22:27:14 EST 2003

xsl:template match="summary"!--processing for each record--
table border="0" cellpadding="0" cellspacing="2"

xsl:apply-templates select="title"/
xsl:apply-templates select="published"/
xsl:apply-templates select="author"/

tdLocal Paper Copy:/td
xsl:apply-templates select="localcopy"/
td colspan="3"

!-- xsl:value-of select="papersummary"/ --
xsl:apply-templates select="papersummary"/


xsl:template match="author"!--this is often recursed since many authors--
xsl:apply-templates select="name"/

xsl:apply-templates select="homepage"/

xsl:template match="name"
tdAuthor:/tdtdxsl:value-of select="."//td

xsl:template match="homepage"

!-- print the link instead xsl:value-of select="."/ --
tda href="{.}"xsl:value-of select="."//a/td

xsl:template match="title"

tr bgcolor="#a8caff"td colspan="3"ixsl:value-of select="."//i/td/tr

xsl:template match="published"
trtd colspan="3"Published: xsl:value-of select="."//td/tr

xsl:template match="localcopy"

tda href="{.}"xsl:value-of select="."//a/td

xsl:template match="papersummary"
xsl:apply-templates select="p"/

xsl:template match="p"

pxsl:value-of select="."//p


The first cluster of XSL stuff matches the end of the document and wraps
everything in open and close HTML tags. The the special XSL template is
applied that the individual tags are matched.

What is important is that the xsl:template just matches a tag. To get
the information out one simply selects it.

xsl:value-of select="."/

The templates is what allows us to have multiple authors, paragraphs etc.

7. Conclusion

This tutorial has hopefully given you a good enough introduction to start
using XML. A natural place for XML to be used is offcourse the Web hence XSL
was also presented. XSL is very important since it gives meaning to the XML
tags. Perhaps this should have been called an intro XSL, but nobody who does
not already know XML will not know XSL and thus not be caught by the "Buzz
Word" that is XML.

XML is becoming more and more pervasive, XML is being used within databases,
word processors, to do RPC/RMI, etc.