Sed and Awk: Editing Streams for Fun and Profit

Table of Contents

A common task, in shell programing and elsewhere, is to take a stream of
characters and somehow modify it or extract data from it. Two powerful tools
that UNIX offers for this purpose, both using the magic of regular expressions,
are sed and awk.

1. Regular Expressions

Regular
expressions give the user the power to match any regular language while managing
to be completely unreadable and incomprehensible. To further complicate things,
there are two types of regular expressions defined by POSIX, basic and extended,
and no two tools, or even two implementations of the same tool, seem to be able
to agree on what the difference really is. For the most part, though, sed and
awk implementations are at least compatible with the POSIX definitions, even if
additional layers and features are added on top.

A regular expression is
used to match some portion of a string. At its most basic, a regex is just a
substring. So the string "Caution! Contents may be hot!" contains
matches for the regular expression

Caution!

as well
as

may be

or even

onten

Regular expressions are case sensitive, so the string contains no matches
for

HOT

Simple, no? But not very powerful. Let’s
add few extra characters. If a circumflex (‘^’) is the first character of a
regex, it will match the beginning of the string. Likewise, if a dollar sign
(‘$’) is the last character of a regular expression, it will match the end of
the string. So the regular expression

^Caution!

would match Caution! in the above example string, but not the in the
string "Wet floor ahead, Caution!". Similarly, the expression

Caution!$

would match the Caution! in the second
string, but not in the first.

Now let’s say you want to match more
than one possibility for a character. Characters between brackets (‘[‘ and
‘]’) are treated as a list of possible characters to match. So
[abcd]‘ would match a single character, and that character may
be ‘a’, ‘b’, ‘c’, or ‘d’. Bracket expressions can also use ranges, so the
previous example is equivalent to ‘[a-d]‘, though constructs
such as this are sometimes bad for internationalization. More on that
later.

Bracket expressions may also by negated using the circumflex
(‘^’) as the first character. So ‘[^abcd]‘ would match a single
character that is not ‘a’, ‘b’, ‘c’, or ‘d’.

A few
exceptions are needed in order to match the characters ‘]’, ‘^’, or ‘-‘ in a
bracket expression. To match a ‘]’, make it the first character, after the
circumflex if one is used. So something like ‘[]abcd]‘ or
[^]abcd]‘. To match a ‘-‘, make it either the first or last
item in the list, after the circumflex. To match a ‘^’, just put it anywhere
except up front.

Another useful way to match multiple possibilities
for a character is the period (‘.’), which will match any character. So the
regular expression ‘Ca.tion would match both
"Caution" and "Caption".

Note that any of these
special characters can be escaped with a ‘\’ to remove their special
meaning, so the expression ‘\.‘ would match a period
character. ‘\\‘ matches a backslash character.

More
than one character may be matched at a time using the asterisk ‘*’. An
asterisk following a character or a bracket expression will match zero or
more instances of that character or bracket expression. For example,
let’s say you’re programming in LISP for some reason, and want to match
every possible car and cdr expression. You could do this using
c[ad]*r‘ which will match "car", "cdr",
"caadr", "cdaar", and everything else. However, it also
matches "cr", which probably isn’t something you want. You can
avoid that using ‘c[ad][ad]*r‘ which forces at least one
instance of [ad] to exist for a match, but there is a cleaner way that
we’ll look into later.

Regular expressions match repetitions
greedily, meaning that it will match as long a string as it possibly can.
So the regular expression ‘.*power‘ applied to the string
"My power supply is not powerful enough" would match "My
power supply is not power
".

If an expression is enclosed in
escaped parenthesis (‘\(‘ and ‘\)’), the entire enclosed expression will
be treated as a single element. So the regex ‘\(bob\)*
would match a string of zero of more bob’s. In addition to allowing
better groupings for repetitions, the text matched within a parenthesis
group may be uses later in the expression with \digit>, with the
first parenthesis group (ordered by the beginning of the grouping) being
\1, the second \2, and so on through \9. So
\([Bb][Oo][Bb]\)\1\1‘ would match "BOBBOBBOB" and
"BoBBoBBoB" but not "BOBbobbob".

A specific
number of repetitions can be specified by adding appending a number
enclosed in escaped curly braces (‘\{‘ and ‘\}’) to an expression. So
"BOBbobbob" could be matched using
\([Bb][Oo][Bb]\)\{3\}‘. Ranges can also be given as
\{start,end\} to match between start and end repetitions
inclusive, or ‘\{start,\}‘ to match at least start
repetitions. POSIX does not specify behavior for ‘\{,end\}‘,
but pretty much everyone implements it.

For Basic Regular
Expressions, that’s about it. Extended Regular Expressions treat
unescaped curly braces and parenthesis as the special characters, and add
a few more special characters of their own.

If two expressions are
separated by a vertical bar (‘|’), then either expression will be matched.
So ‘(bob|jimmy) would match either bob or jimmy.

The
addition symbol (‘+’) can be used to match one or more of an expression,
so ‘c[ad]+r‘ would solve the problems of the car and cdr
example above. ‘expression+‘ is equivalent to
expression{1,}‘. A question mark (‘?’) following
an expression will match that expression zero or one times. So
expression?‘ is equivalent to
expression{0,1}‘.

Another nice little
feature not defined in POSIX but implemented by pretty much everyone is
that escaped angle brackets (‘\‘ and ‘\>’) can be used to match
the beginning or the end of a word. So the expression
Caution\>‘ would match "Caution" and not
"Cautionary".

I mentioned earlier that using ranges in a
bracket expression is bad, and this is because not all character sets are
created equal, or even contiguous. So while something like
[A-Za-z]‘ may match all letters in ASCII, but it wouldn’t
match things like , and who knows what it might do in something
like EBCDIC. To solve this problem, equivalency classes were created, and
given an even more horrible and confusing syntax. If something like
‘[:alpha:]’ occurs in a bracket expression, this matches any character
that would return true for isalpha() in the current locale. Note that the
brackets around the equivalency class are additional brackets, not the
ones already around the bracket expression. So, in ASCII
[[:alpha:]]‘ is equivalent to ‘[A-Za-z]‘,
[[:lower:][:digit:]+=*]‘ is equivalent to
[a-z0-9+=*]‘, and so on.

Further complications are
introduced with collating elements, but that gets more into
internationalization than I care to cover in this article.

So, now
that you’re a regularly matching fool, what next?

2. Sed, the stream editor

When given an set of rules and some
inputer, sed will read a line of the input, modify it according to the
provided rules, output the modified form, and repeat until the input is
gone. The most common use of sed is to replace a regular expression with
some other string, like

s/foo/bar/

which will
replace the first foo on each line of the input with bar. If you want to
replace every foo on each line, add a ‘g’ after the replacement
string.

s/foo/bar/g

The sed commands are often
provided along with the invocation of sed, as in

sed
's/foo/bar/'

Sed uses basic regular expressions, so it requires
that the special characters be escaped with backslashes, otherwise they
are interpreted as the literal character. For example, to use parenthesis
to group elements, they must be used as ‘\(stuff\)‘. GNU sed
defines an "extended regular expression" mode which eliminates the need to
escape these characters, but at the cost of portability. GNU sed also
allows for ‘?’ and ‘+’ in regular expressions, though they must be escaped
(‘\?’ and ‘\+’) if -r is not being used.

The choice of ‘/’ as the
separating character above is arbitrary; any character could be used.
Another common choice is to use ‘%’ to avoid having to escape large
numbers of ‘/’ in the expression or in the replacement text. So
s/regex/replace/‘ is equivalent to
s%regex%replace%‘.

Another option that can be
appended to a replacement, like ‘g’, is ‘p’, which will print the line to
stdout if a replacement was made. This should only be used if sed is
invoked with the ‘-n’ flag, which will cause sed to print nothing unless
explicitly requested with a ‘p’. POSIX does not specify whether lines
printed with ‘p’ should be printed again, so depending on the sed
implementation, some lines may be printed twice.

2.1. Line addresses

Addresses may be specified before the command to limit on
which lines the command will be executed, such as in
12s/foo/bar/‘ which will replace foo with bar, but only on
the 12th line of the input.

Addresses may be a line number (’12’),
a regular expression enclosed in slashes (‘/c[ad]*r/’) which will match
any line containing the expression, or the dollar sign (‘$’) which matches
the last line. A range may also be given as addr1,addr2. If regular
expressions are used in an address range, the first line that matches the
regular expressions will be used. If the first address in a range is a
regular expression, matches for the second address will be checked
beginning with the next line.

The choice of ‘/’ characters to
delimit regular expression addresses is not necessary, but if another
character is used, the first one must be prefixed by a backslash, since
otherwise it will be interpreted as a command. This character does not
affect the delimiting character in ‘s’ commands, so something like
\%c[ad]*r%s/r//‘ is valid.

2.2. Other commands

Other useful commands are ‘d’, which deletes
the line matching the address, and ‘p’, which prints out lines matching
the address, or every line if no address is given (again, this should only
be used in conjunction with -n, since the behavior otherwise is
undefined). These three commands will make up nearly all of your usage of
sed.

The only (portable) command line options that sed accepts
besides -n are -f script-file, which reads in a script from the
given filename, and -e script, which adds the given sed command
to the script to be executed. If -f or -e are given, then a sed command
cannot be given as an operand without -e, since otherwise it will be
interpreted as a filename. If multiple -f or -e commands are given, they
are evaluated in order.

Multiple filenames may be given, and will
be concatenated in order and run through the sed program. stdin is only
used if no filenames are given.

3. How sed really works

Sed has two memory spaces, the hold space
and the pattern space. For each cycle, the pattern space is cleared, a
line of input is read into the pattern space, the program is run, and, if
the -n flag was not given, the final contents of the pattern space are
written to the output. This repeats until all input is read, or until
execution is terminated with the ‘q’ command. Nothing is ever
automatically placed in the hold space, but there are several commands to
manipulate it.

The ‘s’ command, in addition to being the most
useful for actual text processing, can also be used for conditional
branches. A branch point can be defined using ‘:
LABEL
‘, and the command ‘t LABEL
will branch to this label if a successful substitution has been made since
the last branch or input read. ‘b LABEL‘ is the
unconditional counterpart. If no label is given to either t or b, they
will jump to the end of the script, which is useful for starting a new
cycle.

Using all of this, powerful, incomprehensible programs may
be written, like the implementation of the dc calculator shipped with the
GNU sed source, or the following very short text adventure:

# Should be runnable either with or without -n
# Only
commands supported are directions, since I didn't want this to get
#
three miles long
#
# Trying very hard to use only BREs
#
#
Look text shamelessly stolen from Infocom's ZORK

# restore
state
# x exchanges hold and pattern spaces
# Each room must
exchange back to read input
x
s/room0/&/
t
room0
s/room1/&/
t room1
s/room2/&/
t room2
#
default
b room0

# North goes to room1, south goes back
to room0, southeast goes to room2
: room0
x
# i\ outputs text
up to first line without trailing '\'
# '{' and '}' commands are used
to create groups matched by a
# single address
# expression
matches line containing word "look" optionally surrounded by
#
whitespace, and nothing
else
/^[[:space:]]*look[[:space:]]*$/{
i\
Maze\
You are in
a maze of twisty little passages, all alike
b end
}
#
Matches optional leading "go" and word "n" or "north"
# directions
work by putting room name in pattern space, and if substitution
#
was made, the room name is copied to the hold space and the pattern
space
#
cleared

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]
Line Break\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}
Line Break[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
#

No '|' in BREs, so need two expressions for 'se' and
'southeast'

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*
Line Break[Ss][Ee][[:space:]]*$/room2/

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*
Line Break[Ss][Oo][Uu][Tt][Hh][Ee][Aa][Ss][Tt][[:space:]]*$/room2/
t

copyend
b badend

# South goes back to room0, North goes to
room2
: room1
x
# Matches any line that begins with the
word "look
/^[[:space:]]*look[[:space:]]*$/{
i\
West of
House\
You are standing in an open field west of a white house,
with a boarded\
front door.\
There is a small mailbox
here.
b
end
}

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]
Line Break\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room2/

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}
Line Break[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
t

copyend
b badend

# East wins and quits, West goes to
room0, South goes to room1
:
room2
x
/^[[:space:]]*look[[:space:]]*$/{
i\
Stone
Barrow\
You are standing in front of a massive barrow of
stone.In the east face is a\
huge stone door which is
open.You cannot see into the dark of the tomb.
b
end
}
# delete input so not printed when
quitting

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ee]
Line Break\([Aa][Ss][Tt]\)\{0,1\}[[:space:]]*$//
t

win

s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ww]
Line Break\([Ee][Ss][Tt]\)\{0,1\}[[:space:]]*$/room0/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss]
Line Break\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/
t

copyend
b badend

: win
i\
You win!
# d starts
a new cycle, so is not good for deleting pattern space and
quitting
# there will be an extra newline printed out at the end if
-n not used
q

: badend
# assumes all unknown commands
are directions, for brevity
# strips off leading "go", prints out
rest
# does nothing if there is no
input
/./s/\(^[[:space:]]*go[[:space:]]*\)\{0,1\}\(.*\)/There is no
exit to the \2/p
b end
: copyend
# h replaces the hold
space with the contents of the pattern space
h
: end
#
delete whatever is left in the pattern space so it is not
printed
d

Interaction with this little script may look
something like this: (input bold, output italic)

$ sed -f
adventure.sed
look
Maze
You
are in a maze of twisty little passages, all alike

go
southeast

look
Stone
Barrow

You are standing in front of a massive barrow of
stone. In the east face is a

huge stone door which is
open. You cannot see into the dark of the
tomb.

e
You win!

$

Another command that is useful to know in sed is ‘N’,
which reads another line of input and appends it to the pattern space.
This can be used to match multi-line expressions. However,
considerations must be made for lines of data read in unusual
contexts.

# replace all 'one\ntwo' with
'three'
: begin
N
s/one\ntwo/three/
t
# if line
read has 'one\n', strip out second line so first can be
#
output.Restore 'one\n' line from hold space and read next
line
h
s/\(.*\n\).*one$/\1/
t again
b
:
again
s/\n//
p
x
s/.*\n\(.*one\)$/\1/
b begin

Several other commands exist in sed, and are described in the GNU
sed info and man pages, among other places.

4. Awk, that other shell utility thing

Awk is named after its
creators, Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan.
According to the gawk (GNU awk) info page, awk programs are
"refreshingly easy to read and write" compared to programs written in
traditional procedural langauges.

Awk is described as
"data-driven", in that rather than a list of commands to perform on
data, awk programs are a description of data and actions to take based
on which descriptions are matched.

Awk programs consist of a set
of rules of the form ‘PATTERN { ACTION }‘, where the
pattern can be ‘BEGIN’, ‘END’, an extended regular expression enclosed
in /’s, an awk expression that is matched if it evaluates to a non-zero
value, nothing at all which matches everything, or a range given by as
‘PATTERN1, PATTERN2’. The pattern ranges are unlike those in sed in
that they may be repeated; after the end of a range is found, the
beginning may be matched again.

Most of your interaction with
awk will probably be with a small subset of its features. The most
commonly used awk command is ‘print’, usually used in conjunction with
awk’s field separation features. awk '{print $4}' would
print the fourth field of every line of input, so if you were to, for
example, run the output of ls -l through this tiny
program, awk would spit out a big list of group names.

Of
course, awk is much more than a fancy cut.

4.1. Separation of Data

Awk views input as a sequence of
records, and it views records as a collection of fields. By default,
each line of input is a record, and each portion of a record separated
by whitespace is a field. The behavior for records can be changed by
setting the RS variable. RS is a single character (or no character, in
which case all input becomes one record), and by default is ‘\n’.
Field separation can similarly be modified by setting the FS variable.
An initial value can be given to FS on the command line, with the -F
flag. Depending on its contents, FS can be interpreted in three
different ways. By default, FS contains a single space (‘ ‘), which
means that leading and trailing whitespace is ignored, and fields are
separated by any number of spaces or tabs. If FS contains a single
character, the behavior is more like that of cut, in that each
occurrence of the FS character will start a new field, and if more than
one FS characters are adjacent, it will be interpreted as an empty
field. awk -F : '{print $3}' /etc/passwd would print out
the UID of every user, and is equivalent to cut -d : -f 3
/etc/passwd
.

The third mode for FS is when it contains
more than character, in which case it is interpreted as an extended
regular expression. Field separators are then matched starting from
the left, and using the longest possible non-empty string. The fields
are whatever is left in between.

Fields can be accessed using
the $number> variables, and the entire record can be accessed
using $0.

4.2. Examples

Rather than
cover every detail of awk syntax, which would be rather long and
boring, I’ll just go over a few examples. If you want to learn more
about awk, the gawk man and info pages have a complete description of
what awk can do.

Suppose you have a directory listing from ls
-l, and you want to know exactly how many bytes are being used by the
files. Awk can do this, simply by taking the sum of the 5th fields of
each record (which in the case of ls, would be the file length).

Recall that ls -l output looks something like this:

total 28
drwx------2
davidusers4096
2003-05-19 22:31 directory/
-rw-------1
davidusers11
2003-05-19 22:31 file1
-rw-------1
davidusers17138
2003-05-19 22:31 files

Variables can be treated as either
strings or numbers, depending on the context, and conversions are made
automatically, so we can use $5 in a sum simply by adding to it. If
something like $4 were used in a sum instead, it would be converted to
0.

So we can take the sum of the lengths with the following:

BEGIN { total = 0 }
{ total = total + $5 } # 'total
+= $5' would also work
END { print total }

So just assigning
to a variable will cause it to spring into being. The BEGIN statement
could be omitted, since the value for an empty numeric variable is zero
(this can also be seen as unassigned variables being equal to the empty
string, and the emptry string being converted to 0 when used as a
number). For the first record (total 28), $5 is also equal to 0, since
the fifth field of this record is empty.

Also, note that
variables are referenced only by their name, instead of "$name" as in
Bourne shell and some other scripting languages. If $total were used
instead of total, awk would take the current value of total as a
number, and then try to interpret that as a field number.

The
above program may still not be what you want, since directories are
included in the sum as well. Those can be easily eliminated through
pattern matching.

/^-/ { total += $5 }
END {
print total }

This will only add the file’s length to the sum
if the record begins with ‘-‘, which would mean it is a regular
file. Similarly, a pattern of

!
/^d/

would only omit directories.

Patterns
such as this can become cumbersome if only a specific field
matters, so matches may be made based only on a particular field.
Suppose you want only the files owned by the user root.

$3 ~ /^root$/ { total += $5 }
END { print total
}

The ~ operator will result in true if the awk expression
on the left matches the regular expression on the right. !~ can
be used for the opposite.

And for one last example, let’s
throw some numeric and string tests. Same situation as before,
but now we only want to consider the length of the file if it is
greater than 1024, but not if the user’s name is more than 5
characters long.

($5 1024) &&
(length($3) = 5) { total += $5 }
END { print total }

5. Further Reading