Navigation:

Search



Our Friends

Articles Sed and Awk: Editing Streams for …
 

Sed and Awk: Editing Streams for Fun and Profit

How to use sed and awk for text processing.

This was written by David Shea and given on Wed May 21 2003.

Table of Contents


A common task, in shell programing and elsewhere, is to take a stream of characters and somehow modify it or extract data from it. Two powerful tools that UNIX offers for this purpose, both using the magic of regular expressions, are sed and awk.

1. Regular Expressions

Regular expressions give the user the power to match any regular language while managing to be completely unreadable and incomprehensible. To further complicate things, there are two types of regular expressions defined by POSIX, basic and extended, and no two tools, or even two implementations of the same tool, seem to be able to agree on what the difference really is. For the most part, though, sed and awk implementations are at least compatible with the POSIX definitions, even if additional layers and features are added on top.

A regular expression is used to match some portion of a string. At its most basic, a regex is just a substring. So the string " Caution! Contents may be hot! " contains matches for the regular expression

Caution!

as well as

may be

or even

onten

Regular expressions are case sensitive, so the string contains no matches for

HOT

Simple, no? But not very powerful. Let's add few extra characters. If a circumflex ('^') is the first character of a regex, it will match the beginning of the string. Likewise, if a dollar sign ('$') is the last character of a regular expression, it will match the end of the string. So the regular expression

^Caution!

would match Caution! in the above example string, but not the in the string " Wet floor ahead, Caution! ". Similarly, the expression

Caution!$

would match the Caution! in the second string, but not in the first.

Now let's say you want to match more than one possibility for a character. Characters between brackets ('[' and ']') are treated as a list of possible characters to match. So ' [abcd] ' would match a single character, and that character may be 'a', 'b', 'c', or 'd'. Bracket expressions can also use ranges, so the previous example is equivalent to ' [a-d] ', though constructs such as this are sometimes bad for internationalization. More on that later.

Bracket expressions may also by negated using the circumflex ('^') as the first character. So ' [^abcd] ' would match a single character that is not 'a', 'b', 'c', or 'd'.

A few exceptions are needed in order to match the characters ']', '^', or '-' in a bracket expression. To match a ']', make it the first character, after the circumflex if one is used. So something like ' []abcd] ' or ' [^]abcd] '. To match a '-', make it either the first or last item in the list, after the circumflex. To match a '^', just put it anywhere except up front.

Another useful way to match multiple possibilities for a character is the period ('.'), which will match any character. So the regular expression ' Ca.tion would match both " Caution " and " Caption ".

Note that any of these special characters can be escaped with a '\' to remove their special meaning, so the expression ' \. ' would match a period character. ' \\ ' matches a backslash character.

More than one character may be matched at a time using the asterisk '*'. An asterisk following a character or a bracket expression will match zero or more instances of that character or bracket expression. For example, let's say you're programming in LISP for some reason, and want to match every possible car and cdr expression. You could do this using ' c[ad]*r ' which will match " car ", " cdr ", " caadr ", " cdaar ", and everything else. However, it also matches " cr ", which probably isn't something you want. You can avoid that using ' c[ad][ad]*r ' which forces at least one instance of [ad] to exist for a match, but there is a cleaner way that we'll look into later.

Regular expressions match repetitions greedily, meaning that it will match as long a string as it possibly can. So the regular expression ' .*power ' applied to the string " My power supply is not powerful enough " would match " My power supply is not power ".

If an expression is enclosed in escaped parenthesis ('\(' and '\)'), the entire enclosed expression will be treated as a single element. So the regex ' \(bob\)* ' would match a string of zero of more bob's. In addition to allowing better groupings for repetitions, the text matched within a parenthesis group may be uses later in the expression with \> digit>, with the first parenthesis group (ordered by the beginning of the grouping) being \1, the second \2, and so on through \9. So ' \([Bb][Oo][Bb]\)\1\1 ' would match " BOBBOBBOB " and " BoBBoBBoB " but not " BOBbobbob ".

A specific number of repetitions can be specified by adding appending a number enclosed in escaped curly braces ('\{' and '\}') to an expression. So " BOBbobbob " could be matched using ' \([Bb][Oo][Bb]\)\{3\} '. Ranges can also be given as ' \{start,end\} to match between start and end repetitions inclusive, or ' \{start,\} ' to match at least start repetitions. POSIX does not specify behavior for ' \{,end\} ', but pretty much everyone implements it.

For Basic Regular Expressions, that's about it. Extended Regular Expressions treat unescaped curly braces and parenthesis as the special characters, and add a few more special characters of their own.

If two expressions are separated by a vertical bar ('|'), then either expression will be matched. So ' (bob|jimmy) would match either bob or jimmy.

The addition symbol ('+') can be used to match one or more of an expression, so ' c[ad]+r ' would solve the problems of the car and cdr example above. ' expression + ' is equivalent to ' expression {1,} '. A question mark ('?') following an expression will match that expression zero or one times. So ' expression ? ' is equivalent to ' expression {0,1} '.

Another nice little feature not defined in POSIX but implemented by pretty much everyone is that escaped angle brackets ('\> ' and '\>') can be used to match the beginning or the end of a word. So the expression ' Caution\> ' would match " Caution " and not " Cautionary ".

I mentioned earlier that using ranges in a bracket expression is bad, and this is because not all character sets are created equal, or even contiguous. So while something like ' [A-Za-z] ' may match all letters in ASCII, but it wouldn't match things like é, and who knows what it might do in something like EBCDIC. To solve this problem, equivalency classes were created, and given an even more horrible and confusing syntax. If something like '[:alpha:]' occurs in a bracket expression, this matches any character that would return true for isalpha() in the current locale. Note that the brackets around the equivalency class are additional brackets, not the ones already around the bracket expression. So, in ASCII ' [[:alpha:]] ' is equivalent to ' [A-Za-z] ', ' [[:lower:][:digit:]+=*] ' is equivalent to ' [a-z0-9+=*] ', and so on.

Further complications are introduced with collating elements, but that gets more into internationalization than I care to cover in this article.

So, now that you're a regularly matching fool, what next?

2. Sed, the stream editor

When given an set of rules and some inputer, sed will read a line of the input, modify it according to the provided rules, output the modified form, and repeat until the input is gone. The most common use of sed is to replace a regular expression with some other string, like

s/foo/bar/

which will replace the first foo on each line of the input with bar. If you want to replace every foo on each line, add a 'g' after the replacement string.

s/foo/bar/g

The sed commands are often provided along with the invocation of sed, as in

sed 's/foo/bar/'

Sed uses basic regular expressions, so it requires that the special characters be escaped with backslashes, otherwise they are interpreted as the literal character. For example, to use parenthesis to group elements, they must be used as ' \(stuff\) '. GNU sed defines an "extended regular expression" mode which eliminates the need to escape these characters, but at the cost of portability. GNU sed also allows for '?' and '+' in regular expressions, though they must be escaped ('\?' and '\+') if -r is not being used.

The choice of '/' as the separating character above is arbitrary; any character could be used. Another common choice is to use '%' to avoid having to escape large numbers of '/' in the expression or in the replacement text. So ' s/regex/replace/ ' is equivalent to ' s%regex%replace% '.

Another option that can be appended to a replacement, like 'g', is 'p', which will print the line to stdout if a replacement was made. This should only be used if sed is invoked with the '-n' flag, which will cause sed to print nothing unless explicitly requested with a 'p'. POSIX does not specify whether lines printed with 'p' should be printed again, so depending on the sed implementation, some lines may be printed twice.

2.1. Line addresses

Addresses may be specified before the command to limit on which lines the command will be executed, such as in ' 12s/foo/bar/ ' which will replace foo with bar, but only on the 12th line of the input.

Addresses may be a line number ('12'), a regular expression enclosed in slashes ('/c[ad]*r/') which will match any line containing the expression, or the dollar sign ('$') which matches the last line. A range may also be given as addr1,addr2. If regular expressions are used in an address range, the first line that matches the regular expressions will be used. If the first address in a range is a regular expression, matches for the second address will be checked beginning with the next line.

The choice of '/' characters to delimit regular expression addresses is not necessary, but if another character is used, the first one must be prefixed by a backslash, since otherwise it will be interpreted as a command. This character does not affect the delimiting character in 's' commands, so something like ' \%c[ad]*r%s/r// ' is valid.

2.2. Other commands

Other useful commands are 'd', which deletes the line matching the address, and 'p', which prints out lines matching the address, or every line if no address is given (again, this should only be used in conjunction with -n, since the behavior otherwise is undefined). These three commands will make up nearly all of your usage of sed.

The only (portable) command line options that sed accepts besides -n are -f script-file , which reads in a script from the given filename, and -e script , which adds the given sed command to the script to be executed. If -f or -e are given, then a sed command cannot be given as an operand without -e, since otherwise it will be interpreted as a filename. If multiple -f or -e commands are given, they are evaluated in order.

Multiple filenames may be given, and will be concatenated in order and run through the sed program. stdin is only used if no filenames are given.

3. How sed really works

Sed has two memory spaces, the hold space and the pattern space. For each cycle, the pattern space is cleared, a line of input is read into the pattern space, the program is run, and, if the -n flag was not given, the final contents of the pattern space are written to the output. This repeats until all input is read, or until execution is terminated with the 'q' command. Nothing is ever automatically placed in the hold space, but there are several commands to manipulate it.

The 's' command, in addition to being the most useful for actual text processing, can also be used for conditional branches. A branch point can be defined using ' : LABEL ', and the command ' t LABEL ' will branch to this label if a successful substitution has been made since the last branch or input read. ' b LABEL ' is the unconditional counterpart. If no label is given to either t or b, they will jump to the end of the script, which is useful for starting a new cycle.

Using all of this, powerful, incomprehensible programs may be written, like the implementation of the dc calculator shipped with the GNU sed source, or the following very short text adventure:

# Should be runnable either with or without -n
# Only commands supported are directions, since I didn't want this to get
# three miles long
#
# Trying very hard to use only BREs
#
# Look text shamelessly stolen from Infocom's ZORK

# restore state
# x exchanges hold and pattern spaces
# Each room must exchange back to read input
x
s/room0/&/
t room0
s/room1/&/
t room1
s/room2/&/
t room2
# default
b room0


# North goes to room1, south goes back to room0, southeast goes to room2
: room0
x
# i\ outputs text up to first line without trailing '\'
# '{' and '}' commands are used to create groups matched by a
# single address
# expression matches line containing word "look" optionally surrounded by
# whitespace, and nothing else
/^[[:space:]]*look[[:space:]]*$/{
i\
Maze\
You are in a maze of twisty little passages, all alike
b end
}
# Matches optional leading "go" and word "n" or "north"
# directions work by putting room name in pattern space, and if substitution
# was made, the room name is copied to the hold space and the pattern space
# cleared
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]
Line Break\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}
Line Break[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
# No '|' in BREs, so need two expressions for 'se' and 'southeast'
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*
Line Break[Ss][Ee][[:space:]]*$/room2/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*
Line Break[Ss][Oo][Uu][Tt][Hh][Ee][Aa][Ss][Tt][[:space:]]*$/room2/
t copyend
b badend

# South goes back to room0, North goes to room2
: room1
x
# Matches any line that begins with the word "look
/^[[:space:]]*look[[:space:]]*$/{
i\
West of House\
You are standing in an open field west of a white house, with a boarded\
front door.\
There is a small mailbox here.
b end
}
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Nn]
Line Break\([Oo][Rr][Tt][Hh]\)\{0,1\}[[:space:]]*$/room2/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}
Line Break[[:space:]]*[Ss]\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room0/
t copyend
b badend

# East wins and quits, West goes to room0, South goes to room1
: room2
x
/^[[:space:]]*look[[:space:]]*$/{
i\
Stone Barrow\
You are standing in front of a massive barrow of stone.  In the east face is a\
huge stone door which is open.  You cannot see into the dark of the tomb.
b end
}
# delete input so not printed when quitting
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ee]
Line Break\([Aa][Ss][Tt]\)\{0,1\}[[:space:]]*$//
t win
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ww]
Line Break\([Ee][Ss][Tt]\)\{0,1\}[[:space:]]*$/room0/
s/^[[:space:]]*\(go[[:space:]]\)\{0,1\}[[:space:]]*[Ss]
Line Break\([Oo][Uu][Tt][Hh]\)\{0,1\}[[:space:]]*$/room1/
t copyend
b badend

: win
i\
You win!
# d starts a new cycle, so is not good for deleting pattern space and quitting
# there will be an extra newline printed out at the end if -n not used
q

: badend
# assumes all unknown commands are directions, for brevity
# strips off leading "go", prints out rest
# does nothing if there is no input
/./s/\(^[[:space:]]*go[[:space:]]*\)\{0,1\}\(.*\)/There is no exit to the \2/p
b end
: copyend
# h replaces the hold space with the contents of the pattern space
h
: end
# delete whatever is left in the pattern space so it is not printed
d

Interaction with this little script may look something like this: (input bold, output italic)

$ sed -f adventure.sed
look
Maze
You are in a maze of twisty little passages, all alike
go southeast
look
Stone Barrow
You are standing in front of a massive barrow of stone. In the east face is a
huge stone door which is open. You cannot see into the dark of the tomb.
e
You win!

$

Another command that is useful to know in sed is 'N', which reads another line of input and appends it to the pattern space. This can be used to match multi-line expressions. However, considerations must be made for lines of data read in unusual contexts.

# replace all 'one\ntwo' with 'three'
: begin
N
s/one\ntwo/three/
t
# if line read has 'one\n', strip out second line so first can be
# output.  Restore 'one\n' line from hold space and read next line
h
s/\(.*\n\).*one$/\1/
t again
b
: again
s/\n//
p
x
s/.*\n\(.*one\)$/\1/
b begin

Several other commands exist in sed, and are described in the GNU sed info and man pages, among other places.

4. Awk, that other shell utility thing

Awk is named after its creators, Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. According to the gawk (GNU awk) info page, awk programs are "refreshingly easy to read and write" compared to programs written in traditional procedural langauges.

Awk is described as "data-driven", in that rather than a list of commands to perform on data, awk programs are a description of data and actions to take based on which descriptions are matched.

Awk programs consist of a set of rules of the form ' PATTERN { ACTION } ', where the pattern can be 'BEGIN', 'END', an extended regular expression enclosed in /'s, an awk expression that is matched if it evaluates to a non-zero value, nothing at all which matches everything, or a range given by as 'PATTERN1, PATTERN2'. The pattern ranges are unlike those in sed in that they may be repeated; after the end of a range is found, the beginning may be matched again.

Most of your interaction with awk will probably be with a small subset of its features. The most commonly used awk command is 'print', usually used in conjunction with awk's field separation features. awk '{print $4}' would print the fourth field of every line of input, so if you were to, for example, run the output of ls -l through this tiny program, awk would spit out a big list of group names.

Of course, awk is much more than a fancy cut.

4.1. Separation of Data

Awk views input as a sequence of records, and it views records as a collection of fields. By default, each line of input is a record, and each portion of a record separated by whitespace is a field. The behavior for records can be changed by setting the RS variable. RS is a single character (or no character, in which case all input becomes one record), and by default is '\n'. Field separation can similarly be modified by setting the FS variable. An initial value can be given to FS on the command line, with the -F flag. Depending on its contents, FS can be interpreted in three different ways. By default, FS contains a single space (' '), which means that leading and trailing whitespace is ignored, and fields are separated by any number of spaces or tabs. If FS contains a single character, the behavior is more like that of cut, in that each occurrence of the FS character will start a new field, and if more than one FS characters are adjacent, it will be interpreted as an empty field. awk -F : '{print $3}' /etc/passwd would print out the UID of every user, and is equivalent to cut -d : -f 3 /etc/passwd .

The third mode for FS is when it contains more than character, in which case it is interpreted as an extended regular expression. Field separators are then matched starting from the left, and using the longest possible non-empty string. The fields are whatever is left in between.

Fields can be accessed using the $> number> variables, and the entire record can be accessed using $0.

4.2. Examples

Rather than cover every detail of awk syntax, which would be rather long and boring, I'll just go over a few examples. If you want to learn more about awk, the gawk man and info pages have a complete description of what awk can do.

Suppose you have a directory listing from ls -l, and you want to know exactly how many bytes are being used by the files. Awk can do this, simply by taking the sum of the 5th fields of each record (which in the case of ls, would be the file length).

Recall that ls -l output looks something like this:

total 28
drwx------    2 david    users        4096 2003-05-19 22:31 directory/
-rw-------    1 david    users          11 2003-05-19 22:31 file1
-rw-------    1 david    users       17138 2003-05-19 22:31 files

Variables can be treated as either strings or numbers, depending on the context, and conversions are made automatically, so we can use $5 in a sum simply by adding to it. If something like $4 were used in a sum instead, it would be converted to 0.

So we can take the sum of the lengths with the following:

BEGIN { total = 0 }
{ total = total + $5 } # 'total += $5' would also work
END { print total }

So just assigning to a variable will cause it to spring into being. The BEGIN statement could be omitted, since the value for an empty numeric variable is zero (this can also be seen as unassigned variables being equal to the empty string, and the emptry string being converted to 0 when used as a number). For the first record (total 28), $5 is also equal to 0, since the fifth field of this record is empty.

Also, note that variables are referenced only by their name, instead of "$name" as in Bourne shell and some other scripting languages. If $total were used instead of total, awk would take the current value of total as a number, and then try to interpret that as a field number.

The above program may still not be what you want, since directories are included in the sum as well. Those can be easily eliminated through pattern matching.

/^-/ { total += $5 }
END { print total }

This will only add the file's length to the sum if the record begins with '-', which would mean it is a regular file. Similarly, a pattern of

! /^d/

would only omit directories.

Patterns such as this can become cumbersome if only a specific field matters, so matches may be made based only on a particular field. Suppose you want only the files owned by the user root.

$3 ~ /^root$/ { total += $5 }
END { print total }

The ~ operator will result in true if the awk expression on the left matches the regular expression on the right. !~ can be used for the opposite.

And for one last example, let's throw some numeric and string tests. Same situation as before, but now we only want to consider the length of the file if it is greater than 1024, but not if the user's name is more than 5 characters long.

($5 <1024) && (length($3)> = 5) { total += $5 }
END { print total }