Why Functional Programming Matters

These notes are based on a paper of the same title by John Hughes which was published in The Computer Journal vol.32, in April 1989. Look in the Science Library if you wish to view the original source (but don't ask the librarian to let you borrow the journal or she might bite you...)

Key Points

As software becomes more complex, it is important to structure it well.
Well-structured software is easy to write, easy to debug, and allows modules to be reused.
Conventional languages place limits on reuse. Functional languages push those limits back.
Higher-order functions and lazy evaluation contribute greatly to modularity.
Therefore, since modularity is the key to successful software, functional languages are vitally important!

Denying the pleasures of life?

Functional languages appear to have all the attractions of the life of a medieval monk, denying the pleasures of life in the hope it will lead to virtue: no assignment, no side-effects, no explicit flow of control.

The key to understanding the importance of functional programming is to focus on what functional programming adds, rather than what it takes away. The ``pleasures'' denied by functional programming turn out to be necessary in achieving far greater satisfaction.

How to be modular

Modular software is generally accepted to be the key to successful software. Recent languages such as Modula-II and ADA include features to help improve modularity. However, there is a very important point that is often missed. When writing a modular program to solve a problem, one first divides the problem into sub-problems, then solves the sub-problems and combines the solutions. The ways in which the original problem can be divided up depends directly on the ways in which solutions can be ``glued'' together. Therefore, providing new kinds of ``glue'' provides new opportunities for modularisation.

Higher-order functions and lazy evaluation are two very important kinds of glue.

Gluing functions together

Higher-order functions are a new kind of glue that enables simple functions to be glued together to make more complex ones.

You may have already seen a simple recursive function for summing the elements of a list:

sum [] = 0
sum (x:xs) = x + sum xs

This recursive pattern is conventionally called foldr (Hughes calls it reduce), and sum can be expressed as

sum = foldr (+) 0

The definition of foldr can be derived by parameterising the definition of sum:

(foldr (op) z) [] = z
(foldr (op) z) (x:xs) = x `op` ((foldr (op) z) xs)

Extra brackets were placed around (foldr (op) z) to make it clear that this expression replaces sum in the definition. Conventionally, these brackets are omitted, and so ((foldr (op) z) xs) is written (foldr (op) z xs). When a function of three arguments like foldr is applied to only two arguments, the result is taken to be a function in the one remaining argument. This is called partial application or currying.

Having modularised sum, we can start reusing the parts. For example,

product = foldr (*) 1
any = foldr (||) False
all = foldr (&&) True

These are all functions on lists that follow the same pattern of recursion as sum. Product returns the product of all the numbers in the list; any tests for the presence of the value True in the list; and all tests for the presence of the value False.

One way to understand (foldr (op) z) is as a function that replaces all occurences of (:) in a list by (op), and the occurence of [] with z. Taking the list [1,2,3] as an example, since this list means

1 : ( 2 : ( 3 : [] ) )

then (foldr (+) 0) converts it into

1 + ( 2 + ( 3 + 0 ) )

and (foldr (*) 1) converts it into

1 * ( 2 * ( 3 * 1 ) )

Clearly, (foldr (:) []) just copies the list. If we replace the tail of the list with another list, we have append (which is called (++) in Haskell):

a ++ b = foldr (:) b a

We can count the number of elements in a list using the function length as defined by

length = foldr add_1 0
  where add_1 el len = len + 1

This works because add_1 increments 0 as many times as there are (:)'s in the list.

Suppose we want to apply a function to each element of a list, and return the list of results. For example, map (+ 1) [1,2,3] should return the list [2,3,4]. We can even define this function using foldr:

map f = foldr f_and_cons []
  where f_and_cons a l = f a : l

Map is another generally useful function.

If we represent a matrix as a list of lists, then sum . map sum will return the total of all the elements. The map sum part sums up all the rows into a list of row totals, and these row totals are passed to the second sum to give a grand total.

All-in-all, these examples show that modularising a simple function like sum can provide a powerful higher-order function that can be reused in many ways.

foldr is pervasive amongst list operations because it is directly matched to the form of the list data type. The list data type has two constructors: cons (:) and nil ([]); the foldr function has two arguments, one for each of these constructors.

This observation leads us to consider relatives of foldr that apply to other data types. For example, consider the data type of ordered labeled trees, defined by

data Tree a = Tree a [ Tree a ]

This definition says that a Tree is a pair built from an arbitrary label (a) and a list of subtrees of the same type of label. For example, the tree

would be represented as the term

Tree 1 [ Tree 2 [], Tree 3 [ Tree 4 [] ] ]

We will call the higher-order ``fold'' function for trees foldtree. Since trees are built using three constructors (Tree. (:) and []), foldtree will need three arguments. Also, since trees contain values of two different types--nodes and lists--we will have to implement foldtree by two functions. Here goes:

foldtree op1 op2 z (Tree label ls) = label `op1` foldtree' ls
  where
    foldtree' [] = z
    foldtree' (tree:ls)
	= foldtree op1 op2 z tree `op2` foldtree' ls

A function to sum the labels in a tree of numbers is foldtree (+) (+) 0. The tree used earlier has the following constructors:

Tree 1 ( Tree 2 [] : ( Tree 3 ( Tree 4 [] : [] ) ) : [] )

Calling foldtree (+) (+) 0 gives the expression

 (+) 1 ( (+) 2 0  (+) ( (+) 3 ( (+) 4 0  (+) 0 ) ) (+) 0 )

which of course evaluates to 10.

The labels in the tree can be extracted into a list with the function foldtree (:) (++) []. Applying this to our tree gives:

(:) 1 ( (:) 2 [] ++ ( (:) 3 ( (:) 4 [] ++ [] ) ) ++ [] )

Finally, we can define a function analogous to map which applies a function to each label in the tree:

maptree f = foldtree f_and_cons (:) []
  where f_and_cons label subtree = Tree (f label) subtree

All this can be achieved because functional languages allow functions which are indivisible in conventional programming languages to be expressed as a combinations of parts--a general higher-order function and some particular specialising functions. Once defined, such higher-order functions allow many operations to be programmed very easily. Whenever a new data type is defined, higher-order functions should be written for processing it. This makes manipulation of the data type easy, and also localises knowledge about details of the representation.

Gluing programs together

The other kind of glue that functional languages provide enables whole programs to be glued together. Recall that a complete functional program is just a function from its input to its output. If f and g are two such programs, then (g . f), the composition of g and f, is a program which, when applied to its input, computes g (f input).

The program f computes its output which is used as input to the program g. This might be implemented conventionally by storing the output from f in a temporary data structure. The problem with this is that the temporary data structure might occupy so much memory that it is impractical to glue the programs together in this way. Functional languages provide a solution to this problem. The two programs f and g are run together in strict synchronisation. The program f is only started once g tries to read some input, and only runs long enough to deliver the output g is trying to read. Then f is suspended and g is run until it tries to read another input. As an added bonus, if g terminates without reading all of f's output, then f is aborted. The function f can even be a non-terminating program, producing an infinite amount of output, since it will always be terminated forcibly once g is finished. This allows termination to be separated from loop bodies -- a powerful modularisation.

Since this method of evaluation runs f as little as possible, it is called ``lazy evaluation''. It makes it practical to modularise a program as a generator which constructs a large number of possible answers, and a selector which chooses the appropriate one.

Newton-Raphson square roots

If you wish to run the programs described here, load the file /users/studs/ugrad/G360/numerical.lgs into Gofer. The code has not been tested on MacGofer, and I would not be suprised if it does not work (this is true for all the Gofer code I write; you are going to need to use Unix for the Logic Programming part of the course anyway, so why not start now!).

Consider the Newton-Raphson algorithm for finding square roots. The algorithm starts with an initial approximation, a0 and computes better and better approximations using the rule

a' = (a + N/a)/2

If the approximations converge to a limit, then a = (a + N/a)/2, so a = sqrt(N).

Square root programs take a tolerance, e, and stop when two successive iterations differ by less than e.

The algorithm is usually programmed more or less like this:

C   N IS CALLED ZN HERE SO THAT IT HAS
C   THE RIGHT TYPE
    X = A0
    Y = A0+2.*EPS
C   THE VALUE OF Y DOESN'T MATTER SO
C   LONG AS ABS(X-Y).GT.EPS
100 IF (ABS(X-Y).LE.EPS) GOTO 200
    Y = X
    X = (X+ZN/X)/2.
    GOTO 100
200 CONTINUE
C   THE SQUARE ROOT OF ZN IS NOW IN X

This program is indivisible in conventional languages. We will express it in a more modular form using lazy evaluation, and then show some other uses to which the parts may be put.

Since Newton-Raphson computes a sequence of approximations it is natural to represent this using a list. Each approximation is derived from the previous one by the function

next n x = (x + n/x)/2

so next n is the function mapping one approximation onto the next. The sequence of approximations is

[ a0, next n a1, next n (next n a0), ... ]

The standard prelude defines a function to compute this:

iterate f a = a : iterate f (f a)

The list of approximations is computed by

iterare (next n) a0

iterate is an example of a function with an infinite output; this does not matter because no more approximations will be computed that the rest of the program requires. The infinity is only potential; all it means is that iterate itself places no limit on the number of values produced.

The remainder of the square root finder is a function within that takes a tolerance and a list of approximations and looks down the list for two successive approximations that differ by no more than the given tolerance. It can be defined as

within eps (a:b:rest)
    | abs(a-b) <= eps = b
    | otherwise = within eps (b:rest)

Putting the parts together, we have

sqrt a0 eps n = within eps (iterate (next n) a0)

Now that we have the parts of a square root finder, we can try combining them in different ways. One modification is to wait for the ratio between two successive approximations to approach one, rather than for the difference to approach zero. This is more appropriate for very small numbers (where the difference is small to start with) and for very large numbers (rounding errors can exceed the tolerance). It is only necessary to write a replacement for within:

relative eps (a:b:rest)
   | abs(a/b-1)<=eps = b
   | otherwise = relative eps (b:rest)

A new version of sqrt can be defined as

relsqrt s0 eps n = relative eps (iterate (next n) a0)

Numerical differentiation

The result of differentiating a function at apoint is the slope of the function's graph at that point. This can be estimated by evaluating the function at the given point and at another point nearby, and computing the slope of a straight line between the two points. This assumes that, if the points are close enough together then the graph of the function will not curve much in between. This gives the definition

easydiff f x h = ( f (x+h) - f x ) / h

In order to be a good approximation the value of h should be small. Unfortunately, if h is too small then the values of f (x+h) and f h are very close together, so the rounding errors of the subtraction may swamp the result. How can the best value of h be chosen? One solution is to compute a succession of approximations with smaller and smaller values of h, starting with a reasonably large one. Such a sequence should converge to the value of the derivative, but will become hopelessly inaccurate eventually due to rounding error. If within eps is used to select the first approximation that is accurate enough then the risk of rounding error affecting the result is minimised. We need a function to compute the sequence:

differentiate h0 f x
	= map (easydiff f x) (iterate halve h0)
  where
    halve x = x / 2

Here h0 is the initial value of h, and successive values are obtained by successive halving. We can now compute the derivative at any point with

within eps (differentiate h0 f x)

This is not a wonderful solution because it converges fairly slowly. A little mathematics can help. The elements of the sequence can be expressed as ``the right answer + an error term involving h'', and it can be shown theoretically that the error term is roughly proportional to a power of h, so that it gets smaller as h gets smaller. Let the right answer be A and the error term be B*h^n. Since each approximation is computed using a value of h that is twice that used for the next one, any two successive approximations can be expressed as

a  = A + B * 2^n * h^n
a' = A + B * h^n

Now the error term can be eliminated, and we conclude

A = (a' * 2^n - a) / (2^n - 1)

Of course, since the error term is only roughly a power of h this conclusion is also approximate. But it is a much better approximation. This improvement can be applied to all successive pairs of approximations, using the function

elimerror n (a:b:rest)
   = ((b * 2^n - a)/(2^n - 1) : elimerror n (b:rest)

Eliminating error terms from a sequence of approximations yields another sequence which converges much more rapidly. A more efficient derivative can be computed with

within eps (elimerror 1 (differentiate h0 f x))

elimerror only works on sequences of approximations which are computed using a parameter h which is halved between each successive approximation. However, if it is applied to such a sequence its result is also such a sequence! This means that a sequence of approximations can be improved more than once. A different error term is eliminated each time, and the resulting sequences converge faster and faster. So, one could compute a derivative very efficiently using

within eps (elimerror 3
		(elimerror 2
			( elimerror 1 (differentiate h0 f x)))

In numerical analysis terms, this is a fourth order method, and gives an accurate result very quickly. One could even define

elimall n s = s : elimall (n+1) (elimerror n s)

which eliminates more and more error terms to give a sequence of sequences of approximations, each one converging faster than the last. Now,

super s = map head (elimall 1 s)

selects the first approximation from each of those sequences, to form a very rapidly converging sequence indeed. This is really a very sophisticated algorithm--it uses a better and better numerical method as more and more approximations are computed. One could compute derivatives very efficiently indeed with the program

within eps (super (differentiate h0 f x))

This is probably a case of using a sledge-hammer to crack a nut, but the point is that even an algorithm as sophisticated as super can be expressed easily when modularised using lazy evaluation.