- Introduction
- Streams in Scheme
- Functional Programming
- Representing Strings as Streams
- Input Ports as Strings
- An Example: A web server
- Unix Pipes
- Using iterators to build streams
- Performance considerations
- Conclusions
This is a draft. I expect to clean up some things soon. – Alejandro Forero Cuervo.
Introduction
One of the distinguishing characteristics of functional programming, along with the abscense of side-effects and the presence of higher-order functions, is support for lazy evaluation. Semantics supporting lazy evaluation have been present in the Scheme programming language by means of the delay and force constructs. Recently, a programming interface with important improvements has been designed and implemented, in SRFI-40, for creating and using lazy lists, commonly refered to as streams.
This article proposes a special programming style for text-manipulation. In this style strings are represented as streams (rather than, as usually, vectors of characters using the native string type) and functions are programmed in functional style, with no side effects.
The article begins with a quick look at the history of streams and lazy evaluation in the Scheme programming language, explaining the importance of referential transparency and describing the benefits and potential problems of representing strings using streams (such as, for example, being unable to obtain the character at a certain position in constant time).
Some code is provided for converting ports into strings (represented as streams of characters) and viceversa. Using continuations and custom ports, we are able to integrate this programming style with code that isn't based in streams.
The style of programming proposed is compared with the long tradition among Unix administrators of making sophisticated commands by joining together simple commands through pipes. This comparation identifies a few important similarities between both programming styles.
A few performance considerations are quickly analyzed. However, most of the work in this area is left for future articles (or an update to this one).
Finally, pointers to a concrete free-software implementation, based on Chicken Scheme, are provided, along with some examples of programs where I have already used this style successfully.
Streams in Scheme
Semantics for supporting lazy evaluation have been available in Scheme for a long time. A (delay expr) form captures the expresion expr and returns a promise to evaluate it at some point in the future by passing it to the force procedure.
One simplistic way to implement delay would be to transform any (delay expr) form into (lambda () expr); force would then simply call the function passed to it.
However, this wouldn't work, as the semantics for delay and force specify that the expresion in a delay form should be evaluated only the first time the promise is forced. Subsequent forces of the promise should use the value returned the first time. Also, though less importantly, this would make promises indistinguishable from procedures.
R5RS provides an implementation of delay and force as a macro and a procedure respectively that is correct with regards to the caching property. It is worth noticing that all that a language requires to support some form of lazy evaluation is the presence of closures (having a good system of macros, while helpful, is not necessary).
For many years, Scheme programmers have been able to create potentially infinite lists by means of delay and force. For example, a list with all the integers could be represented as:
(define integers
(let loop ((current 0))
(cons current (delay (loop (+ current 1))))))To take the car of the list, one uses the car procedure, as with any other list. However, to obtain the cdr, one must apply the composition of force and cdr. Checking for termination of the list is done as with any other list, using the null? procedure.
SRFI-40 describes multiple problems with this representation for lazy lists. For example, there is an off-by-one error: when n values from the list are used, n+1 values will be computed. A solid justification for solving this problem, along with concrete examples showing that it is worse than it might seem at first look, is included in SRFI-40.
A different representation for lazy-lists is provided in SRFI-40. The term stream in this article refers to objects of that particular type. SRFI-40 introduces a new set of symbols for working with them, namely stream-cons, stream-car, stream-cdr, stream-null?, stream-null, stream-delay and a few others; force and delay are not useful for working with streams (though, as mentioned, a stream-delay form is provided).
With this functionality, the list of integers could be defined as:
(define integers
(let loop ((current 0))
(stream-cons current (loop (+ current 1))))))The stream-cons primitive is a macro that delays the evaluation of both of its arguments until they are actually needed.
Lazy lists evaluate the expressions that provide their contents (the expresions inside a stream-cons or stream-delay form) only as they are actually required (by calls to stream-car, stream-cdr or stream-null?). This is why they can represent infinite lists.
Even if the lists are long but finite, it is possible to traverse them without keeping them entirely in memory. Scheme implementations will be able to recycle memory used to store portions of them as soon as they are no longer required. Consider, for example, the following function which returns the sum of all the elements in a (finite) stream:
(define (lazy-list-sum stream)
(let loop ((total 0) (str stream))
(if (null? str)
total
(loop (+ total (stream-car str)) (force (stream-cdr str))))))If you apply it to a very long finite stream, it will never require to keep more than a single element from the stream in memory at any given time. Had we used a list instead of a stream (and car and cdr instead of stream-car and stream-cdr), it would have been created entirely before the iteration could start.
A correct implementation for SRFI-40, based on the ideas in SRFI-45, allows us to give recursive definitions for iterative algorithms that manipulate streams, such as the one above, and executes them in constant space.
Functional Programming
What skates are to ice, functional programming is to Lisp. Together the two allow you to travel more gracefully, with less effort. – Paul Graham.
Scheme is not a purely functional language, as it includes among its primitives procedures and forms with side-effects such as display or set!. It encourages, however, a style of programming free of side-effects, as long as certain forms are not used.
Generally speaking, we consider a given function referentially transparent if it always produces the same output given the same input. For example, + is a referentially transparent function while a random number generator, as usually defined, is not (unless its seed is one of its inputs). Formally, for a function to be referentially transparent it must be possible to substitute any of its subexpressions for its actual value without altering its meaning.
Many Scheme programmers use the terms functional style or functional programming to refer to this highly recommended style of programming, in which procedures have no side-effects and work by building and returning new objects.
Referentially transparent functions can be easily tested: it suffices to load their definition (and the definitions of functions they depend on, recursively) and invocate them. Other functions require the creation of a "simulation environment", with a "state" analogous to the one found during normal execution. Also, functions written in this functional style can be reasoned about independent from each other. If a set of functions modifies the environment that they depend on, different interactions will produce different results, some of which may trigger errors while others do not.
The style of programming proposed in this article leans strongly towards the practice of writting referentially transparent functions for manipulation of strings. Rather than modify strings directly (through string-set! and similar functions), functions should construct new strings and return them. For example, an hypotetical base64-decode function, doing what its name advertises, would receive an entire string to decode and return a new string with its contents.
This would be impractical (requiring obscene amounts of memory) if the entire contents of the string have to be kept in memory during the whole execution of the functions, but the representation we propose does not pose that inconvenience.
Representing Strings as Streams
Scheme includes a particular type, the aptly-named string type, for the purpose of representing strings. From the point of view of a Scheme programmer, they are not vectors: #(#\a #\b #\c) and "abc" are different objects; applying a vector function to a string is an error. However, internally, strings are represented using vectors by most Scheme implementations.
Streams of characters can serve as an alternate representation for strings. The R5RS functions string->list and list->string suggest the idea of representing them as lists of characters. Although sometimes useful for quickly writing small transformations (such as, in the absense of a string-map! function, (list->string (map char-downcase (filter char-alphabetic? (string->list str))))), I am not aware of signficant Scheme programs using this representation throughout.
Representing strings as stream has one important advantage: it allows a great deal of string-manipulation algorithms to be written in functional-style and executed without requiring the entire input strings to be kept in memory entirely at any time. Functions are written to produce an output stream such that the input streams are only built into memory partially, as they are required to produce each individual character in the output stream. As soon as portions are no longer required, the implementation can reuse their memory.
Consider this example:
(define (stream-map func str)
(stream-delay
(if (stream-null? str)
stream-null
(stream-cons (func (stream-car str))
(stream-map func (stream-cdr str))))))(define (stream-downcase str) (stream-map char-downcase str))
If the stream-downcase function is applied to a stream read from a port, an implementation will never need to keep more than a single byte from the input stream in memory at any given time.
With the usual representation for strings, we could create a version that modifies the input string (no referential transparency here), or we could create one that creates a new string and stores its results there. In both cases we would need to keep the entire contents of the string in memory.
Another option for working with the usual string representation would be writing some code that breaks the input down into chunks and processes each chunk separatedly. Using streams of characters is a very practical way to do it (here each chunk consists of one character); while there are others, they tend to be clumsy.
R5RS defines the following functions for the string type:
- number->string, string->number, symbol->string, string->list, etc. - Convert from strings to other types and viceversa.
- (string? obj) - Evaluates if obj is a string.
- (make-string k [char]) - Builds a string of k char characters.
- (string char ...) - Builds a string with the characters specified.
- (string-length string) - Returns the number of characters in a string.
- (string-ref string n) - Returns the n-th character from a string.
- (string-set! string k char) - Sets the k-th character from a string.
- string=?, string-ci=?, string<?, string<=?, ... - Useful comparison procedures.
- (substring string start end) - Returns the substring from str at the position specified.
- (string-append string ...) - Appends the strings given.
- (string-copy string) - Returns a copy of a string.
- (string-fill! string char) - Stores char at every position in the given string.
Lets focus in a purely functional subset of Scheme: we will, after all, write our functions in functional style. string-set! and string-fill! are thus removed and string-copy is no longer useful.
string-length and string-ref, which required a constant number of steps to execute (complexity O(N)=1), now will have to perform a number of steps linearly proportional to the size of their inputs (O(N)=N). The complexity of (substring string start end) increases as well: now start aditional steps have to be performed to skip the prefix of the string. The algorithmic complexity of the remaining functions remains the same.
This is a good argument for representing strings as vectors rather than lists or streams of characters. Indeed, it is probably the reason why most implementations for Scheme and many other languages (such as Java or C) represent strings as vectors.
However, (1) being able to manipulate strings without keeping them in memory, in a relatively transparent manner, and (2) encouraging referential transparency for string-manipulation programs, by making them much more efficient than they would otherwise be, are very big wins, making a stronger case for using streams to represent strings.
It might seem that representing strings as streams would make the code harder to understand; however, this does not happen in practice as long as a good library for processing streams, with stream-based replacements for common functions for manipulating strings, is available. Although no SRFI specifing one such programming interface exists, there's already one such library (Chicken's stream-ext) in the public domain.
This document does not abide for replacing the internal representation for strings currently used by Scheme implementations; it only proposes that programmers start to use the new stream type in tasks for which they have traditionally used strings. That is, the string type should still be available, unmodified, to programmers. Internally representing strings as streams would probably be too-big a change to justify; many would consider the resulting language not real Scheme, a Scheme-derivative at most.
Although it has rarely been used by Scheme programmers, this representation for strings is not a new or revolutionary idea. It has been used for some time in many programming languages such as Haskell, on which lists are streams and strings are lists (thus streams) of characters. Haskell does go one step further than we propose, by representing strings as streams internally.
Input Ports as Strings
Scheme treats strings and input ports as disjoint types. Indeed, certain operations that apply to strings don't have direct equivalents for ports: how would you, for example, define the length of a given input port? You could define it as the number of characters that can be read from the port before the EOF object is reached, right?
Educated programmers would notice that this opens up the posibility of infinite lengths. Wise programmers would also notice that defining and using one such port-length function would have a disastrous side-effect: it would read and throw away the entire contents of the port.
Another option would be attempting to represent ports as strings, storing them in memory entirely before they are passed back to the program. This would, however, make it impossible to create programs that interact through ports, as the whole input of the interaction would be required before it can actually start.
On the other hand, when strings are represented as streams of characters, we can trivially convert a port into a string:
(define (port->stream port)
(stream-delay
(let ((result (read-char port)))
(if (eof-object? result)
stream-null
(stream-cons result (port->stream port))))))Characters will be read from port just as the streams returned by port->stream is used (by the way, a fancier version of port->stream could close port when EOF is reached). If we create a write-stream function to write the contents of a stream to a port, we can readily evaluate expressions such as (write-stream (stream-map char-downcase (port->stream in-port)) out-port), which perform string manipulation in functional style (the only side effects here would be the calls to port->stream and to write-stream, which can not be avoided) using constant memory.
When representing strings as streams, the port->stream and write-stream functions turn out very useful. The first allows us to work on inputs from files or network connections and the former actually stores the results of our program or flushes them through the network.
An Example: A web server
It is now time for one example, to show that the style of programming produced is not limited to simplistic algorithms and does provide practical advantages.
A web server could be built by defining an http-connection-attend function, receiving as its only input a stream created by port->stream from an input port bound to a network connection. The function would return the stream of characters that should be sent back by the server to the client.
As characters are read from the stream it returns, http-connection-attend would itself read characters from the input stream, which would cause bytes to be read from the network connection. Certainly, http-connection-attend would need to keep portions of the input (such as the HTTP headers, the URI, etc.) in memory during some portions of its execution, but they could be organized into appropriate structures and discarded as soon as they are no longer needed.
The http-connection-attend function would be almost referentially transparent: its only side effects would come from opening the files (or executing the routines required for producing the content to be sent to the browser) and, perhaps, adding an entry to an access log.
This design would make it very easy to test our server, as we need simply call http-connection-attend passing it strings (represented as streams) directly and checking that it returns the appropriate results.
Unix Pipes
Those who do not understand Unix history are condemned to reinvent it, poorly. – Henry Spencer
It has long been a tradition, followed by many Unix shell programmers, to create complex programs by mere composition of small and simpler ones. A chain of programs that pass information from one to the next is thus created, where the output of a program serves as the input to the next. The whole composition of commands is grouped and treated as a single command.
This form of programming is very popular among system administrators, which have been doing it for years, as it allows them to create sophisticated programs very fast. Consider, for example, the following command:
grep svnwiki </var/log/apache2/error.cgi.log | \ tail -1000 | \ sed 's/.*client \([.0-9]*\).*/\1/' | \ sort -u
This command is the composition (by means of the pipe "|" operation) of four simpler commands. The first command, grep, receives the contents of the /var/log/apache2/error.cgi.log file, and outputs all its lines having the word svnwiki anywhere; the second command, tail, receives it and outputs its last 1000 lines; sed performs a replacement based on a regular expression (in this case it removes everything other than the IP address that an HTTP request comes from); and, finally, the sort command is used to removes duplicate lines (IP addresses). The result, as you can see, lists the IP addresses for the last 1000 failed requests including the string svnwiki.
Each of the commands can be thought of as a referentially transparent function producing a string from another. The actual mechanism by which these sequences of piped commands are executed is interesting.
After the shell has parsed the command, it creates a separate operating system process for each sub-command. The processes are connected by means of pipes, so that whenever one process writes bytes to its standard output, they are received by the next in the chain on its standard input.
Each pipe has an associated buffer where the kernel stores bytes written to it which have not been received. You can think of the buffers as bytes traveling through the pipe. Whenever a process attempts to write to a pipe a certain number of bytes, its execution is interrupted until the pipe (its buffer) has room to hold them. When, on the other end of the pipe, a process attempts to read, it is interrupted until characters are actually available (or the process writing to it has terminated).
Initially, only the first process in the above chain will be able to run; all others will wait for input. The first process will eventually start to write bytes to its output and thus the kernel will, at some point, interrupt it and run the second process. The second process will eventually write bytes on its output as well and thus the kernel will be able to run the third process. This situation continues, with the kernel alternating which processes are allowed to run, until, eventually, no more input is available for the first process in the chain. At this moment the first process terminates and this situation is signaled to the second. Eventually the second process will terminate and so on, until the last process in the chain terminates and the whole operation is complete.
There are many similarities between this style of programming and the one suggested in this article. First, each command in the chain acts in functional style: it works by creating its output and passing it to the next command (in Scheme this is done by returning it). Each command is executed by a separate process, entirely independant of the others (except, of course, for the fact that each process has as its input the output of another) so there is no "shared" variable state. This makes it easy to ensure the correction of each component (a command), which makes the composite commands reliable.
Secondly, there is no need to keep the entire string in memory. Note that a sequence of piped commands can be applied to a very big input and, assuming that none of the commands in the chain keeps big portions of their input in memory, the memory required to perform the entire operation will be small and will not depend on the size of the data.
It is also worth noticing that if a given process terminates without reading its entire input, the operating system will usually just terminate all previous processes in the chain.
Assuming an appropriate set of Scheme functions for working with streams, we could program the same shell command in Scheme as follows:
(stream-sort-unique
(regexp-replace-lines "/.*client \\([.0-9]*\\).*/" "\1"
(take-last-lines 1000
(grep "svnwiki" input-stream))))Here port->stream and write-stream might come in handy to create the input-stream stream as well as to write the results to a port.
There are, however, some differences between shell programming and the proposed style for string manipulation. Probably the most important is that we, Scheme programmers, are not limited to streams of characters. The example above would probably be more elegantly coded as follows:
(stream-delete-duplicates
(stream-map
(let ((regexp (regexp "client ([.0-9]*)")))
(lambda (line)
(cadr (string-search regexp (stream->string line)))))
(stream-last-n 1000
(stream-grep "svnwiki"
(stream-map stream->string
(stream-lines input-stream))))))As you can see, this command is more verbose in Scheme than it was when programmed using shell pipes. However, it is only made from simple functions such as stream-map and string-search and should be much easier to understand. One, for example, does not need to be familiar with sed or the parameters of tail and sort.
Being able to pass values encoded with types other than strings has other advantages; we, for example, won't have to represent numbers as strings.
Using iterators to build streams
We can use continuations to create streams using iterators. This will frequently allow us to use older code that does not use streams as if it did. We can then use iterators without having to collect their entire results in memory at any time. Consider the following function:
(define (iterator->stream proc)
(stream-delay
(call-with-current-continuation
(lambda (return)
(proc
(lambda (obj)
(call-with-current-continuation
(lambda (next)
(return
(stream-cons obj
(stream-delay
(call-with-current-continuation
(lambda (new)
(set! return new)
(next #t)))))))))
(lambda () (return stream-null)))
(return stream-null)))))As you can see, it calls proc passing it two functions as its arguments. The first function should be called to collect individual items in the enumeration and the second will stop the iteration altogether. A stream with the objects in the enumeration is returned.
Continuations are used to allow access to the returned stream to control the execution of proc, respecting the usual lazy-evaluation semantics. For example, if the stream is immediately discarded, proc will not even start its execution.
As an example, we can provide a trivial implementation of list->stream using the for-each list iterator:
(define (list->stream list)
(iterator->stream
(lambda (write close) (for-each write list))))Iterators are very commonly found in algorithms for text-manipulation: they are present wherever output ports are used. Combining the above function with custom ports we are able to to easily integrate streams with older code that writes results to ports.
For example, the format function, modeled after Common Lisp's, receives an output port on which it writes its results. If #f is specified instead of a port, format will create a new string with its results and return it. It would be desirable to use format to create and return a stream instead, evaluating only as much as it needs to and never keeping its entire results in memory. This turns out easy to do (assuming a relatively intelligent implementation of format). Consider the following code:
(define (with-output-to-stream proc)
(iterator->stream
(lambda (write close)
(with-output-to-port
(make-output-port
(lambda (string)
(let loop ((i 0))
(when (< i (string-length string))
(write (string-ref string i))
(loop (+ i 1)))))
close)
proc))))All with-output-stream does is create a custom port such that as strings get written, each of their characters is collected (using the function provided by iterator->stream). We can now call format as follows:
(with-output-to-stream
(lambda ()
(format #t "Result: ~{~A~^, ~}~%" '(1 2 3))As you can see, this will allow you to use format as if had been written to return its results as a stream: its code will only be evaluated as you obtain elements from the stream (and if you only use a portion, it won't finish its evaluation). Also, assuming that your implementation of format does not build its result entirely in memory before writing it to its port, it will not need to reserve big memory areas to hold its results.
Suppose a programmer creates a copy-upcase file that reads information from a port and writes it in upper case to another and a copy-even function that on each iteration copies a character from the input to the output and skips a character, until the whole input has been read:
(define (copy-upcase in out)
(let ((char (read-char in)))
(unless (eof-object? char)
(write-char (char-upcase char) out)
(copy-upcase in out))))(define (copy-even in out)
(let ((char (read-char in)))
(unless (eof-object? char)
(write-char char out)
(read-char in)
(copy-even in out))))Both can be used to perform simple transformations on files but what happens when the programmer wants to apply them together to a file? Either she applies one first, saving the results to a temporal file, and then applies the other; or she creates a pipe and forks a separate process.
Another, perhaps simpler, alternative is turning them into stream-based versions. We define and use a port-api->stream-api to convert them, which works on functions that receive an input and an output port and copy information performing some transformation:
(define (file-api->stream-api f)
(lambda (stream)
(with-output-to-stream
(lambda ()
(with-input-from-stream stream
(lambda ()
(f (current-input-port) (current-output-port))))))))(define stream-upcase (file-api->stream-api copy-upcase)) (define stream-even (file-api->stream-api copy-even))
Now we can use stream-upcase and stream-alphabetic as we would any other function that manipulates a stream. For instance, we could combine them with a call to stream-filter:
(stream-upcase
(stream-filter (lambda (x) (char<? x #\g))
(stream-even
(port->stream input-port))))In this case stream-upcase and stream-even are so simple that a programmer would be better of rewriting them specifically to work for streams instead of basing them on copy-upcase and copy-alphabetic. However, in more complex cases, the ones that real-life is made off, making stream-based versions of port-based functions by means of with-output-to-stream and with-input-from-stream can be a powerful mechanism to use code written without knowledge of streams.
A note of caution: the functions called inside iterator->stream (and thus with-output-to-stream) must be reentrant or you might run into problems. Unfortunately, this might not be the case with format in your Scheme implementation.
Defining with-input-from-stream using custom ports is left as an exercise to the reader.
Performance considerations
Benjamin Israeli once said There are three kinds of lies: lies, damn lies, and statistics. Unfortunately, I don't have any statistics on the performance of stream-based string manipulation programs.
If the data being worked on fits entirely in memory, I would expect the streams-based approach to run comparably to a normal approach (representing strings using the normal type). Many string operations would probably be able to process information slightly faster using the normal (vector-based) approach, but they would probably be slower to create their output, being forced to resize big buffers.
If, however, the data does not fit in memory, I would expect the streams-based approach to be much faster. In this case, using a vector-based approach would be very slow because of frequent page faults.
Assuming a rather long chain of nested invocations to stream-manipulating functions, there are a few interesting ideas that could increase performance. Probably the most interesting would be using different threads or processes running in different processors to perform each of the different steps that make up the entire program.
Conclusions
In order for this style of programming to be useful in practice, a good library for manipulating streams, much richer than the minimalistic one defined in SRFI-40 is required. I wrote one for the Chicken Scheme implementation, which should be very easy to port to others. It is in the public domain.
I have also written some other libraries and programs that represent streams as strings with great success. Most notably, stream-cgi, an extension for using Common Gateway Interface; stream-html, a minimalistic HTML generator; svnwiki, a currently unreleased program with a web interface for creating and browsing a wiki using Subversion to store its contents; and mailfolder2html, a program generating HTML files with the contents of a maildir directory, useful for generating archives of mailing list (see an example).
As you see, the style presented can be used under different modern Scheme implementations and can provide interesting advantages over using the traditional string representation for programs having complex logic for text manipulation.
Last update: 2007-06-20 (Rev 11731)