art with code

2008-10-03

I/O in programming languages: basic piping

This post is a part of a series where your intrepid host looks at I/O in different programming languages in search of understanding and interesting abstractions.

Part 1: open and read
Part 2: writing
Part 3: basic piping -- you are here
Part 4: piping to processes
Part 5: structs and serialization

We'll be using Haskell, OCaml, Bash and C today. (Spoiler: Bash wins.)

Piping is the process of reading data from an input stream, processing the read data, and writing the result to an output stream. I.e. a map over streams, which this Haskell example handily demonstrates:

import System.IO
import Char

main = do
input <- hGetContents stdin
let output = map (Char.toUpper) input
hPutStr stdout output

Try running that and writing a couple lines of text to the program. It reads a character from stdin, uppercases it, and writes the uppercased character to stdout. Though, as stdin and stdout are line-buffered, it seems like it only does a line at a time.

The interesting part about the above Haskell program is that it runs in constant space, thanks to the lazy lists. With an eager list, you would need an explicit processing loop, like in this OCaml version:

let optE f x = try Some (f x) with _ -> None
let optEx ex f x = try Some (f x) with e when e = ex -> None
let optEOF = optEx End_of_file

let rec process f ic oc =
match optEOF input_char ic with
| None -> ()
| Some c ->
begin
match optE (fun c -> output_char oc c; flush oc) (f c) with
| None -> ()
| Some () -> process f ic oc
end

let () = process Char.uppercase stdin stdout

Simple, no? Not to be outdone, Haskell has a function named interact that pipes stdin to stdout:

import Char
main = interact (map Char.toUpper)

I suppose the true winner here is Bash, as it is built around piping:

tr '[:lower:]' '[:upper:]'


Writing filename-way file -> file -maps using the above methods is quite trivial. Starting with Bash:

tr '[:lower:]' '[:upper:]' <src >dst

The Haskell version isn't quite as succint but it's no beast either:

import Char

main = mapFiles (map Char.toUpper) "src" "dst"

mapFiles f src dst = writeFile dst =<< mapFile f src
mapFile f filename = return . f =<< readFile filename

Using the above definition of process and withFiles from prelude.ml, the OCaml translation is straightforward:

let process_files f = withFiles (process f)

let () = process_files Char.uppercase "src" "dst"


We can apply the same strategy to a C version, first define a function to transform streams, then wrap it inside fopen-fclose pairs.

#include <stdio.h>

void process(char (*f)(char), FILE *src, FILE *dst)
{
while(!feof(src))
fputc(f(fgetc(src)), dst);
}

void process_files(char (*f)(char), const char *src, const char *dst)
{
FILE *sf = fopen(src, "r");
FILE *df = fopen(dst, "w");
process(f, sf, df);
fclose(df);
fclose(sf);
}

char uppercase(char c) {
char upper = 'A' - 'a';
return (c >= 'a' && c <= 'z') ? (c + upper) : c;
}

int main(int argc, char *argv[])
{
process_files(uppercase, "src", "dst");
return 0;
}


So, piping characters, no big problem. How about something a bit more structured, like lines? Well, in Haskell it isn't a problem:

main = mapFiles (unlines . map reverse . lines) "src" "dst"

And Bash exports all lexing to external programs:

rev <src >dst

Then to the problematic cases. The process that we defined above operates on a per-character basis. What we want is something that operates on a per-line basis. We could write a process_lines that uses input_line instead of input_char, but that's a bit repetitive.

Taking a step back, we could also add in a tokenizer that reads a value from the input channel and calls the processing function with it, writing the output to the output channel. Something like this:

open Prelude

let rec untilEOF f x =
match optEOF f x with
| None -> ()
| Some () -> untilEOF f x

let tokenized_pipe read write f ic oc =
untilEOF (fun ic -> write oc (f (read ic))) ic

let output_line oc s = output_string oc s; output_char oc '\n'

let process_lines = tokenized_pipe input_line output_line

let () = withFiles (process_lines srev) "src" "dst"

Or extend the Stream module to remap a character stream into a line stream:

open Prelude

type 'a accum = Acc of 'a | Done of 'a

let optMap f x = match x with None -> None | Some y -> Some (f y)

let string_of_char c = String.make 1 c
let (^.) s c = s ^ string_of_char c

module Stream =
struct
include Stream

let nextOpt st =
match peek st with
| None -> None
| x -> (junk st; x)

let rec foldlBreak f init st =
match nextOpt st with
| None -> init
| Some x ->
begin match f init x with
| Acc a -> foldlBreak f a st
| Done b -> b
end

let rec foldl f init st = foldlBreak (fun s i -> Acc (f s i)) init st

let map f st = from (fun _ -> optMap f (nextOpt st))

let remap f init st =
from (fun _ ->
match peek st with
| None -> None
| Some x -> Some (foldlBreak f init st))

let join s st = from (fun i -> if i mod 2 = 0 then nextOpt st else Some s)

end

let lines =
Stream.remap (fun s c -> if c = '\n' then Done s else Acc (s ^. c)) ""

let () =
withFiles (fun ic oc ->
let s = Stream.of_channel ic in
let rev_lines = Stream.map srev (lines s) in
Stream.iter (output_string oc) (Stream.join "\n" rev_lines)
) "src" "dst"

Here's a Haskell version of remap:

data Accum a = Acc a | Done (a, a)

remap t acc [] = [acc]
remap t acc (x:xs) =
case t acc x of
Acc a -> remap t a xs
Done (d, a) -> (d : remap t a xs)

line_lexer acc c =
case c of
'\n' -> Done (acc, "")
otherwise -> Acc (acc ++ [c])

my_lines s = remap line_lexer "" s

main = do
c <- readFile "src"
writeFile "dst" (concatMap (\l -> reverse l ++ "\n") $ my_lines c)

And a less crack-rock version of my_lines:

my_lines "" = []
my_lines s =
(line : my_lines (chomp '\n' rest))
where
(line, rest) = break ('\n' ==) s

chomp v (x:xs) | v == x = xs
chomp _ xs = xs


...sorry, got a bit derailed there. You can actually do a quite OCaml-like tokenized process in C as well, though it is a bit more involved due to manual memory management and "arrays are pointers to memory."

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* first some utils, feel free to skip ahead */

typedef struct {
size_t length;
char *ptr;
} string;

string create_string(size_t length) {
string s;
s.length = length;
s.ptr = (char*)malloc(length);
if (s.ptr == NULL) exit(1);
return s;
}

string reverse_string(string s)
{
int i, nidx, hl = s.length / 2;
char tmp;
for (i=0; i < hl; i++) {
tmp = s.ptr[i];
nidx = s.length-1-i;
s.ptr[i] = s.ptr[nidx];
s.ptr[nidx] = tmp;
}
return s;
}

void destroy_string(string s) {
s.length = 0;
if (s.ptr != NULL) free(s.ptr);
s.ptr = NULL;
}


/* the gist of the code */

void tokenize_pipe
(
string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string),
FILE *src, FILE *dst
)
{
while (!feof(src))
if (!write(f(read(src)), dst)) break;
}

void tokenize_pipe_files
(
string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string),
const char *src, const char *dst
)
{
FILE *sf = fopen(src, "r");
FILE *df = fopen(dst, "w");
tokenize_pipe(read, write, f, sf, df);
fclose(df);
fclose(sf);
}

/* reads a string from f */
string alloc_and_read(FILE *f)
{
string s = create_string(256);
if (fgets(s.ptr, s.length, f) == NULL)
s.length = 0;
else
s.length = strlen(s.ptr);
return s;
}

/* writes s to f and destroys s */
int write_and_free(string s, FILE *f)
{
fwrite(s.ptr, 1, s.length, f);
destroy_string(s);
return 1;
}

int main(int argc, char *argv[])
{
tokenize_pipe_files(alloc_and_read, write_and_free, reverse_string, "src", "dst");
return 0;
}


I think I'll leave it there for basic piping. The next subjects will probably be: talking with external processes, structured I/O and mmap, event loop IO and sockets.

Drop examples in the comments if you want to add better ways to do things / more languages here. Pointers to interesting ways to do the above things are also most welcome.

Bis dann, have fun!

3 comments:

Pádraig Brady said...

That was interesting, thanks.

I benchmarked various ways ofreading arbitrary long lines, which may be of interest?

Reading lines in various languages

Anonymous said...

tr is not bash

Ilmari Heikkinen said...

@ Pádraig Brady: Thank you for the link, interesting how large the difference between fgets and getline is. And how cgrep's regex is faster than fgets.

The perl -ne print is also nice and succint, doing the reverse line piping example would be perl -ne 'print scalar reverse' <src >dst, though it offloads the piping to the shell.

@ anonymous: Yes, you are correct. But it's in my /usr/bin, so it is a part of the shell "standard library." That is a bit of a copout of course, since with that reasoning one could say that any language with an interpreter is a part of the shell stdlib.

So maybe I should change "bash" to "shell". But that's again a bit dubious, since you could have a shell that doesn't have piping.

Hence my take is more about getting the job done and demonstrating how piping works in bash. Writing an uppercasing function in bash could be done, but the common way is to use tr.

Blog Archive