Part 1: open and read
Part 2: writing
Part 3: basic piping -- you are here
Part 4: piping to processes
Part 5: structs and serialization
We'll be using Haskell, OCaml, Bash and C today. (Spoiler: Bash wins.)
Piping is the process of reading data from an input stream, processing the read data, and writing the result to an output stream. I.e. a map over streams, which this Haskell example handily demonstrates:
import System.IO
import Char
main = do
input <- hGetContents stdin
let output = map (Char.toUpper) input
hPutStr stdout output
Try running that and writing a couple lines of text to the program. It reads a character from stdin, uppercases it, and writes the uppercased character to stdout. Though, as stdin and stdout are line-buffered, it seems like it only does a line at a time.
The interesting part about the above Haskell program is that it runs in constant space, thanks to the lazy lists. With an eager list, you would need an explicit processing loop, like in this OCaml version:
let optE f x = try Some (f x) with _ -> None
let optEx ex f x = try Some (f x) with e when e = ex -> None
let optEOF = optEx End_of_file
let rec process f ic oc =
match optEOF input_char ic with
| None -> ()
| Some c ->
begin
match optE (fun c -> output_char oc c; flush oc) (f c) with
| None -> ()
| Some () -> process f ic oc
end
let () = process Char.uppercase stdin stdout
Simple, no? Not to be outdone, Haskell has a function named
interact
that pipes stdin to stdout:
import Char
main = interact (map Char.toUpper)
I suppose the true winner here is Bash, as it is built around piping:
tr '[:lower:]' '[:upper:]'
Writing filename-way
file -> file
-maps using the above methods is quite trivial. Starting with Bash:
tr '[:lower:]' '[:upper:]' <src >dst
The Haskell version isn't quite as succint but it's no beast either:
import Char
main = mapFiles (map Char.toUpper) "src" "dst"
mapFiles f src dst = writeFile dst =<< mapFile f src
mapFile f filename = return . f =<< readFile filename
Using the above definition of
process
and withFiles
from prelude.ml, the OCaml translation is straightforward:
let process_files f = withFiles (process f)
let () = process_files Char.uppercase "src" "dst"
We can apply the same strategy to a C version, first define a function to transform streams, then wrap it inside fopen-fclose pairs.
#include <stdio.h>
void process(char (*f)(char), FILE *src, FILE *dst)
{
while(!feof(src))
fputc(f(fgetc(src)), dst);
}
void process_files(char (*f)(char), const char *src, const char *dst)
{
FILE *sf = fopen(src, "r");
FILE *df = fopen(dst, "w");
process(f, sf, df);
fclose(df);
fclose(sf);
}
char uppercase(char c) {
char upper = 'A' - 'a';
return (c >= 'a' && c <= 'z') ? (c + upper) : c;
}
int main(int argc, char *argv[])
{
process_files(uppercase, "src", "dst");
return 0;
}
So, piping characters, no big problem. How about something a bit more structured, like lines? Well, in Haskell it isn't a problem:
main = mapFiles (unlines . map reverse . lines) "src" "dst"
And Bash exports all lexing to external programs:
rev <src >dst
Then to the problematic cases. The
process
that we defined above operates on a per-character basis. What we want is something that operates on a per-line basis. We could write a process_lines
that uses input_line
instead of input_char
, but that's a bit repetitive.Taking a step back, we could also add in a tokenizer that reads a value from the input channel and calls the processing function with it, writing the output to the output channel. Something like this:
open Prelude
let rec untilEOF f x =
match optEOF f x with
| None -> ()
| Some () -> untilEOF f x
let tokenized_pipe read write f ic oc =
untilEOF (fun ic -> write oc (f (read ic))) ic
let output_line oc s = output_string oc s; output_char oc '\n'
let process_lines = tokenized_pipe input_line output_line
let () = withFiles (process_lines srev) "src" "dst"
Or extend the Stream module to remap a character stream into a line stream:
open Prelude
type 'a accum = Acc of 'a | Done of 'a
let optMap f x = match x with None -> None | Some y -> Some (f y)
let string_of_char c = String.make 1 c
let (^.) s c = s ^ string_of_char c
module Stream =
struct
include Stream
let nextOpt st =
match peek st with
| None -> None
| x -> (junk st; x)
let rec foldlBreak f init st =
match nextOpt st with
| None -> init
| Some x ->
begin match f init x with
| Acc a -> foldlBreak f a st
| Done b -> b
end
let rec foldl f init st = foldlBreak (fun s i -> Acc (f s i)) init st
let map f st = from (fun _ -> optMap f (nextOpt st))
let remap f init st =
from (fun _ ->
match peek st with
| None -> None
| Some x -> Some (foldlBreak f init st))
let join s st = from (fun i -> if i mod 2 = 0 then nextOpt st else Some s)
end
let lines =
Stream.remap (fun s c -> if c = '\n' then Done s else Acc (s ^. c)) ""
let () =
withFiles (fun ic oc ->
let s = Stream.of_channel ic in
let rev_lines = Stream.map srev (lines s) in
Stream.iter (output_string oc) (Stream.join "\n" rev_lines)
) "src" "dst"
Here's a Haskell version of remap:
data Accum a = Acc a | Done (a, a)
remap t acc [] = [acc]
remap t acc (x:xs) =
case t acc x of
Acc a -> remap t a xs
Done (d, a) -> (d : remap t a xs)
line_lexer acc c =
case c of
'\n' -> Done (acc, "")
otherwise -> Acc (acc ++ [c])
my_lines s = remap line_lexer "" s
main = do
c <- readFile "src"
writeFile "dst" (concatMap (\l -> reverse l ++ "\n") $ my_lines c)
And a less crack-rock version of
my_lines
:
my_lines "" = []
my_lines s =
(line : my_lines (chomp '\n' rest))
where
(line, rest) = break ('\n' ==) s
chomp v (x:xs) | v == x = xs
chomp _ xs = xs
...sorry, got a bit derailed there. You can actually do a quite OCaml-like tokenized process in C as well, though it is a bit more involved due to manual memory management and "arrays are pointers to memory."
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* first some utils, feel free to skip ahead */
typedef struct {
size_t length;
char *ptr;
} string;
string create_string(size_t length) {
string s;
s.length = length;
s.ptr = (char*)malloc(length);
if (s.ptr == NULL) exit(1);
return s;
}
string reverse_string(string s)
{
int i, nidx, hl = s.length / 2;
char tmp;
for (i=0; i < hl; i++) {
tmp = s.ptr[i];
nidx = s.length-1-i;
s.ptr[i] = s.ptr[nidx];
s.ptr[nidx] = tmp;
}
return s;
}
void destroy_string(string s) {
s.length = 0;
if (s.ptr != NULL) free(s.ptr);
s.ptr = NULL;
}
/* the gist of the code */
void tokenize_pipe
(
string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string),
FILE *src, FILE *dst
)
{
while (!feof(src))
if (!write(f(read(src)), dst)) break;
}
void tokenize_pipe_files
(
string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string),
const char *src, const char *dst
)
{
FILE *sf = fopen(src, "r");
FILE *df = fopen(dst, "w");
tokenize_pipe(read, write, f, sf, df);
fclose(df);
fclose(sf);
}
/* reads a string from f */
string alloc_and_read(FILE *f)
{
string s = create_string(256);
if (fgets(s.ptr, s.length, f) == NULL)
s.length = 0;
else
s.length = strlen(s.ptr);
return s;
}
/* writes s to f and destroys s */
int write_and_free(string s, FILE *f)
{
fwrite(s.ptr, 1, s.length, f);
destroy_string(s);
return 1;
}
int main(int argc, char *argv[])
{
tokenize_pipe_files(alloc_and_read, write_and_free, reverse_string, "src", "dst");
return 0;
}
I think I'll leave it there for basic piping. The next subjects will probably be: talking with external processes, structured I/O and mmap, event loop IO and sockets.
Drop examples in the comments if you want to add better ways to do things / more languages here. Pointers to interesting ways to do the above things are also most welcome.
Bis dann, have fun!
3 comments:
That was interesting, thanks.
I benchmarked various ways ofreading arbitrary long lines, which may be of interest?
Reading lines in various languages
tr is not bash
@ Pádraig Brady: Thank you for the link, interesting how large the difference between fgets and getline is. And how cgrep's regex is faster than fgets.
The perl -ne print is also nice and succint, doing the reverse line piping example would be perl -ne 'print scalar reverse' <src >dst, though it offloads the piping to the shell.
@ anonymous: Yes, you are correct. But it's in my /usr/bin, so it is a part of the shell "standard library." That is a bit of a copout of course, since with that reasoning one could say that any language with an interpreter is a part of the shell stdlib.
So maybe I should change "bash" to "shell". But that's again a bit dubious, since you could have a shell that doesn't have piping.
Hence my take is more about getting the job done and demonstrating how piping works in bash. Writing an uppercasing function in bash could be done, but the common way is to use tr.
Post a Comment