Translate Haskell into English Manually
Have you ever tried to learn Haskell (haskell.org) and hit a brick wall? Have you tried to read the main tutorial, "A Gentle Introduction to Haskell" (www.haskell.org/tutorial), and found it to be about as gentle as a Pan Galactic Gargle Blaster (en.wikipedia.org/wiki/Pan_Galactic_Gargle_Blaster)? Did you have to learn about monads before you could even write your first non-trivial Haskell program? Have you noticed that unless you already know Haskell, it's even less readable than Shakespeare? Have you searched for an example of a nontrivial Haskell program only to find you can't understand it?
In Expert C Programming: Deep C Secrets (which, by the way, is a must read for anyone who ever uses the letter C), Peter van der Linden teaches the reader how to write a small program that translates C-type declarations into English. It's a great little exercise. It's beautiful because it works without relying on powerful tools like Lex and Yacc, yet it's still small enough to be written in a couple hours. I've read stories about people writing Pascal compilers in Haskell, and there's even a version of Perl 6 written in Haskell (www.pugscode.org).
In my last article (www.linuxjournal.com/article/8850), I gave readers an introduction to monads and why they should care about them as well as Haskell in general. (Note: several readers of my last article wondered why I was interested in Haskell in contrast to other excellent languages such as OCaml, Dylan or Common Lisp. Currently, I'm fascinated by Haskell because it is a purely functional programming language, and because it has lazy evaluation. As strange as it might sound, I'm fascinated by Haskell because it's harder for me to wrap my head around than those other languages.) If you still don't believe that Haskell is worthy of attention, read what Tim Sweeney, the founder of Epic, the video game company that created Unreal, said about it at beust.com/weblog/archives/000375.html. Sweeney feels that Haskell may be the gateway to "The Next Mainstream Programming Language". Because video game developers are often on the cutting edge of our field simply out of necessity, his comments are noteworthy.
In this two-part series, starting with Peter van der Linden's exercise, my hope is to give readers a glimpse of the Zen of Haskell, without requiring that they already be Haskell converts. In this first part, I start by reviewing the C version, and I explain why translating imperative code into Haskell isn't an easy thing to do. I finish by explaining the concept of a parse pipeline.
Let's start by reviewing the C version:
/* Translate C type declarations into English. This code was taken from "Expert C Programming: Deep C Secrets", p. 84. [sic: weird whitespace inconsistencies] Compiling: gcc cdecl.c -o cdecl Usage: echo -n "int *p;" | ./cdecl */ #include <stdio.h> #include <string.h> #include <ctype.h> #include <stdlib.h> #define MAXTOKENS 100 #define MAXTOKENLEN 64 enum type_tag { IDENTIFIER, QUALIFIER, TYPE }; struct token { char type; char string[MAXTOKENLEN]; }; int top=-1; struct token stack[MAXTOKENS]; struct token this; #define pop stack[top--] #define push(s) stack[++top]=s enum type_tag classify_string(void) /* figure out the identifier type */ { char *s = this.string; if (!strcmp(s,"const")) { strcpy(s,"read-only"); return QUALIFIER; } if (!strcmp(s,"volatile")) return QUALIFIER; if (!strcmp(s,"void")) return TYPE; if (!strcmp(s,"char")) return TYPE; if (!strcmp(s,"signed")) return TYPE; if (!strcmp(s,"unsigned")) return TYPE; if (!strcmp(s,"short")) return TYPE; if (!strcmp(s,"int")) return TYPE; if (!strcmp(s,"long")) return TYPE; if (!strcmp(s,"float")) return TYPE; if (!strcmp(s,"double")) return TYPE; if (!strcmp(s,"struct")) return TYPE; if (!strcmp(s,"union")) return TYPE; if (!strcmp(s,"enum")) return TYPE; return IDENTIFIER; } void gettoken(void) /* read next token into "this" */ { char *p = this.string; /* read past any spaces */ while ((*p = getchar()) == ' ' ) ; if (isalnum(*p)) { /* it starts with A-Z,0-9 read in identifier */ while ( isalnum(*++p=getchar()) ); ungetc(*p,stdin); *p = '\0'; this.type=classify_string(); return; } if (*p=='*') { strcpy(this.string,"pointer to"); this.type = '*'; return; } this.string[1] = '\0'; this.type = *p; return; } /* The piece of code that understandeth all parsing. */ read_to_first_identifier() { gettoken(); while (this.type!=IDENTIFIER) { push(this); gettoken(); } printf("%s is ", this.string); gettoken(); } deal_with_arrays() { while (this.type=='[') { printf("array "); gettoken(); /* an number or ']' */ if (isdigit(this.string[0])) { printf("0..%d ",atoi(this.string)-1); gettoken(); /* read the ']' */ } gettoken(); /* read next past the ']' */ printf("of "); } } deal_with_function_args() { while (this.type!=')') { gettoken(); } gettoken(); printf("function returning "); } deal_with_pointers() { while ( stack[top].type== '*' ) { printf("%s ", pop.string ); } } deal_with_declarator() { /* deal with possible array/function following the identifier */ switch (this.type) { case '[': deal_with_arrays(); break; case '(': deal_with_function_args(); } deal_with_pointers(); /* process tokens that we stacked while reading to identifier */ while (top >= 0) { if (stack[top].type == '(' ) { pop; gettoken(); /* read past ')' */ deal_with_declarator(); } else { printf("%s ",pop.string); } } } main() { /* put tokens on stack until we reach identifier */ read_to_first_identifier(); deal_with_declarator(); printf("\n"); return 0; }
Notice the stack of tokens and what the various functions do. By and large, I'll follow this closely in the Haskell version, so that the two versions can be compared. Again, notice that Lex and Yacc aren't used. Haskell has no shortage of powerful parsing libraries, but I've decided not to use any of them. After all, if I used a powerful parsing library in the Haskell version, trying to compare the C version to the Haskell version would be like trying to compare apples to peach cobbler.
Last of all, notice that the C version uses global variables for the token stack and so forth. It also prints output wherever convenient. This makes it easy to code, but it doesn't make it easy to translate to Haskell. This is the biggest difference between the C version and the Haskell version. After all, in a purely functional programming language like Haskell, a function can only take inputs and produce outputs. It can't update global state (there are no global variables), and it can't just print a string whenever it is convenient. However, staying purely functional has benefits that are hard to match in an imperative language. For instance, Haskell is lazy and non-strict. It evaluates expressions in whatever order it feels is necessary and only as necessary. For instance, it does not evaluate the arguments to a function until they're actually used in the function--you can pass 1/0 to a function, and as long as the function never uses that parameter, there will be no error. This style of programming can be hard to wrap your head around when you're coming from an imperative and/or object-oriented background. (The tutorial that really helped me finally understand "the Haskell way" was the "Haskell Tutorial for C Programmers" (www.haskell.org/~pairwise/intro/intro.html.)
I'm going to work my way toward the complete program via a series of smaller programs. Start by installing Hugs (www.haskell.org/hugs), which is a friendly, easy-to-use interpreter for Haskell.
Let's start with the obligatory "Hello, World!":
main = do putStrLn "Hello, World!"
The advanced Haskell programmer may note that the do is not strictly required here. At this point, I'm going to "hand waive" and say that the do is syntactically unnecessary (I'm not referring to style) in the same way that the braces are unnecessary in the following C snippet:
if (n > 5) { n -= 3; }
To run the program, simply execute the following at the command line:
$ runhugs hello_world.hs Hello, World!
Translating the source into English, it's simply:
The value of main is to do: Write "Hello, World!\n"
I like to think of Haskell programs as a bunch of mathematical equations pretending to act like a programming language. After all, the following makes sense to a mathematician, but similar code most definitely won't work in most other languages:
a = 1 main = do print c c = a + b b = 2
No matter which order those four lines are arranged, it still works. In most languages, "a = 1" is read "set a equal to 1", but in Haskell, it's read as "a is defined to be 1". The order of the definitions don't really matter most of the time.
By the way, print c is the same as putStrLn (show c). show c converts c into a string; In Java, you'd write c.toString(). This illustrates three points:
You don't use parentheses for function calls in Haskell (although you may sometimes have to wrap a particular argument in parentheses).
In general, what would be object.method() in Java is written method object in Haskell.
Every call to print implicitly calls show. Because Haskell is oriented toward functions, the idiom is to pass an object as the first argument to a function that acts like a method.
Before I get too carried away, let me show you one of the things I find most fascinating about Haskell's type system. (Haskell's type system and this feature in particular are shared with other languages in the ML language family.) Haskell's compiler is smart enough to let you write a swap function that takes two values of any type and return a tuple of the two values reversed.
swap x y = (y, x) main = do print (swap 3 4)
This is possible because the types really don't matter here. However, it's also smart enough to infer types if you don't explicitly state them and even complain about type errors at compile time that you would think could be caught only at runtime. For instance, the following is acceptable:
-- an end-of-line comment add1 x = x + 1 -- x must support "+ 1" callAdd1 x = add1 x -- callAdd1 looks "harmless" main = do print (callAdd1 1) -- pass a number to callAdd1
However, Haskell will complain about the following at compile time:
add1 x = x + 1 callAdd1 x = add1 x main = do print (callAdd1 'a') -- pass a Char to callAdd1 $ runhugs add1.hs ERROR "add1.hs":3 - Unresolved top-level overloading *** Binding : main *** Outstanding context : Num Char
It's complaining that I passed a Char to something that requires a Num. Num is what's called a type class, which is similar to an interface in Java. There are many different types that "implement" the Num "interface", but Char is not one of them.
Now, Pascal programmers might start feeling smug at this point until I remind them that nowhere in the code did I mention x is an int or a Num or anything like that. Similarly, fans of scripting languages with duck typing (en.wikipedia.org/wiki/Duck_typing) might also start feeling smug at this point until I remind them that this error was caught at compile time. (Although Hugs is an interpreter, there is a traditional, native compiler for Haskell called the Glasgow Haskell Compiler (www.haskell.org/ghc), and it could indeed detect this type error at compile time.)
To remain purely functional, Haskell makes a strong distinction between functions that do I/O, which is considered a side effect, and those that don't, which is considered purely functional. In Java, all non-runtime exceptions that are raised in a method must be declared in the signature. (Whether or not checked exceptions are a good thing in Java is another subject.) Similarly, in Haskell, all functions that do I/O must make use of the IO monad, which impacts the signature of the function. Hence, the question is, as a programmer, where do you draw the line between functions that can do I/O and functions that live in the purely functional world?
For most applications, actual I/O can be constrained to a very tiny portion of the program. For instance, in cdecl, I/O can be limited to the main function. The main function can read the C type declaration from the user and later write the English output to the user. The rest of the program can be devoted to the logic necessary to convert C type declarations into English.
If you generalize this pattern, you might say that a small part of the program reads and writes the "state of the world" (for instance, reading from STDIN, writing to STDOUT, manipulating files, talking to network sockets and so on), whereas the rest of the program can be composed of functions that are purely functional, that is, they take data and return data--that's it.
Comparing the situation to Star Trek, the purely functional part of the program is Captain Kirk--you give him a question, and he'll give you a decision. The part of the program that does I/O is the ensign sitting at the controls, and for completeness, Haskell is the Enterprise!
If the "translator" at the heart of the Haskell program simply outputs whatever it is given as input, you arrive at a stripped-down version of the UNIX program cat (or rather, the DOS program type), which is a nice place to start:
main = do s <- getContents putStr s
Note that getContents is a function that returns a string that reads from standard input as needed. Here, the program is taking all of standard input and printing it back out. You would think that would be hideously inefficient, but thanks to Haskell's lazy nature, Haskell doesn't have to be finished reading input before it can start writing output.
Now, ideally, it'd be nice to do something in the middle. cat is interesting only for so long. Ideally, it'd be nice to change the state of the world. Consider:
makeCool s = s ++ " is cool!" main = do s <- getContents putStrLn (makeCool s) $ echo -n "Haskell" | runhugs makeCool.hs Haskell is cool!
makeCool is a function that takes an input string (which happens to be all of standard input when it is called in main), translates it (by appending " is cool!"), and then returns it (where it happens to get printed to STDOUT). This is a suitable beginning for a "C to English" compiler. Now, if only the makeCool function were a lot more intelligent!
Keeping an eye on the C version, let's start by translating the token struct and token stack into Haskell. As I mentioned earlier, there are no global variables in Haskell, so I'll have to use a new data type to represent the state of the parser. I'll call it ParseContext. This ParseContext can be passed to functions explicitly. Later, I'll show how the State monad can be used to pass the ParseContext to functions implicitly.
Let's start with:
data TokenType = Identifier | Qualifier | Type | Symbol Char deriving (Show, Eq)
The data keyword creates a new data type in Haskell. Hence, TokenType is a new type. The data keyword replaces C's enum, struct and union in one shot. Here, it's acting as a cross between an enum and a union. A value of the type TokenType might be an Identifier, a Qualifier, a Type or a Symbol that wraps a specific Char (for instance, Symbol '+').
deriving Show tells the compiler that it can automatically figure out a suitable implementation for the show and read functions (conceptually, toString() and fromString()). deriving Eq tells the compiler that it can figure out a suitable implementation for ==. For instance, it just makes sense that Symbol '+' == Symbol '+' should be True, but Identifier == Qualifier should be False.
Continuing:
data Token = Token { tokenType :: TokenType, tokenValue :: String } deriving Show
Token is a new type. A value of that type is constructed using the constructor named Token. (Note that sometimes the type and the constructor have the same name, but sometimes they don't--look at TokenType above; It has four different constructors.) A Token has a member tokenType of type TokenType and a member tokenValue of type String. The :: should be read aloud as "has type" or "of type". Notice that this data declaration is very much like a struct in C.
Continuing:
data ParseContext = ParseContext { input :: String, -- The input that has not been parsed yet. output :: String, -- The output generated so far. currTok :: Token, -- The current token, if defined. stack :: [Token] -- A stack of tokens we haven't dealt with yet. } deriving Show
It should start looking familiar now. Notice that stack is a list of Tokens.
Now, let's merge the makeCool.hs program from earlier with all of these data declarations to create a overly complicated version of the makeCool.hs program:
data TokenType = Identifier | Qualifier | Type | Symbol Char deriving (Show, Eq) data Token = Token { tokenType :: TokenType, tokenValue :: String } deriving Show data ParseContext = ParseContext { input :: String, -- The input that has not been parsed yet. output :: String, -- The output generated so far. currTok :: Token, -- The current token, if defined. stack :: [Token] -- A stack of tokens we haven't dealt with yet. } deriving Show makeCool :: ParseContext -> ParseContext makeCool ParseContext {input = s} = ParseContext {input = "", output = s ++ " is cool!"} main = do s <- getContents let ctx = ParseContext {input = s, output = ""} in putStrLn $ output $ makeCool ctx
Let's start with makeCool :: ParseContext -> ParseContext. This says makeCool is a function that takes a ParseContext as an argument and returns a ParseContext. Declaring the signature for a function is usually optional in Haskell, so I could have left this line out. However, it often aids readability. Furthermore, it ensures that you and the compiler agree about what's going on.
ParseContext {input = s} can be read as "I take a 'ParseContext' as input, and I'm going to call the value of the 'input' field 's'." The ParseContext that it returns has "" for the "input" field, but 's ++ " is cool!"' for the "output" field, which is really the core of the earlier makeCool function.
The main function has changed by the addition of a let statement. A let statement is a way to create a sub-equation that's usable by the lines nested within the "in" clause. Here a new ParseContext is being created named ctx. The "input" is "s", and "output" is "".
The next line, putStrLn $ output $ makeCool ctx, may cause the reader to have Perl flashbacks, but here, $ is actually used for readability. (That's a joke. Perl programmers should avoid the urge to send me hate mail at this point!)
The following lines are all equivalent:
putStrLn (output (makeCool ctx)) -- Normal function calls. (putStrLn . output . makeCool) ctx -- Function composition: think "f of g". putStrLn $ output $ makeCool ctx -- A right to left pipeline.
I think it's helpful to read the last line backwards: "Pass 'ctx' to the 'makeCool' function. Then pass the result to the 'output' function. Then pass the result to the 'putStrLn' function."
Last of all, note that the makeCool function returns a ParseContext and the output function is how to get the output field from the ParseContext. Remember, what would be ctx.output in Java is written output ctx in Haskell.
The $ leads me to my next point. If you look at putStrLn $ output $ makeCool ctx, it's sort of a pipeline that runs from right to left. If you look at the makeCool function, it's a function that takes a ParseContext and returns a ParseContext. Whether or not the $ construct is used isn't all that important, but it leads me to the point that the main structure of the program can simply be:
A main function that reads input from the outside world and writes output to the outside world.
A ton of functions that transform ParseContext objects, all tied together into a pipeline.
Hence, if the C version is written as:
a(); b(); c();
Where a, b and c each might modify some global state and print output to the user, in Haskell, one might write:
c $ b $ a ctx
Where each function takes a ParseContext and returns a new one.