An Introduction to the Parsec Library

Introduction

The process of compiling or interpreting programs requires, as one of its steps, parsing the source code and structuring them. More specifically, we have a step to convert a string into a set of tokens, called lexical analysis, which is carried out by a lexer or tokenizer.

After that, we structure those tokens in a such a way that encodes meaning of these tokens, such as abstract syntax trees (AST). This step is called parsing and is out by a parser.

The parsec library

Parse combinators are high-order functions that combine smaller combinators (parsers) to build more complex ones. This resembles the idea of context free grammars (CFG), which we talked about in a previous post, where we have productions of form

S \rightarrow A | B

Here, S can be formed by either from A or B, which in turn, can be other productions or terminals.

The parsec library is an implementation of a parser combinator in Haskell. We talked about combinators in Haskell previously (in portuguese).

Parsec vs. Bison/Yacc/Antlr

All Bison, Yacc and Antlr are not actual parsers, but rather parsers generators. They take a grammar file and generate parsers for the languages that can be described by those grammars.

Parsec, on the other hand, is a parser that you write yourself.

In this post we’ll go through several basic concepts of the parsec library, using as the main reference, the book Real World Haskell, Chapter 16.

The source code for all the examples to follow can be found on the blog’s github repository.

Basic combinators

The basic idea of a parser is that it takes an input (for example, a string), it consumes the characters of this string until it’s done with what it’s supposed to parse and then pass the remaining (unparsed) string along, which might be used as input to subsequent parsers.

One of the simples parsers is the one that only consumes a single specific character. There’s a parser named char at Data.Char library, so let’s write one that parses the letter a:

import Data.Char (char)
charAParser = (char 'a')

To test our parser with an input, we use the parse function form Text.Parsec

import Text.Parsec
-- parsedInput = parse someParser "source name" "some input"

This function takes as input a parser, a source name and the actual input. The source name parameter is not important for us now, so we can pass the empty string. Let’s write a simple wrapper to avoid boilerplate:

test p = parse p ""

We can now test our parser with same sample inputs:

> test charAParser "a"
Right 'a'
> test charAParser "ab"
Right 'a'
> test charAParser "ba"
Left (line 1, column 1):
unexpected "b"
expecting "a"

It extracts the first character of the input string if it’s the 'a' character, otherwise it throws an error. If we want to match any char, there’s also the function anyChar. Running it with the same examples:

> test anyChar "a"
Right 'a'
> test anyChar "ab"
Right 'a'
> test anyChar "ba"
Right 'b'

Note that it doesn’t fail for strings starting with 'b'. So far our parsers only match one character, so for example, the string "ab", it only returns the first character.

We can use a parser for the string too. There’s the string combinator but let’s develop our own and show how we can combine combinators to form new ones. There’s the many combinator that applies the combinator passed as argument until it fails.

Thus, we can write a string parser as many anyChar:

stringParser:: Parsec String st String
stringParser = many anyChar

Now let’s try it with the string "ab":

> test stringParser "ab"
Right "ab"

More useful than matching all characters is matching all except some, so we know when to stop parsing. For that, we can use noneOf instead of anyChar. It takes a list of characters as parameter and matches any character that is not on that list.

Let’s now write a wordParser, which keeps parsing all characters until it finds an whitespace:

wordParser:: Parsec String st String
wordParser = many $ noneOf [' ']

Let’s try it on the most classical string example:

> test wordParser "hello world"
Right "hello"

Note that our parsers are throwing away all the unparsed strings. How can we parse the remaining, unparsed string?

The two flavors of the Parsec library: Monads and Applicatives Functors

We’ve talked about Functors and Monads before, but not about Applicatives functors.

Intuitively, they are a structure in between Functors and Monads, that is, they’re more complex and general than Functors but less than Monads. We can also make the analogy of wrappers that we did for monads.

Originally, the Parsec library was written with Monads in mind, but Applicative functors were introduced after that and using them to write parsers usually leads to more clear syntax. So, in this post, we’ll use the applicative flavor to write our parsers.

Here, we’ll only provide an overview of some of the main applicative operators. For further details, the book Learn You a Haskell for Great Good has a nice introduction to Applicatives.

Operators Cheat Sheet. We can use the Maybe typeclass to illustrate the main applicative operators, since it implements an applicative functor.

() Unwrap the contents of both sides, combine them and wrap again

> Just (+3)  Just 9
Just 12

(*>) Unwrap the contents of both sides, but discard the result on the left

> Just (+3) *> Just 9
Just 9
> Just 7 *> Just 8
Just 8

(<*) Unwrap the contents of both sides, but discard the result on the right.

> Just 7 <* Just 8
Just 7

() Unwrap the contents of the right, combine the left and right arguments and return

> (+3)  Just 9
Just 12

(<$) Unwrap the contents of the right, but only wrap the one to the left

> 3 <$ Just 9
Just 3

This analogy of wrappers applied to parsers is not as natural though. In this case, we can think of unwrapping as executing the parser, by consuming the input and wrapping the result as getting the parsed token. The unparsed string is always carried over from parser to parser.

Hopefully with the next examples this will become clearer:

Parsing the second word

If we are to get the token from the second parser instead of the first, we need to execute both parsers but ignore the result of the first. Thus, we can use the operator (*>) to obtain something like

wordParser *> wordParser

This won’t quite work because the first parser doesn’t consume the whitespace, so the second parser will stop before consuming anything. We can fix that by consuming the whitespace:

wordParser *> (char ' ') *> wordParser

So let’s write:

secondWordParser:: Parsec String st String
secondWordParser = wordParser *> (char ' ')  *> wordParser

and now we can test:

> test secondWordParser "ab cd"
> cd

Parsing two words

We can also return both tokens if we use the operator () and then combine them into a list:

twoWordsParser:: Parsec String st [String]
twoWordsParser = listfy  wordParser  ((char ' ') *> wordParser)
                   where listfy a b = [a, b]

Parsing multiple words

Generalizing, we can parse multiple words with the aid of the many combinator:

wordsParser:: Parsec String st [String]
wordsParser = (:)  wordParser  many ((char ' ') *> wordParser)

We could actually write this using the sepBy1 parser, which parses a list of tokens separated by a separator and requires the list to have at least one element:

wordsParserAlt:: Parsec String st [String]
wordsParserAlt = sepBy1 (char ' ')

Simple CSV parser

With what we’ve seen so far, we can write a very basic CSV parser in 4 lines of code.

csvParser:: Parsec String st [[String]]
csvParser = lineParser `endBy` newline <* eof
              where lineParser = cellParser `sepBy` (char ',')
                    cellParser = many $ noneOf ",\n"

Note that it doesn't handle some corner cases like escaped commas within cells. For a full example, refer to either [1] or [2].

Choosing combinators

Recall that in Context Free Grammars, we can have production rules of the type:

S \rightarrow A | B

which means that S can be generated either from A or B. In Parsec, we can express this option using the () operator. Let’s write a simple parser that parses either the "cat" or "dog" strings:

dogCatParser:: Parsec String st String
dogCatParser = (string "dog")  (string "cat")

Testing on some inputs:

> test dogCatParser "dog"
Right "dog"
> test dogCatParser "cat"
Right "cat"
> test dogCatParser "elephant"
Left (line 1, column 1):
unexpected "e"
expecting "cat" or "dog"

Let’s write another example with different animal names:

camelCatParser:: Parsec String st String
camelCatParser = (string "camel")  (string "cat")

and try again with the input "cat":

> test camelCatParser "cat"
Left (line 1, column 1):
unexpected "t"
expecting "camel"

The parser failed because the strings have common prefix. It started matching the camel parser, but it also consumed the "ca" characters and then it failed to match the cat parser.

The try combinator

To avoid this, problem, there’s the try combinator, which will make a parser to not consume its input if it fails to match:

camelCatTryParser:: Parsec String st String
camelCatTryParser = try (string "camel")  (string "cat")

which works as expected:

> test camelCatTryParser "cat"
Right "cat"

We can see that it’s straightforward to convert a standard context free grammar into a haskell program using parsec.

Simple Expression Parser

So far we our parsers have only returned strings and list of strings. We can use data types to structure our parsed data in a way that is easier to evaluate later.

For our example, we’ll build a very simple parser for expressions that only contain + and - binary operators, terminals are all integers and all binaries are surrounded by parenthesis so we don’t have to handle precedence. Examples of valid expressions are "12", "(1+2)", "((3+4)-5)", whereas "1+2" is invalid (no parenthesis).

The first thing we want to do is to define our data types. Our number type, TNumber, is just an alias to Int. Our operator type, TOperator can be one of addition (TAdd) or subtraction (TSubtract). Finally, the expression is either binary (TNode) or a number (TTerminal).

type TNumber = Int

data TOperator = TAdd
               | TSubtract
                 deriving (Eq, Ord, Show)

data TExpression = TNode (TExpression) TOperator (TExpression)
                 | TTerminal TNumber
                   deriving (Show)

From what we’ve seen so far, it’s not very complicated to write parsers for TNumber and TOperator:

numberParser:: Parsec String st TNumber
numberParser = read  (many $ oneOf "0123456789")

operatorParser:: Parsec String st TOperator
operatorParser = chooseOp  (oneOf "+-")
                   where chooseOp '+' = TAdd
                         chooseOp '-' = TSubtract

For the expression we have two choices. Either we parse another expression enclosed in parenthesis or we parse a terminal. In the first case, we call the binaryExpressionParser which looks for the left expression, the operator and then the right expression.

expressionParser:: Parsec String st TExpression
expressionParser = (between (char '(') (char ')') binaryExpressionParser) 
                   (TTerminal  numberParser)

binaryExpressionParser:: Parsec String st TExpression
binaryExpressionParser = TNode  expressionParser  operatorParser  expressionParser

And that’s it! We can now run an example with a valid expression:

> test expressionParser "(123+(324-456))"
Right (JNode (JTerminal 123) JAdd (JNode (JTerminal 324) JSubtract (JTerminal 456)))

The advantage of having this AST is that it’s now very simple to evaluate:

evaluate:: TExpression -> TNumber
evaluate (TNode exp1 TAdd exp2)      = (evaluate exp1) + (evaluate exp2)
evaluate (TNode exp1 TSubtract exp2) = (evaluate exp1) - (evaluate exp2)
evaluate (TTerminal v)               = v

And the final test:

> let Right x = expressionParser "(123+(324-456))"
> evaluate x
-9

It works! We implemented a simple parser and interpreter for a very limited arithmetic expression. There are much better tools to do expression parsing (see [5] for a tutorial), but it’s out of the scope of this post.

Conclusion

We’ve learned the basics of the Parsec library and built some non-trivial parsers gluing together basic parsers using combinators. We even started scratching the parsing of programming languages by writing a parser for arithmetic expressions.

The Parsec applications presented in the Real World Haskell book are great. I felt that the content was a bit hard to follow, but writing helped me get a better understanding of the subject.

References

[1] Real World Haskell – Chapter 16
[2] A gentle introduction to Parsec
[3] StackOverflow – Parsec vs Yacc/Bison/Antlr: Why and when to use Parsec?
[4] Wikipedia – Parser Combinator
[5] Haskell Wiki – Parsing expressions and statements

Advertisements

The Visitor pattern and Vtables in C++

The other day I was developing a Java code and I needed to use instanceof, but since I have read that this is a sign of bad design, I asked for help and was recommended the Visitor pattern.

In this post we’ll talk about this design pattern in Java, that is where my need began, but also in C++ because the implementation is related to vtables, which I’ve been wanting to read about.

Visitors in Java

Let us consider the following example, where we have the classes Dog and Cat, which are specializations of Animal and have methods to emit sound, respectively bark() and meow(). We want to implement a method emitSound() which receives an instance of Animal and emits the sound according to the actual class of the instance. A first alternative using instanceof is the following:

// Listing 1
interface Animal {    
}
public class Dog implements Animal {
	public void bark(){
		System.out.println("Woof");
	}
}
public class Cat implements Animal {
	public void meow(){
		System.out.println("Meow");
	}
}
...
public static void emitSound(Animal animal){
    if(animal instanceof Dog){
        Dog dog = (Dog) animal;
        dog.bark();
    }
    else if(animal instanceof Cat){
        Cat cat = (Cat) animal;
        cat.meow();
    }
}

Here we can quote Scott Meyers, from Effective C++ book:

Anytime you find yourself writing code of the form “if the object is of type T1, then do something, but if it’s of type T2, then do something else,” slap yourself.

To get rid of instanceof, we can add the method emitSound() to Animal interface and replace bark()/meow() for it. In our method emitSound(), we let polymorphism choose the correct implementation.

// Listing 2
interface Animal {
    void emitSound();
}
public class Dog implements Animal {
    @Override
    public void emitSound(){
        System.out.println("Woof");
    }
}
public class Cat implements Animal {
    @Override
    public void emitSound(){
        System.out.println("Meow");
    }
}
...
public static void emitSound(Animal animal){
    animal.emitSound();
}

Another possibility is to delegate the implementation of emitSound() to another class through the use of the Visitor pattern. This pattern considers two types of classes: an element and a visitor. The element implements a method usually named accept() while the visitor implements the method visit(). The accept() method receives a visitor as a parameter and calls the method visit() from this visitor with itself as parameter. This way, the visitor can implement the visit() method for each possible element.

We can implement like this:

// Listing 3
interface Animal {
    void accept(AnimalVisitor visitor);
}
public class Dog implements Animal {
    @Override
    public void accept(AnimalVisitor visitor){
        visitor.visit(this);
    }
}
public class Cat implements Animal {
    @Override
    public void accept(AnimalVisitor visitor){
        visitor.visit(this);
    }
}
interface AnimalVisitor {
    void visit(Dog dog);
    void visit(Cat cat);
}
public class SoundEmissor implements AnimalVisitor {
    @Override
    public void visit(Cat cat){
        System.out.println("Meow");
    }
    @Override
    public void visit(Dog dog){
        System.out.println("Woof");
    }	
}
...
public static void emitSound(Animal animal){
    SoundEmissor soundEmissor = new SoundEmission();
    animal.accept(soundEmissor);
}

The advantage of this approach is that SoundEmissor may contain members and methods related to the way animals emit sounds. Otherwise, these methods would ended up being implemented in Animal.

Another advantage is that the visitor implementation can be chosen at runtime and, being a dependency injection, simplifies testing.

Visitors in C++

In C++ we don’t have instanceof, but we can use the typeid() operator instead. According to [1], if the argument to this operator is a reference or a dereferenced pointer to a polymorfic class, typeid() will return a type_info corresponding to the runtime object type.

// Listing 4
struct Animal {
    virtual void foo(){};
};
struct Dog : public Animal {
    void bark(){
        cout << "Woof\n";
    } 
};
struct Cat : public Animal {
    void meow(){
        cout << "Meow\n";
    }
};
void emitSound(Animal *animal){
    if(typeid(*animal) == typeid(Dog)){
        Dog *dog = dynamic_cast<Dog *>(animal);
        dog->bark();
    }
    else if(typeid(*animal) == typeid(Cat)){
        Cat *cat = dynamic_cast<Cat *>(animal);
        cat->meow();
    }
}

Again, we can create emitSound() in a interface and make use of polymorphism. In C++ we can implement an interface through a class containing only purely virtual function and no visible constructor like Animal in the code below:

// Listing  5
struct Animal {
    virtual void emitSound() = 0;
};
struct Dog : public Animal {
    void emitSound(){
        cout << "Woof\n";
    }
};
struct Cat : public Animal {
    void emitSound(){
        cout << "Meow\n";
    }
};
void emitSound(Animal *animal){
    animal->emitSound();
}

The Visitor pattern can be similarly implemented in the following way:

// Listing 6
struct Cat;
struct Dog;

struct AnimalVisitor {
    virtual void visit(Cat *cat) = 0;
    virtual void visit(Dog *dog) = 0;
};
struct Animal {
    virtual void accept(AnimalVisitor *visitor) = 0;
};
struct SoundEmissor : public AnimalVisitor {
    void visit(Cat *cat){
        cout << "Meow\n";
    }
    void visit(Dog *dog){
        cout << "Woof\n";
    }    
};
struct Dog : public Animal {
    void accept(AnimalVisitor *visitor){
        visitor->visit(this);
    }
};
struct Cat : public Animal {
    void accept(AnimalVisitor *visitor){
        visitor->visit(this);
    }
};
void emitSound(Animal *animal){
    AnimalVisitor *visitor = new SoundEmissor();
    animal->accept(visitor);
}

Single and multiple dispatch

We just saw that the Visitor pattern relies on the use of virtual functions. So now, let’s analyse how C++ implements these kind of functions. Before that, we need to understand the concept of dynamic dispatch.

Dynamic dispatch is a technique used in cases where different classes (with a common ancestor) implement the same methods, but we can’t tell on compile time what is the type of a given instance, as in the cases above.

In C++ and Java, this type of dispatch is also known as single dispatch since it only considers the type of the object calling the method. In Lisp and other few languages, there is the multiple dispatch that also takes into account the type of the parameters.

As an example, take a look at the following code:

// Listing 7
void visitor(Dog *dog){
    cout << "Woof\n";
}
void visitor(Cat *cat){
    cout << "Meow!\n";
}
void emitSound(Animal *animal){
    visitor(animal);
}

We’ll get a compile error since method/function matching for parameters is made at compile time. In this case we need to have an implementation for the Animal type.

Now, we’ll see how C++ implements dynamic dispatch.

C++ Vtables

Although the C++ standard does not specify how dynamic dispatch should be implemented, compilers in general use the a structure called the virtual method table, known for other names such as vtable, which is the one we’ll use henceforth. Actually, this table is an array of pointers to virtual methods.

Each class that defines or inherits at least one virtual method has its own vtable. This table points to the virtual methods defined by the nearest ancestral (including the class itself) that are visible for that class. Besides, each instance of these classes will have an additional member, a pointer that points to the table corresponding to the class used to instantiate that object.

As an example, if we were to instantiate a Dog from Listing 5:

Animal *animal = new Cachorro();

Here, the animal's pointer points to Dog‘s and not Animal‘s. Thus, when we do

animal->emitirSom();

the corresponding table is looked up to find exactly which emitSound() to call. Note that virtual methods requires an extra indirection that may affect performance. Because of this dynamic dispatch is not performed by default unless some class defines a virtual method. On the other hand, in Java this is done by default.

Let’s see a final example,

struct A {
    virtual void foo(){}
    virtual void bar(){}
    void baz(){}
};
struct B : public A {
    void bar(){}
    virtual void baz(){}
};

We can analyse the vtables from class A and B compiling the code above with gcc using the -fdump-class-hierarch flag. A text file with extension .class will be generated like this:

Vtable for A
A::_ZTV1A: 4u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1A)
8 (int (*)(...))A::foo
12 (int (*)(...))A::bar

Class A
size=4 align=4
base size=4 base align=4
A (0xb71f6038) 0 nearly-empty
vptr=((&amp; A::_ZTV1A) + 8u)

Vtable for B
B::_ZTV1B: 5u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1B)
8 (int (*)(...))A::foo
12 (int (*)(...))B::bar
16 (int (*)(...))B::baz

Class B
size=4 align=4
base size=4 base align=4
B (0xb7141618) 0 nearly-empty
vptr=((&amp; B::_ZTV1B) + 8u)
A (0xb71f61f8) 0 nearly-empty
primary-for B (0xb7141618)

Here we can see that function A::foo and A::bar appear in A‘s vtable, but A::baz don’t, since it was not declared virtual. On B‘s vtable we have A::foo, because it was not overridden in B. We also have B::bar, although it has not been declared virtual in B, this property was inherited from A::bar. Finally, B::baz appears on the vtable because it was declared virtual in B.

We can also see the value that the instances of such classes will have: vptr=((& A::_ZTV1A) + 8u) and vptr=((& B::_ZTV1B) + 8u), which is the offset to the functions’ pointers on the respective tables. A nice reference for a understanding of vtables can be seen in [2].

Referências

[1] The typeid operator (C++ only)
[2] LearnCpp.com – The Virtual Table