A Lazy Sequence

Start parsing in Clojure: fnparse

fnparse is a parser combinator library for Clojure. I’ve covered the basic mechanics of parser combinators previously. fnparse is built on the exact same primitives (backtracking state monads) but takes the concept much farther. This article introduces the basics of how to get going with fnparse.

All the examples in this post should work with Clojure 1.2.0, fnparse 2.2.7 and a repl.

Please note that while many of these examples mirror functionality present in regular expressions, this is by no means the extent of parser combinators. It merely represents a common ground that many people are familiar with.

Running a parser

Our first step toward building interesting parsers is to take some text (in this case a string), and run it through a simple parser. fnparse provides a function ‘rule-match’ that takes a number of arguments. The first is the parser to run, and the last is the initial state. The middle two provide error handling functions, which we are going to ignore at this stage. The state for our purposes is a map that contains the remainder of the input stream. Here is a simple example:

user=> (use 'name.choi.joshua.fnparse)
nil
user=> (rule-match (lit "a") prn prn {:remainder ["a"]})
"a"

Our parser in this case is (lit "a"); this produces a literal parser e.g. a parser that consumes the literal string "a". It will match exactly one string containing exactly one lower case a, nothing more or less.

At the other end of the arguments we have our initial state: a map with the single key :remainder which is a vector of strings. In this case a single "a"1.

Because our parser is able to completely consume the input, it returns the result of the parser: "a". If it were not able to parse it returns nil:

user=> (rule-match (lit "a") prn prn {:remainder ["b"]})
{:remainder ["b"]}
nil

As you can see, nil has been returned. Additionally one of our prn functions was called. This was because our parser (lit "a") failed to parse our input "b".

The last thing to examine is that fnparse doesn’t care about the type of tokens in our input sequence. We are parsing a list of string tokens here, but it could be a list of keywords e.g.:

user=> (rule-match (lit :a) prn prn {:remainder [:a]})
:a

Or it could be a string of characters:

user=> (rule-match (lit \a) prn prn {:remainder "a"})
\a

fnparse is designed to be run on a stream of tokens produced by a lexer. For the rest of the tutorial I will be ignoring this and working with strings of characters for simplicity.

Before we continue I’m going to wrap up rule-match to make it easier to use from the repl for our purposes. I would not recommend using this function in a real program:

user=> (defn run-p 
         [parser input]
         (let [result (rule-match parser 
                                  #(prn "fail:" %&) 
                                  #(prn "incomplete:" %&) 
                                  {:remainder input})]
            (cond (nil? result) nil
                  (vector? result) (apply str result)
                  :else (str result))))
#'user/run-p

Rules

In fnparse nomenclature these small parsers are known as a rule. Some rules are composed of sub-rules, while others are atomic. It is the ability to treat rules as first-class Clojure code and create functions that produce more sophisticated rules out of atomic rules that makes parser combinators so expressive.

Now that we know how to run a rule we are going to examine some primitive rules. We will start by introducing the operations that you might be familiar with from regular expressions.

Alternation

Alternation is choosing between two or more rules. This is equivalent to | in regular expressions. In fnparse this is the alt parser. It takes multiple rules as arguments and returns a new rule. This new rule will try to satisfy each sub-rule in turn. alt stops trying sub-rules when one accepts the input. Here is a rule that is equivalent to the regexp #"^(a|b)$":

user=> (run-p (alt (lit \a) (lit \b)) "a")
"a"
user=> (run-p (alt (lit \a) (lit \b)) "b")
"b"
user=> (run-p (alt (lit \a) (lit \b)) "c")
"fail:" ({:remainder "c"})
nil

Optional tokens

opt simply tries to apply a rule but if the subrule fails, the opt rule does not. This is equivalent to ? in regular expressions, eg #"^a?$":

user=> (run-p (opt (lit \a)) "a")
"a"
user=> (run-p (opt (lit \a)) "")
nil
user=> (run-p (opt (lit \a)) "b")
"incomplete:" ({:remainder "b"} {:remainder "b"})
nil

Repetition

Repetition lets us repeatedly apply a parser to the input until it no longer accepts the tokens. fnparse provides two such parsers: rep+ and rep*. rep+ is one or more of the sub rule, rep* is zero or more. rep* will always succeed. These operators match the + and * operations in regular expressions respectively.

First up, lets examine rep+ with a rule similar to #"^a+$":

user=> (run-p (rep+ (lit \a)) "a")
"a"
user=> (run-p (rep+ (lit \a)) "aa")
"aa"
user=> (run-p (rep+ (lit \a)) "aaa")
"aaa"
user=> (run-p (rep+ (lit \a)) "aaab")
"incomplete:" ({:remainder "aaab"} {:remainder (\b)})
nil
user=> (run-p (rep+ (lit \a)) "")
"fail:" ({:remainder ""})
nil
user=> (run-p (rep+ (lit \a)) "b")
"fail:" ({:remainder "b"})
nil

You can see here that we have 2 outright failures (when there are no \a’s in the sequence) and 1 incomplete. Lets contrast this with rep* and a rule equivalent to #"^a*$":

user=> (run-p (rep* (lit \a)) "a")
"a"
user=> (run-p (rep* (lit \a)) "aa")
"aa"
user=> (run-p (rep* (lit \a)) "aaa")
"aaa"
user=> (run-p (rep* (lit \a)) "aaab")
"incomplete:" ({:remainder "aaab"} {:remainder (\b)})
nil
user=> (run-p (rep* (lit \a)) "")
nil
user=> (run-p (rep* (lit \a)) "b")
"incomplete:" ({:remainder "b"} {:remainder "b"})
nil

The first 4 cases are the same. The interesting differences occur with the inputs "" and "b". Note that "" is now successfully parsed.

Anything

The last major operation from regular expressions is . or ‘anything’. fnparse provides an equivalent to this with the anything rule. Lets look at a couple of examples equivalent to #"^.$" and #"^.+$":

user=> (run-p anything "a")
"a"
user=> (run-p anything "b")
"b"
user=> (run-p anything "")
nil  
user=> (run-p anything "ab")
"incomplete:" ({:remainder "ab"} {:remainder (\b)})
nil

Note here that anything differs slightly from . because it happily consumes the end of input without error!

user=> (run-p (rep+ anything) "ab")
"ab"
user=> (run-p (rep+ anything) "a")
"a"
user=> (run-p (rep+ anything) "")
""
user=> (run-p (rep+ anything) "abc")
"abc"

Creating rules

The next thing to learn is how to combine these primitive rules to create more sophisticated rules. To start with lets create a character class rule that mirrors the […] syntax. There are two functions we will create, and we’ll also introduce a new rule combinator: term.

term creates a parser that checks a single token against a validator predicate. If the predicate returns true then the rule succeeds. In this case we are going to create a little DSL for describing character classes. Our char-classes need literal characters, ranges which we will denote with maps, and special keywords for particular common character classes:

user=> (use 'clojure.set)
nil
user=> (defn- rule-to-set
          "Take a rule and create the set of characters it represents."
          [rule]
          (cond (map? rule)    (set (mapcat (fn [[lower upper]]
                                               (map char (range (int lower)
                                                                (inc (int upper)))))
                                       rule))
                (char? rule)   #{rule}
                (string? rule) (set rule)
                (set? rule)    rule
                :else (rule-to-set ({:digit {\0 \9}
                                     :word  {\a \z \A \Z \0 \9}
                                     :space #{\newline \tab \space}}
                                     rule #{}))))
#'user/rule-to-set

user=> (defn char-class
          "This is an example parser that mirrors the character class 
           notation from regular expressions. It is also heavily inspired by 
           Christophe Grand's (https://github.com/cgrand/) regexp, 
           moustache, and enlive librarys."
          [ & rules ]
          (let [negated  (= (first rules) :not)
                rules    (if negated (rest rules) rules)
                ruleset  (apply union (map rule-to-set rules))
                validate (if negated (complement contains?) contains?)]
            (term #(validate ruleset %))))
#'user/char-class

This is a big block of code. Here it is in action:

user=> (run-p (char-class \a) "a")               ;; basic literals
"a"
user=> (run-p (char-class \a) "c")
"fail:" ({:remainder "c"})
nil
user=> (run-p (char-class \a \c) "c")
"c"
user=> (run-p (char-class \a \c) "b")
"fail:" ({:remainder "b"})
nil
user=> (run-p (char-class {\a \c}) "b")          ;; introduce a range
"b"
user=> (run-p (char-class "abc") "b")
"b"
user=> (run-p (char-class :digit) "a")           ;; common class
"fail:" ({:remainder "a"})
nil
user=> (run-p (char-class :digit) "1")
"1"
user=> (run-p (char-class :digit) "9")
"9"
user=> (run-p (char-class :digit "+-") "+")      ;; two rules
"+"
user=> (run-p (char-class :digit "+-") "1")
"1"
user=> (run-p (char-class :not :digit "+-") "1") ;; negation 
"fail:" ({:remainder "1"})
nil
user=> (run-p (char-class :not :digit "+-") "a")
"a"

This demonstrates an expressive feature of parser combinators as a model: it is very reasonable to create custom rules that follow specific features of the domain.

Nested expressions

Its now time to leave the regular expressions examples and move onto a case where regexps are not suitable: nested expressions. The common example here is bracket counting. Given a string of \( and \) characters, we will write a parser that will only parse expressions where the brackets are balanced:

user=> (def brackets (rep* (conc (lit \() 
                                 brackets 
                                 (lit \)))))
#'user/brackets

user=> (defn run-p2 
         [parser input]
         (rule-match parser 
                     #(prn "fail:" %&) 
                     #(prn "incomplete:" %&) 
                     {:remainder input}))
#'user/run-p2

user=> (run-p2 brackets "")
nil
user=> (run-p2 brackets "()")
[(\( nil \))]
user=> (run-p2 brackets "()(")
"incomplete:" ({:remainder "()("} {:remainder (\()})
nil
user=> (run-p2 brackets "()()")
[(\( nil \)) (\( nil \))]
user=> (run-p2 brackets "(())()")
[(\( [(\( nil \))] \)) (\( nil \))]
user=> (run-p2 brackets "(()()")
"incomplete:" ({:remainder "(()()"} {:remainder "(()()"})
nil

This clearly demonstrates a feature of parser combinators: recursive definitions. You can see how brackets has been defined in terms of itself. A brackets expression consists of zero or more pairs of brackets. Each pair may contain more brackets.

This also introduces the conc operator, which concatenates simple rules together.

user=> (run-p2 (conc (rep+ (lit \a)) (lit \b)) "aaab")
([\a \a \a] \b)
user=> (run-p2 (conc (rep+ (lit \a)) (lit \b)) "b")
"fail:" ({:remainder "b"})
nil

run-p2 is a simplified parser runner; its easier to look at these parsers by examining the parse trees they generate than by trying to stringify them.

Multi-brackets

Next we are going to extend the bracket counter to count three types of brackets: (), [], {}. To do so I am going to introduce the complex combinator to show how to capture parse state and use it to make decisions later in parsing process:

user=> (def opening (alt (lit \() (lit \{) (lit \[)))
#'user/opening
user=> (def closing (comp lit {\( \) \{ \} \[ \]}))
#'user/closing

user=> (def multi-bracket 
         (rep* (complex [open opening 
                         body multi-bracket 
                         close (closing open)] 
           [open body close])))
#'user/multi-bracket

user=> (run-p2 multi-bracket "()")
[[\( nil \)]]
user=> (run-p2 multi-bracket "({})")
[[\( [[\{ nil \}]] \)]]
user=> (run-p2 multi-bracket "({})[]")
[[\( [[\{ nil \}]] \)] [\[ nil \]]]
user=> (run-p2 multi-bracket "({)})[]")
"incomplete:" ({:remainder "({)})[]"} {:remainder "({)})[]"})
nil

Where to now?

This article should have introduced enough of the basics of using fnparse. I suggest that you examine the operators provided in the core module, and try to understand the sample JSON parser.

In addition to understanding the core rules and combinators, you may wish to examine the error handling facilities (we have merely used them to print logs) and the state management facilities.

You may also find tutorials on Haskell’s Parsec such as the Using Parsec chapter of Real World Haskell useful.

  1. Note that instead of wrapping your input stream in a map with a :remainder key, you can rebind *remainder-accessor* to some other function. This is particularly useful if your token stream comes from some other library; you wont need to write a translation function.
  2. Note that if you are wanting to actually apply regular expressions to tokens, the library provides built in rules to make that trivial.
7 February 2011