Start parsing in Clojure: fnparse
fnparse
is a parser combinator library for Clojure. I’ve covered the basic mechanics of parser combinators previously. fnparse
is built on the exact same primitives (backtracking state monads) but takes the concept much farther. This article introduces the basics of how to get going with fnparse
.
All the examples in this post should work with Clojure 1.2.0, fnparse 2.2.7 and a repl.
Please note that while many of these examples mirror functionality present in regular expressions, this is by no means the extent of parser combinators. It merely represents a common ground that many people are familiar with.
Running a parser
Our first step toward building interesting parsers is to take some text (in this case a string), and run it through a simple parser. fnparse
provides a function ‘rule-match
’ that takes a number of arguments. The first is the parser to run, and the last is the initial state. The middle two provide error handling functions, which we are going to ignore at this stage. The state for our purposes is a map that contains the remainder of the input stream. Here is a simple example:
user=> (use 'name.choi.joshua.fnparse)
nil
user=> (rule-match (lit "a") prn prn {:remainder ["a"]})
"a"
Our parser in this case is (lit "a")
; this produces a literal parser e.g. a parser that consumes the literal string "a"
. It will match exactly one string containing exactly one lower case a
, nothing more or less.
At the other end of the arguments we have our initial state: a map with the single key :remainder
which is a vector of strings. In this case a single "a"
1.
Because our parser is able to completely consume the input, it returns the result of the parser: "a"
. If it were not able to parse it returns nil
:
user=> (rule-match (lit "a") prn prn {:remainder ["b"]})
{:remainder ["b"]}
nil
As you can see, nil
has been returned. Additionally one of our prn
functions was called. This was because our parser (lit "a")
failed to parse our input "b"
.
The last thing to examine is that fnparse
doesn’t care about the type of tokens in our input sequence. We are parsing a list of string tokens here, but it could be a list of keywords e.g.:
user=> (rule-match (lit :a) prn prn {:remainder [:a]})
:a
Or it could be a string of characters:
user=> (rule-match (lit \a) prn prn {:remainder "a"})
\a
fnparse
is designed to be run on a stream of tokens produced by a lexer. For the rest of the tutorial I will be ignoring this and working with strings of characters for simplicity.
Before we continue I’m going to wrap up rule-match
to make it easier to use from the repl for our purposes. I would not recommend using this function in a real program:
user=> (defn run-p
[parser input]
(let [result (rule-match parser
#(prn "fail:" %&)
#(prn "incomplete:" %&)
{:remainder input})]
(cond (nil? result) nil
(vector? result) (apply str result)
:else (str result))))
#'user/run-p
Rules
In fnparse
nomenclature these small parsers are known as a rule. Some rules are composed of sub-rules, while others are atomic. It is the ability to treat rules as first-class Clojure code and create functions that produce more sophisticated rules out of atomic rules that makes parser combinators so expressive.
Now that we know how to run a rule we are going to examine some primitive rules. We will start by introducing the operations that you might be familiar with from regular expressions.
Alternation
Alternation is choosing between two or more rules. This is equivalent to |
in regular expressions. In fnparse
this is the alt
parser. It takes multiple rules as arguments and returns a new rule. This new rule will try to satisfy each sub-rule in turn. alt
stops trying sub-rules when one accepts the input. Here is a rule that is equivalent to the regexp #"^(a|b)$"
:
user=> (run-p (alt (lit \a) (lit \b)) "a")
"a"
user=> (run-p (alt (lit \a) (lit \b)) "b")
"b"
user=> (run-p (alt (lit \a) (lit \b)) "c")
"fail:" ({:remainder "c"})
nil
Optional tokens
opt
simply tries to apply a rule but if the subrule fails, the opt
rule does not. This is equivalent to ?
in regular expressions, eg #"^a?$"
:
user=> (run-p (opt (lit \a)) "a")
"a"
user=> (run-p (opt (lit \a)) "")
nil
user=> (run-p (opt (lit \a)) "b")
"incomplete:" ({:remainder "b"} {:remainder "b"})
nil
Repetition
Repetition lets us repeatedly apply a parser to the input until it no longer accepts the tokens. fnparse
provides two such parsers: rep+
and rep*
. rep+
is one or more of the sub rule, rep*
is zero or more. rep*
will always succeed. These operators match the +
and *
operations in regular expressions respectively.
First up, lets examine rep+ with a rule similar to #"^a+$"
:
user=> (run-p (rep+ (lit \a)) "a")
"a"
user=> (run-p (rep+ (lit \a)) "aa")
"aa"
user=> (run-p (rep+ (lit \a)) "aaa")
"aaa"
user=> (run-p (rep+ (lit \a)) "aaab")
"incomplete:" ({:remainder "aaab"} {:remainder (\b)})
nil
user=> (run-p (rep+ (lit \a)) "")
"fail:" ({:remainder ""})
nil
user=> (run-p (rep+ (lit \a)) "b")
"fail:" ({:remainder "b"})
nil
You can see here that we have 2 outright failures (when there are no \a
’s in the sequence) and 1 incomplete. Lets contrast this with rep*
and a rule equivalent to #"^a*$"
:
user=> (run-p (rep* (lit \a)) "a")
"a"
user=> (run-p (rep* (lit \a)) "aa")
"aa"
user=> (run-p (rep* (lit \a)) "aaa")
"aaa"
user=> (run-p (rep* (lit \a)) "aaab")
"incomplete:" ({:remainder "aaab"} {:remainder (\b)})
nil
user=> (run-p (rep* (lit \a)) "")
nil
user=> (run-p (rep* (lit \a)) "b")
"incomplete:" ({:remainder "b"} {:remainder "b"})
nil
The first 4 cases are the same. The interesting differences occur with the inputs ""
and "b"
. Note that ""
is now successfully parsed.
Anything
The last major operation from regular expressions is .
or ‘anything’. fnparse
provides an equivalent to this with the anything
rule. Lets look at a couple of examples equivalent to #"^.$"
and #"^.+$"
:
user=> (run-p anything "a")
"a"
user=> (run-p anything "b")
"b"
user=> (run-p anything "")
nil
user=> (run-p anything "ab")
"incomplete:" ({:remainder "ab"} {:remainder (\b)})
nil
Note here that anything
differs slightly from .
because it happily consumes the end of input without error!
user=> (run-p (rep+ anything) "ab")
"ab"
user=> (run-p (rep+ anything) "a")
"a"
user=> (run-p (rep+ anything) "")
""
user=> (run-p (rep+ anything) "abc")
"abc"
Creating rules
The next thing to learn is how to combine these primitive rules to create more sophisticated rules. To start with lets create a character class rule that mirrors the […]
syntax. There are two functions we will create, and we’ll also introduce a new rule combinator: term
.
term
creates a parser that checks a single token against a validator predicate. If the predicate returns true then the rule succeeds. In this case we are going to create a little DSL for describing character classes. Our char-classes need literal characters, ranges which we will denote with maps, and special keywords for particular common character classes:
user=> (use 'clojure.set)
nil
user=> (defn- rule-to-set
"Take a rule and create the set of characters it represents."
[rule]
(cond (map? rule) (set (mapcat (fn [[lower upper]]
(map char (range (int lower)
(inc (int upper)))))
rule))
(char? rule) #{rule}
(string? rule) (set rule)
(set? rule) rule
:else (rule-to-set ({:digit {\0 \9}
:word {\a \z \A \Z \0 \9}
:space #{\newline \tab \space}}
rule #{}))))
#'user/rule-to-set
user=> (defn char-class
"This is an example parser that mirrors the character class
notation from regular expressions. It is also heavily inspired by
Christophe Grand's (https://github.com/cgrand/) regexp,
moustache, and enlive librarys."
[ & rules ]
(let [negated (= (first rules) :not)
rules (if negated (rest rules) rules)
ruleset (apply union (map rule-to-set rules))
validate (if negated (complement contains?) contains?)]
(term #(validate ruleset %))))
#'user/char-class
This is a big block of code. Here it is in action:
user=> (run-p (char-class \a) "a") ;; basic literals
"a"
user=> (run-p (char-class \a) "c")
"fail:" ({:remainder "c"})
nil
user=> (run-p (char-class \a \c) "c")
"c"
user=> (run-p (char-class \a \c) "b")
"fail:" ({:remainder "b"})
nil
user=> (run-p (char-class {\a \c}) "b") ;; introduce a range
"b"
user=> (run-p (char-class "abc") "b")
"b"
user=> (run-p (char-class :digit) "a") ;; common class
"fail:" ({:remainder "a"})
nil
user=> (run-p (char-class :digit) "1")
"1"
user=> (run-p (char-class :digit) "9")
"9"
user=> (run-p (char-class :digit "+-") "+") ;; two rules
"+"
user=> (run-p (char-class :digit "+-") "1")
"1"
user=> (run-p (char-class :not :digit "+-") "1") ;; negation
"fail:" ({:remainder "1"})
nil
user=> (run-p (char-class :not :digit "+-") "a")
"a"
This demonstrates an expressive feature of parser combinators as a model: it is very reasonable to create custom rules that follow specific features of the domain.
Nested expressions
Its now time to leave the regular expressions examples and move onto a case where regexps are not suitable: nested expressions. The common example here is bracket counting. Given a string of \(
and \)
characters, we will write a parser that will only parse expressions where the brackets are balanced:
user=> (def brackets (rep* (conc (lit \()
brackets
(lit \)))))
#'user/brackets
user=> (defn run-p2
[parser input]
(rule-match parser
#(prn "fail:" %&)
#(prn "incomplete:" %&)
{:remainder input}))
#'user/run-p2
user=> (run-p2 brackets "")
nil
user=> (run-p2 brackets "()")
[(\( nil \))]
user=> (run-p2 brackets "()(")
"incomplete:" ({:remainder "()("} {:remainder (\()})
nil
user=> (run-p2 brackets "()()")
[(\( nil \)) (\( nil \))]
user=> (run-p2 brackets "(())()")
[(\( [(\( nil \))] \)) (\( nil \))]
user=> (run-p2 brackets "(()()")
"incomplete:" ({:remainder "(()()"} {:remainder "(()()"})
nil
This clearly demonstrates a feature of parser combinators: recursive definitions. You can see how brackets
has been defined in terms of itself. A brackets expression consists of zero or more pairs of brackets. Each pair may contain more brackets.
This also introduces the conc
operator, which concatenates simple rules together.
user=> (run-p2 (conc (rep+ (lit \a)) (lit \b)) "aaab")
([\a \a \a] \b)
user=> (run-p2 (conc (rep+ (lit \a)) (lit \b)) "b")
"fail:" ({:remainder "b"})
nil
run-p2
is a simplified parser runner; its easier to look at these parsers by examining the parse trees they generate than by trying to stringify them.
Multi-brackets
Next we are going to extend the bracket counter to count three types of brackets: (), [], {}. To do so I am going to introduce the complex
combinator to show how to capture parse state and use it to make decisions later in parsing process:
user=> (def opening (alt (lit \() (lit \{) (lit \[)))
#'user/opening
user=> (def closing (comp lit {\( \) \{ \} \[ \]}))
#'user/closing
user=> (def multi-bracket
(rep* (complex [open opening
body multi-bracket
close (closing open)]
[open body close])))
#'user/multi-bracket
user=> (run-p2 multi-bracket "()")
[[\( nil \)]]
user=> (run-p2 multi-bracket "({})")
[[\( [[\{ nil \}]] \)]]
user=> (run-p2 multi-bracket "({})[]")
[[\( [[\{ nil \}]] \)] [\[ nil \]]]
user=> (run-p2 multi-bracket "({)})[]")
"incomplete:" ({:remainder "({)})[]"} {:remainder "({)})[]"})
nil
Where to now?
This article should have introduced enough of the basics of using fnparse
. I suggest that you examine the operators provided in the core module, and try to understand the sample JSON parser.
In addition to understanding the core rules and combinators, you may wish to examine the error handling facilities (we have merely used them to print logs) and the state management facilities.
You may also find tutorials on Haskell’s Parsec such as the Using Parsec chapter of Real World Haskell useful.
- Note that instead of wrapping your input stream in a map with a
:remainder
key, you can rebind*remainder-accessor*
to some other function. This is particularly useful if your token stream comes from some other library; you wont need to write a translation function. - Note that if you are wanting to actually apply regular expressions to tokens, the library provides built in rules to make that trivial.