Parsers generating parsers

The following is a slightly more advanced use of fnparse; It extends the char-class example provided in my previous post to support taking the text of a regular expression character class and generating a rule from it.

First thing to examine is a couple of new operations that I didn’t introduce in the previous post. lit-alt-seq is approximately equivalent to (fn [s] (apply alt (map lit s))). It could be used to replace the (alt (lit \() (lit \{) (lit \[)) expression in multi-bracket for example.

semantics, the other new operator, is more significant. semantics takes a rule and a transformation. If the rule succeeds, then its results are passed to the transformation function. If the rule fails, the transformation function is not called at all. A brief example:

user=> (run-p2 (semantics (lit-alt-seq "abc") int) "a")
97
user=> (run-p2 (semantics (lit-alt-seq "abc") int) "1")
"fail:" ({:remainder "a"})
nil

This is important because it reduces the need to depend on the complex operator and allows us to reuse a lot more behaviour. Note that there is a constant-semantics variant that is equivalent to (fn [r v] (semantics r (constantly v))).

Here is the code:

; define some private rules for char-class-text
(let [escaped      (semantics (conc (lit \\) anything)
                              second)
      predefined   (semantics (conc (lit \/) (lit-alt-seq "swd"))
                              (comp {\s :space \w :word \d :digit} second))
      single       (alt escaped
                        predefined
                        anything)
      range-rule   (semantics (conc single (lit \-) single)
                              (fn [[l _ u]] {l u}))
      all-chars    (rep* (alt range-rule single))
      char-class-p (complex [negated? (opt (lit \^))
                             hyphen?  (opt (lit \-))
                             chars    all-chars]
                            (concat (if negated? [:not] [])
                                    (if hyphen?  [\-]   [])
                                    chars))]
  (defn char-class-text 
    "This rule takes the textual body of a regular expression character class
     and creates a rule that matches the same characters."
    [text]
    (apply char-class (rule-match char-class-p 
                                  (constantly nil) 
                                  (constantly nil)
                                  {:remainder text}))))

Lets have a look at the char-class-text in action:

user=> (run-p (char-class-text "a-z") "a")
"a"
user=> (run-p (char-class-text "a-z") "A")
"fail:" ({:remainder "A"})
nil
user=> (run-p (char-class-text "/w") "A")
"A"
user=> (run-p (char-class-text "^a-z") "A")
"A"
user=> (run-p (char-class-text "^a-z") "a")
"fail:" ({:remainder "a"})
nil
user=> (run-p (char-class-text "abc") "a")
"a"
user=> (run-p (char-class-text "\\/") "a")
"fail:" ({:remainder "a"})
nil
user=> (run-p (char-class-text "\\/") "/")
"/"

So, why go to all this effort to produce something we already had? Simple; if you are writing a parser for a formally specified format it is likely to include a formal grammar such as an EBNF. These grammars often include complex rules for things such as character classes. Instead of manually translating these large rules, it is often easier to produce a simple parser like this that generates a new rule. In addition to reducing the likelihood of error, if future revisions of the spec change the rule, its very easy to update.