Skip to content

Kiuatan

Kiuatan ("horse" or "pony" in Chinook Jargon) is a library for building and running parsers in the Pony programming language.

  • Kiuatan uses Parsing Expression Grammar semantics, which means:
    • Choices are ordered, i.e. the parser will always try to parse alternatives in the order they are declared.
    • Sequences are greedy, i.e. the parser will not backtrack from the end of a sequence.
    • You can use positive and negative lookahead that does not advance the match position to constrain greedy sequences.
    • Parsers do not backtrack from successful choices.
  • Kiuatan parsers are "packrat" parsers; they memoize intermediate results, resulting in linear-time parsing.
  • Parsers use Mederios et al's algorithm to handle unlimited left-recursion.

Obtaining Kiuatan

Corral

The easiest way to incorporate Kiuatan into your Pony project is to use Pony Corral. Once you have it installed, cd to your project's directory and type:

corral add github chalcolith/kiuatan

This will add the library to your project. You can then build your project with something like:

corral fetch
corral run -- ponyc .

Git

You can clone and build Kiuatan directly from GitHub (you must have Corral in your PATH):

git clone https://github.com/chalcolith/kiuatan.git
cd kiuatan
make && make test

To use Kiuatan in a project without Corral you will need to add kiuatan/kiuatan to your PONYPATH environment variable.

Examples

The Kiuatan repo contains a canonical calulator example of how to use Kiuatan: calc.

Concepts

Kiuatan grammars can match over source sequences of any type that is

S: (Any #read & Equatable[S])

The most common source type will be U8 for parsing UTF-8 text (note that you will need to handle converting UTF-8 into normalized Unicode yourself if necessary).

Named Rules

A NamedRule encapsulates and names a grammatical rule that encodes a PEG rule. To create a rule, you provide a name, a body, and an optional action. For example, the following rule will match either one two three or one deux three.

let rule =
  recover val
    let ws = NamedRule[U8]("WhiteSpace", Star[U8](Single[U8](" \t"), 1))
    NamedRule[U8]("OneTwoThree",
      Conj[U8](
        [ Literal[U8]("one")
          ws
          Disj[U8]([ Literal[U8]("two"); Literal[U8]("deux") ])
          ws
          Literal[U8]("three")
        ]))
  end

You can build the body of a rule from the following combinator classes:

  • Single: matches a single source item. The constructor takes a set of possibilities to match. If you provide an empty list, this rule will match any single item.
  • Literal: matches a string of items.
  • Conj: matches a sequence of child rules.
  • Disj: matches one of a number of alternative child rules, in order. If one of the alternatives matches, but an outer rule fails later, the parser will not backtrack to another alternative.
  • Error: will trigger an error with the given message.
  • Look: will attempt to match its child rule, but will not advance the match position.
  • Neg: will succeed if its child rule does not match, and will not advance the match position.
  • Star: will match a number of repetitions of its child rule. You can specify a minimum or maximum number of times to match.
  • Bind: will bind the result of its child rule to an existing variable. See the calc example for an example of how to use Bind.
  • Condition: will succeed only if its child matches and the given condition returns true.

Recursion

In order to allow recursive rules, you can create a rule with no body and set its body later using the set_body() method:

// Add <- Add Op Num | Num
// Op <- [+-]
// Num <- [0-9]+
let rule: NamedRule[U8] val =
  recover val
    let add = NamedRule[U8]("Add", None)
    let num = NamedRule[U8]("Num", Star[U8](Single[U8]("0123456789"), 1))
    let op = NamedRule[U8]("Op", Single[U8]("+-"))
    let body = Disj[U8]([Conj[U8]([add; op; num]); num])
    add.set_body(body)
    add
  end

Note that Kiuatan can handle both direct and indirect left-recursion.

Source

A Source is a sequence of sequences of your source type. Internally this is represented as a linked list of sequences, called "segments". The idea behind this is that you can swap out individual segments of text that your Parser actor knows about, while maintaining the parse memo for the other segments. This allows a text editor, for example, to handle localized changes without re-parsing the whole source file.

Parser

A Parser actor knows about the source you are parsing, and holds a memo of parsing results across parse attempts.

In order to attempt to parse a particular sequence (or sequence of segments) of items, create a Parser actor, giving it an initial source, and then call its parse behaviour, passing a rule to use and a callback for when the parse either succeeds or fails:

  let segment = "one two three"
  let parser = Parser[U8]([segment])
  parser.parse(rule, {(result: Result[U8]) =>
    match result
    | let success: Success[U8] =>
      Debug.out("succeeded!")
    | let failure: Failure[U8] =>
      Debug.out("failed")
    end
  })

The generic parameters for the Parser actor type (and other types in Kiuatan) are as follows:

  • S: this is the "source type"; i.e. your Sources will be sequences of values of this type.
  • D: this is a "data" type. You can pass a value of this type to the Parser.parse() behaviour, and this value will be passed to your semantic Actions.
  • V: this is the "result value" type. For each successful parse result, zero or more result values (from child results) will be passed to your semantic Action, if present. If a rule has no action, child result values will be combined and passed to the next highest action.

Updating Source

You can update a parser's source by calling its remove_segment and insert_segment behaviours. The next time you initiate a parse, the parser's source will have been updated.

Results

If a parse succeeds, the result will be of type Success, which represents the concrete parse tree. You can get details about the location of the match and results from child rules.

If a parse fails, the result will be of type Failure, which contains some information about the location in the source where the failure occurred, and possibly an error message.

Actions and Result Values

If you wish, you can build a more abstract parse tree using semantic Actions that you pass to rules. These actions should return a "result value" of your desired type. The result values from child rule successes are passed to the action.

Actions also receive a Bindings map; you can access bound result values from child parses using this; you must return the map with the result of the action.

Public Types