SphericalKat Blog Resume

A gentle introduction to parser-combinators

Hang around programmers long enough, and you’ll often hear of parsers or parsing. Let’s find out what it is and what we use it for.

Parsers are programs that can analyze and process input, typically text, according to a set of rules. They are commonly used for tasks such as validating user input, interpreting programming languages, or extracting data from structured text.

Naturally, parsing is the act of running a parser over input. For example, a JSON parser will accept text and output an object.

With that out of the way, let’s design a parser.

Note: All code will be written in Rust for the purposes of this article.

Parser Combinators

Parser combinators are a way of building parsers by combining smaller parsers into larger ones. Each parser combinator takes one or more parsers as input and returns a new parser as output. By chaining together parser combinators, we can create more complex parsers that can handle a wider range of inputs. This approach is often used in functional programming languages, such as Haskell, where parsers are represented as functions that can be composed together.

You might ask; but why use Parser Combinators? Surely you can write a parser by hand instead of chaining multiple parsers together?

Why use Parser Combinators?

Parser combinators offer several advantages over hand-written parsers.

They are often easier to read and understand. By breaking down the parser into smaller, composable parts, it becomes easier to reason about each part of the parsing process.

Parser combinators are often more modular and reusable than hand-written parsers. Once a set of parser combinators has been defined, it can be used to parse many different inputs.

Parser combinators often result in shorter, more concise code. This is because they allow complex parsing logic to be expressed in a more declarative style, rather than a procedural style.

It is also difficult and time consuming to write parsers with good error recovery mechanisms. It requires understanding the intricacies of the recursive descent algorithm, and then implement recovery strategies on top of them.

If you’re writing a programming language, there will almost certainly be changes in the syntax and grammar in the process, which can cause time-consuming and painful refactoring. Parser combinators solve both error recovery and refactoring by providing an easy to use API that allows for rapid prototyping.

Parser combinators are an excellent choice for domain-specific languages that lack an existing parser. With a reliable and fault-tolerant parser combinator library, what would have taken several days to accomplish can be achieved in as little as half an hour.

Abstract Syntax Trees

Abstract Syntax Trees (ASTs) are a way of representing the structure of code or data in a program. They are used by compilers and interpreters to transform source code into executable code, or to analyze data structures.

An AST is a tree-like data structure that represents the syntactic structure of the code or data, without including information about the specific characters used to write the code. Instead, it captures the essential structure of the code, including the relationships between different parts of the code and the order in which they appear. ASTs are often used in conjunction with parser combinators to build parsers for programming languages or other structured data formats.

An example AST for a parsed JSON file is shown here:

Some(
    Object(
        {
            "acres": Str(
                "home",
            ),
            "leaving": Object(
                {
                    "front": Str(
                        "college",
                    ),
                    "origin": Num(
                        981339097.0,
                    ),
                    "cowboy": Num(
                        -355139449.0,
                    ),
                    "fed": Num(
                        -283765067.9149623,
                    ),
                    "tail": Array(
                        [
                            Num(
                                -2063823378.8597813,
                            ),
                            Bool(
                                true,
                            ),
                            Null,
                            Num(
                                -153646.6402,
                            ),
                            Str(
                                "board",
                            ),
                        ],
                    ),
                    "although": Num(
                        -794127593.3922591,
                    ),
                },
            ),
            "activity": Str(
                "value",
            ),
            "noise": Bool(
                false,
            ),
            "office": Num(
                -342325541.1937506,
            ),
        },
    ),
)

Our first parser

We’re going to write a simple CLI tool to parse JSON and generate an AST (Abstract Syntax Tree) called bourne. We will write it in Rust, and use the chumsky parser-combinator library.

Note that our parser will not handle the full JSON specification, such as escape characters or exponential numbers, but it should handle most common JSON, such as nested arrays, objects, numbers, strings, booleans and null values.

Creating the project

Create a simple rust binary project using cargo:

cargo new bourne
cd bourne/

Add the chumsky and ariadne libraries to Cargo.toml

We’ll be using ariadne for effective error recovery and reporting.

[package]
name = "bourne"
version = "0.1.0"
edition = "2021"

[dependencies]
ariadne = "0.3.0"
chumsky = {git = "https://github.com/zesterer/chumsky", rev = "3b1ab31"}

Read a JSON file from stdin

To parse some JSON, we first have to read it into memory. For this purpose, we’ll take a file path through stdin and read it.

use std::{env, fs};

fn main() {
	let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
        .expect("Failed to read file");
}

Define the AST model

We will define a recursive AST based on the JSON value type as a Rust enum.

use std::{collections::HashMap, env, fs};

#[derive(Clone, Debug)]
enum Json {
    Invalid, // invalid node
    Null, // null node
    Bool(bool), // boolean node with its value
    Str(String), // string node with value
    Num(f64), // number node with value
    Array(Vec), // array node with values
    Object(HashMap), // nested object with child key value pairs
}

fn main() ...

Build a parser function

Finally, let’s build a parser() function that will return a recursive parser that can take in string input and return the built AST and any errors that might be encountered.

use ariadne::{Color, Label, Report, ReportKind, Source};
use chumsky::prelude::*;

#[derive(Clone, Debug)]
enum Json {...

fn parser<'a>() -> impl Parser<'a, &'a str, Json, extra::Err>> {
	recursive(|value| {
	})
}

fn main() {
	...
	let (json, errs) = parser().parse(src.trim()).into_output_errors();
    println!("{:#?}", json);
    errs.into_iter().for_each(|e| {
        Report::build(ReportKind::Error, (), e.span().start)
            .with_message(e.to_string())
            .with_label(
                Label::new(e.span().into_range())
                    .with_message(e.reason().to_string())
                    .with_color(Color::Red),
            )
            .finish()
            .print(Source::from(&src))
            .unwrap()
    });
}

Let’s break down all the changes here.

  1. First, we defined a function called parser that returns an implementation of the Parser trait defined by chumsky. This trait takes in a few type parameters, which are described below:

    1. Parsers are parameterised over the lifetime of their inputs. Because we don't yet know what input our parser will be used to parse, we declare a generic lifetime, 'a, to allow the parser to work with whatever input lifetime it needs to work with.
    2. The first type parameter (i.e: ignoring the lifetime parameter) of the [Parser] trait is the input type. Inputs must implement the [Input] trait. Examples of inputs include strings, slices, arrays, [Stream]s, and much more. For now we are specifying that this parser can only operate upon string slices: but it is also possible to introduce the input type as a generic type parameter like I: Input<'src> instead if you want your parser to be generic across more than just string slices.
    3. The second type parameter of the [Parser] trait is the output type. This is the type of the value that the parser will eventually give you, assuming that parsing was successful. Here, we use the output type of [Json], i.e: the enum we just defined to represent our AST.
  2. Next, we returned the value returned by the recursive function, which takes in a callback where we will actually construct the parser. There are a few things worth paying attention to here:

    1. recursive allows us to define a parser recursively in terms of itself by giving us a copy of it within the closure's scope.
    2. You might ask, why do we need a recursive parser? It’s because we want to be able to handle nested JSON nodes, such as arrays and objects.
  3. Finally, we call Parser::parse() on the parser returned by the parser() function we just defined. This takes in the input and returns the output and errors encountered during parsing. We then print the returned output and errors.

    Note that since we’re using a Rich error, we’ll get a lot of helpful information about what went wrong for free from chumsky!

Building the parser

Let’s get down to the meat and potatoes, actually building the parser! Parser combinators allow us to declaratively build parsers and combine them to build large parsers.

And that’s it! Our parser is ready.

Create a file called sample.json with some test JSON data, and run the parser like below:

cargo run -- sample.json

You should be able to see the output AST similar to this:

Some(
    Object(
        {
            "batters": Object(
                {
                    "batter": Array(
                        [
                            Object(
                                {
                                    "type": Str(
                                        "Regular",
                                    ),
                                    "id": Str(
                                        "1001",
                                    ),
                                },
                            ),
                            Object(
                                {
                                    "id": Str(
                                        "1002",
                                    ),
                                    "type": Str(
                                        "Chocolate",
                                    ),
                                },
                            ),
                        ],
                    ),
                },
            ),
            "type": Str(
                "donut",
            ),
            "id": Str(
                "0001",
            ),
            "name": Str(
                "Cake",
            ),
            "ppu": Num(
                55.0,
            ),
        },
    ),
)

Conclusion

I hope this article has demystified parser combinators and given you a taste of how powerful they can be. If you’re interested in learning more, I encourage you to check out chumsky, the parser combinator library used in this article, as well as other parser combinator libraries like nom and combine.

That’s all for now, happy learning!