Recursive Descent Parsers in Scala 2: Build Parser Using FastParse Parser Combinators

In the last post, we touched upon all the aspects of building a suitable context-free grammar for a recursive descent parser while building one ourselves for a language of simple arithmetic expressions. In this post we will use that grammar to build a parser library in Scala to validate simple arithmetic expressions. To achieve the same, we will use a library called FastParse that allows us to write recursive descent parsers easily and efficiently.

What are Parser Combinators anyway?

A parser combinator is essentially a higher-order function that accepts other simpler parsers as input and return a parser for more complex rules as its output. For example, if we had a parser helloParser for the string "hello" and a parser worldParser for the string "world", we could essentially build a parser for "helloworld" strings as a function that accepts these parsers as arguments, like so

helloworldParser(helloParser, worldParser)

This is called as Combinatory Parsing technique and it is pretty much analogous to how production rules come together to form a context-free grammar. Infact, the focus of this blog post is going to be to showcase how we can literally one-to-one translate production rules from any grammar into independent parsers using FastParse. For basic understanding on writing parsers using FastParse library, I would recommend Easy Parsing with Parser Combinators
blog post from the author of this library and its API documentation.

Let's first include the following dependency in our SBT project:
"com.lihaoyi" %% "fastparse" % "2.2.2".

To start somewhere, let's create a simple parser for our "helloworld" example.

Here, we start by importing the fastparse library. Following which we define parsers (with the P(...) function) that parses strings, "hello" and "world". We can then combine these parsers using the ~ operator to build a parser that parses string "helloworld" and finally capture the results using ! operator. Calling parse method on fastparse then parses our "helloworld" string.

Translating Context-Free Grammar

E -> TE'
E' -> +TE' | -TE' | ε
T -> FT'
T' -> *FT' | /FT' | ε
F -> Num | (E)

We can now try building parser in FastParse for the grammar of our language of simple arithmetic expressions.

Here, we have created parsers for each production rules from the aforementioned context-free grammar where methods expr, term and factor represents E, T and F nonterminals from the grammar, respectively.

Sequence Operator

Each terminal and nonterminal in the grammar is essentially represented as a parser in FastParse. In order to form strings of terminals and nonterminals, we can simply combine parser using the ~ operator as seen in the "helloworld" example above. We can simplify the code above using this operator like so:

I personally, however, prefer the literal translation because it allows me to easily make changes to the parser as the grammar evolves.

Repeat Operator

Notice that the .rep operator in expr1 and term1 translates to ε in E' and T'. The .rep operator creates a parser that attempts to parse the given parser zero or more times. If you want to parse something a given number of times, you can use .rep(min = 2, max = 4) or the shorter .rep(1) for one or more times, as in the case of num parser.

Cut Operator

The ~/ is called "Cut" operator in FastParse. It is a marker in a recursive descent parser that refrains you from backtracking beyond that point. This improves the quality of error-reporting since the parser won't backtrack to try every other parser method and report error from some other parse method while it should have failed right at the current parser method. We won't be discussing it here in detail as Li Haoyi has already explained this really well in this conference which you should definetly check out.

Capture Operator

Our parser implementation doesn't return any useful value as such. The reason being that we don't really intent to create a parse tree. Our objective here is to solely validate any given input string. This can be seen as P[Unit] return types in every parser definition. However, if for any reason we need to return a value from parser, we could do so with the help of the .! operator.

Here, .! captures the result as a string which can then be converted to an integer using the map operator, as shown above.

CharIn Utility

Beside basic APIs, FastParse also provides common utilities. In our case, we have used CharIn utility to define expr, term and num parsers. It essentially creates a parser that parses literal strings matching regex-style character ranges.

Whitespace Handling

In order to handle whitespace and other non-significant characters with FastParse, we can make use of certain imports from FastParse library out of the box, such as, NoWhitespace._ to disallow any whitespaces, SingleLineWhitespace._ for skipping spaces and tabs on a single line, MultiLineWhitespace._ for skipping newlines, spaces and tabs, etc. Note that these imports effect the ~ and .rep operators.

You can also create custom whitespace handlers, for which you can refer the FastParse API documentation.

End Parser

We have defined our expr parser to end with the End parser so as to force it to consume the entire input string before calling it a success. By default, a parser does not need to consume the entire input. It can succeed early by consuming only a portion of the input.

Conclusion

We have seen how we can use FastParse library to translate content-free grammars for recursive descent parser fairly easily. This post, however, doesn't cover all the other FastParser APIs that you may need to write complex parsers. You may need to refer the API documentation for such cases.