Recursive Descent Parsers in Scala 2: Build Parser Using FastParse Parser Combinators
In the last post, we touched upon all the aspects of building a suitable context-free grammar for a recursive descent parser while building one ourselves for a language of simple arithmetic expressions. In this post we will use that grammar to build a parser library in Scala to validate simple arithmetic expressions. To achieve the same, we will use a library called FastParse that allows us to write recursive descent parsers easily and efficiently.
What are Parser Combinators anyway?
A parser combinator is essentially a higher-order function that accepts other simpler parsers as input and return a parser for more complex rules as its output. For example, if we had a parser helloParser
for the string "hello" and a parser worldParser
for the string "world", we could essentially build a parser for "helloworld" strings as a function that accepts these parsers as arguments, like so
helloworldParser(helloParser, worldParser)
This is called as Combinatory Parsing technique and it is pretty much analogous to how production rules come together to form a context-free grammar. Infact, the focus of this blog post is going to be to showcase how we can literally one-to-one translate production rules from any grammar into independent parsers using FastParse. For basic understanding on writing parsers using FastParse library, I would recommend Easy Parsing with Parser Combinators
blog post from the author of this library and its API documentation.
Let's first include the following dependency in our SBT project:
"com.lihaoyi" %% "fastparse" % "2.2.2"
.
To start somewhere, let's create a simple parser for our "helloworld" example.
Here, we start by importing the fastparse library. Following which we define parsers (with the P(...) function) that parses strings, "hello" and "world". We can then combine these parsers using the ~
operator to build a parser that parses string "helloworld" and finally capture the results using !
operator. Calling parse
method on fastparse
then parses our "helloworld" string.
Translating Context-Free Grammar
E -> TE'
E' -> +TE' | -TE' | ε
T -> FT'
T' -> *FT' | /FT' | ε
F -> Num | (E)
We can now try building parser in FastParse for the grammar of our language of simple arithmetic expressions.
Here, we have created parsers for each production rules from the aforementioned context-free grammar where methods expr
, term
and factor
represents E
, T
and F
nonterminals from the grammar, respectively.
Sequence Operator
Each terminal and nonterminal in the grammar is essentially represented as a parser in FastParse. In order to form strings of terminals and nonterminals, we can simply combine parser using the ~
operator as seen in the "helloworld" example above. We can simplify the code above using this operator like so:
I personally, however, prefer the literal translation because it allows me to easily make changes to the parser as the grammar evolves.
Repeat Operator
Notice that the .rep
operator in expr1
and term1
translates to ε
in E'
and T'
. The .rep
operator creates a parser that attempts to parse the given parser zero or more times. If you want to parse something a given number of times, you can use .rep(min = 2, max = 4) or the shorter .rep(1) for one or more times, as in the case of num
parser.
Cut Operator
The ~/
is called "Cut" operator in FastParse. It is a marker in a recursive descent parser that refrains you from backtracking beyond that point. This improves the quality of error-reporting since the parser won't backtrack to try every other parser method and report error from some other parse method while it should have failed right at the current parser method. We won't be discussing it here in detail as Li Haoyi has already explained this really well in this conference which you should definetly check out.
Capture Operator
Our parser implementation doesn't return any useful value as such. The reason being that we don't really intent to create a parse tree. Our objective here is to solely validate any given input string. This can be seen as P[Unit]
return types in every parser definition. However, if for any reason we need to return a value from parser, we could do so with the help of the .!
operator.
Here, .!
captures the result as a string which can then be converted to an integer using the map
operator, as shown above.
CharIn Utility
Beside basic APIs, FastParse also provides common utilities. In our case, we have used CharIn
utility to define expr
, term
and num
parsers. It essentially creates a parser that parses literal strings matching regex-style character ranges.
Whitespace Handling
In order to handle whitespace and other non-significant characters with FastParse, we can make use of certain imports from FastParse library out of the box, such as, NoWhitespace._
to disallow any whitespaces, SingleLineWhitespace._
for skipping spaces and tabs on a single line, MultiLineWhitespace._
for skipping newlines, spaces and tabs, etc. Note that these imports effect the ~
and .rep
operators.
You can also create custom whitespace handlers, for which you can refer the FastParse API documentation.
End Parser
We have defined our expr
parser to end with the End
parser so as to force it to consume the entire input string before calling it a success. By default, a parser does not need to consume the entire input. It can succeed early by consuming only a portion of the input.
Conclusion
We have seen how we can use FastParse library to translate content-free grammars for recursive descent parser fairly easily. This post, however, doesn't cover all the other FastParser APIs that you may need to write complex parsers. You may need to refer the API documentation for such cases.