FMT - Education

author:	Frank Groeneveld
title:	Benchmarking and Optimisation of Engage!-based XML Parsers
keywords:	parsing, generative techniques, .NET Core
topics:	Algorithms and Data Structures , Case studies and Applications , Software Technology
committee:	Vadim Zaytsev
started:	November 2020
end:	January 2021

Description

There are dozens of various parsing techniques in existence, but all of them can be cleanly classified into two disjunct categories: tree-based and event-based ones. Most parsing techniques are tree-based, they use some restrictions about the input (specified in a form of a grammar, a regular expression, or any other kind of constraint) to either refuse the input as incorrect, or to build a consistent data structure from it (traditionally, a tree, hence the name, but in general some kind of object). Event-based techniques, also called reactive parsing techniques, work differently: they process the input as a stream, not as a fixed length string, and react in some way to the events (such as tokens) of the input. The easiest example is XML parsing, for which there is DOM, which is tree-based: you run it once, it must consume the entire input, and you get the entire tree back; and there is SAX, to which you keep addressing queries like "opening of the next tag

". Usually it is assumed that tree-based parsing techniques are easier to understand, and faster for small pieces of data; event-based techniques are much better for huge files — looking for a particular line in a multi-gigabyte file is blazingly fast when you do not have to store its entire contents in memory.

Last year Engage! was written, as the first attempt to formalise the process of event-based parsing, up to the point of providing a language to write event-based specifications (like people can write tree-based specifications in ANTLR, SDF, Rascal, bison, yacc, etc). It was published with a small case study that showed that the generated event-based parser had comparable or better performance to a generated tree-based PEG parser.

This project will serve as a continuation of that work. In particular:

Engage! has a tokeniser that consumes the entire input before producing a string of tokens — it can be easily rewritten to work on streams, and then tested differentially with the old implementation, demonstrating exactly the same behaviour with better performance for larger inputs.
Engage! is the only event-based parser generator, but there are many popular manually written event-based parsers. By reimplementing the features of one of them — say, SAX, differences between generator-produced and handmade parsers can be investigated in detail. Will the handmade parser consistently outperform the generated one? Will there be a class of tasks where it will be otherwise? Will there be a class of tasks where the difference will be especially drastic? Can we optimise the generator to lessen the gap?
Doing a substantial case study may demand, or at least provide ideas for, extensions of the original language of Engage!

Engage! is written in C# and also generates C# code. C# is a general purpose language similar to Java (but newer and slightly nicer), it works on all platforms on top of .NET Core, and has great IDE support in the form of Rider or Visual Studio.