parsek

Library enables tokenization and parsing of expressions with a tokenizer, lexer, and scanner. Supports JSON, CSV, and custom languages, offering configurable expression parsing and dynamic lookahead.

#text
#parsing
#json
#fileformat
#csv

Suggest an edit

JVMKotlin/NativeWasmJS

GitHub stars6

Authorskobjects

Open issues0

LicenseApache License 2.0

Creation dateabout 4 years ago

Last activityabout 1 year ago

Latest release0.10.0 (about 1 year ago)

GitHub repository Wiki page

Parsek

Parser library for Kotlin consisting of a tokenizer and expression parser.

Tokenization

Tokenization is the process of splitting the input into a stream of token that is consumed by a parser.

In Parsek, this is distributed between two classes called Lexer and Scanner.

Lexer

The lexer (source, kdoc) is basically an iterator for a stream of tokens that is generated by splitting the input using regular expressions.

Regular expressions are mapped to token types using a function which typically just returns a fixed token type inline. The function can be used to implement a second layer of mapping, but this should be fairly uncommon. Input mapped to null (typically whitespace) will not be reported.

The lexer is usually not used directly; instead, it's handed in to the Scanner, which in turn is used by the parser.

The reason for the Lexer/Scanner split is to separate "raw" parsing from providing a nice and convenient API. The small API surface of the Lexer allows us to easily install additional processing between the Lexer and Scanner, for instance for context-sensitive newline filtering.

Typically, the Lexer is constructed directly inline where the Scanner is constructed.

Token

The token class (source, kdoc) stores the token type (typically a user-defined enum), the token text and the token position. Token instances are generated by the Lexer.

RegularExpressions

The RegularExpressions object (source, kdoc) contains a set of useful regular expressions for source code and data format tokenization.

Scanner

The Scanner class (source, kdoc) provides a simple API for convenient access to the token stream generated by the Lexer.

The scanner provides a notion of a "current" token that can be inspected multiple times -- opposed to iterator.next(), where the current token is "gone" after the call. This makes it easy to hand the scanner with the current token down in a recursive descend parser until it is consumed and processed by the corresponding handler.
It provides unlimited dynamic lookahead.
It provides a tryConsume() convenience method that checks for a given token text and consumes the token and returns true when it was found.

Scanner Use Cases

Typical use cases that only need a scanner and no expression parser are data formats such as JSON or CSV.

For a simple example, please refer to the JSON parser example.

Expression Parser

The configurable expression parser (source, kdoc) operates on a tokenizer, is stateless and should be shared / reused.

For ternary expressions, create a suffix expression and use the supplied tokenizer to consume the rest of the ternary.
Functions / "Apply" can be implemented in a similar way. Alternatively, this can be implemented in primary expression
parsing by checking for an opening brace after the primary expression.
"Grouping" brackets should be implemented where primary expressions are processed, too.

Expression Parser-Based Examples

A simple example evaluating mathematical expressions directly (opposed to building an explicit parse tree) can be found in the tests
A complete PL/0 parser is included in the examples module to illustrate how to use the expression parser and tokenizer for a simple but computational complete language: Parser.kt, Pl0Test.kt
A parser for mathematical expressions: ExpressionParser.kt, ExpressionsTest.kt
A simple example for using the scanner and expression parser to implement a simple indentation-based programming language: mython, MythonTest.kt
A BASIC interpreter using Parsek: https://github.com/stefanhaustein/basik