Skip to content

Latest commit

 

History

History
119 lines (89 loc) · 5.24 KB

README.adoc

File metadata and controls

119 lines (89 loc) · 5.24 KB

AsciiDoc Parsing Lab

This project is a laboratory for researching and developing a formal grammar (i.e., grammar formalism) for the AsciiDoc Language. The source code in this repository is highly experimental and incomplete. It should not be considered a standalone AsciiDoc processor implementation. Rather, it’s meant to serve as a reference for proofs. The grammars and grammar-related helpers formulated in this repository will be contributed to the AsciiDoc Language Specification and TCK.

What’s this about?

The development of the AsciiDoc Language Specification is now in full swing. The center piece of this specification is the normative definition of the AsciiDoc Language.

Up until now, the AsciiDoc syntax rules have only been informally described through pre-spec implementation code (Asciidoctor) and user-oriented documentation. Developing a specification for the AsciiDoc Language necessitates formalizing the syntax into a grammar.

A formal grammar describes the sequences of characters (i.e., markup) that are valid according to the syntax using a set of rules. Establishing a formal grammar is a major step forward for the AsciiDoc Language and its specification. It will help root out well-known inconsistencies, ambiguities, and idiosyncrasies. However, bridging this gap while retaining reasonable compatibility is a major challenge of the specification that requires substantial and open-ended research, hence the need for this project.

This lab is primarily focused on exploring a PEG grammar for AsciiDoc.

How do I run it?

This repository is structured as a Node.js project with a Mocha test suite. Thus, in order to work with it, you first need to have Node.js installed.

The best way to install Node.js is to use nvm (Node Version Manager).

$ nvm install 20

Once Node.js 20 is installed, switch to it:

$ nvm use

Next, install the dependencies of the project using npm:

$ npm i

At the core of this repository is a collection of parsers for AsciiDoc. The parsers are generated from grammar files located in the grammar folder. The grammars, which end in .pegjs, are written for peggy. peggy is the parser generator used by this project to the generate the parsers.

The code in this repository is intended to be run by way of the test suite. But in order to run the tests, you first need to use the npm script gen to generate the parsers.

$ npm run gen

Now you can run the tests using the npm test script:

$ npm t

Most of the tests are data-driven. These tests are located in the test/tests folder. Each test consists of at least an input file (ending in -input.adoc) and an output file (ending in -output.json). The input file (i.e., the test file) is an AsciiDoc file. The output file is the expected ASG that should be produced from it by a compliance AsciiDoc processor. Some tests also have a configuration file that ends in -config.json).

These data-driven tests are a blueprint of the tests that will be included in the AsciiDoc TCK. Once the tests are contributed to the AsciiDoc TCK, this test suite will be updated to use the AsciiDoc TCK directly.

Unmapped syntax

Below is a rough list of the AsciiDoc syntax that has not yet been mapped, or not mapped fully, in the grammar. There’s likely more syntax and edge cases to cover, so this is only what we know to be outstanding.

  • open block

  • quote/verse block

  • pass/stem block

  • thematic and page breaks

  • intrinsic document attributes

  • attribute entry with multiline value

  • attribute entry with value enclosed in pass macro

  • revision info line

  • tables

  • ifeval directive

  • variable-length delimiter lines and block nesting in preprocessor

  • line and block comments (likely in preprocessor)

  • inline replacements

  • backslash escaping for blocks

  • more thorough backslash escaping for inlines (need to consider all ASCII symbols)

  • lines, tags, and leveloffset attributes on include directive

  • optional option on include directive

  • automatic ID generation for headings

  • anchor on dlist term

  • anchor on list item

  • inline double and single quotes

  • inline image macro

  • inline footnote macro

  • inline UI macros (btn, kbd, menu)

  • inline xref macro

  • inline index term macro and shorthands

  • block video and audio macros

  • block toc macro

  • callouts in verbatim blocks

  • use Unicode Alpha property when matching letters in block grammar

  • …​

In general, the line preprocessor is woefully incomplete. We’ve been focusing on sorting out the block grammar first. Then we’ll go back and map it correctly in the line preprocessor (or fold the two grammars into one).

Copyright © 2023-present Dan Allen and Sarah White (OpenDevise Inc.) and the individual contributors to this project.

Use of this software is granted under the terms of the Eclipse Public License v 2.0 (EPL-2.0) License.

Trademarks

AsciiDoc® and AsciiDoc Language™ are trademarks of the Eclipse Foundation, Inc.