Core PDML Specification

Version

1.0.1

Published

2021-12-03

License

CC BY-ND 4.0

Website

https://pdml-lang.dev/

Author

Christian Neumanns

Introduction

The Practical Data and Markup Language (PDML) is a text format to store data.

A distinction is made between Core PDML and PDML Extensions. Core PDML is the minimum needed to store data. Extensions are optional features to make PDML more practical.

This document is the official specification for Core PDML.

Document Structure

A PDML document is a tree of nodes.

The syntax for a node is defined as follows (in EBNF):

"[" name ( separator ? child_node + ) ? "]"

Node

A node is enclosed by a pair of square brackets: [...]. A node starts with [ and ends with ].

Each document has exactly one root node.

Name

Each node has a name.

A node name must match the regex [a-zA-Z_][a-zA-Z0-9_\.-]*. This means that a name starts with a letter or an underscore (_), optionally followed by any number of letters, digits, underscores (_), hyphens (-), or dots (.).

Here are some examples of valid node names:

color
Index_12
_ID_12.5-a

A node name does not need to be unique. Different nodes in a tree can have the same name.

Separator

The separator separates the node's name from its content.

The separator is a single whitespace character. The following whitespace characters are allowed:

NameC-style syntaxUnicode
Space" "U+0020
Tab"\t"U+0009
Unix new line"\n"U+000A
Windows new line"\r\n"U+000D U+000A

The separator is required if the first child node is text. Example:

[color green]

The separator is optional if the first child node is a node. Hence this code:

[b [i huge]]

... can also be written as:

[b[i huge]]

Child Node

A node can optionally have any number of child nodes.

A child node can be text (a sequence of Unicode characters) or another node (with optional child nodes too).

Examples:

  • Node with one text child:

    [color light green]

    The node's name is color. The node's single child node is the text light green.

  • Node with child node:

    [config [color light green]]

    The node config has one child node. The child node's name is color, its text is light green.

  • Tree of nodes:

    [config
        [color light green]
        [size
            [width 200]
            [height 100]
        ]
    ]
  • Node containing a mixture of text and nodes (markup code):

    [p We can write words in [i italic], [b bold], or [b[i bold and italic]].]

Empty Node

If a node has no child nodes, it is called an empty node.

Example:

[new_line]

Escape Characters

As seen already, [ and ] are used as node delimiters. Therefore these two characters must be escaped when they are used in text nodes.

A backslash (\) is used as escape character (as in C-like programming languages). Therefore the backslash must itself be escaped too.

The final rule is simple: Characters [, ], and \ must be preceded by \ when they are used in text nodes, as shown in the following table:

CharacterEscape sequence
[\[
]\]
\\\

Example:

Suppose node foo contains the text: Characters [, ], and \ must be escaped.

This would be written as:

[foo Characters \[, \], and \\ must be escaped.]

Whitespace

The following whitespace characters before of after the root node are ignored:

NameC-style syntaxUnicode
Space' 'U+0020
Tab'\t'U+0009
Carriage return'\r'U+000D
Line feed'\n'U+000A

Other characters before or after the root node are illegal.

Within a PDML document, there are no whitespace handling rules defined in Core PDML. Whitespace is preserved when a PDML document is parsed.

Consider the following PDML snippet:

[a  foo   [b]
    2 [c] [d]
]

In this example, node a contains 7 child nodes:

  • text {space}foo{space}{space}{space}

  • empty node b

  • text {new line}{space}{space}{space}{space}2{space}

  • empty node c

  • text {space}

  • empty node d

  • text {new line}

Applications reading PDML documents (or customized PDML parsers) are free to implement any appropriate whitespace handling rules, such as:

  • skip whitespace nodes

  • trim leading and/or trailing whitespace in text nodes

  • replace whitespace sequences with a single space (similar to HTML)

New Lines

New lines are defined differently in Unix/Linux and Windows. Unix uses a single line feed ("\n"). Windows uses a carriage return, followed by a line feed ("\r\n").

The following rules are applied in PDML:

  • Reading Rule

    When a PDML document is read, Unix and Windows new lines are both supported, whether the application runs on Unix or Windows, even if a single document uses a mixture of Unix/Windows new lines.

    For example, a parser reads "\n" and "\r\n" as a single new line.

  • Writing Rule

    When a PDML document is written, the operating system's canonical new line is used.

    For example, a writer running on Unix writes "\n". On Windows it writes "\r\n".

Encoding

PDML documents are encoded in UTF-8.

Grammar

The grammar is defined in separate documents, in two variations:

Note

This document is the only official specification for Core PDML.

The EBNF grammar and the railroad diagrams are just auxiliary assets to help readers better contextualize the specification.

Examples

More examples of PDML code can be found in PDML Examples.

License

This specification is licensed under CC BY-ND 4.0.

Permission is granted to create verbatim translations of this specification into other human languages.

Versioning

This specification uses Semantic Versioning.

Website

PDML's website is https://pdml-lang.dev/.

Markup Code

This document is written in PML which uses the PDML syntax.

The markup code is available on Github.

Pull requests are welcome.