Introduction to PDML

Author: Christian Neumanns

Published: 2021-11-16

Introduction

The Practical Data and Markup Language (PDML) is a text format to store data and markup code.

PDML's design goals are:

  • human-friendly (easy to read and write for people)

  • suitable for:

    • data and markup code

    • small and big, complex data structures

  • a basic syntax that is succinct and simple, and therefore easy to parse/deserialize and serialize

  • unique, powerful extensions

A distinction is made between basic PDML and extensions. Basic PDML is the absolute minimum needed to store data. Extensions are optional features to make PDML more practical.

This document mainly covers basic PDML. However, chapter Extensions contains an overview of extensions, and a link to more information.

Basic Examples

This chapter shows some basic, simple examples to get an idea of what can be done with PDML.

Text Node

Suppose a config file in which parameter color has the value green.

In JSON we would use the following syntax:

"color" = "green"

In XML we could use an attribute:

color = "green"

... or an element:

<color>green</color>

In PDML the syntax is:

[color green]

The above code is called a node in PDML.

As can be seen, a node is delimited by [] - a pair of square brackets. A node starts with [, and ends with ].

A node has a name and an optional value. In our example, the name is color and the value is the text green.

A space character is used to separate the name from the value.

Text values can contain spaces, new lines and Unicode characters:

[names
    Tim
    Tom
    Tam
    😃
]

Child Node

Besides text, a node's value can also be another node:

[config [color green]]

The value of node config is another node with name color and value green.

For better readability we can also write:

[config
    [color green]
]

Tree

The node's content can be a list of nodes, and each child node can itself have any number of child nodes:

[config
    [color green]
    [size
        [width 200]
        [height 100]
    ]
]

Hence, PDML can be used to store simple or complex tree data that can be structured or unstructured.

Mixed Child Nodes

A node's content can be a mixture of any number of text and child notes. This makes PDML convenient to store markup code.

Suppose we want to render:

Life is better if we are kind.

In HTML we would write:

<div>Life is <i>better</i> if we are <b>kind</b>.</div>

In PDML this is written as:

[div Life is [i better] if we are [b kind].]

Empty Node

A node can be empty. It has a name, but no content:

[color]

In JSON this would be written as:

"color" = null

In XML:

<color></color>

or simply:

<color />

There is not much more to say about PDML's basic syntax.

For a formal and complete definition please refer to the PDML Specification.

Versatility

Despite PDML's utmost simplicity, it can be used to store different kinds of data, such as:

  • configuration files

  • database tables

  • markup code

  • unstructured, heterogenous, or polymorphic data

Examples are shown in the article PDML Examples.

The PDML syntax is used in the Practical Markup Language (PML), the precursor of PDML (as explained later). For a real-world example of a PDML document you can have a look at the markup code of the PDML specification which is written in PML and uses the PDML syntax.

PDML can be converted to XML, and XML to PDML. Hence, XML technology (which is well supported in many programming languages) can be used with PDML documents. For example you can read a PDML document into an XML DOM and:

  • validate the document with XML Schema

  • query the document with XML Query

  • change the document (add, remove, and modify nodes) and write a modified version back to XML or PDML

  • transform the document with XSLT

Examples of how to do this in Java are shown in the article Open-Source Parser For PracticalXML (pXML).

PDML vs XML/JSON/YAML

For a thorough explanation of the rationale behind PDML please read Suggestion For a Better XML/HTML Syntax.

That article compares code examples written in XML, JSON, and YAML and demonstrates that PDML is:

  • less verbose than XML and JSON, but slightly more verbose than YAML

  • suitable for markup code, unlike JSON and YAML

  • suitable for big, complex data structures, unlike YAML

Moreover PDML has a number of unique, practical extensions not found in XML, JSON, or YAML (see next chapter).

Basic PDML (without extensions) is much easier to parse than XML, JSON, or YAML.

Extensions

As seen already, PDML's basic syntax is very simple and succinct - easy to read and write for humans and machines. Despite its simplicity, basic PDML can be used to store small/big data/markup code.

However, this utmost simplicity can cause inconveniences, especially when big documents are read and written by humans. Therefore a PDML implementation can optionally provide pluggable extensions to make it more practical.

The following chapters provide a non-exhaustive, brief overview of some useful extensions. It's a subset of extensions that are implemented already in the reference implementation written in Java.

Comments

A comment starts with [- and ends with -]. Comments can be inserted anywhere. They can be nested to any level. Text within comments is ignored.

Example:

This is [- good -] awesome.
[- TODO: explain why -]

[- another comment
    [- nested comment -]
-]

Attributes

PDML attributes are conceptually similar to XML attributes. They are typically used to add metadata to nodes.

For example, the following HTML code uses attributes to identify and style node div:

<div id="my_div" class="my_class">content</div>

In PDML this would be written as follows:

[div (id=my_div class=my_class) content]

Character Escape Sequences

Besides the mandatory character escape sequences (\[, \], and \\), the following whitespace and Unicode escape sequences can be used:

CodeDescription
\tTAB character
\rcarriage return
\rline feed
\uhhhhUnicode escape (4 hex digits / 16 bits)
\UhhhhhhhhUnicode escape (8 hex digits / 32 bits)

For example, this text:

line 1\nline 2 \u0041 \U0001F600

... is parsed as:

line 1
line 2 A 😃

Parameters

Parameters are used to define recurring text snippets and data structures. This helps to eliminate code duplication and makes PDML documents more maintainable.

A parameter is declared once with a !set node, and its value can then be inserted any number of times with a !get node.

Here is an example of PML markup code that stores the company's website URL into parameter company_URL, and then inserts the URL in subsequent text:

[doc [title Company Overwiew]
    [!set company_URL=https://www.my_company.org]
    ...
    Our website: [!get company_URL]
    ...
    Click [link url=[!get company_URL]/contacts/index.html text=here] to see a list of contacts.
]
Note

Note the ! character that precedes the name in nodes set and get. The ! is used to denote a so-called extension node, and provides a distinction from normal data nodes. A PDML implementation can provide any number of extension nodes, and support pluggable, customized extensions to cover specific needs.

Document Splitting

When a PDML document exceeds a certain size, it often makes sense to split it up into different files. For example:

  • each table in a database document is stored in a separate file

  • each chapter in a long article or book is stored in a separate file

Document splitting is done with the !ins-file extension node. Here is an example of markup code that uses a different file for each chapter in an article:

File main.pml
[doc [title Long Article]
    [!ins-file path=chapters/introduction.pml]
    [!ins-file path=chapters/body.pml]
    [!ins-file path=chapters/conclusion.pml]
]
File chapters/introduction.pml
[ch [title Introduction]
    text text text
]
File chapters/body.pml
[ch [title Body]
    text text text
]
File chapters/conclusion.pml
[ch [title Conclusion]
    text text text
]

Sub-documents can themselves also be splitted to any level.

!ins-file nodes are also useful if different documents share common parts, such as a common header/footer used in all articles of a blog.

Types (work in progress)

Types are used to validate the content of nodes, and to define how a node is parsed.

For example, node birthdate could be configured to be of type date, which means that the content of node birthdate must be text that represents a valid date in the past, such as:

[birthdate 1879-03-14]

Let's look at a real-world use-case of a PDML type in PML. Some PML nodes are designed to contain small or large pieces of raw text. For instance, PML has a node named code to display highlighted source code. Suppose we want to show the following source code in a PML document:

repeat 3 times
    write_line ( "[Hello]" )
.

If we used only basic PDML syntax, and a code node that is itself indented (because it's contained in a parent node), we would need to write:

    [code repeat 3 times
    write_line ( "\[Hello\]" )
.]

This is not very readable. Moreover the characters [ and ] in the source code must be escaped ("\[Hello\]").

A dedicated PDML type associated with node code removes these inconveniences and allows us to write:

    [code
        """
        repeat 3 times
            write_line ( "[Hello]" )
        .
        """
    ]

Note that:

  • The text content of node code is defined between the two """ lines

  • The indent of the first """ defines the indent to be removed in the subsequent source code lines.

  • Characters [ and ] in the source code don't need to be escaped anymore.

BTW, if we want to highlight source code, we can use attribute lang of PML's code node:

    [code (lang=Java)
        """
        for (int i=1; i <= 3; i++) {
            System.out.println ( "[Hello]" );
        }
        """
    ]

Finally, PML supports an alternative syntax (without the """ fences):

    [code (lang=Java)
        for (int i=1; i <= 3; i++) {
            System.out.println ( "[Hello]" );
        }
    code]

This syntax variant is also realized through a PDML type.

A PDML implementation can provide a standard set of frequently used types (string, number, boolean, date, time, etc.). To maximize flexibility and customization for different domains, additional types can be added programmatically or by configuration data that can be included in the PDML document, or provided in an external (possibly shared) PDML document.

History of PDML

In 2018 I created the Practical Markup Language (PML) to solve problems I encountered with existing markup languages (Markdown, Asciidoctor, HTML, Docbook, etc.). In march 2019 I published We Need a New Document Markup Language - Here is Why to illustrate the existing problems, and to show how they are solved in PML.

Besides being suitable for markup code, the PML syntax could also be used to store data. In March 2021 I therefore published Suggestion For a Better XML/HTML Syntax (also published on codeproject). The new syntax was called practicalXML (pXML), because it was more succinct, but conceptually similar to XML. Moreover, pXML could be converted to XML, and vice versa. All was published and documented at the (now obsolete) pXML website.

In October 2021 pXML was renamed to PDML. The reason was that pXML needed a lot of improvements (extensions) to make it suitable for PML (e.g. parameterized text, document splitting, raw text sections, etc). At the end, pXML was more than just an alternative syntax for XML. It had plugable and configurable types and extensions, as well as other features not available in XML. Thus the name was changed from practicalXML (pXML) to Practical Data and Markup Language (PDML), and everything was published on a new website.

In a nutshell: PDML originated in PML, and was temporarily called pXML.