An Invisible XML modularity proposal

Norm Tovey-Walsh

Invisible XML (iXML) is a powerful tool for parsing implicit structures to reveal explicit structures. But at the moment, every iXML grammar is an island, entire of itself: all of the rules must be contained in a single file.

Reuse is a well established software engineering practice. As grammars proliferate, some mechanism for reuse is desirable. If grammars exist for parsing ISBNs or URIs or ISO 8601 dates and times, it would be useful to be able to reuse those grammars in our own grammars, without copying all of the rules. Having multiple copies of the same rules is a recipe for introducing errors.

In a complex system that supports reuse, it’s also valuable to be able to distinguish the parts of a system that are the “public interface”, which users can confidently expect to function consistently, even if the grammar is updated, from the parts that are implementation details, which the implementor may choose to refactor or change without affecting the behavior of the public interface.

An iXML grammar defines a set of rules. In this proposal for grammar reuse, the distinction between public interface and implementation details will, in practice, often boil down to the set of rule names (and semantics) that are expected to be unchanging. A grammar for ISO 8601 date times might, for example, declare that date and dateTime are what the grammar is designed to parse. It must contain other rules for parsing years, months, days, delimiters, etc., but those are implementation details. Authors of sharable grammars can, according to this proposal, specify that only date and dateTime are “safe” to import, that they will continue to function even if other aspects of the grammar change.

It was Michael Sperberg-McQueen who initially proposed that grammar modularity might be territory already mapped out by RELAX NG. RELAX NG is a grammar-based validation technology defined in terms of a set of rules (which RELAX NG calls patterns) operating over the names of, and relationships between, XML elements, attributes, and content.

RELAX NG has reuse mechanisms that have proven to be valuable and remarkably effective in practice. A thorough understanding of RELAX NG is not required to understand this proposal, it suffices to say that at a high level, one RELAX NG grammar can include another and, in doing so, replace or extend patterns.

If we take the capabilities of RELAX NG as a good starting point, we can formulate an initial set of requirements.

  1. A grammar author must be able to include rules from one or more other grammars.
  2. A grammar author must be able to define a public interface which specifies the nonterminals they expect to share.
  3. It should be convenient to include all of the rules that are in the public interface and no others.
  4. It should be possible to include any rule from a grammar. Ultimately, reuse should be in the hands of the author of the including grammar.
  5. When combining grammars, simply including a nonterminal from another grammar does not change its definition.
  6. It must, however, also be possible to redefine any nonterminal in the included grammar.
  7. It may happen that two different grammars define a nonterminal with the same name and that an author wishes to include both of them. It must be possible to disambiguate their names. (It must be possible to include the “digit” nonterminal from bodyparts.ixml and the “digit” nonterminal from numbers.ixml into the same including grammar.)

Some terminology:

defined

A nonterminal is defined by a single, specific rule.

included

If a grammar a.ixml includes rules from another grammar, b.ixml, we say that b.ixml is the included grammar.

including

If a grammar a.ixml includes rules from another grammar, b.ixml, we say that a.ixml is the including grammar.

visible

A nonterminal is visible in a grammar if it there is a rule definining it in that grammar or if a rule defining it has been included from another grammar.

Including nonterminals from one grammar allows them to be visible in a second grammar distinct from that which defines them.

This can be visualized as a graph. Suppose that we have a grammar for a sentence, sentence.ixml:

sentence = "The", animal, verb, pphrase, punct .
  animal = -" ", ("cat" | "bat" | "rat") .
    verb = -" ", ("sat" | "spat") .
 pphrase = prep, " the mat" .
    prep = -" ", ("in" | "on" | "at") .
   punct = "." | "!" .
Figure 1The sentence.ixml grammar

This grammar will parse sentences like “The cat sat on the mat.” (It’s a toy grammar that deals with spaces between words in an entirely impractical manner.)

Ignoring for the moment that there’s no syntax for including the sentence.ixml grammar in another grammar, we could imagine writing a book.ixml grammar for parsing books, where the sentence nonterminal is included from the sentence.ixml grammar.

   book = chapter++(-#a, -#a), -#a* .
chapter = number, -". ", para++-#a .
   para = sentence++" " .
@number = [N]+ .
Figure 2The book.ixml grammar

This would parse “books” like this:

1. The cat sat on the mat.
The rat spat at the mat.
 
2. The bat sat in the mat.

This can be visualized as shown in Figure 3. The suppressed space terminals have been left out of the diagram for simplicity.

Figure 3Visualization of the book grammar including the sentence grammar

The dashed line symbolizes that the sentence nonterminal in the including grammar (book.ixml) is defined in the included grammar (sentence.ixml).

1Building blocks

Several new features are added to the iXML grammar to support this proposal for modularity.

1.1A share statement

The share statement establishes the public interface of a grammar.

+share foo, bar, baz

This declaration asserts that foo, bar, and baz are the nonterminals in the public interface. It is an error if any nonterminal listed as shared is not defined by a rule in the grammar that shares it.

If a grammar does not have an explicit share statement, all of the nonterminals it defines are implicitly shared.

1.2An include statement

In the simplest case, an alternate grammar is included.

+include "sentence.ixml"

This has the effect of exposing all of the rules in the public interface of sentence.ixml into the including grammar.

It is an error to have a rule in the including grammar that defines a nonterminal with the same name as any included nonterminal. (If your grammar defines a nonterminal named “animal”, it cannot simultaneously include a rule named “animal” from some other grammar.)

1.2.1Explicit includes

The include statement can enumerate the nonterminals that are being included:

+include animal, verb from "sentence.ixml"

This includes the rules for animal and verb, irrespective of the included grammar’s public interface.

It is an error if any included nonterminal is not defined by a rule in the included grammar.

1.2.2Renaming included nonterminals

An explicit include can also rename the included rule.

+include animal from "sentence.ixml"
+include animal as muppet from "muppets.ixml"

This has the effect of making the animal from sentence.ixml available in the including grammar under the name animal and the animal from muppets.ixml available under the name muppet.

1.3Redefinitions

The include statement can also include rules that redefine nonterminals from the included grammar. For example, we might redefine some nonterminals from the sentence.ixml grammar:

+include sentence from "sentence.ixml" (
  animal |= -" ", "gnat" .
 pphrase = prep, object .
  object = -" ", "the", (animal | thing) .
)
 
-S = sentence.
thing = -" ", ("mat" | "hat" | "vat") .

A redefinition using “=” replaces the original definition. A redefinition using “|=” extends the original definition with a new alternative.

This example extends animal to include “gnat” and redefines prepositional phrase (pphrase) to allow different objects. This modified grammar will parse sentences like “The gnat sat in the rat!”

The nonterminals used in a redefinition can be defined in the redefinitions, in the included grammar, or in the including grammar, subject to the following constraints:

  • A rule in the redefinitions must redefine a nonterminal from the included grammar, or the nonterminal it defines must be used in another rule in the redefinitions. It is an error if a nonterminal in the redefinitions is unreachable.

    This grammar is not valid:

    +include sentence from "sentence.ixml" (
      animal |= -" ", "gnat" .
      vehicle = [L+].
    )
    animal-list = animal++" ".

    It defines an unreachable nonterminal, vehicle, in the redefinitions.

  • If a nonterminal used in a redefinition is also defined in the redefinitions, it is an error to define it in the including grammar.

    This grammar is not valid:

    +include sentence from "sentence.ixml" (
      animal |= -" ", fly .
      fly = "gnat" .
    )
    animal-list = (animal|fly)++" ".
    fly = "mosquito".

    The nonterminal fly is defined in both the redefinitions and the including grammar.

  • If a nonterminal used in a redefinition is defined in the including grammar, it is an error if it is also defined in the included grammar.

    This grammar is not valid:

    +include sentence from "sentence.ixml" (
      sentence = "One", animal, verb, pphrase, punct .
    )
     
    para = sentence++" ".
    verb = -" ", ("sat" | "begat") .

    The nonterminal verb is used in a redefinition, and it is defined in both the included grammar and the including grammar. This makes the definition used in the redefinition for sentence ambiguous.

2Examples

In all of the examples, sentence.ixml is the grammar from Figure 1.

2.1Simple include

Include sentence from sentence.ixml, and use it to construct a paragraph:

+include "sentence.ixml"
 
para = sentence ++ -" ", #a* .

Matches: “The cat sat on the mat. The rat spat on the mat. ”:

<para>
   <sentence>The
      <animal>cat</animal>
      <verb>sat</verb>
      <pphrase>
         <prep>on</prep> the mat</pphrase>
      <punct>.</punct>
   </sentence>
   <sentence>The
      <animal>rat</animal>
      <verb>spat</verb>
      <pphrase>
         <prep>on</prep> the mat</pphrase>
      <punct>.</punct>
   </sentence>
</para>

2.2Selective include

Include only the nonterminal animal:

+include animal from "sentence.ixml"
 
para = sentence ++ -" ", #a* .

This is not a valid grammar, as the nonterminal sentence is undefined.

+include animal from "sentence.ixml"
 
animals = animal ++ -",", #a* .

Matches: “ cat, bat, rat, rat ”:

<animals>
   <animal>cat</animal>
   <animal>bat</animal>
   <animal>rat</animal>
   <animal>rat</animal>
</animals>

2.3Overriding rules

Here’s a grammar that changes what animal matches:

+include "sentence.ixml" (
  animal = -" ", ("gnat" | "cat" | "bat" | "rat") .
)
 
para = sentence ++ -" ", #a* .

Matches: “The gnat sat in the mat. ”:

<para>
   <sentence>The
      <animal>gnat</animal>
      <verb>sat</verb>
      <pphrase>
         <prep>in</prep> the mat</pphrase>
      <punct>.</punct>
   </sentence>
</para>

Alternatively, since we only want to add gnat as an animal, we could do:

+include "sentence.ixml" (
  animal |= -" ", "gnat" .
)
 
para = sentence ++ -" ", #a* .

to achieve the same effect.

It’s possible to reference rules in the including grammar within redefinitions, provided that there are no rules for the defining the same nonterminal in the included grammar.

+include sentence from "sentence.ixml" (
 pphrase = prep, object .
)
 
-S = sentence, #a* .
object = -" ", "the hat" | -" ", "the bat" .

Matches: “The bat sat on the bat. ”:

<sentence>The
   <animal>bat</animal>
   <verb>sat</verb>
   <pphrase>
      <prep>on</prep>
      <object>the bat</object>
   </pphrase>
   <punct>.</punct>
</sentence>

This grammar does not match “The bat sat on the mat. ”.

A third alternative is to put the object rule in the redefinitions:

+include sentence from "sentence.ixml" (
 pphrase = prep, object .
 object = " the hat" | " the bat" .
)
 
-S = sentence, #a* .

This matches the same inputs, but the nonterminal object cannot be used in the including grammar because it isn’t visible.

2.4Renaming rules

Suppose you want to include a nonterminal that already has the same name as an existing nonterminal in your grammar. You can rename the included nonterminal:

+include verb as at-verb from "sentence.ixml"
 
verbs = (verb| at-verb>verb ) ++ -",", #a* .
verb = -" ", "begat" .

Matches: “ spat, begat, sat ”:

<verbs>
   <verb>spat</verb>
   <verb>begat</verb>
   <verb>sat</verb>
</verbs>

3Template grammars

It may be useful to have grammars that are intentionally incomplete. A grammar for tables, for example, might provide nonterminals for table, row, and cell, but leave unspecified the nonterminal for cell content.

It’s an error to have an iXML grammar with an undefined nonterminal. It would be possible to provide a dummy rule and then redefine it, but that invites errors if the grammar is included without redefining it. Instead, we allow a “template grammar” to enumerate the definitions that it requires:

+require cell-content
+share table
 
table = row+ .
row   = (-'|', cell)+, -'|', nl .
cell  = cell-content .
-nl   = -#a .

It’s an error if a grammar provides a definition for a nonterminal that it claims to require. It’s an error if there are any undefined nonterminals that are not explicitly called out as required. A template grammar can’t be used as a standalone grammar because it is incomplete. (We might want a specific error code for that.)

When a template grammar is included, the including grammar must provide a definition for each of the required nonterminals. The following grammar identifies tables containing words:

+include nl, table as word-table from "table.ixml" (
  -cell-content = -" "*, ([L]+) ++ (" "+), -" "* .
)
 
page = (para | word-table) ++ nl, nl? .
para = [L | " "]+, nl .

For example:

This is a paragraph
 
So is this
 
|x|o|x|
|o|o|x|
|x|o|x|
 
Another paragraph

Parses as:

 1 |<page>
   |   <para>This is a paragraph</para>
   |   <para>So is this</para>
   |   <word-table>
 5 |      <row>
   |         <cell>x</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
10 |      <row>
   |         <cell>o</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
15 |      <row>
   |         <cell>x</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
20 |   </word-table>
   |   <para>Another paragraph</para>
   |</page>

An alternate version identifies tables of numbers:

+include nl, table as number-table from "table.ixml" (
  -cell-content = -" "*, ["0"-"9"]+, -" "* .
)
 
page = (para | number-table) ++ nl, nl? .
para = [L | " "]+, nl .

And, of course, many other options are possible.

4Open questions

4.1Should we allow multiple includes?

There’s a use case for including the same template grammar more than once:

+include table as word-table from "table.ixml" (
  cell-content = [L]+ .
)
+include table as number-table from "table.ixml" (
  cell-content = ["0"-"9"]+ .
)
 
page = (para | word-table | number-table) ++ nl, nl? .
para = [L | " "]+, nl .

The user, of course, would have to be careful that the set of included nonterminals was distinct.

ATests

In the tests that follow, the “numbers.ixml” grammar is:

1 |+share number, decimal, hexadecimal
  | 
  |S = number .
  |number>numb = decimal | hexadecimal .
5 |@decimal>dec = digit+ .
  |@hexadecimal>hex = hexdigit+ .
  |-digit = ["0"-"9"] .
  |-hexdigit = digit | [ "a"-"f" ] .

The “table.ixml” grammar is:

1 |+require cell-content
  |+share table
  | 
  |table = row+ .
5 |row   = (-'|', cell)+, -'|', nl .
  |cell  = cell-content .
  |-nl   = -#a .

1Test t001

Input

1a”.

Grammar
  |+include "numbers.ixml" { or +include number from ... }
  | 
  |S = number .
Result
  |<S><numb hex='1a'/></S>

2Test t002

Input

1a”.

Grammar
  |+include number as decimal from "numbers.ixml"
  | 
  |S = decimal .
Result
  |<S>
  |   <decimal hex='1a'/>
  |</S>

3Test t003

Input

1a”.

Grammar
  |+include hexadecimal as base16 from "numbers.ixml"
  | 
  |S = base16 .
Result
  |<S base16='1a'/>

4Test t004

Input

1a”.

Grammar
  |+include hexadecimal as base16>hex from "numbers.ixml"
  | 
  |S = base16 .
Result
  |<S hex='1a'/>

5Test t005

Input

1a”.

Grammar
  |+include hexadecimal as ^base16 from "numbers.ixml"
  | 
  |S = base16 .
Result
  |<S>
  |   <base16>1a</base16>
  |</S>

6Test t006

Input

1a”.

Grammar
  |+include hexadecimal as ^base16>hexi from "numbers.ixml"
  | 
  |S = base16 .
Result
  |<S>
  |   <hexi>1a</hexi>
  |</S>

7Test t007

Input

1A”.

Grammar
1 |+include number from "numbers.ixml" (
  | 
  |)
  | 
5 |S = number .
Result
 1 |<fail xmlns:ixml='http://invisiblexml.org/NS' ixml:state='failed'>
   |   <line>1</line>
   |   <column>1</column>
   |   <pos>2</pos>
 5 |   <unexpected>A</unexpected>
   |   <permitted>['0'-'9'], ['a'-'f']</permitted>
   |   <could-be-next>
   |      <in rule='digit'>
   |         <tokens>['0'-'9']</tokens>
10 |      </in>
   |      <in rule='hexdigit'>
   |         <tokens>['a'-'f']</tokens>
   |      </in>
   |   </could-be-next>
15 |</fail>

8Test t008

Input

1A”.

Grammar
1 |+include number from "numbers.ixml" (
  |  hexdigit = digit | ["A"-"F"] .
  |)
  | 
5 |S = number .
Result
  |<S>
  |   <numb hex='1A'/>
  |</S>

9Test t009

Input

1A”.

Grammar
1 |+include number from "numbers.ixml" (
  |  hexdigit |= ["A"-"F"] .
  |)
  | 
5 |S = number .
Result
  |<S>
  |   <numb hex='1A'/>
  |</S>

10Test t010

Grammar
1 |+include number from "numbers.ixml" (
  |  otherdigit = ["A"-"F"] .
  |)
  | 
5 |S = number .
Result

Error: Unused rule: otherdigit

11Test t011

Input

1A”.

Grammar
1 |+include number from "numbers.ixml" (
  |  hexdigit |= capitalHex .
  |)
  | 
5 |S = number .
  |capitalHex = ["A"-"F"] .
Result
  |<S>
  |   <numb hex='1A'/>
  |</S>

12Test t012

Input

1A”.

Grammar
1 |+include number from "numbers.ixml" (
  |  hexdigit |= capitalHex .
  |  capitalHex = ["A"-"F"] .
  |)
5 | 
  |S = number .
Result
  |<S>
  |   <numb hex='1A'/>
  |</S>

13Test t013

Grammar
1 |+include number from "numbers.ixml" (
  |  hexdigit |= capitalHex .
  |  capitalHex = ["A"-"F"] .
  |)
5 | 
  |S = number .
  |capitalHex = ["A"-"F"] .
Result

Error: capitalHex in override and including grammar

14Test t014

Grammar
1 |+include number from "numbers.ixml" (
  |  number = hexadecimal .
  |)
  | 
5 |S = number .
  |hexadecimal = ["0"-"9" | "a"-"f" | "A"-"F" ]+ .
Result

Error: hexadecimal used in override, defined in both including and included grammars

15Test t015

Input
1 |This is a paragraph
  | 
  ||x|o|x|
  ||o|o|x|
5 ||x|o|x|
Grammar
1 |+include nl, table as word-table from "table.ixml" (
  |  -cell-content = -" "*, ([L]+) ++ (" "+), -" "* .
  |)
  | 
5 |page = (para | word-table) ++ nl, nl? .
  |para = [L | " "]+, nl .
Result
 1 |<page>
   |   <para>This is a paragraph</para>
   |   <word-table>
   |      <row>
 5 |         <cell>x</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
   |      <row>
10 |         <cell>o</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
   |      <row>
15 |         <cell>x</cell>
   |         <cell>o</cell>
   |         <cell>x</cell>
   |      </row>
   |   </word-table>
20 |</page>