An Invisible XML modularity proposal
Invisible XML (iXML) is a powerful tool for parsing implicit structures to reveal explicit structures. But at the moment, every iXML grammar is an island, entire of itself: all of the rules must be contained in a single file.
Reuse is a well established software engineering practice. As grammars proliferate, some mechanism for reuse is desirable. If grammars exist for parsing ISBNs or URIs or ISO 8601 dates and times, it would be useful to be able to reuse those grammars in our own grammars, without copying all of the rules. Having multiple copies of the same rules is a recipe for introducing errors.
In a complex system that supports reuse, it’s also valuable to be able to distinguish the parts of a system that are the “public interface”, which users can confidently expect to function consistently, even if the grammar is updated, from the parts that are implementation details, which the implementor may choose to refactor or change without affecting the behavior of the public interface.
An iXML grammar defines a set of rules. In this proposal for grammar
reuse, the distinction between public interface and implementation details will,
in practice, often boil down to the set of rule names (and semantics) that are
expected to be unchanging. A grammar for ISO 8601 date times might, for example,
declare that date and dateTime are what the grammar is
designed to parse. It must contain other rules for parsing years, months, days,
delimiters, etc., but those are implementation details. Authors of sharable
grammars can, according to this proposal, specify that only date
and dateTime are “safe” to import, that they will continue to
function even if other aspects of the grammar change.
It was Michael Sperberg-McQueen who initially proposed that grammar modularity might be territory already mapped out by RELAX NG. RELAX NG is a grammar-based validation technology defined in terms of a set of rules (which RELAX NG calls patterns) operating over the names of, and relationships between, XML elements, attributes, and content.
RELAX NG has reuse mechanisms that have proven to be valuable and remarkably effective in practice. A thorough understanding of RELAX NG is not required to understand this proposal, it suffices to say that at a high level, one RELAX NG grammar can include another and, in doing so, replace or extend patterns.
If we take the capabilities of RELAX NG as a good starting point, we can formulate an initial set of requirements.
- A grammar author must be able to include rules from one or more other grammars.
- A grammar author must be able to define a public interface which specifies the nonterminals they expect to share.
- It should be convenient to include all of the rules that are in the public interface and no others.
- It should be possible to include any rule from a grammar. Ultimately, reuse should be in the hands of the author of the including grammar.
- When combining grammars, simply including a nonterminal from another grammar does not change its definition.
- It must, however, also be possible to redefine any nonterminal in the included grammar.
- It may happen that two different grammars define a nonterminal
with the same name and that an author wishes to include both of them. It must be
possible to disambiguate their names. (It must be possible to include the
“digit” nonterminal from
bodyparts.ixmland the “digit” nonterminal fromnumbers.ixmlinto the same including grammar.)
Some terminology:
- defined
A nonterminal is defined by a single, specific rule.
- included
If a grammar
a.ixmlincludes rules from another grammar,b.ixml, we say thatb.ixmlis the included grammar.- including
If a grammar
a.ixmlincludes rules from another grammar,b.ixml, we say thata.ixmlis the including grammar.- visible
A nonterminal is visible in a grammar if it there is a rule definining it in that grammar or if a rule defining it has been included from another grammar.
Including nonterminals from one grammar allows them to be visible in a second grammar distinct from that which defines them.
This can be visualized as a graph. Suppose that we have a grammar for
a sentence, sentence.ixml:
sentence = "The", animal, verb, pphrase, punct .animal = -" ", ("cat" | "bat" | "rat") .verb = -" ", ("sat" | "spat") .pphrase = prep, " the mat" .prep = -" ", ("in" | "on" | "at") .punct = "." | "!" .
This grammar will parse sentences like “The cat sat on the mat.” (It’s a toy grammar that deals with spaces between words in an entirely impractical manner.)
Ignoring for the moment that there’s no syntax for including the
sentence.ixml
grammar in another grammar, we could imagine writing a
book.ixml grammar for parsing books, where the
sentence nonterminal is included from the sentence.ixml grammar.
book = chapter++(-#a, -#a), -#a* .chapter = number, -". ", para++-#a .para = sentence++" " .@number = [N]+ .
This would parse “books” like this:
1. The cat sat on the mat.The rat spat at the mat.2. The bat sat in the mat.
This can be visualized as shown in Figure 3. The suppressed space terminals have been left out of the diagram for simplicity.
The dashed line symbolizes that the sentence
nonterminal in the including grammar (book.ixml) is
defined in the
included grammar (sentence.ixml).
1. Building blocks
Several new features are added to the iXML grammar to support this proposal for modularity.
1.2. An include statement
In the simplest case, an alternate grammar is included.
+include "sentence.ixml"This has the effect of exposing all of the rules in the public interface
of sentence.ixml into the including grammar.
It is an error to have a rule in the including grammar that defines a nonterminal with the same name as any included nonterminal. (If your grammar defines a nonterminal named “animal”, it cannot simultaneously include a rule named “animal” from some other grammar.)
1.2.1. Explicit includes
The include statement can enumerate the nonterminals that are being included:
+include animal, verb from "sentence.ixml"This includes the rules for animal and verb,
irrespective of the included grammar’s public interface.
It is an error if any included nonterminal is not defined by a rule in the included grammar.
1.2.2. Renaming included nonterminals
An explicit include can also rename the included rule.
+include animal from "sentence.ixml"+include animal as muppet from "muppets.ixml"
This has the effect of making the animal from
sentence.ixml available in the including grammar under the
name animal and the animal from
muppets.ixml available under the name
muppet.
1.3. Redefinitions
The include statement can also include rules that redefine nonterminals from the included grammar. For example, we might redefine some nonterminals from the sentence.ixml grammar:
+include sentence from "sentence.ixml" (animal |= -" ", "gnat" .pphrase = prep, object .object = -" ", "the", (animal | thing) .)-S = sentence.thing = -" ", ("mat" | "hat" | "vat") .
A redefinition using “=” replaces the original definition. A redefinition using “|=” extends the original definition with a new alternative.
This example extends animal to include “gnat” and redefines
prepositional phrase (pphrase) to allow different objects. This modified grammar will
parse sentences like “The gnat sat in the rat!”
The nonterminals used in a redefinition can be defined in the redefinitions, in the included grammar, or in the including grammar, subject to the following constraints:
A rule in the redefinitions must redefine a nonterminal from the included grammar, or the nonterminal it defines must be used in another rule in the redefinitions. It is an error if a nonterminal in the redefinitions is unreachable.
This grammar is not valid:
+include sentence from "sentence.ixml" (animal |= -" ", "gnat" .vehicle = [L+].)animal-list = animal++" ".It defines an unreachable nonterminal,
vehicle, in the redefinitions.If a nonterminal used in a redefinition is also defined in the redefinitions, it is an error to define it in the including grammar.
This grammar is not valid:
+include sentence from "sentence.ixml" (animal |= -" ", fly .fly = "gnat" .)animal-list = (animal|fly)++" ".fly = "mosquito".The nonterminal
flyis defined in both the redefinitions and the including grammar.If a nonterminal used in a redefinition is defined in the including grammar, it is an error if it is also defined in the included grammar.
This grammar is not valid:
+include sentence from "sentence.ixml" (sentence = "One", animal, verb, pphrase, punct .)para = sentence++" ".verb = -" ", ("sat" | "begat") .The nonterminal
verbis used in a redefinition, and it is defined in both the included grammar and the including grammar. This makes the definition used in the redefinition forsentenceambiguous.
2. Examples
In all of the examples, sentence.ixml is the
grammar from Figure 1.
2.1. Simple include
Include sentence from
sentence.ixml, and use it to construct a
paragraph:
+include "sentence.ixml"para = sentence ++ -" ", #a* .
Matches: “The cat sat on the mat. The rat spat on the mat. ”:
<para>
<sentence>The
<animal>cat</animal>
<verb>sat</verb>
<pphrase>
<prep>on</prep> the mat</pphrase>
<punct>.</punct>
</sentence>
<sentence>The
<animal>rat</animal>
<verb>spat</verb>
<pphrase>
<prep>on</prep> the mat</pphrase>
<punct>.</punct>
</sentence>
</para>2.2. Selective include
Include only the nonterminal animal:
+include animal from "sentence.ixml"para = sentence ++ -" ", #a* .
This is not a valid grammar, as the nonterminal sentence is undefined.
+include animal from "sentence.ixml"animals = animal ++ -",", #a* .
Matches: “ cat, bat, rat, rat ”:
<animals>
<animal>cat</animal>
<animal>bat</animal>
<animal>rat</animal>
<animal>rat</animal>
</animals>2.3. Overriding rules
Here’s a grammar that changes what animal matches:
+include "sentence.ixml" (animal = -" ", ("gnat" | "cat" | "bat" | "rat") .)para = sentence ++ -" ", #a* .
Matches: “The gnat sat in the mat. ”:
<para>
<sentence>The
<animal>gnat</animal>
<verb>sat</verb>
<pphrase>
<prep>in</prep> the mat</pphrase>
<punct>.</punct>
</sentence>
</para>Alternatively, since we only want to add gnat as an animal, we could do:
+include "sentence.ixml" (animal |= -" ", "gnat" .)para = sentence ++ -" ", #a* .
to achieve the same effect.
It’s possible to reference rules in the including grammar within redefinitions, provided that there are no rules for the defining the same nonterminal in the included grammar.
+include sentence from "sentence.ixml" (pphrase = prep, object .)-S = sentence, #a* .object = -" ", "the hat" | -" ", "the bat" .
Matches: “The bat sat on the bat. ”:
<sentence>The
<animal>bat</animal>
<verb>sat</verb>
<pphrase>
<prep>on</prep>
<object>the bat</object>
</pphrase>
<punct>.</punct>
</sentence>This grammar does not match “The bat sat on the mat. ”.
A third alternative is to put the object rule in the redefinitions:
+include sentence from "sentence.ixml" (pphrase = prep, object .object = " the hat" | " the bat" .)-S = sentence, #a* .
This matches the same inputs, but the nonterminal object cannot
be used in the including grammar because it isn’t
visible.
2.4. Renaming rules
Suppose you want to include a nonterminal that already has the same name as an existing nonterminal in your grammar. You can rename the included nonterminal:
+include verb as at-verb from "sentence.ixml"verbs = (verb| at-verb>verb ) ++ -",", #a* .verb = -" ", "begat" .
Matches: “ spat, begat, sat ”:
<verbs>
<verb>spat</verb>
<verb>begat</verb>
<verb>sat</verb>
</verbs>3. Template grammars
It may be useful to have grammars that are intentionally incomplete. A grammar for tables, for example, might provide nonterminals for table, row, and cell, but leave unspecified the nonterminal for cell content.
It’s an error to have an iXML grammar with an undefined nonterminal. It would be possible to provide a dummy rule and then redefine it, but that invites errors if the grammar is included without redefining it. Instead, we allow a “template grammar” to enumerate the definitions that it requires:
+require cell-content+share tabletable = row+ .row = (-'|', cell)+, -'|', nl .cell = cell-content .-nl = -#a .
It’s an error if a grammar provides a definition for a nonterminal that it claims to require. It’s an error if there are any undefined nonterminals that are not explicitly called out as required. A template grammar can’t be used as a standalone grammar because it is incomplete. (We might want a specific error code for that.)
When a template grammar is included, the including grammar must provide a definition for each of the required nonterminals. The following grammar identifies tables containing words:
+include nl, table as word-table from "table.ixml" (-cell-content = -" "*, ([L]+) ++ (" "+), -" "* .)page = (para | word-table) ++ nl, nl? .para = [L | " "]+, nl .
For example:
This is a paragraphSo is this|x|o|x||o|o|x||x|o|x|Another paragraph
Parses as:
1 |<page>|<para>This is a paragraph</para>|<para>So is this</para>|<word-table>5 |<row>|<cell>x</cell>|<cell>o</cell>|<cell>x</cell>|</row>10 |<row>|<cell>o</cell>|<cell>o</cell>|<cell>x</cell>|</row>15 |<row>|<cell>x</cell>|<cell>o</cell>|<cell>x</cell>|</row>20 |</word-table>|<para>Another paragraph</para>|</page>
An alternate version identifies tables of numbers:
+include nl, table as number-table from "table.ixml" (-cell-content = -" "*, ["0"-"9"]+, -" "* .)page = (para | number-table) ++ nl, nl? .para = [L | " "]+, nl .
And, of course, many other options are possible.
4. Open questions
4.1. Should we allow multiple includes?
There’s a use case for including the same template grammar more than once:
+include table as word-table from "table.ixml" (cell-content = [L]+ .)+include table as number-table from "table.ixml" (cell-content = ["0"-"9"]+ .)page = (para | word-table | number-table) ++ nl, nl? .para = [L | " "]+, nl .
The user, of course, would have to be careful that the set of included nonterminals was distinct.
A. Tests
In the tests that follow, the “numbers.ixml” grammar is:
1 |+share number, decimal, hexadecimal||S = number .|number>numb = decimal | hexadecimal .5 |@decimal>dec = digit+ .|@hexadecimal>hex = hexdigit+ .|-digit = ["0"-"9"] .|-hexdigit = digit | [ "a"-"f" ] .
The “table.ixml” grammar is:
1 |+require cell-content|+share table||table = row+ .5 |row = (-'|', cell)+, -'|', nl .|cell = cell-content .|-nl = -#a .
1. Test t001
- Input
“
1a”.- Grammar
|
+include "numbers.ixml" { or +include number from ... }||S = number .- Result
|<S><numb hex='1a'/></S>
2. Test t002
- Input
“
1a”.- Grammar
|
+include number as decimal from "numbers.ixml"||S = decimal .- Result
|
<S>|<decimal hex='1a'/>|</S>
3. Test t003
- Input
“
1a”.- Grammar
|
+include hexadecimal as base16 from "numbers.ixml"||S = base16 .- Result
|<S base16='1a'/>
4. Test t004
- Input
“
1a”.- Grammar
|
+include hexadecimal as base16>hex from "numbers.ixml"||S = base16 .- Result
|<S hex='1a'/>
5. Test t005
- Input
“
1a”.- Grammar
|
+include hexadecimal as ^base16 from "numbers.ixml"||S = base16 .- Result
|
<S>|<base16>1a</base16>|</S>
6. Test t006
- Input
“
1a”.- Grammar
|
+include hexadecimal as ^base16>hexi from "numbers.ixml"||S = base16 .- Result
|
<S>|<hexi>1a</hexi>|</S>
7. Test t007
- Input
“
1A”.- Grammar
1 |
+include number from "numbers.ixml" (||)|5 |S = number .- Result
1 |
<fail xmlns:ixml='http://invisiblexml.org/NS' ixml:state='failed'>|<line>1</line>|<column>1</column>|<pos>2</pos>5 |<unexpected>A</unexpected>|<permitted>['0'-'9'], ['a'-'f']</permitted>|<could-be-next>|<in rule='digit'>|<tokens>['0'-'9']</tokens>10 |</in>|<in rule='hexdigit'>|<tokens>['a'-'f']</tokens>|</in>|</could-be-next>15 |</fail>
8. Test t008
- Input
“
1A”.- Grammar
1 |
+include number from "numbers.ixml" (|hexdigit = digit | ["A"-"F"] .|)|5 |S = number .- Result
|
<S>|<numb hex='1A'/>|</S>
9. Test t009
- Input
“
1A”.- Grammar
1 |
+include number from "numbers.ixml" (|hexdigit |= ["A"-"F"] .|)|5 |S = number .- Result
|
<S>|<numb hex='1A'/>|</S>
10. Test t010
- Grammar
1 |
+include number from "numbers.ixml" (|otherdigit = ["A"-"F"] .|)|5 |S = number .- Result
Error: Unused rule: otherdigit
11. Test t011
- Input
“
1A”.- Grammar
1 |
+include number from "numbers.ixml" (|hexdigit |= capitalHex .|)|5 |S = number .|capitalHex = ["A"-"F"] .- Result
|
<S>|<numb hex='1A'/>|</S>
12. Test t012
- Input
“
1A”.- Grammar
1 |
+include number from "numbers.ixml" (|hexdigit |= capitalHex .|capitalHex = ["A"-"F"] .|)5 ||S = number .- Result
|
<S>|<numb hex='1A'/>|</S>
13. Test t013
- Grammar
1 |
+include number from "numbers.ixml" (|hexdigit |= capitalHex .|capitalHex = ["A"-"F"] .|)5 ||S = number .|capitalHex = ["A"-"F"] .- Result
Error: capitalHex in override and including grammar
14. Test t014
- Grammar
1 |
+include number from "numbers.ixml" (|number = hexadecimal .|)|5 |S = number .|hexadecimal = ["0"-"9" | "a"-"f" | "A"-"F" ]+ .- Result
Error: hexadecimal used in override, defined in both including and included grammars
15. Test t015
- Input
1 |
This is a paragraph|||x|o|x|||o|o|x|5 ||x|o|x|- Grammar
1 |
+include nl, table as word-table from "table.ixml" (|-cell-content = -" "*, ([L]+) ++ (" "+), -" "* .|)|5 |page = (para | word-table) ++ nl, nl? .|para = [L | " "]+, nl .- Result
1 |
<page>|<para>This is a paragraph</para>|<word-table>|<row>5 |<cell>x</cell>|<cell>o</cell>|<cell>x</cell>|</row>|<row>10 |<cell>o</cell>|<cell>o</cell>|<cell>x</cell>|</row>|<row>15 |<cell>x</cell>|<cell>o</cell>|<cell>x</cell>|</row>|</word-table>20 |</page>