DTD Customizer

DTD Customizer Yann Dirson

ydirson@@altern.org

September 2001

Rationale for this work The goal of this project is to allow one to easily customize a document-class schema (SGML or XML DTD, W3C or TREX or RELAX XML schema, Thot S schema description, etc.), by specifying at a high level of abstraction what changes to do (like: removing an element or adding a new one, restricting the content model of an element, etc.), and to have an automated tool that will do all the necessary mechanic steps that would be fastidious and error-prone if done by hand. This is because of the following considerations: popular document-class like DocBook, designed to address a large number of needs, are necessary too vast for a number of those needs. Also, despite this they can't be universal, and sometimes they miss a couple of features. Hence the (not new) idea of customizing a well-known document-class for a particular application. the process of customizating a DTD, in addition to the design stage, contains a costly mechanical sub-process, namely ensuring that the whole customization consistently implements a change (like the removal of an element, that must be repercuted on all elements that reference it). This process requires a good knowledge of the internal structure of the base DTD, as well as good tools which can help to get this knowledge and to get a complete view of the result of the customization (eg. dtd* from PerlSGML). the needs that led to a customization, or at least the understanding of those needs, will probably evolve with time, leading to multiple revisions of a customization layer. Doing this maintainance by hand has at least the same knowledge requirements than the initial customization, to which we now have to add the knowledge of the initial customization. It grows even more if you have a tree/cascade of customization layers. the base DTD for a customization may itself have to evolve with time. In many cases it will be useful to update a customized DTD to the latest revision of the base DTD. Again, this is costly if done manually. The idea for this software came to me while designing a DocBook customization, realizing that ideally I would need not one single customization layer, but really a hierarchy of customization layers, that would be very difficult to maintain without such a tool. This tool most probably be written using the PerlSGML library for handling DTDs.

Types of customization START of block to rewrite Although some "customization" could be applied to all DTDs, by just having the tool to rewrite a new DTD file, this won't be supported, partly because it would make it hard to check the correctness of the output of the tool, but mostly because I do not consider such DTDs as modular enough. The type of customization we talk about here is the one described in "DocBook - the definitive guide", in the section describing how to write DocBook customization layers. The features of a DTD that this tool will use are: parameter entities used to define elements' content, as those are redefinable conditionally defined DTD chunks (using parameter entities to include/ignore those), as it allows elements themselves to be removed or completely redefined END of block to rewrite Most customizations of a document-class like DocBook are either selecting a subset, or adding domain-specific elements or attributes, or a combination of those. In the latter case, using this tool it should trivial to clearly separate the restricting part of the customization from the extending part. It would even be trivial to maintain several extentions that would be based on a single subset of DocBook (or TEI, or whatever), like the following: DocBook | | v Subset / \ / \ / \ v v DTD-1 DTD-2 Following these considerations, this tool will initially focus on operations that either strictly restrict, or strictly extend a DTD. Unfortunately, all such operations are not easy to define formally, and in a number of cases it will be easier for the customization layer designer to specify a restricted or extented content model, and have the tool just ensure that the specification is either a subset or a superset. Other operations will be investigated and implemented later, should the need arise. Element renaming, being defined and implemented be really simple means, will surely be an exception. Both for restriction and extension, a declaration should be available to the customization layer author to declare a global pure restriction or extension. This will be used to forbid invalid constructs, and issue warning for easily mis-used constructs.

Global modular architecture The core of the tool will take as input and output abstract representations of the document-class. Those data structures will be defined by a generic schema-handling library, like the (still to come) redesign of the Perl5 PerlSGML modules. The input to this tool core will be produced by a schema parser. Hopefully the above-mentionned redesign of PerlSGML will handle this part of the work, allowing to plug parsers for new notations for schema descriptions. The output of the tool core will have to be transcribed into a usable syntax (which may be other than the input syntax). This will probably be integrated into PerlSGML as well, in a generic way similar to the input layer. Output of differential customization layers, like those manually written for DocBook, may be supported using additional output modules. However, as this way of customization probably only exists to simplify the manual work this tool hopes to render useless, it's not clear to me at this point why it should be supported, and I don't plan to put any work on this myself.

Core tool design

DTD restriction

Element content restriction Content restriction is a wide area, and the general case may be difficult to formalize in a usable way, so before focussing on a couple of restrictions that will be easy to formalise, we'll see how the tool can act as a simple checker.

Checking a content model Content models are mostly defined using regexp-like syntaxes, so we can use the finite-state machines (FSM) theory and tools. A content-model is a (context-free ? should reread the theory...) grammar that can be modelled by a FSM. Thus we basically need to check that all words accepted by the FSM for the new content-model are accepted by the FSM for the original one. A naive (and proof-of-concept) algorithm would be to compute the FSM for the intersection of those grammars, and check that its canonical form is the same as the supposed-restriction's. I hope we can find a one-pass algorithm that would save computing time and be more elegant, but well, the priority is to get something to work.

Some kinds of content restriction the syntax used for examples is only here as a design helper. A formal syntax will be defined later, possibly as XML data. An apparently quite simple type of content restriction is probably the removal an occurence of an element in a content model. Even that is somewhat difficult to express if we want it to survive to future revisions of the parent DTD, the main problem being to address a single occurence of an element type (or of a parameter entity) in a content model (including inside parameter entities), when several such occurences can be found. Simpler to specify is the removal of all occurences of an element (including pseudo-elements like #PCDATA) from a content model: Removing all occurences of an element in a content model REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ELEMENTS: regexp ]]> Even simpler to specify is the pure destruction of an element or entity, and hence its removal from all content models and entities. This is not just a specific case of the former, because it requires to undefine the element or entity. Zapping an element ZAP (ELEMENTS|ENTITIES): regexp ]]> However, the very existence of content exceptions make it difficult to handle changes to parameter entities in a secure way, as those entities can then be used both with additive and substractive semantics. Their usage in the base DTD (and in the current customization layer) should be checked to be sure of the semantics of the following clause, which would be a pure extension when applied to an entity used for a content exclusion: REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp ]]> For element definitions where content exceptions are not parametrized, or whose parametrization should be broken (more on this later), explicit manipulation may be necessary: ADD EXCLUSION: REMOVE INCLUSION: ]]> Often an entity is used in the definition of several content models, and we only want to restrict some of those contents. Then the parametrization has to rewritten using a new entity with a "smaller content". These entities can be kept linked (the old one being an formally extension of the new one), or not - such decision will impact further customisation layers. DESYNC ENTITY: name IN (ELEMENTS|ENTITIES): regexp \ TO: RESTRICTION NAMED: name LINKED: (YES|NO) To which a restriction can then be applied: REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp Note that such a two-step mechanism may have an impact on other design issues. But maybe not, as the following syntax attempts to demonstrate: ... ]]>

Attribute restriction these customizations depend on the base DTD using parameter entities for attributes definition. This is not unlike element restriction, but much simpler, as attributes are not referenced in such complex places like content models, and can occur only once within one element definition. remove an #implied or defaulted attribute REMOVE: attribute FROM: element-regexp make required an #implied or defaulted attribute REQUIRE: attribute IN: element-regexp OTOH, attributes also have some sort of "content models" which, even if simpler that an element's content model, can also be subject to customization. allow less tokens REMOVE: token FROM: attribute-regexp IN: element-regexp change CDATA to tokens This may require some additional care, and SGML attribute minimization should surely be turned off in this case, if we want a doc using the customization layer to be parsable as an instance of the base DTD. TOKENIZE: attribute-regexp FROM: element-regexp AS: token-model

DTD extention

adding an element Well, just add its definition, and make sure it conforms with the modularity standards of the base DTD. It will still need to be used somewhere, though.

Element content extention As well as restrictions, extentions of a content model may be quite complex to specify, and "REDEFINE" clauses may be the best way to go in many cases. Still, if a DTD is properly parametrized, it may be that the place you want to add an element/entity to is itself within what I'll name a "consistent entity", that is a part of a content model that is either a pure sequence of |'d elements or a pure sequence of &'d elements. Entities to be included in such a way, and those on the same level in the target consistent entity, should be safe-checked to be sure they don't break the consistency of the target entity. ADD CONSISTENT: (element|entity) TO: (element|entity) Appending or prepending an element to a content model is also easy: ADD (APPEND|PREPEND): element TO: element

symetric constructs from DTD restriction REMOVE EXCLUSION: ADD INCLUSION:

Other operations

element redefinition Sometimes arbitrary changes to a content model must be done, that do not fall under "pure restriction" or "pure extension" or other categories, or that are too hard to describe as such, and the content model must be completely rewritten. REDEFINE ELEMENT: element AS: content-model REDEFINE ENTITY: entity AS: cdata

element renaming If element names are parametrized, this is trvial. Otherwise it involves removing the original element, creating a new one with the same content model, and changing the definition of all elements and entities that referenced the original. RENAME ELEMENT: element TO: new-name

element or entity forking This is a replacement of some of the occurences of an element by a modification of the original. It is not unlike element renaming, but somewhat more complex to express.