head 1.7; access; symbols; locks; strict; comment @# @; 1.7 date 2001.09.27.21.46.09; author dwitch; state Exp; branches; next 1.6; 1.6 date 2001.09.25.20.49.23; author dwitch; state Exp; branches; next 1.5; 1.5 date 2001.09.25.20.15.09; author dwitch; state Exp; branches; next 1.4; 1.4 date 2001.09.19.21.06.44; author dwitch; state Exp; branches; next 1.3; 1.3 date 2001.09.19.20.40.27; author dwitch; state Exp; branches; next 1.2; 1.2 date 2001.07.25.23.33.48; author dwitch; state Exp; branches; next 1.1; 1.1 date 2001.07.25.22.47.00; author dwitch; state Exp; branches; next ; desc @@ 1.7 log @- all literallayout's to monospace @ text @
DTD Customizer Yann Dirson
ydirson@@altern.org
September 2001
Rationale for this work The goal of this project is to allow one to easily customize a document-class schema (SGML or XML DTD, W3C or TREX or RELAX XML schema, Thot S schema description, etc.), by specifying at a high level of abstraction what changes to do (like: removing an element or adding a new one, restricting the content model of an element, etc.), and to have an automated tool that will do all the necessary mechanic steps that would be fastidious and error-prone if done by hand. This is because of the following considerations: popular document-class like DocBook, designed to address a large number of needs, are necessary too vast for a number of those needs. Also, despite this they can't be universal, and sometimes they miss a couple of features. Hence the (not new) idea of customizing a well-known document-class for a particular application. the process of customizating a DTD, in addition to the design stage, contains a costly mechanical sub-process, namely ensuring that the whole customization consistently implements a change (like the removal of an element, that must be repercuted on all elements that reference it). This process requires a good knowledge of the internal structure of the base DTD, as well as good tools which can help to get this knowledge and to get a complete view of the result of the customization (eg. dtd* from PerlSGML). the needs that led to a customization, or at least the understanding of those needs, will probably evolve with time, leading to multiple revisions of a customization layer. Doing this maintainance by hand has at least the same knowledge requirements than the initial customization, to which we now have to add the knowledge of the initial customization. It grows even more if you have a tree/cascade of customization layers. the base DTD for a customization may itself have to evolve with time. In many cases it will be useful to update a customized DTD to the latest revision of the base DTD. Again, this is costly if done manually. The idea for this software came to me while designing a DocBook customization, realizing that ideally I would need not one single customization layer, but really a hierarchy of customization layers, that would be very difficult to maintain without such a tool. This tool most probably be written using the PerlSGML library for handling DTDs.
Types of customization START of block to rewrite Although some "customization" could be applied to all DTDs, by just having the tool to rewrite a new DTD file, this won't be supported, partly because it would make it hard to check the correctness of the output of the tool, but mostly because I do not consider such DTDs as modular enough. The type of customization we talk about here is the one described in "DocBook - the definitive guide", in the section describing how to write DocBook customization layers. The features of a DTD that this tool will use are: parameter entities used to define elements' content, as those are redefinable conditionally defined DTD chunks (using parameter entities to include/ignore those), as it allows elements themselves to be removed or completely redefined END of block to rewrite Most customizations of a document-class like DocBook are either selecting a subset, or adding domain-specific elements or attributes, or a combination of those. In the latter case, using this tool it should trivial to clearly separate the restricting part of the customization from the extending part. It would even be trivial to maintain several extentions that would be based on a single subset of DocBook (or TEI, or whatever), like the following: DocBook | | v Subset / \ / \ / \ v v DTD-1 DTD-2 Following these considerations, this tool will initially focus on operations that either strictly restrict, or strictly extend a DTD. Unfortunately, all such operations are not easy to define formally, and in a number of cases it will be easier for the customization layer designer to specify a restricted or extented content model, and have the tool just ensure that the specification is either a subset or a superset. Other operations will be investigated and implemented later, should the need arise. Element renaming, being defined and implemented be really simple means, will surely be an exception. Both for restriction and extension, a declaration should be available to the customization layer author to declare a global pure restriction or extension. This will be used to forbid invalid constructs, and issue warning for easily mis-used constructs.
Global modular architecture The core of the tool will take as input and output abstract representations of the document-class. Those data structures will be defined by a generic schema-handling library, like the (still to come) redesign of the Perl5 PerlSGML modules. The input to this tool core will be produced by a schema parser. Hopefully the above-mentionned redesign of PerlSGML will handle this part of the work, allowing to plug parsers for new notations for schema descriptions. The output of the tool core will have to be transcribed into a usable syntax (which may be other than the input syntax). This will probably be integrated into PerlSGML as well, in a generic way similar to the input layer. Output of differential customization layers, like those manually written for DocBook, may be supported using additional output modules. However, as this way of customization probably only exists to simplify the manual work this tool hopes to render useless, it's not clear to me at this point why it should be supported, and I don't plan to put any work on this myself.
Core tool design
DTD restriction
Element content restriction Content restriction is a wide area, and the general case may be difficult to formalize in a usable way, so before focussing on a couple of restrictions that will be easy to formalise, we'll see how the tool can act as a simple checker.
Checking a content model Content models are mostly defined using regexp-like syntaxes, so we can use the finite-state machines (FSM) theory and tools. A content-model is a (context-free ? should reread the theory...) grammar that can be modelled by a FSM. Thus we basically need to check that all words accepted by the FSM for the new content-model are accepted by the FSM for the original one. A naive (and proof-of-concept) algorithm would be to compute the FSM for the intersection of those grammars, and check that its canonical form is the same as the supposed-restriction's. I hope we can find a one-pass algorithm that would save computing time and be more elegant, but well, the priority is to get something to work.
Some kinds of content restriction the syntax used for examples is only here as a design helper. A formal syntax will be defined later, possibly as XML data. An apparently quite simple type of content restriction is probably the removal an occurence of an element in a content model. Even that is somewhat difficult to express if we want it to survive to future revisions of the parent DTD, the main problem being to address a single occurence of an element type (or of a parameter entity) in a content model (including inside parameter entities), when several such occurences can be found. Simpler to specify is the removal of all occurences of an element (including pseudo-elements like #PCDATA) from a content model: Removing all occurences of an element in a content model REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ELEMENTS: regexp ]]> Even simpler to specify is the pure destruction of an element or entity, and hence its removal from all content models and entities. This is not just a specific case of the former, because it requires to undefine the element or entity. Zapping an element ZAP (ELEMENTS|ENTITIES): regexp ]]> However, the very existence of content exceptions make it difficult to handle changes to parameter entities in a secure way, as those entities can then be used both with additive and substractive semantics. Their usage in the base DTD (and in the current customization layer) should be checked to be sure of the semantics of the following clause, which would be a pure extension when applied to an entity used for a content exclusion: REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp ]]> For element definitions where content exceptions are not parametrized, or whose parametrization should be broken (more on this later), explicit manipulation may be necessary: ADD EXCLUSION: REMOVE INCLUSION: ]]> Often an entity is used in the definition of several content models, and we only want to restrict some of those contents. Then the parametrization has to rewritten using a new entity with a "smaller content". These entities can be kept linked (the old one being an formally extension of the new one), or not - such decision will impact further customisation layers. DESYNC ENTITY: name IN (ELEMENTS|ENTITIES): regexp \ TO: RESTRICTION NAMED: name LINKED: (YES|NO) To which a restriction can then be applied: REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp Note that such a two-step mechanism may have an impact on other design issues. But maybe not, as the following syntax attempts to demonstrate: ... ]]>
Attribute restriction these customizations depend on the base DTD using parameter entities for attributes definition. This is not unlike element restriction, but much simpler, as attributes are not referenced in such complex places like content models, and can occur only once within one element definition. remove an #implied or defaulted attribute REMOVE: attribute FROM: element-regexp make required an #implied or defaulted attribute REQUIRE: attribute IN: element-regexp OTOH, attributes also have some sort of "content models" which, even if simpler that an element's content model, can also be subject to customization. allow less tokens REMOVE: token FROM: attribute-regexp IN: element-regexp change CDATA to tokens This may require some additional care, and SGML attribute minimization should surely be turned off in this case, if we want a doc using the customization layer to be parsable as an instance of the base DTD. TOKENIZE: attribute-regexp FROM: element-regexp AS: token-model
DTD extention
adding an element Well, just add its definition, and make sure it conforms with the modularity standards of the base DTD. It will still need to be used somewhere, though.
Element content extention As well as restrictions, extentions of a content model may be quite complex to specify, and "REDEFINE" clauses may be the best way to go in many cases. Still, if a DTD is properly parametrized, it may be that the place you want to add an element/entity to is itself within what I'll name a "consistent entity", that is a part of a content model that is either a pure sequence of |'d elements or a pure sequence of &'d elements. Entities to be included in such a way, and those on the same level in the target consistent entity, should be safe-checked to be sure they don't break the consistency of the target entity. ADD CONSISTENT: (element|entity) TO: (element|entity) Appending or prepending an element to a content model is also easy: ADD (APPEND|PREPEND): element TO: element
symetric constructs from DTD restriction REMOVE EXCLUSION: ADD INCLUSION:
Other operations
element redefinition Sometimes arbitrary changes to a content model must be done, that do not fall under "pure restriction" or "pure extension" or other categories, or that are too hard to describe as such, and the content model must be completely rewritten. REDEFINE ELEMENT: element AS: content-model REDEFINE ENTITY: entity AS: cdata
element renaming If element names are parametrized, this is trvial. Otherwise it involves removing the original element, creating a new one with the same content model, and changing the definition of all elements and entities that referenced the original. RENAME ELEMENT: element TO: new-name
element or entity forking This is a replacement of some of the occurences of an element by a modification of the original. It is not unlike element renaming, but somewhat more complex to express.
@ 1.6 log @- tagged as DocBook 4.1 SGML @ text @d239 1 a239 1 d261 1 a261 1 d286 1 a286 1 d307 1 a307 1 d332 1 a332 1 d339 1 a339 1 d347 1 a347 1 d390 1 a390 1 d401 1 a401 1 d415 1 a415 1 d450 1 a450 1 d460 1 a460 1 d471 1 a471 1 d493 1 a493 1 d511 1 a511 1 @ 1.5 log @- start to investigate an XML syntax @ text @d1 1 a1 2 DTD Customizer ============== d3 116 a118 77 Rationale for this work ======================= The goal of this project is to allow one to easily customize a document-class schema (SGML or XML DTD, W3C or TREX or RELAX XML schema, Thot S schema description, etc.), by specifying at a high level of abstraction what changes to do (like: removing an element or adding a new one, restricting the content model of an element, etc.), and to have an automated tool that will do all the necessary mechanic steps that would be fastidious and error-prone if done by hand. This is because of the following considerations: - popular document-class like DocBook, designed to address a large number of needs, are necessary too vast for a number of those needs. Also, despite this they can't be universal, and sometimes they miss a couple of features. Hence the (not new) idea of customizing a well-known document-class for a particular application. - the process of customizating a DTD, in addition to the design stage, contains a costly mechanical sub-process, namely ensuring that the whole customization consistently implements a change (like the removal of an element, that must be repercuted on all elements that reference it). This process requires a good knowledge of the internal structure of the base DTD, as well as good tools which can help to get this knowledge and to get a complete view of the result of the customization (eg. dtd* from PerlSGML). - the needs that led to a customization, or at least the understanding of those needs, will probably evolve with time, leading to multiple revisions of a customization layer. Doing this maintainance by hand has at least the same knowledge requirements than the initial customization, to which we now have to add the knowledge of the initial customization. It grows even more if you have a tree/cascade of customization layers. - the base DTD for a customization may itself have to evolve with time. In many cases it will be useful to update a customized DTD to the latest revision of the base DTD. Again, this is costly if done manually. The idea for this software came to me while designing a DocBook customization, realizing that ideally I would need not one single customization layer, but really a hierarchy of customization layers, that would be very difficult to maintain without such a tool. This tool most probably be written using the PerlSGML library for handling DTDs. Types of customization ====================== Although some "customization" could be applied to all DTDs, by just having the tool to rewrite a new DTD file, this won't be supported, partly because it would make it hard to check the correctness of the output of the tool, but mostly because I do not consider such DTDs as modular enough. The type of customization we talk about here is the one described in "DocBook - the definitive guide", in the section describing how to write DocBook customization layers. The features of a DTD that this tool will use are: - parameter entities used to define elements' content, as those are redefinable - conditionally defined DTD chunks (using parameter entities to include/ignore those), as it allows elements themselves to be removed or completely redefined Most customizations of a document-class like DocBook are either selecting a subset, or adding domain-specific elements or attributes, or a combination of those. In the latter case, using this tool it should trivial to clearly separate the restricting part of the customization from the extending part. It would even be trivial to maintain several extentions that would be based on a single subset of DocBook (or TEI, or whatever), like the following: d120 1 d131 1 d133 108 d242 1 a242 95 Following these considerations, this tool will initially focus on operations that either strictly restrict, or strictly extend a DTD. Unfortunately, all such operations are not easy to define formally, and in a number of cases it will be easier for the customization layer designer to specify a restricted or extented content model, and have the tool just ensure that the specification is either a subset or a superset. Other operations will be investigated and implemented later, should the need arise. Element renaming, being defined and implemented be really simple means, will surely be an exception. Both for restriction and extension, a declaration should be available to the customization layer author to declare a global pure restriction or extension. This will be used to forbid invalid constructs, and issue warning for easily mis-used constructs. Global modular architecture =========================== The core of the tool will take as input and output abstract representations of the document-class. Those data structures will be defined by a generic schema-handling library, like the (still to come) redesign of the Perl5 PerlSGML modules. The input to this tool core will be produced by a schema parser. Hopefully the above-mentionned redesign of PerlSGML will handle this part of the work, allowing to plug parsers for new notations for schema descriptions. The output of the tool core will have to be transcribed into a usable syntax (which may be other than the input syntax). This will probably be integrated into PerlSGML as well, in a generic way similar to the input layer. Output of differential customization layers, like those manually written for DocBook, may be supported using additional output modules. However, as this way of customization probably only exists to simplify the manual work this tool hopes to render useless, it's not clear to me at this point why it should be supported, and I don't plan to put any work on this myself. ================ CORE TOOL DESIGN ================ DTD restriction =============== Element content restriction --------------------------- Content restriction is a wide area, and the general case may be difficult to formalize in a usable way, so before focussing on a couple of restrictions that will be easy to formalise, we'll see how the tool can act as a simple checker. Checking a content model ------------------------ Content models are mostly defined using regexp-like syntaxes, so we can use the finite-state machines (FSM) theory and tools. A content-model is a (context-free ? should reread the theory...) grammar that can be modelled by a FSM. Thus we basically need to check that all words accepted by the FSM for the new content-model are accepted by the FSM for the original one. A naive (and proof-of-concept) algorithm would be to compute the FSM for the intersection of those grammars, and check that its canonical form is the same as the supposed-restriction's. I hope we can find a one-pass algorithm that would save computing time and be more elegant, but well, the priority is to get something to work. Some kinds of content restriction --------------------------------- Note: the syntax used for examples is only here as a design helper. A formal syntax will be defined later, possibly as XML data. An apparently quite simple type of content restriction is probably the removal an occurence of an element in a content model. Even that is somewhat difficult to express if we want it to survive to future revisions of the parent DTD, the main problem being to address a single occurence of an element type (or of a parameter entity) in a content model (including inside parameter entities), when several such occurences can be found. Simpler to specify is the removal of all occurences of an element (including pseudo-elements like #PCDATA) from a content model: REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ELEMENTS: d249 14 d264 1 a264 7 Even simpler to specify is the pure destruction of an element or entity, and hence its removal from all content models and entities. This is not just a specific case of the former, because it requires to undefine the element or entity. ZAP (ELEMENTS|ENTITIES): d271 17 d289 1 a289 10 However, the very existence of content exceptions make it difficult to handle changes to parameter entities in a secure way, as those entities can then be used both with additive and substractive semantics. Their usage in the base DTD (and in the current customization layer) should be checked to be sure of the semantics of the following clause, which would be a pure extension when applied to an entity used for a content exclusion: REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ENTITIES: d296 12 a307 5 For element definitions where content exceptions are not parametrized, or whose parametrization should be broken (more on this later), explicit manipulation may be necessary: d311 1 d318 28 d347 1 a347 18 Often an entity is used in the definition of several content models, and we only want to restrict some of those contents. Then the parametrization has to rewritten using a new entity with a "smaller content". These entities can be kept linked (the old one being an formally extension of the new one), or not - such decision will impact further customisation layers. DESYNC ENTITY: IN (ELEMENTS|ENTITIES): \ TO: RESTRICTION NAMED: LINKED: (YES|NO) To which a restriction can then be applied: REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ENTITIES: Note that such a two-step mechanism may have an impact on other design issues. But maybe not, as the following syntax attempts to demonstrate: d363 109 a471 66 Attribute restriction --------------------- Note: these customizations depend on the base DTD using parameter entities for attributes definition. This is not unlike element restriction, but much simpler, as attributes are not referenced in such complex places like content models, and can occur only once within one element definition. - remove an #implied or defaulted attribute REMOVE: FROM: - make required an #implied or defaulted attribute REQUIRE: IN: OTOH, attributes also have some sort of "content models" which, even if simpler that an element's content model, can also be subject to customization. - allow less tokens: REMOVE: FROM: IN: - change CDATA to tokens. This may require some additional care, and SGML attribute minimization should surely be turned off in this case, if we want a doc using the customization layer to be parsable as an instance of the base DTD. TOKENIZE: FROM: AS: DTD extention ============= adding an element ----------------- Well, just add its definition, and make sure it conforms with the modularity standards of the base DTD. It will still need to be used somewhere, though. Element content extention ------------------------- As well as restrictions, extentions of a content model may be quite complex to specify, and "REDEFINE" clauses may be the best way to go in many cases. Still, if a DTD is properly parametrized, it may be that the place you want to add an element/entity to is itself within what I'll name a "consistent entity", that is a part of a content model that is either a pure sequence of |'d elements or a pure sequence of &'d elements. Entities to be included in such a way, and those on the same level in the target consistent entity, should be safe-checked to be sure they don't break the consistency of the target entity. ADD CONSISTENT: <(element|entity)> TO: <(element|entity)> Appending or prepending an element to a content model is also easy: ADD (APPEND|PREPEND): TO: symetric constructs from DTD restriction ---------------------------------------- d474 60 a533 29 Other operations ================ - element redefinition Sometimes arbitrary changes to a content model must be done, that do not fall under "pure restriction" or "pure extension" or other categories, or that are too hard to describe as such, and the content model must be completely rewritten. REDEFINE ELEMENT: AS: REDEFINE ENTITY: AS: - element renaming If element names are parametrized, this is trvial. Otherwise it involves removing the original element, creating a new one with the same content model, and changing the definition of all elements and entities that referenced the original. RENAME ELEMENT: TO: - element or entity forking This is a replacement of some of the occurences of an element by a modification of the original. It is not unlike element renaming, but somewhat more complex to express. @ 1.4 log @- content-model checking - update of content-model restriction @ text @d189 7 d203 7 d220 7 d234 7 d250 24 @ 1.3 log @- Rework of the global architecture @ text @d150 33 a182 11 difficult to formalize in a usable way, so we'll focus at first on a couple of restrictions. It should be noted that the possibility of restricting the content-model is highly dependant on the DTD. A quite simple type of content restriction is probably the removal an occurence of an element in a content model. Even that is somewhat difficult to express if we want it to survive to future revisions of the parent DTD, the main problem being to address a single occurence of an element type (or of a parameter entity) in a content model (including inside parameter entities), when several such occurences can be found. d197 6 a202 6 handle changes to parameter entities in a secure way, as those entities can then be used both with additive and substractive semantics. Their usage in the base DTD (and in the current customization layer) should be checked to be sure of the semantics of the following clause, which would be an part of a pure extension when applied to an entity used for a content exclusion: d207 1 d213 9 @ 1.2 log @more on modularity standards @ text @d4 10 a13 6 The goal of this project is to allow one to easily customize a SGML or XML DTD, by specifying at a high level of abstraction what changes to do (like: removing an element or adding a new one, restricting the content model of an element, etc.), and to have a mechanic tool that will do all the necessary steps that would be fastidious and error-prone if done by hand. d17 27 a43 18 - the process of customizating a DTD contains a costly mechanical sub-process, namely ensuring that the whole customization is consistently implements a change (like the removal of an element, that must be repercuted on all elements that reference it). This process requires good tools (eg. dtd* from PerlSGML), and a good knowledge of the internal structure of the base DTD. - the needs that led to a customization, or at least the understanding of those needs, will probably evolve with time, leading to multiple revisions of a customization layer. Doing this maintainance by hand has at least the same knowledge requirements than the initial customization, to which we now have to add the knowledge of the initial customization. - the base DTD for a customization may itself have to evolve with time. In many cases it will be useful to update a customized DTD to the latest revision of the base DTD. Again, this is costly if done manually. d50 1 a50 4 This tool will be most useful with a DTD that is designed for easy customization, like DocBook. It will most probably be written using the PerlSGML library for d57 1 d72 1 d74 7 a80 7 Most customizations of a DTD like DocBook are either selecting a subset, or adding domain-specific elements, or a combination of those. In the latter case, using this tool it should trivial to clearly separate the restricting part of the customization from the extending part. It would even be trivial to maintain several extentions that would be based on a single subset of DocBook (or TEI, or whatever), like the following: d96 6 d107 2 a108 2 to the customization layer author to declare a pure restriction or a pure extension. This will be used to forbid invalid constructs, and d112 2 a113 2 Modularity standards ==================== d115 26 a140 8 Modularity standards of the base DTD to be used may need to be formally described. This may need insight into DTDs other than DocBook, which I do not have right now. Help needed here. I should probably have a look at TEI. My raw idea for now is to use flags to tell whether a feature is present or not, and/or to parse the DTD to detect some features (like whether an element is in its own incluse/ignore section). @ 1.1 log @Initial revision @ text @d93 4 d101 4 @