head 1.7;
access;
symbols;
locks; strict;
comment @# @;
1.7
date 2001.09.27.21.46.09; author dwitch; state Exp;
branches;
next 1.6;
1.6
date 2001.09.25.20.49.23; author dwitch; state Exp;
branches;
next 1.5;
1.5
date 2001.09.25.20.15.09; author dwitch; state Exp;
branches;
next 1.4;
1.4
date 2001.09.19.21.06.44; author dwitch; state Exp;
branches;
next 1.3;
1.3
date 2001.09.19.20.40.27; author dwitch; state Exp;
branches;
next 1.2;
1.2
date 2001.07.25.23.33.48; author dwitch; state Exp;
branches;
next 1.1;
1.1
date 2001.07.25.22.47.00; author dwitch; state Exp;
branches;
next ;
desc
@@
1.7
log
@- all literallayout's to monospace
@
text
@
DTD Customizer
Yann
Dirson
ydirson@@altern.org
September 2001
Rationale for this work
The goal of this project is to allow one to easily customize
a document-class schema (SGML or XML DTD, W3C or TREX or RELAX XML
schema, Thot S schema description, etc.), by specifying at a high
level of abstraction what changes to do (like: removing an element
or adding a new one, restricting the content model of an element,
etc.), and to have an automated tool that will do all the
necessary mechanic steps that would be fastidious and error-prone
if done by hand.
This is because of the following considerations:
popular document-class like DocBook, designed to address
a large number of needs, are necessary too vast for a number
of those needs. Also, despite this they can't be universal,
and sometimes they miss a couple of features. Hence the (not
new) idea of customizing a well-known document-class for a
particular application.
the process of customizating a DTD, in addition to the
design stage, contains a costly mechanical sub-process, namely
ensuring that the whole customization consistently implements
a change (like the removal of an element, that must be
repercuted on all elements that reference it). This process
requires a good knowledge of the internal structure of the
base DTD, as well as good tools which can help to get this
knowledge and to get a complete view of the result of the
customization (eg. dtd* from PerlSGML).
the needs that led to a customization, or at least the
understanding of those needs, will probably evolve with time,
leading to multiple revisions of a customization layer. Doing
this maintainance by hand has at least the same knowledge
requirements than the initial customization, to which we now
have to add the knowledge of the initial customization. It
grows even more if you have a tree/cascade of customization
layers.
the base DTD for a customization may itself have to
evolve with time. In many cases it will be useful to update a
customized DTD to the latest revision of the base DTD. Again,
this is costly if done manually.
The idea for this software came to me while designing a
DocBook customization, realizing that ideally I would need not one
single customization layer, but really a hierarchy of
customization layers, that would be very difficult to maintain
without such a tool.
This tool most probably be written using the PerlSGML
library for handling DTDs.
Types of customization
START of block to rewrite
Although some "customization" could be applied to all DTDs,
by just having the tool to rewrite a new DTD file, this won't be
supported, partly because it would make it hard to check the
correctness of the output of the tool, but mostly because I do not
consider such DTDs as modular enough. The type of customization
we talk about here is the one described in "DocBook - the
definitive guide", in the section describing how to write DocBook
customization layers.
The features of a DTD that this tool will use are:
parameter entities used to define elements' content,
as those are redefinable
conditionally defined DTD chunks (using parameter
entities to include/ignore those), as it allows elements
themselves to be removed or completely redefined
END of block to rewrite
Most customizations of a document-class like DocBook are
either selecting a subset, or adding domain-specific elements or
attributes, or a combination of those. In the latter case, using
this tool it should trivial to clearly separate the restricting
part of the customization from the extending part. It would even
be trivial to maintain several extentions that would be based on a
single subset of DocBook (or TEI, or whatever), like the
following:
DocBook
|
|
v
Subset
/ \
/ \
/ \
v v
DTD-1 DTD-2
Following these considerations, this tool will initially
focus on operations that either strictly restrict, or strictly
extend a DTD. Unfortunately, all such operations are not easy to
define formally, and in a number of cases it will be easier for
the customization layer designer to specify a restricted or
extented content model, and have the tool just ensure that the
specification is either a subset or a superset.
Other operations will be investigated and implemented later,
should the need arise. Element renaming, being defined and
implemented be really simple means, will surely be an
exception.
Both for restriction and extension, a declaration should be
available to the customization layer author to declare a global
pure restriction or extension. This will be used to forbid
invalid constructs, and issue warning for easily mis-used
constructs.
Global modular architecture
The core of the tool will take as input and output abstract
representations of the document-class. Those data structures will
be defined by a generic schema-handling library, like the (still
to come) redesign of the Perl5 PerlSGML modules.
The input to this tool core will be produced by a schema
parser. Hopefully the above-mentionned redesign of PerlSGML will
handle this part of the work, allowing to plug parsers for new
notations for schema descriptions.
The output of the tool core will have to be transcribed into
a usable syntax (which may be other than the input syntax). This
will probably be integrated into PerlSGML as well, in a generic
way similar to the input layer.
Output of differential customization layers, like those
manually written for DocBook, may be supported using additional
output modules. However, as this way of customization probably
only exists to simplify the manual work this tool hopes to render
useless, it's not clear to me at this point why it should be
supported, and I don't plan to put any work on this myself.
Core tool design
DTD restriction
Element content restriction
Content restriction is a wide area, and the general case
may be difficult to formalize in a usable way, so before
focussing on a couple of restrictions that will be easy to
formalise, we'll see how the tool can act as a simple
checker.
Checking a content model
Content models are mostly defined using regexp-like
syntaxes, so we can use the finite-state machines (FSM)
theory and tools. A content-model is a (context-free ?
should reread the theory...) grammar that can be modelled
by a FSM.
Thus we basically need to check that all words
accepted by the FSM for the new content-model are accepted
by the FSM for the original one. A naive (and
proof-of-concept) algorithm would be to compute the FSM for
the intersection of those grammars, and check that its
canonical form is the same as the supposed-restriction's. I
hope we can find a one-pass algorithm that would save
computing time and be more elegant, but well, the priority
is to get something to work.
Some kinds of content restriction
the syntax used for examples is only here as a
design helper. A formal syntax will be defined later,
possibly as XML data.
An apparently quite simple type of content restriction
is probably the removal an occurence of an element in a
content model. Even that is somewhat difficult to express
if we want it to survive to future revisions of the parent
DTD, the main problem being to address a single occurence of
an element type (or of a parameter entity) in a content
model (including inside parameter entities), when several
such occurences can be found.
Simpler to specify is the removal of all occurences of
an element (including pseudo-elements like #PCDATA) from a
content model:
Removing all occurences of an element in a content
model
REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ELEMENTS: regexp
>
>
]]>
Even simpler to specify is the pure destruction of an
element or entity, and hence its removal from all content
models and entities. This is not just a specific case of
the former, because it requires to undefine the element or
entity.
Zapping an element
ZAP (ELEMENTS|ENTITIES): regexp
>
>
]]>
However, the very existence of content exceptions make
it difficult to handle changes to parameter entities in a
secure way, as those entities can then be used both with
additive and substractive semantics. Their usage in the
base DTD (and in the current customization layer) should be
checked to be sure of the semantics of the following clause,
which would be a pure extension when applied to an entity
used for a content exclusion:
REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp
>
>
]]>
For element definitions where content exceptions are
not parametrized, or whose parametrization should be broken
(more on this later), explicit manipulation may be
necessary:
ADD EXCLUSION:
REMOVE INCLUSION:
>
>
]]>
Often an entity is used in the definition of several
content models, and we only want to restrict some of those
contents. Then the parametrization has to rewritten using a
new entity with a "smaller content". These entities can be
kept linked (the old one being an formally extension of the
new one), or not - such decision will impact further
customisation layers.
DESYNC ENTITY: name IN (ELEMENTS|ENTITIES): regexp \
TO: RESTRICTION NAMED: name LINKED: (YES|NO)
To which a restriction can then be applied:
REMOVE WHICH: all (ELEMENTS|ENTITIES): regexp FROM ENTITIES: regexp
Note that such a two-step mechanism may have an
impact on other design issues. But maybe not, as the
following syntax attempts to demonstrate:
>
>
>
...
>
>
>
]]>
Attribute restriction
these customizations depend on the base DTD using
parameter entities for attributes definition.
This is not unlike element restriction, but much
simpler, as attributes are not referenced in such complex
places like content models, and can occur only once within one
element definition.
remove an #implied or defaulted attribute
REMOVE: attribute FROM: element-regexp
make required an #implied or defaulted attribute
REQUIRE: attribute IN: element-regexp
OTOH, attributes also have some sort of "content models"
which, even if simpler that an element's content model, can
also be subject to customization.
allow less tokens
REMOVE: token FROM: attribute-regexp IN: element-regexp
change CDATA to tokens
This may require
some additional care, and SGML attribute minimization should
surely be turned off in this case, if we want a doc using
the customization layer to be parsable as an instance of the
base DTD.
TOKENIZE: attribute-regexp FROM: element-regexp AS: token-model
DTD extention
adding an element
Well, just add its definition, and make sure it conforms
with the modularity standards of the base DTD. It will still
need to be used somewhere, though.
Element content extention
As well as restrictions, extentions of a content model
may be quite complex to specify, and "REDEFINE" clauses may be
the best way to go in many cases. Still, if a DTD is properly
parametrized, it may be that the place you want to add an
element/entity to is itself within what I'll name a
"consistent entity", that is a part of a content model that is
either a pure sequence of |'d elements or a pure sequence of
&'d elements. Entities to be included in such a way, and
those on the same level in the target consistent entity,
should be safe-checked to be sure they don't break the
consistency of the target entity.
ADD CONSISTENT: (element|entity) TO: (element|entity)
Appending or prepending an element to a content model is
also easy:
ADD (APPEND|PREPEND): element TO: element
symetric constructs from DTD restriction
REMOVE EXCLUSION:
ADD INCLUSION:
Other operations
element redefinition
Sometimes arbitrary changes to a content model must be
done, that do not fall under "pure restriction" or "pure
extension" or other categories, or that are too hard to
describe as such, and the content model must be completely
rewritten.
REDEFINE ELEMENT: element AS: content-model
REDEFINE ENTITY: entity AS: cdata
element renaming
If element names are parametrized, this is trvial.
Otherwise it involves removing the original element, creating
a new one with the same content model, and changing the
definition of all elements and entities that referenced the
original.
RENAME ELEMENT: element TO: new-name
element or entity forking
This is a replacement of some of the occurences of an
element by a modification of the original. It is not unlike
element renaming, but somewhat more complex to express.
@
1.6
log
@- tagged as DocBook 4.1 SGML
@
text
@d239 1
a239 1
d261 1
a261 1
d286 1
a286 1
d307 1
a307 1
d332 1
a332 1
d339 1
a339 1
d347 1
a347 1
d390 1
a390 1
d401 1
a401 1
d415 1
a415 1
d450 1
a450 1
d460 1
a460 1
d471 1
a471 1
d493 1
a493 1
d511 1
a511 1
@
1.5
log
@- start to investigate an XML syntax
@
text
@d1 1
a1 2
DTD Customizer
==============
d3 116
a118 77
Rationale for this work
=======================
The goal of this project is to allow one to easily customize a
document-class schema (SGML or XML DTD, W3C or TREX or RELAX XML
schema, Thot S schema description, etc.), by specifying at a high
level of abstraction what changes to do (like: removing an element or
adding a new one, restricting the content model of an element, etc.),
and to have an automated tool that will do all the necessary mechanic
steps that would be fastidious and error-prone if done by hand.
This is because of the following considerations:
- popular document-class like DocBook, designed to address a large
number of needs, are necessary too vast for a number of those needs.
Also, despite this they can't be universal, and sometimes they miss a
couple of features. Hence the (not new) idea of customizing a
well-known document-class for a particular application.
- the process of customizating a DTD, in addition to the design stage,
contains a costly mechanical sub-process, namely ensuring that the
whole customization consistently implements a change (like the removal
of an element, that must be repercuted on all elements that reference
it). This process requires a good knowledge of the internal structure
of the base DTD, as well as good tools which can help to get this
knowledge and to get a complete view of the result of the
customization (eg. dtd* from PerlSGML).
- the needs that led to a customization, or at least the understanding
of those needs, will probably evolve with time, leading to multiple
revisions of a customization layer. Doing this maintainance by hand
has at least the same knowledge requirements than the initial
customization, to which we now have to add the knowledge of the
initial customization. It grows even more if you have a tree/cascade
of customization layers.
- the base DTD for a customization may itself have to evolve with
time. In many cases it will be useful to update a customized DTD to
the latest revision of the base DTD. Again, this is costly if done
manually.
The idea for this software came to me while designing a DocBook
customization, realizing that ideally I would need not one single
customization layer, but really a hierarchy of customization layers,
that would be very difficult to maintain without such a tool.
This tool most probably be written using the PerlSGML library for
handling DTDs.
Types of customization
======================
Although some "customization" could be applied to all DTDs, by just
having the tool to rewrite a new DTD file, this won't be supported,
partly because it would make it hard to check the correctness of the
output of the tool, but mostly because I do not consider such DTDs as
modular enough. The type of customization we talk about here is the
one described in "DocBook - the definitive guide", in the section
describing how to write DocBook customization layers.
The features of a DTD that this tool will use are:
- parameter entities used to define elements' content, as those are
redefinable
- conditionally defined DTD chunks (using parameter entities to
include/ignore those), as it allows elements themselves to be removed
or completely redefined
Most customizations of a document-class like DocBook are either
selecting a subset, or adding domain-specific elements or attributes,
or a combination of those. In the latter case, using this tool it
should trivial to clearly separate the restricting part of the
customization from the extending part. It would even be trivial to
maintain several extentions that would be based on a single subset of
DocBook (or TEI, or whatever), like the following:
d120 1
d131 1
d133 108
d242 1
a242 95
Following these considerations, this tool will initially focus on
operations that either strictly restrict, or strictly extend a DTD.
Unfortunately, all such operations are not easy to define formally,
and in a number of cases it will be easier for the customization layer
designer to specify a restricted or extented content model, and have
the tool just ensure that the specification is either a subset or a
superset.
Other operations will be investigated and implemented later, should
the need arise. Element renaming, being defined and implemented be
really simple means, will surely be an exception.
Both for restriction and extension, a declaration should be available
to the customization layer author to declare a global pure restriction
or extension. This will be used to forbid invalid constructs, and
issue warning for easily mis-used constructs.
Global modular architecture
===========================
The core of the tool will take as input and output abstract
representations of the document-class. Those data structures will be
defined by a generic schema-handling library, like the (still to come)
redesign of the Perl5 PerlSGML modules.
The input to this tool core will be produced by a schema parser.
Hopefully the above-mentionned redesign of PerlSGML will handle this
part of the work, allowing to plug parsers for new notations for
schema descriptions.
The output of the tool core will have to be transcribed into a usable
syntax (which may be other than the input syntax). This will probably
be integrated into PerlSGML as well, in a generic way similar to the
input layer.
Output of differential customization layers, like those manually
written for DocBook, may be supported using additional output modules.
However, as this way of customization probably only exists to simplify
the manual work this tool hopes to render useless, it's not clear to
me at this point why it should be supported, and I don't plan to put
any work on this myself.
================
CORE TOOL DESIGN
================
DTD restriction
===============
Element content restriction
---------------------------
Content restriction is a wide area, and the general case may be
difficult to formalize in a usable way, so before focussing on a
couple of restrictions that will be easy to formalise, we'll see how
the tool can act as a simple checker.
Checking a content model
------------------------
Content models are mostly defined using regexp-like syntaxes, so we
can use the finite-state machines (FSM) theory and tools. A
content-model is a (context-free ? should reread the theory...)
grammar that can be modelled by a FSM.
Thus we basically need to check that all words accepted by the FSM for
the new content-model are accepted by the FSM for the original one. A
naive (and proof-of-concept) algorithm would be to compute the FSM for
the intersection of those grammars, and check that its canonical form
is the same as the supposed-restriction's. I hope we can find a
one-pass algorithm that would save computing time and be more elegant,
but well, the priority is to get something to work.
Some kinds of content restriction
---------------------------------
Note: the syntax used for examples is only here as a design helper. A
formal syntax will be defined later, possibly as XML data.
An apparently quite simple type of content restriction is probably the
removal an occurence of an element in a content model. Even that is
somewhat difficult to express if we want it to survive to future
revisions of the parent DTD, the main problem being to address a
single occurence of an element type (or of a parameter entity) in a
content model (including inside parameter entities), when several such
occurences can be found.
Simpler to specify is the removal of all occurences of an element
(including pseudo-elements like #PCDATA) from a content model:
REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ELEMENTS:
d249 14
d264 1
a264 7
Even simpler to specify is the pure destruction of an element or
entity, and hence its removal from all content models and entities.
This is not just a specific case of the former, because it requires to
undefine the element or entity.
ZAP (ELEMENTS|ENTITIES):
d271 17
d289 1
a289 10
However, the very existence of content exceptions make it difficult to
handle changes to parameter entities in a secure way, as those
entities can then be used both with additive and substractive
semantics. Their usage in the base DTD (and in the current
customization layer) should be checked to be sure of the semantics of
the following clause, which would be a pure extension when applied to
an entity used for a content exclusion:
REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ENTITIES:
d296 12
a307 5
For element definitions where content exceptions are not parametrized,
or whose parametrization should be broken (more on this later),
explicit manipulation may be necessary:
d311 1
d318 28
d347 1
a347 18
Often an entity is used in the definition of several content models,
and we only want to restrict some of those contents. Then the
parametrization has to rewritten using a new entity with a "smaller
content". These entities can be kept linked (the old one being an
formally extension of the new one), or not - such decision will impact
further customisation layers.
DESYNC ENTITY: IN (ELEMENTS|ENTITIES): \
TO: RESTRICTION NAMED: LINKED: (YES|NO)
To which a restriction can then be applied:
REMOVE WHICH: all (ELEMENTS|ENTITIES): FROM ENTITIES:
Note that such a two-step mechanism may have an impact on other design
issues. But maybe not, as the following syntax attempts to
demonstrate:
d363 109
a471 66
Attribute restriction
---------------------
Note: these customizations depend on the base DTD using parameter
entities for attributes definition.
This is not unlike element restriction, but much simpler, as
attributes are not referenced in such complex places like content
models, and can occur only once within one element definition.
- remove an #implied or defaulted attribute
REMOVE: FROM:
- make required an #implied or defaulted attribute
REQUIRE: IN:
OTOH, attributes also have some sort of "content models" which, even
if simpler that an element's content model, can also be subject to
customization.
- allow less tokens:
REMOVE: FROM: IN:
- change CDATA to tokens. This may require some additional care, and
SGML attribute minimization should surely be turned off in this case,
if we want a doc using the customization layer to be parsable as an
instance of the base DTD.
TOKENIZE: FROM: AS:
DTD extention
=============
adding an element
-----------------
Well, just add its definition, and make sure it conforms with the
modularity standards of the base DTD. It will still need to be used
somewhere, though.
Element content extention
-------------------------
As well as restrictions, extentions of a content model may be quite
complex to specify, and "REDEFINE" clauses may be the best way to go
in many cases. Still, if a DTD is properly parametrized, it may be
that the place you want to add an element/entity to is itself within
what I'll name a "consistent entity", that is a part of a content
model that is either a pure sequence of |'d elements or a pure
sequence of &'d elements. Entities to be included in such a way, and
those on the same level in the target consistent entity, should be
safe-checked to be sure they don't break the consistency of the target
entity.
ADD CONSISTENT: <(element|entity)> TO: <(element|entity)>
Appending or prepending an element to a content model is also easy:
ADD (APPEND|PREPEND): TO:
symetric constructs from DTD restriction
----------------------------------------
d474 60
a533 29
Other operations
================
- element redefinition
Sometimes arbitrary changes to a content model must be done, that do
not fall under "pure restriction" or "pure extension" or other
categories, or that are too hard to describe as such, and the content
model must be completely rewritten.
REDEFINE ELEMENT: AS:
REDEFINE ENTITY: AS:
- element renaming
If element names are parametrized, this is trvial. Otherwise it
involves removing the original element, creating a new one with the
same content model, and changing the definition of all elements and
entities that referenced the original.
RENAME ELEMENT: TO:
- element or entity forking
This is a replacement of some of the occurences of an element by a
modification of the original. It is not unlike element renaming, but
somewhat more complex to express.
@
1.4
log
@- content-model checking
- update of content-model restriction
@
text
@d189 7
d203 7
d220 7
d234 7
d250 24
@
1.3
log
@- Rework of the global architecture
@
text
@d150 33
a182 11
difficult to formalize in a usable way, so we'll focus at first on a
couple of restrictions. It should be noted that the possibility of
restricting the content-model is highly dependant on the DTD.
A quite simple type of content restriction is probably the removal an
occurence of an element in a content model. Even that is somewhat
difficult to express if we want it to survive to future revisions of
the parent DTD, the main problem being to address a single occurence
of an element type (or of a parameter entity) in a content model
(including inside parameter entities), when several such occurences
can be found.
d197 6
a202 6
handle changes to parameter entities in a secure way, as those entities
can then be used both with additive and substractive semantics. Their
usage in the base DTD (and in the current customization layer) should
be checked to be sure of the semantics of the following clause, which
would be an part of a pure extension when applied to an entity used
for a content exclusion:
d207 1
d213 9
@
1.2
log
@more on modularity standards
@
text
@d4 10
a13 6
The goal of this project is to allow one to easily customize a SGML or
XML DTD, by specifying at a high level of abstraction what changes to
do (like: removing an element or adding a new one, restricting the
content model of an element, etc.), and to have a mechanic tool that
will do all the necessary steps that would be fastidious and
error-prone if done by hand.
d17 27
a43 18
- the process of customizating a DTD contains a costly mechanical
sub-process, namely ensuring that the whole customization is
consistently implements a change (like the removal of an element,
that must be repercuted on all elements that reference it). This
process requires good tools (eg. dtd* from PerlSGML), and a good
knowledge of the internal structure of the base DTD.
- the needs that led to a customization, or at least the
understanding of those needs, will probably evolve with time, leading
to multiple revisions of a customization layer. Doing this
maintainance by hand has at least the same knowledge requirements
than the initial customization, to which we now have to add the
knowledge of the initial customization.
- the base DTD for a customization may itself have to evolve with
time. In many cases it will be useful to update a customized DTD to
the latest revision of the base DTD. Again, this is costly if done
manually.
d50 1
a50 4
This tool will be most useful with a DTD that is designed for easy
customization, like DocBook.
It will most probably be written using the PerlSGML library for
d57 1
d72 1
d74 7
a80 7
Most customizations of a DTD like DocBook are either selecting a
subset, or adding domain-specific elements, or a combination of those.
In the latter case, using this tool it should trivial to clearly
separate the restricting part of the customization from the extending
part. It would even be trivial to maintain several extentions that
would be based on a single subset of DocBook (or TEI, or whatever),
like the following:
d96 6
d107 2
a108 2
to the customization layer author to declare a pure restriction or a
pure extension. This will be used to forbid invalid constructs, and
d112 2
a113 2
Modularity standards
====================
d115 26
a140 8
Modularity standards of the base DTD to be used may need to be
formally described. This may need insight into DTDs other than
DocBook, which I do not have right now. Help needed here. I should
probably have a look at TEI.
My raw idea for now is to use flags to tell whether a feature is
present or not, and/or to parse the DTD to detect some features (like
whether an element is in its own incluse/ignore section).
@
1.1
log
@Initial revision
@
text
@d93 4
d101 4
@