James Clark
2007-04-04 08:50:02 UTC
This is a write-up of an idea I've been thinking about for quite a
while. I did an off-the-cuff presentation of this to Sanjiva when he
was in Bangkok a few days ago. This message is an attempt to
communicate this to everybody else. There's rather a lot of discursive,
motivating material at the beginning: the meat of the message is towards
the end. This is because I planning to use this message as the basis of
my first blog entry (I've been thinking about starting a blog for some
time, but in trying to write the first blog entry, I am beginning to
understand why novelists have such a hard time writing the first
sentence of a novel), unless of course you all tell me the idea is
useless and/or incomprehensible. So please don't be shy about
expressing your opinions on this idea. I want to make sure my first
blog entry is worth reading.
I see the real pain-point for distributed computing at the moment as not
the messaging framework but the handling of the payload. A successful
distributed computing platform needs
- a payload format
- a way to express a contract that a payload must meet
- a way to process a payload that may conform to one or more contracts
that is
- suitable for average, relatively low-skill programmers
- allows for loose coupling (version evolution, extensibility,
suitability for a wide variety of implementation technologies)
For the payload format, XML has to be the mainstay, not because it's
technically wonderful, but because of the extraordinary breadth of
adoption that it has succeeded in achieving. This is where the JSON (or
YAML) folks are really missing the point by proudly pointing to the
technical advantages of their format: any damn fool could produce a
better data format than XML.
We also have to live in a world where XSD is currently dominant as the
wire-format for the contract (thank you, W3C, Microsoft and IBM).
But I think it's fairly obvious that current XML/XSD databinding
technologies have major weaknesses when considered as a solution to
problem of payload processing for a distributed computing platform. The
two basic databinding techniques I see today are:
- Generating XSD from an implementation in a statically typed language
which includes optional annotations; this provides a great developer
experience, but from a coupling perspective doesn't seem much of an
improvement beyond CORBA or DCOM. The other problem is that it's tough
to do this in a dynamically typed language (absent sophisticated type
inference or mandatory annotations).
- Generating programming language stubs from an XSD which includes
optional annotations. This is problematic from the developer experience
point of view: there's a mismatch between XML's fundamental structures,
attributes and elements, which are optimized for imposing structure on
text, and the terms in which developers naturally think of data
structures. Beyond this inherent problem, it's hard to author schemas
using XSD and even harder to author schemas that have the right
loose-coupling properties. And the tooling often introduces additional
coupling problems.
This pain is experienced most sharply at the moment in the SOAP world,
because the big commercial players have made a serious investment in
trying to produce tools that work for the average developer. But I
believe the REST world has basically the same problem: it's not really
feeling the pain at the moment because REST solutions are mostly created
by relatively elite developers who are comfortable dealing with XML
directly.
The REST world also takes a less XML-centric view of the world, but for
non-XML payload formats (JSON, or property-value pairs) their only
solution to the contract problem is a MIME type, which I think is
totally insufficient as a contract mechanism for enterprise-quality
distributed computing. For example, it's not enough to say "accessing
this URI will give you JSON"; there needs to be a description the
structure of the JSON, and that description needs to be machine
readable.
Some people propose solving the XML-processing problem by adopting an
XML-centric processing model, for which the leading technologies are
XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data
model. I'm not criticizing the WGs' efforts: they've done about as good
a job as could be done given the constraints they were working under.
But there is no way it can overcome the constraint that a data model
based around XML and XSD is just not very good data model for
general-purpose computing. The structures of XML (attributes, elements
and text) are those of SGML and these come from the world of markup.
Considered as general purpose data structures, they suck pretty badly.
There's a fundamental lack of composability. Why do we need both
elements and attributes? Why can't attributes contain elements? Why is
the type of thing that can occur as the content of an element not the
same as the type of thing that can occur as a document? Why do we still
have cruft like processing instructions and DTDs? XSD makes a (misguided
in my view) attempt to add a OO/programming language veneer on top. But
it can't solve the basic problems, and, in my view, this veneer ends up
making things worse not better.
I think there's some real progress being made in the programming
language world. In particular I would single out Microsoft's LINQ work.
My doubts on this are with its emphasis on static typing. While I think
static typing is a invaluable within a single, controlled system, I
think for a distributed system the costs in terms of tight coupling
often outweigh the benefits. I believe this is less of the case if the
typing is structural rather than named. But although LINQ (or at least
newer versions of C#) have introduced some welcome structural typing
features, named typing is still thoroughly dominant.
In the Java world, there's been a depressing lack of innovation at the
language level from Sun; outside of Sun, I would single out Scala from
EPFL (which can run on a JVM). This adds some nice functional features
which are smoothly integrated with Java-ish OO features. XML is
fundamentally not OO: XML is all about separating data from processing,
whereas OO is all about combining data and processing. Functional
programming is a much better fit for XML: the problem is making it
usable by the average programmer, for whom the functional programming
mindset is very foreign.
This brings me to the main point I want to make. There seems to me to
be another approach for improving things in this area, which I haven't
seen being proposed (maybe I just haven't looked in the right places).
The basic idea is to have a schema language that operates at a different
semantic level. In the following description I'll call this
yet-to-be-designed language TEDI (Type Expressions for Data Interchange,
pronounced "Teddy").
If you look at the major scripting languages today, I think it's
striking that at a very high level, their data structures are pretty
similar and are composed from:
- arrays
- maps
- scalars/primitives or whatever you want to call them
This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array
datastructure is a little idiosyncratic.) The SOAP data model is also
not dissimilar.
When you drill down into the details, there are of course a lot of
differences:
- some languages have fixed-length tuples as well as variable-length
arrays
- most languages distinguish between a struct that has a fixed set of
identifiers as keys and a map that can have an unlimited set keys
(though there are often restrictions on the types of keys, for example,
to prohibit mutable types)
- there's a wide variety of primitives: almost all languages have
strings (though they differ in whether they are mutable) and numbers;
beyond that, many languages have booleans, a null value, some sort of
date-time support
TEDI would be defined in terms of a generic data model that makes a
tasteful restricted choice from these programming languages' data
structures: not limiting the choice to the lowest common denominator,
but leaving our frills and focusing on the basics and on things that be
naturally mapped into each language. At least initially, I think I
would restrict TEDI to trees rather than handle general graphs. Although
graphs are important, I think the success of JSON shows that trees are
good enough as a programmer-friendly data interchange mechanism.
I would envisage both an XML and a non-XML syntax for TEDI. The non-XML
syntax might have JSON flavour. For example, a schema might look like
this:
{ url: String, width: Integer?, height: Integer?, title: String? }
This would specify a struct with 4 keys: the value of the "url" key is a
string; the value of the "width" key is a string or null. You can thus
think of the schema as being a type expression for a generic scripting
language data structure.
The key design goal for TEDI something would be to make it easy and
natural for a scripting-language programmer to work with.
There's one other big piece that's needed to make TEDI work:
annotations. Each component of a TEDI schema can have multiple,
independent annotations, which may be inline or externally attached in
some way. Each annotation has a prefix that identifies a binding. A
TEDI binding specification has to be developed for each programming
language and each serialization that will be used with TEDI.
The most important TEDI binding specification would be the one for XML.
This specifies for a combination of a
- a TEDI schema,
- XML binding annotations for the TEDI schema, and
- an instance of the generic TEDI data model conforming to the schema
which XML infosets are considered correct representations of the
instance, and also identifies one of these infosets as the canonical
representation. The XML binding annotations should always be optional:
there should be a default XML serialization of any TEDI instance.
For example, an instance of the example schema above might get
serialized as
<root>
<url>Loading Image...</url>
<title>A fine picture</title>
</root>
But with an annotation
@xml.element(name="picture")
{ url: String, width: Integer?, height: Integer?, title: String? }
it might get serialized as
<picture>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</picture>
Let's try and make this more concrete by imagining what it would look
like for a particular scripting language, say Python. First of all
people in the Python community would need to get together to create a
TEDI binding for Python. This would work in an analogous way to the XML
binding. It would specify for a combination of a
- a TEDI schema,
- Python binding annotations for the TEDI schema, and
- an instance of the generic TEDI data model conforming to the schema
which Python data structures are considered representations of the
instance, and also identify one of these data structures as the
canonical representation.
The API would be very simple. You would have a TEDI module that
provided functions to create schema objects in various ways. The
simplest way would be to create it from a string containing the non-XML
representation of the TEDI schema complete with any inline annotations
Any XML and Python annotations would be used; annotations from other
bindings would be ignored. The schema object would provide two
fundamental operations:
- loadXML: this takes XML and returns a Python structure, throwing an
exception if the XML is not valid according to the TEDI schema
- saveXML: this take a Python structure and returns/outputs XML,
throwing an exception if the Python structure is not valid according to
the schema
XML is not the only possible serialization. The JSON community could
develop a JSON binding. If you implemented that, then your API would
have loadJSON and saveJSON methods as well.
One complication that must be handled in order to make this
industrial-strength is streaming. A good first step would be to able to
handle the pattern where the document element contains zero or more
header elements, and then a possibly very large number of entry
elements, each of which is not large; you streaming solution you want in
this case is for the API to deliver the entries as an iterator rather
than an array.
Another challenge in designing the TEDI XML binding is handling
extensibility. I think the key here is for one of the TEDI *primitives*
to be an XmlElement (or maybe XmlContent). (This might also be useful
in dealing with XML mixed content.) With different TEDI schemas you
should be able to get quite different representations out of the same
XML document. For a SOAP message, you might have a very generic TEDI
schema that represents it as an array of headers and a payload (all
being XmlElements); or you might have a TEDI schema for a specific type
of message that represented the payload as a particular kind of
structure.
This shows how you could fit TEDI into a world where XML is the dominant
wire format, but still leverage other more suitable wire formats when
appropriate.
But how do you interop with a world that uses XSD as the wire format for
contracts? The minimum is to create a tool that can take a TEDI schema
with XML annotations and generate an XSD. There'll be limits because of
the limited power of XSD (and these will need to be taken into
consideration in designing the TEDI XML binding): some of the
constraints of the TEDI schema might not be captured by the XSD. But
that's a normal situation: there are often complex constraints on an XML
document being interchanged that cannot be expressed in XSD.
A more difficult task is to take an XSD and generate a TEDI together
with XML binding annotations. This would be one of the main things that
would drive adding complexity to the TEDI XML binding annotations. I
expect that the work of the XML Schema Patterns for Databinding WG would
be valuable input on what was really needed.
In the future, there's still hope that the wire-format for the contract
need not always be XSD: WSDL 2.0 makes a significant effort not to
restrict itself to XSD; so you could potentially publish a WSDL with
both the XSD and the TEDI for a web service.
The closest thing I've seen to TEDI is Paul Prescod's XBind language
(http://www.prescod.net/xml/xbind/), but it has a rather different
philosophy in that it separates validation from data binding, whereas
TEDI integrates them. Another difference is that Paul has written some
code, whereas TEDI is completely vaporware at this point.
The first step in implementing TEDI would be to pick a scripting
language (probably Ruby or Python), and do the implementation in and for
that language. Eventually it would be desirable to have a
high-performance modular C engine, that could be integrated into each
scripting language that is implemented in C, so that serialization and
deserialization performance via TEDI would be more competitive with the
language's native facilities (it would be interesting to see how big a
hit TEDI would be). Similarly you would want a Java implementation to
integrate with dynamic languages that are implemented in Java (Rhino,
Groovy, JRuby).
James
while. I did an off-the-cuff presentation of this to Sanjiva when he
was in Bangkok a few days ago. This message is an attempt to
communicate this to everybody else. There's rather a lot of discursive,
motivating material at the beginning: the meat of the message is towards
the end. This is because I planning to use this message as the basis of
my first blog entry (I've been thinking about starting a blog for some
time, but in trying to write the first blog entry, I am beginning to
understand why novelists have such a hard time writing the first
sentence of a novel), unless of course you all tell me the idea is
useless and/or incomprehensible. So please don't be shy about
expressing your opinions on this idea. I want to make sure my first
blog entry is worth reading.
I see the real pain-point for distributed computing at the moment as not
the messaging framework but the handling of the payload. A successful
distributed computing platform needs
- a payload format
- a way to express a contract that a payload must meet
- a way to process a payload that may conform to one or more contracts
that is
- suitable for average, relatively low-skill programmers
- allows for loose coupling (version evolution, extensibility,
suitability for a wide variety of implementation technologies)
For the payload format, XML has to be the mainstay, not because it's
technically wonderful, but because of the extraordinary breadth of
adoption that it has succeeded in achieving. This is where the JSON (or
YAML) folks are really missing the point by proudly pointing to the
technical advantages of their format: any damn fool could produce a
better data format than XML.
We also have to live in a world where XSD is currently dominant as the
wire-format for the contract (thank you, W3C, Microsoft and IBM).
But I think it's fairly obvious that current XML/XSD databinding
technologies have major weaknesses when considered as a solution to
problem of payload processing for a distributed computing platform. The
two basic databinding techniques I see today are:
- Generating XSD from an implementation in a statically typed language
which includes optional annotations; this provides a great developer
experience, but from a coupling perspective doesn't seem much of an
improvement beyond CORBA or DCOM. The other problem is that it's tough
to do this in a dynamically typed language (absent sophisticated type
inference or mandatory annotations).
- Generating programming language stubs from an XSD which includes
optional annotations. This is problematic from the developer experience
point of view: there's a mismatch between XML's fundamental structures,
attributes and elements, which are optimized for imposing structure on
text, and the terms in which developers naturally think of data
structures. Beyond this inherent problem, it's hard to author schemas
using XSD and even harder to author schemas that have the right
loose-coupling properties. And the tooling often introduces additional
coupling problems.
This pain is experienced most sharply at the moment in the SOAP world,
because the big commercial players have made a serious investment in
trying to produce tools that work for the average developer. But I
believe the REST world has basically the same problem: it's not really
feeling the pain at the moment because REST solutions are mostly created
by relatively elite developers who are comfortable dealing with XML
directly.
The REST world also takes a less XML-centric view of the world, but for
non-XML payload formats (JSON, or property-value pairs) their only
solution to the contract problem is a MIME type, which I think is
totally insufficient as a contract mechanism for enterprise-quality
distributed computing. For example, it's not enough to say "accessing
this URI will give you JSON"; there needs to be a description the
structure of the JSON, and that description needs to be machine
readable.
Some people propose solving the XML-processing problem by adopting an
XML-centric processing model, for which the leading technologies are
XQuery and XSLT2. The fundamental problem here is the XQuery/XPath data
model. I'm not criticizing the WGs' efforts: they've done about as good
a job as could be done given the constraints they were working under.
But there is no way it can overcome the constraint that a data model
based around XML and XSD is just not very good data model for
general-purpose computing. The structures of XML (attributes, elements
and text) are those of SGML and these come from the world of markup.
Considered as general purpose data structures, they suck pretty badly.
There's a fundamental lack of composability. Why do we need both
elements and attributes? Why can't attributes contain elements? Why is
the type of thing that can occur as the content of an element not the
same as the type of thing that can occur as a document? Why do we still
have cruft like processing instructions and DTDs? XSD makes a (misguided
in my view) attempt to add a OO/programming language veneer on top. But
it can't solve the basic problems, and, in my view, this veneer ends up
making things worse not better.
I think there's some real progress being made in the programming
language world. In particular I would single out Microsoft's LINQ work.
My doubts on this are with its emphasis on static typing. While I think
static typing is a invaluable within a single, controlled system, I
think for a distributed system the costs in terms of tight coupling
often outweigh the benefits. I believe this is less of the case if the
typing is structural rather than named. But although LINQ (or at least
newer versions of C#) have introduced some welcome structural typing
features, named typing is still thoroughly dominant.
In the Java world, there's been a depressing lack of innovation at the
language level from Sun; outside of Sun, I would single out Scala from
EPFL (which can run on a JVM). This adds some nice functional features
which are smoothly integrated with Java-ish OO features. XML is
fundamentally not OO: XML is all about separating data from processing,
whereas OO is all about combining data and processing. Functional
programming is a much better fit for XML: the problem is making it
usable by the average programmer, for whom the functional programming
mindset is very foreign.
This brings me to the main point I want to make. There seems to me to
be another approach for improving things in this area, which I haven't
seen being proposed (maybe I just haven't looked in the right places).
The basic idea is to have a schema language that operates at a different
semantic level. In the following description I'll call this
yet-to-be-designed language TEDI (Type Expressions for Data Interchange,
pronounced "Teddy").
If you look at the major scripting languages today, I think it's
striking that at a very high level, their data structures are pretty
similar and are composed from:
- arrays
- maps
- scalars/primitives or whatever you want to call them
This goes for Perl, Python, Ruby, Javascript, AWK. (PHP's array
datastructure is a little idiosyncratic.) The SOAP data model is also
not dissimilar.
When you drill down into the details, there are of course a lot of
differences:
- some languages have fixed-length tuples as well as variable-length
arrays
- most languages distinguish between a struct that has a fixed set of
identifiers as keys and a map that can have an unlimited set keys
(though there are often restrictions on the types of keys, for example,
to prohibit mutable types)
- there's a wide variety of primitives: almost all languages have
strings (though they differ in whether they are mutable) and numbers;
beyond that, many languages have booleans, a null value, some sort of
date-time support
TEDI would be defined in terms of a generic data model that makes a
tasteful restricted choice from these programming languages' data
structures: not limiting the choice to the lowest common denominator,
but leaving our frills and focusing on the basics and on things that be
naturally mapped into each language. At least initially, I think I
would restrict TEDI to trees rather than handle general graphs. Although
graphs are important, I think the success of JSON shows that trees are
good enough as a programmer-friendly data interchange mechanism.
I would envisage both an XML and a non-XML syntax for TEDI. The non-XML
syntax might have JSON flavour. For example, a schema might look like
this:
{ url: String, width: Integer?, height: Integer?, title: String? }
This would specify a struct with 4 keys: the value of the "url" key is a
string; the value of the "width" key is a string or null. You can thus
think of the schema as being a type expression for a generic scripting
language data structure.
The key design goal for TEDI something would be to make it easy and
natural for a scripting-language programmer to work with.
There's one other big piece that's needed to make TEDI work:
annotations. Each component of a TEDI schema can have multiple,
independent annotations, which may be inline or externally attached in
some way. Each annotation has a prefix that identifies a binding. A
TEDI binding specification has to be developed for each programming
language and each serialization that will be used with TEDI.
The most important TEDI binding specification would be the one for XML.
This specifies for a combination of a
- a TEDI schema,
- XML binding annotations for the TEDI schema, and
- an instance of the generic TEDI data model conforming to the schema
which XML infosets are considered correct representations of the
instance, and also identifies one of these infosets as the canonical
representation. The XML binding annotations should always be optional:
there should be a default XML serialization of any TEDI instance.
For example, an instance of the example schema above might get
serialized as
<root>
<url>Loading Image...</url>
<title>A fine picture</title>
</root>
But with an annotation
@xml.element(name="picture")
{ url: String, width: Integer?, height: Integer?, title: String? }
it might get serialized as
<picture>
<url>http://www.example.com/pic.jpg</url>
<title>A fine picture</title>
</picture>
Let's try and make this more concrete by imagining what it would look
like for a particular scripting language, say Python. First of all
people in the Python community would need to get together to create a
TEDI binding for Python. This would work in an analogous way to the XML
binding. It would specify for a combination of a
- a TEDI schema,
- Python binding annotations for the TEDI schema, and
- an instance of the generic TEDI data model conforming to the schema
which Python data structures are considered representations of the
instance, and also identify one of these data structures as the
canonical representation.
The API would be very simple. You would have a TEDI module that
provided functions to create schema objects in various ways. The
simplest way would be to create it from a string containing the non-XML
representation of the TEDI schema complete with any inline annotations
Any XML and Python annotations would be used; annotations from other
bindings would be ignored. The schema object would provide two
fundamental operations:
- loadXML: this takes XML and returns a Python structure, throwing an
exception if the XML is not valid according to the TEDI schema
- saveXML: this take a Python structure and returns/outputs XML,
throwing an exception if the Python structure is not valid according to
the schema
XML is not the only possible serialization. The JSON community could
develop a JSON binding. If you implemented that, then your API would
have loadJSON and saveJSON methods as well.
One complication that must be handled in order to make this
industrial-strength is streaming. A good first step would be to able to
handle the pattern where the document element contains zero or more
header elements, and then a possibly very large number of entry
elements, each of which is not large; you streaming solution you want in
this case is for the API to deliver the entries as an iterator rather
than an array.
Another challenge in designing the TEDI XML binding is handling
extensibility. I think the key here is for one of the TEDI *primitives*
to be an XmlElement (or maybe XmlContent). (This might also be useful
in dealing with XML mixed content.) With different TEDI schemas you
should be able to get quite different representations out of the same
XML document. For a SOAP message, you might have a very generic TEDI
schema that represents it as an array of headers and a payload (all
being XmlElements); or you might have a TEDI schema for a specific type
of message that represented the payload as a particular kind of
structure.
This shows how you could fit TEDI into a world where XML is the dominant
wire format, but still leverage other more suitable wire formats when
appropriate.
But how do you interop with a world that uses XSD as the wire format for
contracts? The minimum is to create a tool that can take a TEDI schema
with XML annotations and generate an XSD. There'll be limits because of
the limited power of XSD (and these will need to be taken into
consideration in designing the TEDI XML binding): some of the
constraints of the TEDI schema might not be captured by the XSD. But
that's a normal situation: there are often complex constraints on an XML
document being interchanged that cannot be expressed in XSD.
A more difficult task is to take an XSD and generate a TEDI together
with XML binding annotations. This would be one of the main things that
would drive adding complexity to the TEDI XML binding annotations. I
expect that the work of the XML Schema Patterns for Databinding WG would
be valuable input on what was really needed.
In the future, there's still hope that the wire-format for the contract
need not always be XSD: WSDL 2.0 makes a significant effort not to
restrict itself to XSD; so you could potentially publish a WSDL with
both the XSD and the TEDI for a web service.
The closest thing I've seen to TEDI is Paul Prescod's XBind language
(http://www.prescod.net/xml/xbind/), but it has a rather different
philosophy in that it separates validation from data binding, whereas
TEDI integrates them. Another difference is that Paul has written some
code, whereas TEDI is completely vaporware at this point.
The first step in implementing TEDI would be to pick a scripting
language (probably Ruby or Python), and do the implementation in and for
that language. Eventually it would be desirable to have a
high-performance modular C engine, that could be integrated into each
scripting language that is implemented in C, so that serialization and
deserialization performance via TEDI would be more competitive with the
language's native facilities (it would be interesting to see how big a
hit TEDI would be). Similarly you would want a Java implementation to
integrate with dynamic languages that are implemented in Java (Rhino,
Groovy, JRuby).
James