GlueX Technical Note
Request for Comments
HDDM - GlueX Data Model
draft 1.1
Richard Jones
September 12, 2003
(supersedes draft 1.0, September 12, 2003)
abstract
The GlueX experiment has adopted a policy that all shared data should
be described in xml. A set of tools has been developed that enable
programmers to respect this policy without the overhead of having
to do building and parsing of xml documents in user code. The data
are described in terms of data model structured like an xml document.
The general rules for building such model documents is called the
HDDM specification. Once a user has described the data model in xml,
tools are provided to translate the model into c structures, xml
schema, and a general-purpose i/o library.
Background
The workflow model for GlueX offline analysis is shown in Fig. 1.
It is important even at an early stage in the experiment to consider
carefully critical elements of the software framework in which these
components operate. One critical aspect of the framework is the way
the data are represented at intermediate stages in the pipeline. The
collaboration made an early decision to describe all such data in xml.
This choice brings with it the advantages of a host of automated tools
and libraries in the public domain for parsing and searching xml documents.
However it also demands careful design so that the additional software
layers introduced do not impose unintended restrictions on the client
programs.
A search on the web recently turned up a number of interesting projects
in which other groups are attempting to do the same or similar things.
Two projects stand out as most interesting from the point of view of
GlueX, the BinX project [1]
supported by the University of Edinburgh and the HFS project
[2] supported by NCSA, Urbana.
BinX is still in its infancy. None of the tools or libraries described
in the reference document appear to be available for downloading at the
time of of this writing. However from the description given, some
observations can be made. The aim is to provide a general metadata format
for describing just about any binary data file. The authors start from
the premise that storing and retrieving data files in xml format does not
make sense, and proceed to the conclusion that one never needs to actually
express them in xml. Thus their goal appears to be to use the xml as a
sort of documentation facility that is separate from the actual process
of handling and interpreting the data. This approach makes the most sense
when one is dealing with a set of i/o libraries and file formats that are
already in place, and wants to introduce a metadata description after the
fact. In GlueX we have the possibility of starting out using the xml in
the design process, and having the other pieces automatically created from
the xml. BinX adds the metadata as a new layer on top of, and independent
from, the layer in which programs actually process data, and as such it
looks more like a retrofit tool than an integrated model component.
However it is still new and seems to have some resources behind it, so
GlueX developers should follow it with interest.
The Hierarchical Data Format HDF is much older, by contrast, and seems to
already have established a base of scientific users. The design appears
to be oriented around the need to archive large multi-dimensioned arrays,
such as images. Its origins date back to the mid-1990's before the advent
of xml and reflects that fact in the frequent references to storing and
recovering data from fortran programs, but it is being actively developed
and in version 5 (HDF5) they have included new functionality to describe
data in xml. There are also java tools in the distribution, that probably
reflects the fact that support for scientific programming in java is on the
rise. The primary concern regarding the potential usefulness of HFS to
GlueX is the complexity of the data model and the program interface.
This package is designed to do much more than is needed by GlueX. For
example, the ability to delete arbitrary objects from existing data files,
or add new ones, complicates the i/o library considerably over what is
needed for HEP data streams. The number of operations that are required
to set up a new data file are considerable, compared with the simple
opens and writes to which HEP programmers are accustomed. From the point
of view of design, multi-dimensioned arrays are not the most appropriate
structure for expressing much of the content of a GlueX event. While it
could probably be made to work for GlueX, it appears that HDF would be
overkill for this problem. Thus there is still good reason to carry on
the earlier development of a xml data model for GlueX.
Fig. 1: The conceptual data model for GlueX begins with a physics
event, coming either from the detector or a Monte Carlo program,
which builds up internal structure as it flows through the
analysis pipeline. The data model specifies the elements of information
that are contained in a view of the event at each stage and the
relationships between them. The implementation provides standard
methods for creating, storing and accessing the information.
General Notes
- At each stage (lower-case items in diagram) in the pipeline one has
a unique view of the event.
- To each of these is associated a unique data model that expresses
the event in that view.
- GlueX policy is to use xml to describe all of our shared data, which
means any data that might be passed as input to a program or produced
as output. This does not mean that all data records are represented
as plain-text files, but that to each data file or i/o port is attached
some metadata that a tool can use to automatically express all of its
contents in the form of a plain-text xml document.
- This policy is interpreted to mean that to each data file or i/o port
of a program is associated a xml schema that defines the data structure
that the program expects or produces. The schemas should be either
bundled with the distribution of the program, or published on the web
and indicated by links in the Readme file.
- Any xml document should be accepted as input to a program if it is
valid according to the schema associated to that input port.
- In practice, this last requirement adds significant of overhead to the
task of writing a simple analysis program, because it must be capable of
parsing general xml documents as input. In addition to this overhead
imposed on the program code itself, the author must also produce schemas
for each input or output port or file accessed by the program.
- The purpose of the Hall D Data Model (HDDM) is to simplify the
programmer's task by providing automatic ways to generate schemas
and providing i/o libraries to standardize input and output of data
in the form of xml-described files.
- HDDM consists of a specification supported by a
set of tools to assist its implementation. The
specification is a set of rules that a programmer must obey in
constructing schemas in order for them to be supported by the tools.
The tools include an automatic schema generator and an i/o
library builder for c and c++.
- The HDDM specification was designed to enable the construction of an
efficient i/o library. It was assumed in the design that users could
not afford a general xml-parsing cycle every time an event is read in
or written out by a program. It was also assumed that serializing data
in plain-text xml is too expensive in terms of bandwidth and storage.
Using the HDDM tools, users can efficiently pass data between programs
in a serialized binary stream, and convert to/from plain-text xml
representations using a translation tool when desired.
- Programmers are not obligated to use HDDM tools to work in the GlueX
software framework. If they provide their own schema for each file or
i/o port used by the program and accept any xml that respects their
input schema then they are within the agreed framework.
- The HDDM tools are presently implemented in c and c++, so programmers
wishing to work in java have more work to do. However, they will find
it easy to interface to other programs that do use the HDDM libraries
because they provide for the correct reading and writing of valid xml
files and the automatic generation of schemas that describe them.
- The hddm-c tool automatically constructs a set of c-structures
based on xml metadata that can be used to express the data in memory.
It also builds an i/o library that can be called by user code to
serialize/deserialize these structures.
- The serialized data format supported by hddm-c consists of a
xml header in plain text that describes the structure and contents of
the file, followed by a byte string that is a reasonably compact
serialization of the structured data in binary form. Such hddm
files are inherently self-describing. The overhead of including the
metadata in the stream with the data is negligible.
- The hddm-xml tool extracts the xml metadata from a hddm file
header and expresses the data stored in the file in the form of a
plain-text xml document.
- The hddm-schema tool extracts the xml metadata from a hddm file
header and generates a schema that describes the structure of the data
in the file. The schema produced by hddm-schema will always
validate the document produced by hddm-xml when both act on the
same hddm file. More significantly, the schema can be used to check
the validity of other xml data that originate from a different source.
- The xml-hddm tool reads an xml document and examines its schema
for compliance with the HDDM specification. If successful, it parses
the xml file and converts it into hddm format.
- The schema-hddm tool reads a schema and checks it for compliance
with the HDDM specification. If successful, it parses it into the form
of a hddm file consisting of the header only and no data. Such a
data-less hddm file is also called a "template"
(see below).
- Note that the hddm-xml, hddm-schema, and hddm-c tools
can act on any hddm data file written by any program, even if the code
that produced the data is no longer available. This is because
sufficient metadata is provided in the schema header to completely
reconstruct the file's contents in xml, or instantiate it in c-structures.
- A tool called xml-xml has been included in the tool set as a
simple means to validate an arbitrary xml document against a dtd or
schema, and reformat it with indentation to make it easier to read.
- Tools called stdhep2hddm and hddm2stdhep provide
conversion between the hddm data stream and the STDHEP format used by
HDFast. This is an example where a user program achieves xml i/o
by employing translators, in this case a two-stage pipeline.
- In spite of the array of tools described above, the programmer still
must do the work of describing the structure and contents of the data
expected or produced by his program. He may do this in one of two
ways: either he constructs an original schema describing his data, or
he creates an original xml template of his data and then generates the
schema using hddm-schema.
- Since schemas are rather verbose and repetitive, the suggested method
is to create a template first, use hddm-schema to transform it
into a basic HDDM schema, and then add facets to the schema to enrich the
minimal set of metadata generated from the template. This method has
the advantage that one starts off with a basic schema that is known to
conform to the rules for HDDM schemas (see below)
so it is relatively simple thereafter to stay within the specification.
- As a shortcut to creating schemas, it is not necessary to do anything
more than just create the template. The basic schema that is generated
automatically from the template contains sufficient information to
validate most data, so a programmer can get by without ever learning
how to write or modify schemas.
Rules for constructing HDDM templates
- A hddm template is nothing more than a plain-text xml file that mimics
the structure of the xml that the program expects on input or produces
on output. In some ways it is like sample data that the programmer
might provide to a user to demonstrate how to use it, although the
comparison is not perfect.
- The top element in the template must be <HDDM> and have
three required attributes: class, version, and xmlns.
The value of the latter must be xmlns="http://www.gluex.org/hddm".
The values of the class and version arguments are user-defined. They
serve to identify a group of schemas that share a basic set of tags.
See below for more details on classes.
- The names of elements below the root <HDDM> element are
user-defined, but they must be constructed according to the following
rules.
- All values in hddm files are expressed as attributes of elements.
Any text that appears between tags in the template is treated as
a comment and ignored.
- An element may have two information attached to it: child elements
which appear as new tags enclosed between the open and close tags of
the parent element, and attributes which appear as key="value"
items inside the open tag.
- All quantities in the data model are carried by named attributes of
elements. The rest of the document exists to express the meaning of
the data and the relationships between them.
- All elements in the model document either hold attributes, contain other
elements, or both. Empty elements are meaningless, and are not allowed.
- One way a template is not like sample data is that it does not
contain actual numerical or symbolic values for the fields in the
structure. In the place of actual values, the types of the fields
are given. For example, instead of showing energy="12.5 as
might be shown for sample data, the template would show in this
position energy="float" or energy="double".
- The complete list of allowed types supported by hddm is "int", "long",
"float", "double", "boolean", "string", "anyURI", and "Particle_t". The
Particle_t type is a value from an enumerated list of capitalized
names of physical particles. The int type is a 32-bit signed integer,
and long is a 64-bit signed integer. The other cases are obvious.
- Attributes in the template that do not belong to this list are assumed
to be constants. Constants are sometimes useful for annotating the
xml record. The must have the same value for all instances of the
element throughout the template.
- Any given attribute may appear more than once throughout the template
hierarchy. Wherever it appears, it must appear with identical
attributes and with content elements of the same order and type.
- Another difference between a template sample data is that the
template never shows a given element more than once in a given context,
even if the given tag would normally the repeated more than once for
an actual sample. One obvious example of this is a physics event,
which is represented only once in the template, but repeated multiple
times in a file.
- By default, it is assumed that an element appearing in the template
must appear in that position exactly once. If the element is allowed
to appear more than once or not at all then additional attributes
should be inserted in the element of the form minOccurs="N1"
and maxOccurs="N2", where N1 can be zero or any positive
integer and N2 can be any integer no smaller than N1, or
set to the string "unbounded". Each defaults to 1.
- Arrays of simple types are represented by a sequence of elements,
each carrying an attribute containing a single value from the array.
This is more verbose than allowing users to include arrays as a simple
space separated string of values, but the chosen method is more apt
for expressing parallelism between related arrays of data.
- An element may be used more than once in the model, but it may never
appear as a descendent of itself. Such recursion is complicated to
handle and it is hard to think of a situation where it is necessary.
- Examples of valid hddm templates are given in the examples section
below.
- Because templates contain new tags that are invented by the programmer,
it is not possible to write a standard template schema against which a
programmer can check his new xml file for use as a template. Instead of
using schema validation, the programmer can use the hddm-schema
tool to check a xml file for correctness as a hddm template. Any errors
that occur in the hddm - schema transformation indicate problems in the
xml file that must be fixed before it can be used as a template.
Rules for constructing HDDM schemas
- HDDM schemas must be valid xml schemas, belonging to the namespace
http://www.w3.org/2001/XMLSchema. Not every valid schema is a valid
HDDM schema, however, because xml allows for several different ways to
express a given data structure.
- GlueX programmers are not obligated to write schemas that conform to
the HDDM specification, but if they do, they have the help of the HDDM
tools for efficient file storage and i/o.
- In the following specification, a prefix xs: is applied to the
names of elements, attributes or datatypes that belong to the official
schema namespace "http://www.w3.org/2001/XMLSchema", whose meaning is
defined by the xml schema standard. The extensions introduced for the
specific needs of GlueX are assigned to a private namespace called
"http://www.gluex.org/hddm" that is denoted by the prefix hddm:.
- The top element defined by the schema must be <hddm:HDDM> and have
three required attributes: class, version, and xmlns.
The value of the latter must be xmlns="http://www.gluex.org/hddm".
The class and version arguments are of type xs:string and are
user-defined. They serve to identify a group of schemas that share a
basic set of tags. See below for more details.
- The names of elements below the root <hddm:HDDM> element are
user-defined, but they must be constructed according to the following
rules.
- An element may have two kinds of content: child elements and attributes,
and hence must have xs:complexType. Elements represent the
grouping together of related pieces of data in a hierarchy of nodes.
The actual numerical or symbolic values of individual variables appear
as the values of attributes. Examples are shown
below.
- All quantities in the data model are carried by named attributes of
elements. The rest of the document exists to express the meaning of
the data and the relationships between them.
- All elements in the model document either hold attributes, contain other
elements, or both. Empty nodes are meaningless, and are not allowed.
- Text content between open and close tags is allowed in documents
(type="mixed") but it is treated as a comment and stripped on
translation. Basic HDDM schemas do not use type="mixed"
elements.
- The datatype of an attribute is restricted to a subset of basic types
to simplify the task of translation. Currently the list is
xs:int, xs:long, xs:float, xs:double,
xs:boolean, xs:string, xs:anyURI and
hddm:Particle_t. User types that are derived from the above
by xs:restriction may also be defined and used in a HDDM schema.
- Attributes must always be either "required" or "fixed". Default
attributes, i.e. those that are sometimes present inside their host and
sometimes not are not allowed. This allows a single element to be
treated as a fixed-length binary object on serialization, which has
advantages for efficient i/o.
- A datum that is sometimes absent can be expressed in the model by
assigning it as an attribute to its own host element and putting the
host element into its parent with minOccurs="0".
- Fixed attributes (with use="fixed") may be attached to
user-defined elements. They may be of any valid schema datatype, not
just those listed above, and may be used as comments to qualify the
information contained in the element. Because they have the same
value for every instance of the element, they do not take up space in
the binary stream, but they are included explicitly in the output
produced by the hddm-xml translator.
- All elements must be globally defined in the schema, i.e. declared at
the top level of the xs:schema element. Child elements are
included in the definition of their parents through a ref=tagname
reference. Local definitions of elements inside other elements are not
allowed. This guarantees that a given element has the same meaning and
contents wherever it appears in the hierarchy.
- Arrays of simple types are represented by a sequence of elements,
each carrying an attribute containing a single value from the array.
This is more verbose than allowing a simple list type like is defined
by xs:list, but the chosen method is more apt for expressing
parallelism between related arrays of data, such as frequently occurs
in descriptions of physical events. Forbidding the use of simple
xs:list datatypes should encourage programmers to chose the
better model, although of course they could just mimic the habitual use
of lists by filling the data tree with long strings of monads!
- Elements are included inside their parent elements within a
xs:sequence schema declaration. Each member of the sequence
must be a reference to another element with a top-level definition.
- A given element may occur only once in a given the sequence, but may
have minOccurs and maxOccurs attributes to indicate
possible absence or repetition of the element.
- The sequence is the only content organizer allowed by HDDM.
More complex organizers are supported by schema standards, such as
all and choice, but their use would complicate the i/o
i/o interfaces that have to handle them and they add little by way
of flexibility to the model the way it is currently defined.
- An element may be used more than once in the model, but it may never
appear as a descendent of itself. Such recursion is complicated to
handle and it is hard to think of a situation where it is necessary.
- A user can check whether a given schema conforms to the HDDM rules
by transforming it into a hddm template
document. Any errors that occur during the transformation generate
a message indicating where the specification has been violated.
Class relationships between HDDM schemas
- Two HDDM schemas belong to the same class if all tags that are
defined in both have the same set of attributes in both.
- This is a fairly weak condition. It is possible that all data files
used in GlueX will belong to the same class, but it is not required.
- If two HDDM schemas belong to the same class then it is possible to
form a union schema that will validate documents of either type by
taking the xml union of the two schema documents and changing any
sequence elements in one and not in the other to minOccurs="0".
- The translation tools xml-hddm and hddm-xml will work
with any HDDM class.
- Any program built using the i/o library created with hddm-c is
dependent on the class of the schema used during the build. Any files
it writes through this interface will be built on this schema, however
it is able to read any file of the same class without recompilation.
- A new schema may be derived from an existing HDDM schema by taking the
existing one and adding new elements to the structure. In this case
the version attribute of the HDDM tag should be incremented, while
leaving the class attribute unchanged.
- A program that was built using the hddm-c tool for its i/o
interface can read from any from any hddm file of the same class as
the original schema used during the build. If the content of the file
is a superset of the original schema then nothing has changed. If
some elements of the original schema are missing in the file then the
i/o still works transparently, but the c-structures corresponding to
the missing elements will be empty, i.e. zeroed out.
- The c/c++ i/o library rejects an attempt to read from a hddm file that
has a schema of a different class from the one for which it was built.
- No mandatory rules are enforced on the version attribute of the
hddm file, but it is available to programs and may be used to select
certain actions based on the "vintage" of the data.
- Programs that need simultaneous access to multiple classes of hddm
files can be built with more than one i/o library. The structures and
i/o interface are defined in separate header files hddm_X.h and
implementation files hddm_X.c, where X is the class letter.
Implementation Notes
- There is a complementarity between xml schemas and the xml templates
that express the metadata in hddm files. Depending on the level of
detail desired, schemas may become arbitrarily sophisticated and
complex. On the other hand, only a small subset of that information
is needed to support the functions of the hddm tool set. Templates
allow that information to be distilled in a compact form that is both
human-readable and valid xml.
- In the present implementation, the text layout of the template
(including the whitespace between the tags) is used by the hddm tools
to simplify the encoding and decoding. There is exactly one tag per
line and two leading spaces per level of indent. This may change in
future implementations. This means that hddm file headers should not
be edited by hand.
- The XDR [3]
library is used to encode the binary values in the hddm
stream. This means that hddm files are machine-independent, and
can be read and written on any machine without any dependence on
whether the machine is little-endian or big-endian. XDR is the network
encoding standard library developed for Sun's rpc and nfs services.
For more info, search for RFC 1014 on the web or do "man xdr" under
linux.
- The binary file format will change. The point is not to fix
on some absolute binary format at this early stage. The only
design constraint was that the data model be specified in xml and
that the data be readily converted into plain-text xml, preferably
without needing to look up auxiliary files or loading the libraries
that wrote it.
- The design of the i/o library has been optimized for flexibility:
the user can request only the part of the model that is of interest.
The entire model does not even have to be present in the file, in which
case only the parts of the tree that are present in the file are loaded
into memory, and the rest of the requested structure is zeroed out.
- The only constraint between the model used in the program and that
of the hddm stream is that there be no collisions, that is tags
found in both but with different attributes.
- Two data models with colliding definitions can be used in one program
but they have to have different class Ids. Two streams with
different class Ids cannot feed into each other. In any case the
xml viewing tool hddm-xml can read a hddm stream of any class.
Examples
- A simple model of an event fragment describing hits in a
time-of-flight wall. It allows for multiple hits per detector
in a single event, with t and dE information for each hit.
The hits are ordered by side (right: end=0, left: end=1) and then by
horizontal slab. The minOccurs and maxOccurs attributes allow those
tags to appear any number of times, or not at all, in the given context.
<forwardTOF>
<slab y="float" minOccurs="0" maxOccurs="unbounded">
<side end="int" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" maxOccurs="unbounded" />
</side>
</slab>
</forwardTOF>
- A model of the output from an event generator.
An example of actual output from genr8
converted to xml using hddm-xml. Warning: some browsers have
difficulty displaying plain xml. Mozilla 1.x and Internet Explorer 6
give a nice view of the document below.
<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="s" version="1.0" xmlns="http://www.gluex.org/hddm">
<physicsEvent eventNo="int" runNo="int">
<reaction type="int" weight="float" maxOccurs="unbounded">
<beam type="Particle_t">
<momentum px="float" py="float" pz="float" E="float" />
<properties charge="int" mass="float" />
</beam>
<target type="Particle_t">
<momentum px="float" py="float" pz="float" E="float" />
<properties charge="int" mass="float" />
</target>
<vertex maxOccurs="unbounded">
<product type="Particle_t" decayVertex="int" maxOccurs="unbounded">
<momentum px="float" py="float" pz="float" E="float" />
<properties charge="int" mass="float" />
</product>
<origin vx="float" vy="float" vz="float" t="float" />
</vertex>
</reaction>
</physicsEvent>
</HDDM>
- A more complex example follows, showing a hits tree for the full
detector.
<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="s" version="1.0" xmlns="http://www.gluex.org/hddm">
<physicsEvent eventNo="int" runNo="int">
<hitView version="1.0">
<barrelDC>
<cathodeCyl radius="float" minOccurs="0" maxOccurs="unbounded">
<strip sector="int" z="float" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" maxOccurs="unbounded" />
</strip>
</cathodeCyl>
<ring radius="float" minOccurs="0" maxOccurs="unbounded">
<straw phim="float" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" minOccurs="0" maxOccurs="unbounded" />
<point z="float" dEdx="float" phi="float"
dradius="float" maxOccurs="unbounded" />
</straw>
</ring>
</barrelDC>
<forwardDC>
<package pack="int" minOccurs="0" maxOccurs="unbounded">
<chamber module="int" minOccurs="0" maxOccurs="unbounded">
<cathodePlane layer="int" u="float" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" minOccurs="0" maxOccurs="unbounded"/>
<cross v="float" z="float" tau="float" maxOccurs="unbounded" />
</cathodePlane>
</chamber>
</package>
</forwardDC>
<startCntr>
<sector sector="float" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" maxOccurs="unbounded" />
</sector>
</startCntr>
<barrelCal>
<module sector="float" minOccurs="0" maxOccurs="unbounded">
<flash t="float" pe="float" maxOccurs="unbounded" />
</module>
</barrelCal>
<Cerenkov>
<module sector="float" minOccurs="0" maxOccurs="unbounded">
<flash t="float" pe="float" maxOccurs="unbounded" />
</module>
</Cerenkov>
<forwardTOF>
<slab y="float" minOccurs="0" maxOccurs="unbounded">
<side end="int" minOccurs="0" maxOccurs="unbounded">
<hit t="float" dE="float" maxOccurs="unbounded" />
</side>
</slab>
</forwardTOF>
<forwardEMcal>
<row row="int" minOccurs="0" maxOccurs="unbounded">
<column col="int" minOccurs="0" maxOccurs="unbounded">
<flash t="float" pe="float" maxOccurs="unbounded" />
</column>
</row>
</forwardEMcal>
</hitView>
</physicsEvent>
</HDDM>
References
[1]
Representing Scientific Data on the Grid with BinX, Binary XML Description
Language, M. Westhead and M. Bull, University of Edinburgh, January 2003.
[2]
HDF5 Users's Guide,
working draft, July 2003.
[3]
RFC 1832 (rfc1832) -
XDR: External Data Representation standard, September 1995.
This material is based upon work supported by the National Science Foundation under Grant No. 0901016.