"Seegrid will be due for a migration to confluence on the 1st of August. Any update on or after the 1st of August will NOT be migrated"
You are here: SEEGrid>Xmml Web>GdF2 (15 Oct 2010, UnknownUser)EditAttach

XMML version of GDF2


Related pages

The ASEG GDF2 specification is attached here: ASEG-GDF2-REV3_8_Publn.pdf

XML dump from Intrepid

sample GDF2 data in xml, supplied by DesFitzgerald:

Analysis of the model

GDF supports a (set of) tabulations, where the records in a table have a common structure. It is described as "point-located data", and is usually used to encode "raw" observational data, before processing into grids, etc. However, notwithstanding the "point-located" monicker, within the GDF encoding the position information is not actually highlighted in any way, and merely appears in arbitrary columns within the table. In fact the rows of the table are not explicitly indexed to position, time, or anything else - it is simply a sequence of records.

There is currently no feature-type for un-indexed records in GML/XMML. The "coverage" model comes closest, in which a set of (homogeneously typed) records is mapped to a set of elements comprising a spato-temporal domain - see ISO 19123 and XmmlCoverage.

Should there be a sequence-of-records feature type in XMML?

The GML/XMML Coverage is encoded as an explicit functional map, where the domain set (the independent or controlled variable) appears separate from the range values (the dependent variables). Thus, the strawmen examples shown here use the XMML Coverage encoding, extracting either the
  • positions
  • sampling times
as the domain of the coverage, and recording the rest of the table as a sequence of records (some band-orented and tabular variants are also possible). The assumption is that either position or time is in fact the controlled variable here, and the other values are observations dependent on this and thus appear as the content of the range element.

Is "fiducial" actually the controlled parameter?? What is this?

Strictly speaking, the independent variable is the line number/fiducial combination - that's unambiguous. Or else the line number/shotpoint combination. Or else the station number (for point observations, which are often random or semi-random.) -- AlanReid - 24 Aug 2004

XMML strawmen

Time Series version

The sample dataset encoded as an XMML TimeComplexCoverage (See XmmlCoverage for more detail):
  • XmmlSchemaRepository:trunk/Examples/GDF/AmWR_r_t.xml - explicit version, observations recorded as a list of records - This reflects a raw "data-collection" view, where the observations are made on clock-ticks, and the positional data is part of the observation

Point-locations version

The sample dataset encoded as an XMML PointSetCoverage (See XmmlCoverage for more detail):
  • XmmlSchemaRepository:trunk/Examples/GDF/AmWR_r.xml - explicit version, observations recorded as a list of records - This provides the data organised as a spatial coverage, ready for spatial processing such as gridding or interpolation

Three variant encodings of the spatial coverage version are also provided for comparative purposes:
  • XmmlSchemaRepository:trunk/Examples/GDF/AmWR_t.xml - explicit version, observations recorded in a single table
  • XmmlSchemaRepository:trunk/Examples/GDF/AmWR_b.xml - explicit version, observations recorded in a series of lists in band-sequential order
  • XmmlSchemaRepository:trunk/Examples/GDF/AmWR_f.xml - observations in an external file

Record structure / DFN file

Within the coverage documents, the rangeParameters element carries a link to the definition of the record structure. This is given in

Within these documents the ComponentTupleDefinition contains a set of member/AxisDefinition elements (see RecordSchema for more detail, but note that the actual syntax here is currently under revision). Following the GmlProperty pattern, the axis definitions could also be embedded directly within the document that contains the data. When stored in a separate document and included by reference, as shown here, this is most similar to the approach used for the GDF2 Data Definition (DFN) file.


Note: these examples are intended to show how data equivalent to GDF-2 can be represented using one of XMML's generic formats - the PointSetCoverage and TimeComplexCoverage. This shows the patterns used in XMML to represent this kind of data. It is not intended to be final, but rather to serve as a stepping-off point for developing an XMML encoding or practice for point-located geophysics data.

Other models and encodings may provide additional insight. For example the OPeNDAP model for sequences appears to match the GDF-2 model quite closely. The TimeComplexCoverage version shown here (XmmlSchemaRepository:trunk/Examples/GDF/AmWR_r_t.xml) is very similar to this model, but imposes the additional rigour of an "index time" for each record in the sequence.

The design decisions from this point are
  1. whether to persist with the GML/XMML "coverage" approach
  2. if so, what degree of specialization is required to make this useful in the geophysics community
  3. if not, then is the "sequence" approach more useful, or something else?

In the generic encoding shown in the examples, the data is "soft-typed" - the semantics (meaning) of the data is included as local values of "metadata" within the document.

Various levels of strong-typing are possible, for example in which XML tags might be "gravity" or "radiometrics" etc. Strong typing makes the data file more immediately readable. But strong typing has some drawbacks, primarily in terms of maintenance: a schema-file must be created (or generated) that describes all the tag-names used in the data instance. If this schema is intended to act as the "community" schema, then it must be defined in advance and must be exhaustive in respect of all the properties, units, etc likely to be measured. On the other hand, if the schema is generated "at-run-time" in response to the structure of the input, then it is not a community schema, and may not support interoperability.

In situations like observational geophysics, where the details of the observed properties vary from survey to survey, then it is more common to use a "soft-typing" approach. The labels, units, etc associated with each observed property are described in an XML instance - either the same document or in a separate file - rather than a schema. These terms may or may not be drawn from a controlled vocabulary, which may or may not be extensible. Of course this is only moving the issue to another position in the data modelling and processing chain, but is often a better optimisation.

The discussion below has been partly overtaken by revisions made to the strawmen above

Your rework is certainly wordy Breaking the XY coords out is new for us.. but OK We could do this view of things. It is not easy how to push part of this to binary as everything is so intermingled. The traditional header/data relationship is lost. So this is fine for a slower interchange/archive format but perhaps not a candidate for a persistent optimized data representation.

-- DesFitzgerald - 13 July 2004

I don't think it is as tricky as that.

  1. wordy - kinda - but the wordy piece is really only the Tuple/Axis definitions (see RecordSchema, but this is currently under revision) - equivalent to the old DFN file. Furthermore, this overhead only occurs once per dataset, and the data stream itself is packaged pretty tight. But I am willing to listen to suggestions regarding the specific syntax of this section - it is not tightly coupled in to the other parts of the data anyway- the only real need is for each axis definition to carry its own ID.
  2. breaking out the locations - I have followed the GML coverage pattern quite strictly here. I'm not sure I buy your header/data comment, since what is effectively the independent (controlled) variable (position) (for the purposes of analysis anyway) is kept separate from the dependent (measured) variables. Note that it would be possible (though probably not wise) to duplicate the positional information in the table. For example, you might move either/or the lat-lon or projected version of the position there, instead of hiding it in what is effectively a "comment".
  3. Linking to external (e.g. binary) files is easy - see the example I XmmlSchemaRepository:trunk/Examples/GDF/AmWR_f.xml
  4. Not sure that anyone should use this for persistence. It is a transfer format. I tend to characterise it as a "canonical" format for transfer, but not smart for storage except perhaps for archive.
  5. is GDF-2 primarily for "raw" data? In which case, is the "intermingling" you mention a pretty fundamental characteristic - following from the fact that the positional information is just another measurmeent at this stage, rather than a "controlled" variable?

-- SimonCox - 13 Jul 2004/11 Aug 2004

I'm certainly interested to see this. I've read through it without grasping all the finer points, because I'm not up with the jargon, but there are a few fundamentals that are worth keeping in mimnd, that I fear are in danger of getting lost.

  1. I don't buy the idea that position is the "independent variable". Its another observable, with its own uncertainties and grief in the field. I guess I'd accept that time was the independent viariable if you needed to name one explicitly. And I (and I expect lots of other people) would be quite unhappy to see position in anyway submerged.
  2. The difference between grids and point or line data is fundamental. The latter two are observations (at least in principle). A grid is an interpolated product, designed for compact storage and fast computation, but its a significant loss of information. I (and again lots of other people) would be unhappy to see the distinction blurred.
  3. There's a good reason that GDF has the data in one file and the header in another. You get to look at the header without opening (or maybe even having to load) a 5 Gb data file (say). That matters to people at the coal face.
  4. Is anyone keeping the ASEG Standards committee up to date on this stuff?

Apart from that, go to it.

-- AlanReid - 16 Jul 2004

A few responses -

  • the "independent variable" issue. I am aware of the fact that pretty much all of the quantities are the results of observation, including position. I guess, it is usually the case that all the observations are made on clock-ticks, so in terms of the observation process, time is the basic independent variable. Two responses to the issue, however:
    1. The XMML/GML coverage is merely an encoding of a mapping from a spatio-temporal domain to a set of property values called the "range". The examples that I generated so far required no changes to the generic coverage.xsd schema. While in the examples I used the 2-D geographic location for the domain, it is almost equally straightforward to specify a 1-D temporal "point-set" (i.e. a time-complex whose mambers are all time-instants) as the domain of the coverage, in which case the geospatial data could indeed be accomodated as values of the range as you suggest.
    2. however, GDF-2 is explicitly designed for (as the title says) "point located data". The examples that I generated were merely intended to reflect that. My understanding is that the "point located-ness" matches the general use that is made of this data, i.e. usually either examined as spatial profiles, or interpolated onto a spatial grid. "Coverage" data can generally be understood as "data ready for processing" - see InformationViews. This requires that one property is designated to be the controlled variable, with the other axes the dependent variable(s).
  • Yes - the difference between observed and interpolated data is important. However, please note that the "coverage" representation on the GDF2 page on the TWiki uses the point-set version for the domain, in which the spatial positions are explicitly and independently described. This is not a grid. The points in the example are exactly the same points in the same sequence as the sample data provided by Des. There is another form for the domain called "rectified grid" which is an implicit geometry typically used for conventional grid locations.

But I do not agree that it is the observed vs interpolated distinction that is "fundamental". Rather, for all data sets that are "estimates" of the true value, an account of the procedure(s) by which they were obtained needs to be available, somehow. That includes datasets that are generated by instrumental observation, by human observation, by computation, by simulation, etc. I have argued elsewhere that only when the property values are asserted (e.g. legal title) are they known to be completely free from error. Even the simplest and most direct measurements still have an algorithm and uncertainties involved, such as calibrations. The uncertainty may or may not be important in the context within which they are being used. But while it is important to be clear how far down the processing chain we are with any particular representation, "fundamental" is a heavy word to use that is not justified as a distinction between line and gridded data.

XMML is intended to provide a transfer format, typically a "snapshot" of whatever the provider has. There is no implied assertion of any particular merit in the dataset. The value is whatever has been agreed between the provider and receiver.

  • XMML coverage does already support out-of-band representation of the range - see XmmlSchemaRepository:trunk/Examples/GDF/AmWR_f.xml. But note that the general principle is that the fully XML version of the data is taken to be the "canonical" representation. The requirement that I was working with was to do an XML version of GDF. However, if what you want is to have rapid access to the "metadata" then we may specify a distinct "feature type" for this. It should look pretty much like XmmlSchemaRepository:trunk/Examples/GDF/AmWR_f.xml, maybe with the "File" link pointing to the full XML version such as AmWR_r.xml ... smile
  • DavidPratt had encouraged me to look at GDF-2 with the goal of generating a "standard" XML version. These experiments are an attempt to get that started. As I noted above, what I showed in the examples uses the "generic" coverage encoding without alterations, making particular selections from the choices provided in the generic model. My expectation is that after some iteration and critique of these, I will generate a specialised profile of the XMML/GML coverage that is tailored for the GDF use case, or possibly a variant of this. As far as possible the profile will follow the generic patterns, but wherever necessary we may throw in some more specific capabilities.

-- SimonCox - 16 Jul 2004

ASEG (and SEG) maintains two relevant exchange standards. Grid eXchange Format (GXF) for grids and GDF for point and line observations. Line observations are really just sets of points anyway. As a data structure, a grid is a raster (boolean, integer or real) and the whole GIS/Remote Sensing community has an interest in an XML schema for raster data. I understand such schemas are under development. Intrepid itself recognises that by using the ER Mapper raster data type for its grids. So you don't actually need to include the grid case in your schema. Lots of other people are anyway. And I expect you know that. In fact, I'd be interested to hear if there's a standard raster schema been agreed yet.

OK, so you can always regard grids as regularly spaced data points, but there is a distinction that matters to people like us. In our little world a grid is inevitably a regularization of irregularly spaced data. Sometimes very irregular and often very anisotropic. Some of the grid points may be quite reliable, because they're near an observation. Others may be hopeless. But all too often you can't tell because you don't have the sampling pattern or the companion uncertainty grid. This is a real problem in practice. And you'll find that people will pay a lot more for the underlying point data than they will for a grid made from the point data. And data owners will often release grids but retain the underlying point data. So it's undoubtedly built into the culture. Is there a good reason? I think so.

I can usually make a some kind of half-reliable estimate of the uncertainty in a point value. And I could (in principle) do the same for any given grid point at the time of gridding, so you are right in principle. But of course not many people do (or can do) the grid uncertainty calculation, and even fewer bother to retain the grid uncertainty matrix. I've calculated one once in my life. Most people would in fact be a baffled if you asked for it. We just accept that grids are a lot less reliable unless you have the original point locations at least.

I was forced to think about this when someone teaching geophysics asked me what the typical uncertainty was in an airborne mag grid point value. After quite a bit of thinking, I replied that it depended on so many things and varied so widely that a "typical" value was meaningless. But I could say that most modern airborne mag observations were probably good to better than 1 nT. And I've recently been interpreting a gravity data grid where I know the data density varies wildly, but I don't know most of the point distribution. It's like fighting with both hands tied behind your back.

In summary, I'm saying three things:

  1. Most (or maybe all) practitioners make a clear distinction between point and gridded (ie. raster) data and would look for schemas which respect that.
  2. The difference between point and grid data in the geophysical world is that we have a handle on uncertainties for point data and we do not for grid data. We instinctively recognise that, so we cling to the distinction.
  3. There is a suitable XML raster schema in development byt the Remote Sensing community. You don't need to include grid data in your point schema. Do you have any news of progress with that?

-- AlanReid - 29 Jul 2004

The examples linked above are for point-located data. As noted above, and implied by the page title, they correspond with GDF-2, and not with GXF. Looking back through this discussion I think it is AlanReid who introduced "grid" into the discussion. While there is a GML/XMML encoding for gridded data, and it does follow the same general "coverage" pattern, it is not under examination here. So there should be no confusion.

However, the rather firm way that the concern is being expressed does indicate that something else is going on. The way that AlanReid is describing it, it appears that "practitioners" are inferring some metadata (concerning attribute data quality) from the data representation (point-located or line, vs gridded (interpolated) ). That worries me - the data format should not, in itself, carry this kind of semantic implicitly. Explicitly, fine, but a data format is pretty much value neutral in itself. GML/XMML is about formatting data for transfer, with tags included that indicate semantics as documented in the schema and other accompanying documentation. Now whether the data provider chooses to use the format for rubbish or gold is up to them and the agreement they have with the data consumer. But the information about data quality should not be inferred from the transfer format - that is very much not the "XML" way, which is about explicit tagging. For example, some data is acquired using techniques in which the sample points absolutely are on a regular grid (e.g. imagery). So the grid-ness or otherwise does not automatically determine data quality in all application domains, even within geophysics.

Data quality may be handled explicitly in XML formats, as dataset or even item level "metadata". The sample data from DFA did not include such metadata, so it is not seen in the XML examples.

Concerning the question regarding standards for gridded data: we are circling this one at present. As I note above, GML does support gridded data. But we know that out there in the real world, there are a number of existing standards, mostly binary. netCDF, geoTIFF, HDF-EOS, ECW are a few. Figuring out how to manage these in a GML/XMML environment is going to be very important.

-- SimonCox - 02 Aug 2004

As to your point about current practice being poor (that is, inferring metadata from format), you maybe have a point. But it is nevertheless a very deeply embedded current practice. Even if you succeed in re-educating us all (which would probably take a decade, if not a generation), there's still the legacy data. For Australia, that's a deep pile. For the world, somewhat deeper. You are going to have to consider "backward compatability" to some degree. Just take a look at GA's GADDS (http://www.ga.gov.au/gadds). That's an awesome piece of work to dismiss as "poor practice".

By all means point to where we need to be, but we are where we are, and GA's GADDS is "current best practice". You'll see a similar thing by Geol Survey of Canada. The rest of the world is well behind either.

-- AlanReid - 24 Aug 2004

Topic attachments
I Attachment Action Size Date Who Comment
ASEG-GDF2-REV3_8_Publn.pdfpdf ASEG-GDF2-REV3_8_Publn.pdf manage 914.6 K 29 May 2003 - 21:36 SimonCox GDF-2 - r3.8
AmWRtemp.xmlxml AmWRtemp.xml manage 7.6 K 30 Jun 2004 - 10:49 UnknownUser First sample ASEG GDF2 data in xml
AmWRtemp1.xmlxml AmWRtemp1.xml manage 7.6 K 30 Jun 2004 - 11:22 UnknownUser ASEG GDF2 xml with CDATA
Topic revision: r19 - 15 Oct 2010, UnknownUser

Current license: All material on this collaboration platform is licensed under a Creative Commons Attribution 3.0 Australia Licence (CC BY 3.0).