"Seegrid will be due for a migration to confluence on the 1st of August. Any update on or after the 1st of August will NOT be migrated"
You are here: SEEGrid>Xmml Web>TextualData (15 Oct 2010, UnknownUser)EditAttach

Textual Data

Current definitions in schema file:

* XmmlSchemaRepository:trunk/GeoSciML/gsmlBase.xsd

Text description, analysis, background, explanation etc. is an important part of scientific information exchange. It needs to be included in a mark-up language for geoscientific information exchange.

BGS has developed the #BGSTextBase system to hold various kinds of text from geological memoirs to notes made by geologists while field mapping.

The OGC has a discussion paper on #LocationOrganizerFolders which are a generic way of gathering together some narrative text with GML Features and other kinds of data to which the text refers.

There is an overlap between these two systems which suggests a suitable way forward for handling text in a geoscientific mark-up language (and more general GML application languages). They both have some characteristics which are specific to the way in which they have been implemented so far and which might be dropped for a standard mark-up language.

From looking at these two systems and previous comments by SimonCox we can consider how a "Narrative Text Feature" might look. (I'm not sure of the best name to call this.) This is "a bit of a stretch conceptually", but these features could have a kind of identity (e.g. corresponding to a particular report or publication) with an associated name and usually a location corresponding to the area they are describing. The textContent property could contain arbitrary mark-up like XHTML or #DocBook or a custom format. Rather than allow any mixed content I have just allowed one "any" element to make it possible to validate if an application is aware of an appropriate schema for the element's namespace. The feature would have an unbounded number of subject properties which might be simple strings with keywords or be gml:codeType's with an associated codeSpace. The feature would also have an unbounded number of relatedFeatures pointing to features to which the text refers.

NarrativeText.png

The best way to indicate links to particular bits of text within the textContent is not obvious to me. The BGS Textbase system has <indexterm> elements inside the text which are easy to convert to links when presenting on the web but it puts the storage of metadata in two different places which is untidy and not really in keeping with GML design. I am thinking about making the relatedFeatures property an extended XLink which points both to the feature and possibly to a bit of text inside the textContent using XPointer either with character counts (a bit of a pain to maintain if the text gets edited) or using the ids of arbitrary mark-up elements inside the textContent (e.g. <a> elements in XHTML). For the moment, however, I'm keeping it simple and not implementing pointers within the text.

The textContent property uses a type MarkUpOrRefType which can be re-used in any feature which needs a text property including mark up.

The BGS Textbase DTD and #DocBook and other text schemas (but not really XHTML) have their own way of representing a hierarchical document structure. Using these, however, is not accessible to a generic GML processing application and tying an XMML/GeoSciML schema to a particular text schema could well be too restrictive. I think representing a hierarchy of text sections is untidy in terms of GML features and not particularly necessary so any continuous piece of text will just be contained within a single NarrativeText feature. Any hierarchy will only be apparent to an application that understands whatever mark-up has been put inside the textContent

There is a simple example instance at XmmlSchemaRepository:trunk/Examples/GeoSciML/textEx1.xml

-- MarcusSen - 16 Mar 2004

BGS Textbase

The BGS Textbase has the following properties

  1. It manages hierarchically structured text from the smallest chunks (usually paragraphs) to enclosing chunks like sub-sections, sections, chapters and whole reports/memoirs.
  2. It can associate with each chunk items of discovery metadata which may be subject terms taken from a general geo-scientific thesaurus or terms taken from a number of more specialist dictionaries like a lexicon of lithostratigraphical units.
  3. It can associate with each chunk identifiers for a number of feature types like boreholes and faults. (The identifiers are currently the primary keys of associated BGS corporate databases.)
  4. The associations may be made with a chunk at any level of the hierarchy so an association made with a chapter implicitly applies also to its enclosed paragraphs and sections. They can also be made in-line with a few words of the text of a paragraph etc.

The Textbase application uses the associations in two ways:
  1. to find text that concerns a particular subject, describes a particular unit, concerns a particular borehole etc.,
  2. to create links when the text is presented on the web which allow the reader to jump from e.g. a mention of a borehole to the corporate database entry for the borehole or a log viewer for the borehole etc.

The subject terms and identified feature links are conceptually different kinds of association but they are currently handled using exactly the same mark-up syntax. The associations between text and subject terms and, to a lesser extent, identified features, are made manually by an author who judges what is appropriate. We are still in a learning process about making these judgements. from the point of view of searching we don't want to associate a chunk of text with, for example, every lithostratigraphical unit it mentions in passing, we want a search for a particular unit to bring back text that says something potentially interesting about that unit. For linking we probably do want to be able to create links from any mention of, for example, a borehole to any other corporate data we may hold on the borehole. It is also possible to create links from mentions of a subject term to a dictionary defining the term but this can result in very large amounts of mark-up which is impractical.

-- MarcusSen - 08 Oct 2003

Location Organizer Folder

The OpenGIS "Geospatial Fusion Services Testbed", and Pilot, looked at this from the perspective of converting abitrary multi-media resources into references within a "Location Organizer Folder" (LOF). Geoparser services converted text into something like the "TextFeature" proposed, and allowed in-line or by reference storage of the original and marked-up objects. This is a pretty flexible scheme that provides for consistent metadata patterns and traversable relationships between such resources to be stored. FYI, MapBroker (underlying WebMap Composer) has a fully functional LOFManager package so that we could easily build appropriate UI components and a "LOF store" - actually just a transactional WFS able to store complex XML. LOF's are discussion papers that I'm sure the OGC would be delighted to see some real industrial uses, and hence progress to RFC status.

-- RobAtkinson - 29 May 2003

There are two places for text in LOF. Overall a LOF is supposed to bring together Features and other data that are relevant to, for example, some event or story. The "Causal Narrative" inside a LOF is where authors can write text describing the event or whatever and include pointers in the text to the other Features or Information Elements (like video clips etc.) which are also stored in the LOF. (The content of these features may be stored directly or there may be remote references to external content.) The other place for text is as "GeoText" in Information Elements alongside other media types like sound, video etc. which may be referred to by the Causal Narrative. For example, the text of a newspaper article or email message might be contained here. This text may be parsed by an automated Geoparser to try to extract references to geographical features like place names etc. from which geo-spatial features with geographical properties can be created and stored in the LOF.

The Causal Narrative text is just plain text with one kind of possible mark-up element; Xlinks pointing to one of the other Features or Information Elements contained in the LOF. The GeoText in Information Elements can have arbitrary mark-up (XHTML etc.) which is not validated by the LOF Schema. Links from geo-spatial features (or other information elements) that have been identified by the Geoparser inside the text to the bits of text where they are mentioned are maintained by Xlinks outside the text content that use XPointer syntax with first and last character counts to point to the relevant bit of text inside the teh GeoText contents.

Footnotes

Old notes on BGS Textbase DTD

(Deleted longer description -- MarcusSen - 08 Oct 2003)

The Textbase DTD is more or less a simplified subset of #DocBook with the addition of an <indexterm> element which has two attributes 'scheme' and 'value'. The scheme attribute indicates which subject dictionary (general thesaurus, lithostratigraphic lexicon etc.) or which feature type (borehole, fault etc.). The value indicates the subject term or the feature identifier.

DocBook

There is an experimental XML Schema version at http://www.oasis-open.org/docbook/xmlschema/. I'm not sure how close this is supposed to be to an official release or even if there is a definite plan for an official release. The DTD for Docbook has a very flexible but also very complicated system of using parameter entities to enable modularisation and customisation. From a brief examination, the XML Schema version seems to essentially to reproduce this structure by using Schema group and attributeGroup definitions. This doesn't really take advantage of the mechanisms available in XML Schema. Contrary to (my) expectations the XML Schema version is actually more readable (or at least no less readable) than the DTD version; possibly because the DTD has stretched the parameter entity mechanism so far in order to achieve flexibility over readability.

-- MarcusSen - 20 Jan 2003

Mixed content/Structured vs. semi-structured data

DTD technology might have advantages over XML Schema for semi-structured information, such as text. DTD is more like the "production rules" method used to describe grammars. OTOH XSD is (semi-)object oriented, and better for fully structured data. In particular, "mixed content" is not a natural fit, and sometimes causes issues in validation. It is much much easier to stick with simple text nodes in XSD, and within GML we never used mixed="true". So before we go willy-nilly converting everything to XSD, I wonder if there might be a way to continue to use DTD for text, and merge this with XSD for structured data. Have you tried any experiments in this area?

-- SimonCox - 16 Jan 2003

I haven't experimented with this. It is possible to put both DTD and XML Schema declarations in an instance document so that it can be validated both by a DTD aware processor and an XML Schema aware processor. I noticed that the XML Schema version of MathML suggests putting character entity declarations (e.g. for − and other mathematical symbols) in an internal DTD subset as well as the XML Schema declaration because XML Schema can't provide the character entities. I suppose "ANY" elements could be used both in the text DTD and a GeoSciML Schema at places where elements from the other can be inserted. Probably a short program (in Java, VB or whatever) would have to be written to call the appropriate validators for the appropriate parts of an instance document. Is this the kind of thing you were thinking of or do you know of any other ways of combining the two?

Maybe it would be useful to think of some use cases, (i) with text fragments included in data documents and (ii) with data included in text reports, in order to make clear what the combined "schema" should achieve. Really there are two parts to the current text DTD. The structured text part which can easily be inserted at appropriate container elements in an XML schema and might have alternatives like XHTML or full DocBook substituted for it. The part might fit better in adapted GML metadata elements?

-- MarcusSen - 20 Jan 2003

DTD -> XSD

I just tried using XML Spy to convert a few pieces of GML into DTD. In the first place I am very pleased to see how "clean" the result is. But it also rings a few alarm bells:
  • namespaces have vanished entirely - this make mix & match vocabularies impossible
  • cardinality constraints are generalised - for example, in the XSD the content model of Envelope specifies exactly 2 pos elements, but in the DTD this has become "+" (i.e. 1 or more). I guess it should have been (pos,pos) so this could be a weakness in XML Spy.

-- SimonCox - 16 Jan 2003

Well, it's not completely impossible. It can be bodged by using fixed namespace prefixes in the DTD (with, for convenience, a parameter entity to define it at the start and fixed attributes giving the namespace declarations). This can break down if the thing goes through a Schema aware processor which alters the namespace prefix and it is rather awkward. The W3C has made examples of XHTML+MathML+SVG DTD profiles using this kind of mechanism to get around the namespace problem. (I'm not sure if it is only intended as an interim measure until Schema becomes more widespread or if this kind of use of DTDs is expected to last long term.)

-- MarcusSen - 20 Jan 2003

Statutory reports - text

Some notes from emails:

The preferred AESIS format for sumbitting data in digital form could probably be extended to be XMML. For company report data each report (record, object) was submitted in a format including the following:

SR
TI First Annual and Final Technical Report for the period November 1998-99 -
EL21/98
AV Open file - Tas. Available from MRT
TN EL21/1998
AU Parfrey, O.
AU Simpson, K.L.
CO Pasminco Australia Limited (Exploration)
RN 00_4414
CL 1 v., 14pp, 3fig, 1 table, 11appx
PY 1999
ST Dundas Group
ST Mt Read Volcanics
SU Rosebery Fault
LO Burns Peak
LO Mt Kershaw
AB Some soil geochemistry and an IP survey have been undertaken during the
last period.  The results have not indicated significant exploration
targets.
10 80144
25 SK55-3 
ER

The tags define the items and I know that we (MRT) didn't use all the possibilities. There were a number of problems with submitting data like this as there still semed to be a lot of manual work at the AESIS end and, where items were defined in authority tables via link tables, it was not possible to have corrections to entries in the authority tables flow through to AESIS.

-- BobRichardson - 26 Apr 2002
Topic attachments
I Attachment Action Size Date Who Comment
NarrativeText.pngpng NarrativeText.png manage 7.3 K 17 Mar 2004 - 17:32 MarcusSen  
cbv33.dtddtd cbv33.dtd manage 16.9 K 16 Apr 2003 - 14:16 SimonCox GeoSciML DTD - without MathML
cbv33_el_local.xsdxsd cbv33_el_local.xsd manage 28.1 K 16 Apr 2003 - 14:17 SimonCox Auto-converted by XML Spy
cbv35.dtddtd cbv35.dtd manage 27.9 K 16 Apr 2003 - 14:17 SimonCox GeoSciML DTD - with tables and MathML
Topic revision: r15 - 15 Oct 2010, UnknownUser
 

Current license: All material on this collaboration platform is licensed under a Creative Commons Attribution 3.0 Australia Licence (CC BY 3.0).