"Seegrid will be due for a migration to confluence on the 1st of August. Any update on or after the 1st of August will NOT be migrated"

Identifying web resources

Contents

Related pages



Identifiers and the World Wide Web

Resource identification plays a key role in The Architecture of the WWW (AW3), since it enables linking. The standard form of identifier in the web context is the Universal Resource Identifier (URI) which is the generalization of a growing number of specific schemes, including the well known Universal Resource Locator (URL) (also known as "http URI") and Universal Resource Name (URN) schemes.

Those responsible for creating URIs will normally use an algorithm or rule to ensure uniqueness. This may be based on using some of the descriptors for the resource as fields in its identifier. This can make the URI 'memorable' and help users to guess identifiers they haven't been explicitly told, which is definitely a Good Thing.

However, this should not be pushed too hard, and a strong recommendation in AW3 is that users are encouraged to treat all URIs as "opaque" - "the Web is designed so that agents communicate resource information state through representations, not identifiers". The "users" in the first statement are URI readers of course (but see also here for a discussion of encoding some metadata in URI's -- SteveRichard - 17 Jun 2009).

Purpose of identifiers

In the web context, identifiers play two roles:
  1. to name or denote a resource
  2. to provide a resolvable locator for a web resource

One of the strengths of the web infrastructure has been that the URL (or 'http URI') often combines these functions. Since they piggy-back on the DNS system for resolution, in most situations URLs are a good choice for a resource identifier.

Identifiers vs queries

In terms of their semantics, URIs vary between
  • fully opaque identifier
    • e.g. utilizing a catalog-number or pseudo-random number generator or clock-ticks or similar
    • this denotes a persistent resource "by name"
    • "identity comparison" is the only logical operation supported
  • hierarchical identifier or path
    • steps in the path are an (ordered) series of values for fields
    • the path may reflect the location of a static resource in a hierarchical store
  • parameterized query
    • e.g. a service call, which may be encoded for the http GET method as a URL with a ? in the middle
    • this implicitly denotes either
      • a (potentially empty) set of resources that satisfy the query
      • a new resource constructed on-the-fly, perhaps a subset of or extract from a larger resource
    • since it includes visible parameters, a consumer may attempt (scheme-dependent) operations in addition to identity comparison

Published URI schemes fall into all camps, and everywhere in between - e.g. the urn:uuid scheme rolls an opaque UUID into the context of a URI; while the OGC URN scheme "def" branch includes support for highly parameterized CRS designators, newly built (from more primitive components).

URI resolution

In order to obtain a resource representation denoted by an identifier, a URI must be resolved.

Http URIs have the advantage of a widely deployed generic resolver apparatus for the host part of the identifer (i.e. DNS). Resolution of the remainder is delegated to the http server corresponding to the host.

The URL syntax also allows a fragment to be identified within a resource, indicated by the string appearing after the # symbol. The fragment may itself be either an identifier (barename) or may be parameterized (e.g. XPointer syntax). However, it is important to remember that the fragment is not passed in the http request - rather, it is the responsibility of the client to process the fragment identifier to obtain a sub-resource from within a potentially much larger resource, all of which must be transferred prior to the client-side operation (e.g. see http://www.w3.org/TR/swbp-vocab-pub/ for a discussion of 'hash- vs. slash-namespaces' in the context of RDF resources).

A perceived disadvantage of http URIs relates to lack of persistence in the face of (likely) server configuration changes. The use of alternative URI schemes has been discussed at length, and it is clear that http URIs can act as persistent identifiers if suitable attention is paid to their governance, such as a carefully defined 'scheme', and can also remain 'on the web' if the governance care is extended to the deployment of the http servers associated with the scheme, In fact, the underlying issue is to recognize that there is URI lifecycle.


The URN scheme was originally designed primarily to overcome the persistence issue. However, it thereby begs the resolution question. The only universal URN "resolver" is the IANA registry of URN namespaces, which then merely identifies a document describing each scheme. This is not necessarily a bad thing, since URNs may be used to identify resources that are not web-accessible, including physical resources, and also "concepts" that may have multiple alternative realizations. Nevertheless, for maximum information transparency some kind of web-resource corresponding to, or proxying for, the identified resource will usually be available. This may be just a document containing a description or definition. The latter is exemplified by URNs used to identify units-of-measure or observed-properties. While identity comparison can serve some purposes, examination of a full description of the resource will be needed for many others.

In TermResolutionMechanisms RobAtkinson started a discussion on "best practice" including the use of URIs for concepts. In GmlIdentifiers SimonCox provides an overview of the various slots in GML documents that can be used to hold the object identifier, and how to build URLs from these that can be used in cross-referencing.


URN or URL

URL weaknesses

URLs have some weaknesses, some of which I try to describe below, which means they are not always optimal:
  • persistence
    • it may be useful for an identifier to stay the same even if a resource moves - e.g. concept definitions
    • it may be useful for one identifier to apply to a resource with multiple representations, each potentially with a different URL
  • semantics & governance
    • URLs are "owned" by the domain owner
      1. it may not be easy/possible or at least encouraged for a user from another organization to mint new ones in the same hierarchy
      2. other users may not have confidence in the stability of the URL and resource
    • URL structure is typically based on a file-system path on the server, which may provide the wrong kind of information
    • even if a path is meaningful, it is only a mono-hierarchical classification
  • resolvability
    • URLs are sometimes used for purposes where they are not required to be resolvable (e.g. XML namespace names, offline resources); this causes confusion

Other URI schemes

For these, and other, reasons, the concept of URI includes other identification schemes, which have various different strengths, applicable to various use-cases. In some cases these are based on other standard communication protocols (mailto: ftp:// callto:) but in some cases they are not even intended to be immediately resolvable (DOI, URN).

XRI, from OASIS, is a scheme that is both abstract (persistent, reassignable) and resolvable (FOSS software is available from OpenXRI). This is an interesting development that deals with some of the technical issues with URN. However, the issue of governance must still be resolved.

Furthermore, as Henry Thompson and David Orchard point out in this note from the W3C Technical Architecture Group, URLs (which they call "http URIs") can manage all of the requirements that have led to the development of alternative identifer schemes such as URN and XRI. For example, the "facetted" nature of most URN schemes is mirrored by the "key=value" pairs pattern after the "?" in an extended URL.

However ...

Unresolvable identifiers are still useful!

This may appear somewhat bizarre, since the web is built on connectivity. But the utility depends on what the identifier is for. Some applications require a URI that plays only role 1 - i.e. denotation is more important than location.

Stable identifiers may be used very effectively to proxy for an actual resource, in order to support reasoning, in particular the most simple form "is A the same as B". If two links use the same identifier, then it is not necessary to resolve them for us to know they are referring to the same thing! Yes, it must be possible, eventually, to obtain a description of the thing. But this may even be done offline, with the identifier still supporting useful operations.

In what contexts might this (stable denotation but not necessarily location) be the case?

It particularly applies when the resource acts as a label or classifier, which denotes and thus implies a descriptor. A necessary (though not sufficient) requirement for an interoperable classifier is that it is from a well-governed scheme. A requirement for a usable classifier in the web context is that the denotation is a URI.

It is a bonus if the URI itself supports some reasoning without resolving and retrieving an actual resource.

The OGC URN scheme is a useful unresolvable scheme

The OGC URN scheme, particularly the "def" branch, is intended to support this use-case. The concepts defined by URNs encoded in conformance with this scheme are "resolved" through reference to the document that describes the scheme.

URN structure does not supply semantics to the reader

The structure within a URN scheme is provided first and foremost as a rule for URN providers to ensure that a newly minted URN is indeed unique. The "structure" in a URN provides a check on the provider that can ensure certain governance pre-requisites are met when creating identifiers.

It is usually not intended to encourage a URN consumer to algorithmically process a URN - users are encouraged to treat all URIs as "opaque" since "the Web is designed so that agents communicate resource information state through representations, not identifiers". However RFC 3406 does allow either
  • the structure is opaque (no exposition)
  • a regular expression for parsing the identifier into components, including naming authorities

As far as the user is concerned, a URN scheme provides (a) an authority for the identifier (the first field after "urn:"), who in turn manage (b) the "rule" to enable a resolver to retrieve a resource. However, the URN scheme definition is not a rule to allow it to be deconstructed to become the resource!

Also see The use of Metadata in URIs.

Summary

  1. It is smart to use URI's to identify things, even those that you don't want to put inline
  2. there are several URI schemes
    • URL's are locators - DNS + http server acts as the resolution service
      • http identifiers have the advantage that many clients are preconfigured to bind to the "http resolver service" (DNS). Note, however, that DNS itself only resolves the host, and the http server at the host does the rest. Adding new identifiers involves configuration of the URL resolver (a.k.a. http server).
    • other URI schemes, such as URN, XRI, DOI, handle, have their own resolution mechanisms (which may not be web-based)
    • URN functionality can be mimicked in URL's (see discussion in http://www.w3.org/2001/tag/doc/URNsAndRegistries-50 for example).
  3. URIs should be treated as opaque - see http://www.w3.org/TR/webarch/#uri-opacity . However
    • This applies to the user, not to the creator - there must still be a rule that allows URI's to be assigned in some unique way. There is no prohibition on structure or codes, and this may assist in the governance process for URI assignment.
    • The recommendation is a "should", not a "must"
    • RFC 3406 anticipates cases where a URN is parsed into components, such as naming authorities
  4. The official URN resolution system is a document - http://www.ietf.org/rfc/rfc3406.txt
    • this directs the user to the register of URN namespaces http://www.iana.org/assignments/urn-namespaces, which carries refs to the document that defines each one. I.e. the canonical URN resolver does exist, but it is manual (!)
    • Within a community it is OK to specify a special automatic resolution service - for example see http://www.cgi-iugs.org/uri .
    • Identifier uniqueness can only be ensured by an identifier registry, which is the key element underlying a resolver.
  5. URN's are a good solution for "reference material", which doesn't change much
    • the requirement to have a resolver ⇒ implies a registry ⇒ implies a registration process
    • a non-trivial registration process tends to discourage frequent change (or at least expose disorderly processes).
    • most pointers to "reference material" are not intended to trigger a get operation. Rather, they will trigger a comparison operation ("Do I know this identifier?", "Is this identifier the same as that one?").
  6. A corollary is that URN's are not normally a good solution for ephemeral resources, or "data instances" with weak or non-public governance arrangements
    • though there may be an exception if the identifier is defined by a well-documented algorithm

If you've got this far, you'd probably better also read Dan Connolly's Untangle URIs, URLs, and URNs .

 
Topic revision: r20 - 15 Oct 2010, UnknownUser
 

Current license: All material on this collaboration platform is licensed under a Creative Commons Attribution 3.0 Australia Licence (CC BY 3.0).