An ESIP inter-operability system should be database-centric. Object
oriented relational database systems (ORDBMSs) have been developed which
are wholly suited to the task of managing earth science data, given an
appropriate schema (catalog definition), and are superior to competing
alternatives. Given the right architecture and a modest amount of companion
tools, they may be the foundation for a system which not only performs
inter-operability services for researchers, but also significantly aids
in their work.
Richard Troy, October 15, 1998
In order to understand and appreciate the importance and scope of this work, consideration must be given to the evaluation criteria under which the entire Earth System Information Partner (ESIP) Federation will be scrutinized. The very first bullet item for this evaluation requires that the ESIPs work together to permit research to be browsed automatically and queried by remote clients as if each ESIP site were part of a larger whole[1, ESIP Federation CFP].
Given this perspective on success, the whole problem of ESIP inter-operability
is an exercise in data management. It behooves us to pay attention to work
previously performed on this problem that we may benefit from years of
existing research in the field [2 & 3,
Sequoia 2000, BigSur, et al.]. As intimated in the bullet cited above,
a solution must behave as a cohesive system, yet we know that in practice
each member will necessarily find unique solutions to their Earth science
computing problems. Therefore any effective and successful interoperability
system must manage the common elements of each Partners work, permit easy
embellishment for unique research issues, and enable any existing work
to be "encapsulated" so that it continues to function.
The database community is well aware of the issues of interoperability
between multiple solutions to common problems; most often "gateways" are
created which perform schema mapping, i.e. translation services. Gateways
are not a particularly good solution, as has been learned over the last
decades, if only because, for N variants, a minimum of 2(N-1) translations
are required. N! translations are avoided by using a single variation,
to and from which all others translate. The reality is that schemas change
over time, which further compounds the problem. "Application" and "middle-ware"
solutions often borrow the gateway philosophy, sharing its troubles, and
such "solutions" are incomplete because they do not address what data is
to be stored, nor how it can be retrieved, and fail to define commonality.
A single system must be available in which all of the meta-data - that
data which describes the various data-sets - is accessible. This "single
system" may itself be distributed through any of a number of means, but
alternatives in which the joining of partner data-sets into a logically
singular collection is left to client applications must be rejected as
unworkable for reasons cited above and because this leaves too much undefined
for the authors of such code. It should be realized that what may appear
to be alternative solutions require a single fundamental; that all meta-data
be available to all browsing software. It would be effective if all the
meta-data were to be made available at a single site through which all
searches be initiated but this is only a sufficient condition and not both
necessary and sufficient. Distributed databases offer a competent solution
and may be indistinguishable from a non-distributed one.
In addition to the core set of features common to all ESIP collections,
the great variety of unique features offered by each Partner must not be
over-looked. Therefore browsing and application software variations are
both necessary and desirable. We presume there exist applications for most
all kinds of extant scientific data sets which ESIP partners presently
use. These may continue to be useful in browsing ESIP collections by performing
a single schema translation and perhaps modest embellishment to address
new possibilities [4, Dial, et al.]. It is perhaps
even possible to use ORDBMS provided schema mapping tools ("views," etc.)
to permit continued use of some applications without any changes at all.
New technologies for browsing and "data mining" will no doubt be developed.
It is therefore necessary and sufficient for our interoperability efforts
to focus on providing appropriate data abstractions and storage infrastructure
for this meta-data to be consistently available to all client applications
(in a single repository, be it distributed or singular), and perhaps provide
a single reasonably competent general purpose browser to illuminate the
way ahead.
One great risk of any inter-operability system is the quality of the
meta-data collected. There is potential for the meta-data collection process
to become a burden to researchers, who typically have limited resources.
Scientific defensibility, while near and dear to researchers hearts and
minds, is often an over-looked aspect of meta-data storage because it requires
the collection and maintenance of the processing history of each item.
This lineage is especially critical in collaborative situations, yet often
in existing systems, this meta-data is largely maintained in the brains
of researchers and staff. This problem does not arise so long as the meta-data
include a repository not only for descriptions of data "objects", but also
for the processes which act upon them. Doing so presents an opportunity
for automation of some or all aspects of data-processing, from simple "read
files in this directory and update the meta-data" to complete automation
[3, BigSur]. Automation permits the scientist to
focus on research, relieving the need for people to maintain the inter-operability
publication effort and may actually aid performance of research. In this
way, the inter-operability effort may enhance the researchers work instead
of becoming a distraction from it.
If the meta-data becomes stale, parts of the collection remain unavailable.
Addressing the practical aspect of storage and access to meta-data,
database research since 1980 [5, Ingres, Stonebraker]
has provided relational database technology which industry has developed
extensively (Ingres Corp, Oracle, Sybase, Informix, et el.). Further research
has guided the merging of "Object Oriented" concepts with relational ones
[6, Postgres, Stonebraker] and again, the "Object-Relational
DataBase Management Systems" (ORDBMS) industry has further refined these
concepts and developed a host of tools with great success (Illustra Corp,
et el.). There are presently a plethora of rapid deployment tool-sets available
for programming with database access in all of the modern, popular languages
such as Java (Symantec, et el.). Therefore we need not look further than
the database and its schema to be successful in implementing interoperability.
Though helpful to aid beginners, an Applications Programming Interface
(API) cannot be considered an interoperability standard; a requirement
to use an API may become burdensome, stifling innovation in data-mining
technologies. Modern tools already provide significant help to the programmer.
What sophisticated access is required is thankfully provided by commercial
vendors in the RDBMS and ORDBMS marketplace. An API should only be considered
an aid to programming, and help in establishing convention, and should
not be viewed as a requisite. It should also be noted that unlike potential
non-database centric solutions, using an ORDBMS provides everything necessary
for all types of users, and poses little burden for very "light-weight"
Earth science systems. The sophistication of the inter-operability system
may or may not be used as is appropriate for the researchers needs.
It should be noted that "data-access" is aided by DBMS-centrism; The
location of the data is known and can be requested. We presume that most
data shall exist in files. Database entries may include sufficient meta-data
to make the location of the data clear, as URLs presently provide for the
internet. Various schemes have been worked out, notably URLs, but also
the Kahn-Wilensky Handle [7, Profs. Kahn &
Wilensky], and the DLOBH construct [3, BigSur] which
provides for a named object to be resident inside, or outside of a database.
A simple "file server" may permit such handles (naming schemes) to function
in a distributed way. This technology is now well-defined.
Turning to the specifics of a suitable database schema, the commercial marketplace has again taken research and performed further development and refinement, and now provides a "commercial, off the shelf" solution. The following meta-data (table) descriptions are taken from the Berkeley Earth Science Tools (BEST) commercialization of BigSur research [8, Berkeley Earth Science Tools]. (The BEST system is now in use at UC Berkeley, the Langley Research Center, and is being implemented at several sites elsewhere.)
Table Summary
Due to the size and scope of this document, it is unfeasible to cover the intimate details of each table, or perform exhaustive analysis. The necessary and sufficient abstractions have already been implemented in a practical design, and are presented here as an example and food for thought. An appropriate way to evaluate the schema is to inquire how certain questions might be answered from the data contained therein. Should any questions not be answerable, the schema is found lacking. Should this not be the case, it is sufficient. Exhaustive analysis of the schema illuminates no wants; surely it can be extended and embellished, but within the scope of an ESIP interoperability schema, it is sufficient.
Identification
Keyword_Instance
Note that BEST offers several companion schemas which are available to address additional needs. For example, a Multi-Dimension Array feature is available which comes with a schema to handle large scientific objects which contain arrays of data.
MDimensionArray
Any new system introduced for ESIP inter-operability will necessarily involve changes in the way Partners perform their research to some extent. The vision above described offers the least intrusion into present behaviours, requires the least addition of new resources, has the most flexibility and offers benefits not available from any other architecture. Catalogs -- database schemas -- are the critical core of any such endeavour and the outline above should serve well to reach success.
http://s2k-ftp.CS.Berkeley.EDU:8000/postgres/index.html,
and as found in: http://s2k-ftp.CS.Berkeley.EDU:8000/ingres/
http://s2k-ftp.CS.Berkeley.EDU:8000/postgres/index.html,
and in: http://www.postgresql.org/index.html