Earth Science Catalogs for ESIPs

Earth Science Catalogs for ESIPs Abstract

An ESIP inter-operability system should be database-centric. Object oriented relational database systems (ORDBMSs) have been developed which are wholly suited to the task of managing earth science data, given an appropriate schema (catalog definition), and are superior to competing alternatives. Given the right architecture and a modest amount of companion tools, they may be the foundation for a system which not only performs inter-operability services for researchers, but also significantly aids in their work.

Richard Troy, October 15, 1998

In order to understand and appreciate the importance and scope of this work, consideration must be given to the evaluation criteria under which the entire Earth System Information Partner (ESIP) Federation will be scrutinized. The very first bullet item for this evaluation requires that the ESIPs work together to permit research to be browsed automatically and queried by remote clients as if each ESIP site were part of a larger whole[1, ESIP Federation CFP].

Given this perspective on success, the whole problem of ESIP inter-operability is an exercise in data management. It behooves us to pay attention to work previously performed on this problem that we may benefit from years of existing research in the field [2 & 3, Sequoia 2000, BigSur, et al.]. As intimated in the bullet cited above, a solution must behave as a cohesive system, yet we know that in practice each member will necessarily find unique solutions to their Earth science computing problems. Therefore any effective and successful interoperability system must manage the common elements of each Partners work, permit easy embellishment for unique research issues, and enable any existing work to be "encapsulated" so that it continues to function.

The database community is well aware of the issues of interoperability between multiple solutions to common problems; most often "gateways" are created which perform schema mapping, i.e. translation services. Gateways are not a particularly good solution, as has been learned over the last decades, if only because, for N variants, a minimum of 2(N-1) translations are required. N! translations are avoided by using a single variation, to and from which all others translate. The reality is that schemas change over time, which further compounds the problem. "Application" and "middle-ware" solutions often borrow the gateway philosophy, sharing its troubles, and such "solutions" are incomplete because they do not address what data is to be stored, nor how it can be retrieved, and fail to define commonality.

A single system must be available in which all of the meta-data - that data which describes the various data-sets - is accessible. This "single system" may itself be distributed through any of a number of means, but alternatives in which the joining of partner data-sets into a logically singular collection is left to client applications must be rejected as unworkable for reasons cited above and because this leaves too much undefined for the authors of such code. It should be realized that what may appear to be alternative solutions require a single fundamental; that all meta-data be available to all browsing software. It would be effective if all the meta-data were to be made available at a single site through which all searches be initiated but this is only a sufficient condition and not both necessary and sufficient. Distributed databases offer a competent solution and may be indistinguishable from a non-distributed one.

In addition to the core set of features common to all ESIP collections, the great variety of unique features offered by each Partner must not be over-looked. Therefore browsing and application software variations are both necessary and desirable. We presume there exist applications for most all kinds of extant scientific data sets which ESIP partners presently use. These may continue to be useful in browsing ESIP collections by performing a single schema translation and perhaps modest embellishment to address new possibilities [4, Dial, et al.]. It is perhaps even possible to use ORDBMS provided schema mapping tools ("views," etc.) to permit continued use of some applications without any changes at all. New technologies for browsing and "data mining" will no doubt be developed. It is therefore necessary and sufficient for our interoperability efforts to focus on providing appropriate data abstractions and storage infrastructure for this meta-data to be consistently available to all client applications (in a single repository, be it distributed or singular), and perhaps provide a single reasonably competent general purpose browser to illuminate the way ahead.

One great risk of any inter-operability system is the quality of the meta-data collected. There is potential for the meta-data collection process to become a burden to researchers, who typically have limited resources. Scientific defensibility, while near and dear to researchers hearts and minds, is often an over-looked aspect of meta-data storage because it requires the collection and maintenance of the processing history of each item. This lineage is especially critical in collaborative situations, yet often in existing systems, this meta-data is largely maintained in the brains of researchers and staff. This problem does not arise so long as the meta-data include a repository not only for descriptions of data "objects", but also for the processes which act upon them. Doing so presents an opportunity for automation of some or all aspects of data-processing, from simple "read files in this directory and update the meta-data" to complete automation [3, BigSur]. Automation permits the scientist to focus on research, relieving the need for people to maintain the inter-operability publication effort and may actually aid performance of research. In this way, the inter-operability effort may enhance the researchers work instead of becoming a distraction from it.

If the meta-data becomes stale, parts of the collection remain unavailable.

Addressing the practical aspect of storage and access to meta-data, database research since 1980 [5, Ingres, Stonebraker] has provided relational database technology which industry has developed extensively (Ingres Corp, Oracle, Sybase, Informix, et el.). Further research has guided the merging of "Object Oriented" concepts with relational ones [6, Postgres, Stonebraker] and again, the "Object-Relational DataBase Management Systems" (ORDBMS) industry has further refined these concepts and developed a host of tools with great success (Illustra Corp, et el.). There are presently a plethora of rapid deployment tool-sets available for programming with database access in all of the modern, popular languages such as Java (Symantec, et el.). Therefore we need not look further than the database and its schema to be successful in implementing interoperability. Though helpful to aid beginners, an Applications Programming Interface (API) cannot be considered an interoperability standard; a requirement to use an API may become burdensome, stifling innovation in data-mining technologies. Modern tools already provide significant help to the programmer. What sophisticated access is required is thankfully provided by commercial vendors in the RDBMS and ORDBMS marketplace. An API should only be considered an aid to programming, and help in establishing convention, and should not be viewed as a requisite. It should also be noted that unlike potential non-database centric solutions, using an ORDBMS provides everything necessary for all types of users, and poses little burden for very "light-weight" Earth science systems. The sophistication of the inter-operability system may or may not be used as is appropriate for the researchers needs.

It should be noted that "data-access" is aided by DBMS-centrism; The location of the data is known and can be requested. We presume that most data shall exist in files. Database entries may include sufficient meta-data to make the location of the data clear, as URLs presently provide for the internet. Various schemes have been worked out, notably URLs, but also the Kahn-Wilensky Handle [7, Profs. Kahn & Wilensky], and the DLOBH construct [3, BigSur] which provides for a named object to be resident inside, or outside of a database. A simple "file server" may permit such handles (naming schemes) to function in a distributed way. This technology is now well-defined.

Turning to the specifics of a suitable database schema, the commercial marketplace has again taken research and performed further development and refinement, and now provides a "commercial, off the shelf" solution. The following meta-data (table) descriptions are taken from the Berkeley Earth Science Tools (BEST) commercialization of BigSur research [8, Berkeley Earth Science Tools]. (The BEST system is now in use at UC Berkeley, the Langley Research Center, and is being implemented at several sites elsewhere.)

Table Summary

Due to the size and scope of this document, it is unfeasible to cover the intimate details of each table, or perform exhaustive analysis. The necessary and sufficient abstractions have already been implemented in a practical design, and are presented here as an example and food for thought. An appropriate way to evaluate the schema is to inquire how certain questions might be answered from the data contained therein. Should any questions not be answerable, the schema is found lacking. Should this not be the case, it is sufficient. Exhaustive analysis of the schema illuminates no wants; surely it can be extended and embellished, but within the scope of an ESIP interoperability schema, it is sufficient.

Identification

This is the center of the Schema - all objects have an entry in this table. ObjectAssociation

This table associates objects into structures, such as heiarchies, if desired. ObjectType

This table defines object types. Type names are completely at the users discretion. Objects

This table either contains the object in question or a reference to it. Process

This table, along with the Parameter table, describes what is required to run a process. ProcessAssociation

This table associates processes so that a work-flow is established. Process_Queue

This table contains instances of process runs before, during or after runs, as desired. Parameter

This table describes each and every parameter, and does not contain parameter values. ParameterValue

This table contains real values, for either pending process runs or defaults ParameterSet

This table is used to describe named parameter sets for set-default purposes. Lineage_Paramater

This table contains historical parameter values for lineage purposes. Systems

Contains run-time information about distributed system resources that can run processes. Citation Contains

information "citing" documents that may have more information. CitationTemplate

Contains citation information for use by those who use data products in their work. ContactInformation

Describes individuals responsible for various objects, processes or systems. Security_Code

Provides a flexible, site specific security scheme, permitting or restricting access. Graphic

Or 'Identification_Graphic', directs browsing applications to "thumbnail" images. Metadata_Ref_Info

Or 'Metadata_Reference_Information', is required by the FGDC Metadata Standard. Reference it...

Keyword_Instance

Identifies special words associated with objects. Thesaurus

Provides more information about entries in the Keyword_Instance table.

Note that BEST offers several companion schemas which are available to address additional needs. For example, a Multi-Dimension Array feature is available which comes with a schema to handle large scientific objects which contain arrays of data.

MDimensionArray

Contains descriptions of multiple dimension arrays. MDimensionAxis

Contains definitions of axes for MDAs. There are also other miscellaneous embellishments to the schema, such as a "reference" schema to handle back-ground map presentations, and to provide a repository for information useful to earth science, but not, strictly speaking, a part of an earth science data-set. (Examples: political boundaries, locations of waterways, etc.)

Conclusions

Any new system introduced for ESIP inter-operability will necessarily involve changes in the way Partners perform their research to some extent. The vision above described offers the least intrusion into present behaviours, requires the least addition of new resources, has the most flexibility and offers benefits not available from any other architecture. Catalogs -- database schemas -- are the critical core of any such endeavour and the outline above should serve well to reach success.

Citations

Earth Science Information Partnership Grant, Call For Participation. As quoted by Martha Maiden.

Sequoia 2000 research project at UC Berkeley, http://s2k-ftp.CS.Berkeley.EDU:8000/sequoia/abouts2k.html

The "BigSur" research project at UC Berkeley, http://s2k-ftp.CS.Berkeley.EDU:8000/nasa_e2e/

Dial was presented at the August ’98 EOS-DIS Technology Transfer Workshop, presented by Suresh. It was our mutual observation that Dial could be modified fairly easily to use BigSur with great advantage.

The Ingres Relational Database System, Professor Michael Stonebraker, as described in

http://s2k-ftp.CS.Berkeley.EDU:8000/postgres/index.html,

and as found in: http://s2k-ftp.CS.Berkeley.EDU:8000/ingres/

The Postgres Object-Relational Database System, Professor Michael Stonebraker, as described in

http://s2k-ftp.CS.Berkeley.EDU:8000/postgres/index.html,

and in: http://www.postgresql.org/index.html

The Kahn-Wilensky Handle, as described in, A Framework for Distributed Digital Object Services http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html

Berkeley Earth Science Tools, inc., http://sciencetools.com/