Big Sur: Base Schema

Section Coordinator:
Jean Anderson, jean_anderson@postgres.berkeley.edu
Last Updated:
$Date: 1995/05/23 20:01:05 $

Introduction

The first layer of the Big Sur database infrastructure, the base schema, establishes a framework to support the database management and processing needs of Sequoia applications.

Our goal is to establish reusable building blocks composed of types and tables that are useful to many applications. This is like a code library that supplies functionality shared by many applications. The library includes functions all applications need; however, it probably does not include every function a single application needs.

Likewise, the Big Sur base schema might not include all building blocks an application needs. We will work with and encourage researchers to extend the base building blocks as needed for a specific application purpose. We will watch for application-specific building blocks that satisfy a more general need and will incorporate them into the base schema.

This section describes the base building blocks in detail, organized as follows:

o Design Approach

We explain why a typical schema design approach does not scale for Sequoia and describe an alternative based on emerging geospatial data standards:

o Base Schema

We describe the basic SQL building blocks implemented in the Illustra object-relational database that support earth science data: Researchers use these building blocks to create a database for a particular application.

Design Approach

Big Sur schema development is drawing from geospatial standards to:
o Accelerate schema development.
Many person-years have already been invested defining data components that Sequoia can use today.

o Avoid tuning the schema to any single application.
Integrating new applications should not require changes to the base schema; rather, the base schema should be extended to support new applications.

o Support interoperability.
Most of the geospatial standards have been designed with interoperability in mind; many of them already provide tools.

First we describe a typical schema design approach and why it does not scale for Sequoia data. Next we describe a standards-based approach, focusing on the Spatial Archive and Interchange Format (SAIF) [SAIF94] and the Content Standards for Digital Geospatial Metadata (FGDC) [FGDC94] standards. Finally we mention a few lessons learned so far and where we expect our approach to fall short.


Typical Design Approach

Schema design requires defining a framework that, at minimum, specifies:

The goal is to produce an application-independent schema. For example, software that displays a satellite image should not require a different underlying schema than software that transforms objects to a different map projection.

However, establishing a base framework is labor-intensive, even for a single data domain or application.

First of all, the schema designers must thoroughly understand the data domain and processing requirements. References such as [FLEM89] are used by many database organizations and establish a schema design strategy that encourages a sensible, data-driven approach independent of any specific application. However, since it is easier to understand a known application than the underlying data domain, the design process often results in application-specific schemas that must be altered to add new applications. A large body of literature describes schema evolution, transformation, and migration strategies; see, for example, [ROSE94] and [HURS94].

Secondly, even with a well-understood data domain, specifying objects takes much longer than anyone would imagine. Given a longitude, does the database attribute get named longitude, Longitude, lon, or long? Is the type a single or double precision real? or perhaps in decimal degrees, minutes, and seconds? Does the valid domain go from -180 to +180? or from 0 to 360? [BRIG94] cites a study that shows that the likelihood two people will choose the same name for an object ranges from 7% to 18%. Viewed another way, disagreement is the expected norm and concensus is time-consuming to achieve.

This approach does not scale for Sequoia. Sequoia has scarse labor resources, a high number of applications, and--most importantly--many data domains to integrate. The Catch 22 is the researchers who understand each data domain do not understand schema design and the schema designers do not understand the science domains.

A different schema design approach is needed, one that does not require scientists to become schema designers and that does not require schema designers to become fully versant in each science domain.


Standards-Based Approach

Sequoia data are as diverse, if not more so, than most standards efforts. Furthermore, base standards for the Sequoia domains have already been at least partially established many times. Together [JOHN92] and [CASS93] identify over 50 standards related to GIS mapping and remote sensing technologies alone. The time such efforts take is staggering:

Sequoia does not have person-years of resources available to develop a standard that encompasses all Sequoia data domains. We must leverage the person-years others have already been spent defining many data elements we can use now; and we should work with the standards organizations to extend the standards as needed. At the very least, checkpointing schema development against the relevant standards will help ensure that the base schema remains domain- and application-independent.

Current Sequoia schema development is based on a hybrid of the FGDC metadata and SAIF standards. Our goal is to use FGDC at the high-level, especially for data set information, and fill in from SAIF for more complex constructs. Where the two standards overlap, we will provide views that present data from either perspective.

Both standards are briefly introduced next.


Spatial Archive and Interchange Format (SAIF)

SAIF was accepted as a draft standard by the Canadian General Standards Board in 1991[SAIF94]. It is an extensible, object-oriented data modeling and archival/transfer standard that targets both geospatial and non-geospatial data. [ANDE94] describes a raster prototype based on SAIF and implemented in the Illustra object-relational database.

The advantages of SAIF include:

The disadvantages of SAIF include:


Content Standards for Digital Geospatial Metadata (FGDC)

The Federal Geographic Data Committee (FGDC) started specifying a digital geospatial metadata standard in 1992, resulting in a final draft published in 1994 [FGDC94]. It adopts a simpler data model organized into the hierarchy of data and compound elements needed to document a geospatial data set.

The FGDC metadata standard is particularly useful for high-level metadata and appears to be gaining popularity in the geospatial user community.

The advantages of the FGDC metadata standard include:

The disadvantages of FGDC include:


Lessons Learned

The lessons we have learned so far include: We have also noticed a few weaknesses in our approach:

Base Schema

The Big Sur base schema is made up of base types and SQL template create statements, which researchers use for creating a database schema for a given application. We will work with researchers to extend the building blocks as they need to meet application requirements.

This section describes:


Design Guidelines and Software Requirements

First we mention the software requirements since they drive some of the guidelines. The Big Sur base schema requires: We recommend that researchers follow these guidelines as they extend the Big Sur base schema:

SQL Tables

Big Sur tables may be used directly, or they may be inherited to support a user-defined schema. The Applications section shows how these building blocks are applied and extended by the Sequoia remote sensing and climate model applications.

We have grouped the tables into the following categories:

Each implemented category is described in this section and contains the following minimum information:

o Entity Relationship Diagram (ERD)
Entity relationship diagrams are useful for creating visual road maps of a database schema. Clicking on this one will bring up a description of the notation.

o SQL Code
Clicking on any SQL Code reference will bring up source code.

o Reference
Clicking on the reference will take you to the document containing detailed information about the attributes, if such a document exists.

1. Data Set Metadata

Data set metadata describe the data sets that are available at a particular site as well as detailed information about each data set. The Big Sur implementation is based on the FGDC metadata standard. Big Sur Version 1 implements Identification Information and Metadata Reference Information, the two required compound data types. Detailed information for this part of the schema includes:
o ERD: Data Set Metadata

o SQL Code

o FGDC Metadata Standard

2. Lineage

Lineage describes the processing history, algorithms, and parameters that were used to process the data. Detailed information for this part of the schema includes:
o ERD: Lineage

o SQL Code

3. Grid

Gridded data in the Big Sur schema is multi-dimensional and georeferenced. An initial cut of a grid is nailed down for remote sensing and climate modeling data. Georeferencing is partially nailed down, but we need to include projection handling. Detailed information for this part of the schema includes:
o ERD: Grid

o SQL Code

o SAIF Standard
Additional explanation follows.

SAIF 3.2 changes the SAIF 3.1 specification in the following ways:

We are additionally modifying the SAIF specification as follows:

4. Big Sur Object

BigSurObject contains the data granule. Detailed information for this part of the schema includes:
o ERD: BigSur Object

o SQL Code


Base Types

This section describes the type building blocks in the Big Sur schema. So far, these new base types are implemented or planned: A brief introduction to each type is below; complete information is forthcoming.

1. DLOBH

The Distributed Large Object Handle (dlobh) provides location-transparent access to Illustra large_object and external_file objects. Key features include the ability to mix Illustra large object types in the same attribute in a table and to access objects on a remote Illustra server.

The following functions provide input/output support for msql and C:

FileToDLOBH
Inserts a new dlobh into the database from a file on disk. It takes two arguments:

DLOBHToFile
Copies a dlobh from the database into a file on disk. It takes two arguments:

For example, the following SQL statements create a table and insert the C standard i/o header first as an internal large_object, then as an external file.

$ msql test * create table test_dlobh (foo dlobh); * insert into test_dlobh values (FileToDLOBH('/usr/include/stdio.h','I')); * insert into test_dlobh values (FileToDLOBH('/usr/include/stdio.h','X')); Next, we copy both files back out onto disk and check for the files that were created in /tmp. * select DLOBHToFile(foo, '/tmp/dlobh') from test_dlobh; * \g $ ls -l /tmp/dlobh.* -rw-r--r-- 1 miadmin 10 12420 Nov 21 11:25 /tmp/dlobh.2001 -rw-r--r-- 1 miadmin 10 12420 Nov 21 11:25 /tmp/dlobh.stdio.h

Selecting a dlobh returns the handle, which uses the internet Uniform Resource Locator (URL) convention:

$ msql test -w 0 * select foo from test_dlobh; ----------------------------------------------------------- |foo | ----------------------------------------------------------- |dlobh://arcadia.CS.Berkeley.EDU/default/test/pbrown//I/I0105826192499| |dlobh://arcadia.CS.Berkeley.EDU/default/test/pbrown//X:728911045:0//usr/include/stdio.h| ----------------------------------------------------------- The URL fields for the dlobh are as follows: dlobh://HOSTNAME:PORT/SERVER/DATABASE/USER/PASSWORD/TYPE/LOHANDLE

2. MDARRAY

Jean's note: I haven't had time to integrate all the information for this section that Sunita has given me.

The default Illustra array type supports arrays of arrays. The Big Sur mdarray type supports actual multi-dimensional arrays, storing the array as an external file. It is similar to the array type implemented in POSTGRES Version 4.2. A description may be found in Sequoia technical report s2k-93-32, available via anonymous ftp from s2k-ftp.cs.berkeley.edu:/pub/sequoia/tech-reports.

The following creates an mdarray and inserts a three-dimensional array of data:

create table int_marray3 (a marrayof(integer)); insert into int_marray3 values ('(1:2, 0:3, 2:7), int2.4.6'); The following updates individual elements in the array, then a range: update int_marray3 set a[1,1,2] = 112; update int_marray3 set a[1,1,3] = 113; update int_marray3 set a[1,2,2] = 122; update int_marray3 set a[1,2,3] = 123; update int_marray3 set a[2:2,1:2,2:3] = '(2,2,1), int2.2.1'; The following selects the entire array, then a specific element, and finally a two-dimensional hyperslap: select * from int_marray3; select a[1,1,2] from int_marray3; select a[1:1, 1:2, 2:3] from int_marray3;

3. SAIF Enumerations

A SAIF enumeration is a list of named values. For example, the UnitMeaning enumeration identifies the real-world units stored in another variable. Its domain of values is large and includes metres, miles, degrees, radians, years, and seconds.

A subset of the SAIF enumerations are implemented in the Big Sur schema for those portions of the schema that use SAIF. Supporting functions include:

SaifEnumHelp
With no argument, outputs a list of SAIF enumerations that have been implemented. With the name of an enumeration as the argument, returns the domain of values for that enumeration.
For example, the following SQL statements list the available enumerations, outputs the values for UnitMeaning, then creates a table with a UnitMeaning type and inserts into it. return SaifEnumHelp(); return SaifEnumHelp('UnitMeaning'); create table enum_test (sample UnitMeaning); insert into enum_test values ('degrees'); As mentioned above in the SAIF implementation notes, Big Sur Version 1 has implemented enumerations as text strings. UnitMeaning is a prime example of an enumeration domain we will need to be flexible.


Sequoia Glossary
Big Sur Introduction