Database Classes

Section 5.3 of Programming with Data mentions a number of different ways to define an S database, including user-defined database classes. Not much more is said about these in the book.

Here they are. The idea is quite powerful. The S evaluation manager treats user-defined databases essentially just like other types of databases, until it reaches the point at which it needs to take an explicit action: reading, writing, or removing an object, for example. At this point it calls a corresponding generic function: dbread, dbwrite, and dbremove in these cases. The designer of the database class provides methods for these functions, all of which get the database object as their first argument. An important part of the philosophy is that the evaluation manager is responsible for determining in most cases that we really need to read, write, or remove the corresponding object. The database methods do not worry about issues such as committing assignments or whether there are multiple versions of the same named object on other databases. Although a moderate amount of programming is needed to add a new kind of database class, the results can usually be reused for other, similar applications. This page outlines what is needed, and gives an example using the S symbolic dump format. The example has some mild practicality, in providing a portable database that could be used in a network file system over diverse hardware. Its main virtue, though, is to work out the essential ideas in a context that doesn't require discussing any other software, such as database management systems.

A database class is any class that extends the virtual class database. Once the required methods for dbread, etc. are defined or inherited, you can attach an object of the class and use the resulting database like any other.

Notice that this is different from attaching a list-like object as a database. In that case, all the data resided in the single object. For database classes, on the other hand, the database object usually identifies a directory, database, URL or other external resource that will be used to store and retrieve data. The database object's class has the key role of telling the evaluator to use corresponding methods to deal with the low-level access to the database.

An Example

Let's do a very simple example. In this case we are going to use a directory in the file system as our database. But instead of relying on the standard S binary representation for objects, we will read and write files using the text-only "symbolic dump format" (pp 237-241). As for all user-defined database classes, the new class extends the virtual class "database". Other than that, the representation only contains the path for the directory.
    setClass("dumpDataBase", representation(path = "character"))

We will store objects in separate files in the directory; as a heuristic we'll append ".Sdata" to the object name to get the file name. This is consistent with the data.dump function, and also helps avoid confusion if the user creates some other files in the same directory.

As with most class definitions, the next step is to write a generating function for objects from the class. The example generator function takes the path name of a directory as argument. If the directory doesn't exist, the function optionally creates it. If the directory exists or is created, the path stored in the object is a global, not a local path, so that future use of the database object is unambiguous.

The next step is to write some methods for the database operations. The four critical generic functions and the corresponding methods are:

All the examples in this case are trivial one-line methods, but looking at them will help you understand what the methods can assume and what they need to do. In addition to these four methods, several others may optionally be defined.

That's it, apart from a one-line utility that returns the path corresponding to an object name. Even this simplistic example has some utility. Notice that the contents of the database are now portable; we could attach and use such a database on a networked file system, regardless of the computer on which S was running.

This is a very compact implementation, but lacking in generality. One problem is that only certain object names are valid file names. A somewhat fancier alternative definition of the database class allows arbitrary object names at the cost of keeping a local table to map object names into file names.

Databases without Directories

Normally, the S evaluator manager maintains an internal table of the names of all the objects currently on a database. As S tasks alter the database, this directory is updated to reflect assignments and removals of objects. For user-defined database classes, the internal directory is initialized when the database is attached, via a call from the evaluator manager to the function dbobjects. This should return a vector of the current object names. There is only this one call each time the database is attached. From then on, the evaluator believes in its own internal table.

Occasionally, you may not want S to have such an internal table. In some applications, it may be expensive or even impossible to define all the objects in the database. At other times, you may not want to count on the internal view of the database. For example, if other processes are expected to be updating the directory during the session, you will not be guaranteed a current version of the database: S will not know about objects that have been added or removed by other software. On the other hand, not having an internal directory is a serious problem if a database is to be used on the regular S search path. Efficiency suffers greatly if each attempt to search for an S function, for example, has to run a complex computation to test whether the corresponding objects exists on a particular database. There are dangers of getting trapped in infinite loops, if the function is needed to evaluate one of the database methods themselves.

The approach taken is to allow user-defined database classes without internal directories, but not on the search list. Only databases attached with purpose="data" are allowed to dispense with directories. These databases are accessed only when an explicit where= argument is supplied. See page 226 of Programming with Data for this use of attach.

You indicate that a database should be attached without a directory by having the dbobjects method return the NULL object. If databases from your class never should have directories, just make sure that the default dbobjects method is used; i.e., don't have a method for your class or for any class that your class extends.

If the database has no directory, then there must be a method for the function dbexists; reasonably so, since the evaluator and the user need to be able to test whether an object exists on the database, rather than trying to read the object and getting an error.

Let's extend our previous example to make directories optional. We'll call this class "dumpFilesDB"; it extends the previous class to add a flag for having a directory.

setClass("dumpFilesDB", representation("dumpDataBase", directory = "logical"))
The generator function is the same, with an added argument for whether to create a directory. All the methods are inherited from "dumpDataBase", except for dbobjects, which returns NULL if the directory flag in the object is FALSE.

Actions on Attaching and Detaching

In some cases, you may need to take some special actions when a database is attached or detached. All the evaluator manager normally does on attaching the database is to form the internal table of object names from the result of the call to dbobjects; on detaching, all the internal storage for the database is released.

If the contents of the database have been modified, the evaluator will look for a method for the function dbdetach. This function carries over from an earlier version of S, where it was used in the case of attaching an S list or other object. It's name is really a misnomer; it should be called dbsave, since its intent is to save a modified object or other database when that database is detached (see page 68 of the book Statistical Models in S). For back-compatibility, it will continue to be called dbdetach. It takes two arguments, the object originally attached and the position on the search list in which the database appeared. If you want to, you can write a method for this function for your database class. Remember that it will only be called by the evaluator if some objects have been modified in the database. If you want to force it to be called, you can cheat by setting the status: database.status(where) = "modified". (Notice that the argument is the position or some other way of identifying the attached database, as discussed below.)

Attached Objects vs. Database Objects

Page 224 of Programming with Data, and elsewhere in the book, recommend using the object returned by the attach function, of class "attached" to identify the attached database. With S chapters, this is in distinction to using the path name of the directory containing the chapter. For these it's a good idea but often not crucial.

When we come to attached objects, however, and even more to user-defined database classes, the distinction is crucial. For one thing, modifications of the attached database have no effect on the original object (page 228). But the distinction can be still more important, and subtle, in other situations.

Let's take the example of trying to change the status of a database. Suppose we attach an object, say myDB, and then want to mark the database as modified (to guarantee that dbdetach is called. Here's the right way:

  dbAttached = attach(myDB)
  datbase.status(dbAttached) = "modified"
You might think that the second line could use myDB as an argument instead of dbAttached, but the meaning is subtly different. The evaluator manager goes through some heuristic code that tries to match the object to the position, name, or some other way of identifying the attached database. This process is very definitely not guaranteed to work for arbitrary objects; in fact, you probably wouldn't want it to in many cases. In the example above, the point is emphasized by:
> dbAttached = attach(myDB)
> database.status(myDB) = "modified"
> database.status(myDB)
[1] "unattached"
> database.status(dbAttached)
[1] "readwrite"
(The evaluator should probably have warned in the assignment case; in fact what it did was to attach the object temporarily, do the requested modification, and then detach it at the end of the task.)

In any case, the essential point is always to use the attached object in these calls. If you didn't save it, you can get it back by calling database.attached with, for example, the current position in the search list as the argument, for example, database.attached(2) .


John Chambers<jmc@research.bell-labs.com>
Last modified: Thu Oct 8 18:48:21 EDT 1998