Section 5.3 of Programming with Data mentions a number of different ways to define an S database, including user-defined database classes. Not much more is said about these in the book.
Here they are. The idea is quite powerful.
The S evaluation manager treats user-defined databases essentially
just like other types of databases, until it reaches the point at
which it needs to take an explicit action: reading, writing, or
removing an object, for example.
At this point it calls a corresponding generic function:
dbread
, dbwrite
, and dbremove
in these cases.
The designer of the database class provides methods for these
functions, all of which get the database object as their first
argument.
An important part of the philosophy is that the evaluation manager is
responsible for determining in most cases that we really need
to read, write, or remove the corresponding object.
The database methods do not worry about issues such as committing
assignments or whether there are multiple versions of the same named
object on other databases.
Although a moderate amount of
programming is needed to add a new kind of database class, the results
can usually be reused for other, similar applications.
This page outlines what is needed, and gives an example using the S
symbolic dump format.
The example has some mild practicality, in providing a portable
database that could be used in a network file system over diverse
hardware.
Its main virtue, though, is to work out the essential ideas in a
context that doesn't require discussing any other software, such as
database management systems.
A database class is any class that extends the virtual class
database
.
Once the required methods for dbread
, etc. are defined or inherited, you can attach an
object of the class and use the resulting database like any other.
Notice that this is different from attaching a list-like object as a database. In that case, all the data resided in the single object. For database classes, on the other hand, the database object usually identifies a directory, database, URL or other external resource that will be used to store and retrieve data. The database object's class has the key role of telling the evaluator to use corresponding methods to deal with the low-level access to the database.
"database"
.
Other than that, the representation only contains the
path for the directory.
setClass("dumpDataBase", representation(path = "character"))
We will store objects in separate files in the directory; as a
heuristic we'll append ".Sdata"
to the object name to get
the file name. This is consistent with the data.dump
function, and also
helps avoid confusion if the user creates some other files in the same directory.
As with most class definitions, the next step is to write a generating function for objects from the class. The example generator function takes the path name of a directory as argument. If the directory doesn't exist, the function optionally creates it. If the directory exists or is created, the path stored in the object is a global, not a local path, so that future use of the database object is unambiguous.
The next step is to write some methods for the database operations. The four critical generic functions and the corresponding methods are:
dbobjects
, which returns the character vector of
current object names for the database
;
dbread
, which reads an object of a given
name
from the database
;
dbwrite
, which writes out an object corresponding
to a given name; and
dbremove
, which removes the object of a given name
dbexists
,
which says whether a particular object name exists in the
database; this is never called except when the database has no directory;
dbdetach
, which is called when the database is
detached, and contains any actions necessary at that time.
That's it, apart from a one-line utility that returns the path corresponding to an object name. Even this simplistic example has some utility. Notice that the contents of the database are now portable; we could attach and use such a database on a networked file system, regardless of the computer on which S was running.
This is a very compact implementation, but lacking in generality.
One problem is that only certain object names are valid file
names.
A somewhat fancier alternative
definition of the database class allows arbitrary object names at
the cost of keeping a local table to map object names into file names.
Databases without Directories
Normally, the S evaluator manager maintains an internal table of the
names of all the objects currently on a database.
As S tasks alter the database, this directory is updated to reflect
assignments and removals of objects.
For user-defined database classes, the internal directory is
initialized when the database is attached, via a call from the
evaluator manager to the function dbobjects
.
This should return a vector of the current object names.
There is only this one call each time the database is attached. From
then on, the evaluator believes in its own internal table.
Occasionally, you may not want S to have such an internal table. In some applications, it may be expensive or even impossible to define all the objects in the database. At other times, you may not want to count on the internal view of the database. For example, if other processes are expected to be updating the directory during the session, you will not be guaranteed a current version of the database: S will not know about objects that have been added or removed by other software. On the other hand, not having an internal directory is a serious problem if a database is to be used on the regular S search path. Efficiency suffers greatly if each attempt to search for an S function, for example, has to run a complex computation to test whether the corresponding objects exists on a particular database. There are dangers of getting trapped in infinite loops, if the function is needed to evaluate one of the database methods themselves.
The approach taken is to allow user-defined database classes without
internal directories, but not on the search list.
Only databases attached with purpose="data"
are allowed
to dispense with directories.
These databases are accessed only when an explicit where=
argument is supplied.
See page 226 of Programming with Data for
this use of attach
.
You indicate that a database should be attached without a directory by
having the dbobjects
method return the NULL
object.
If databases from your class never should have directories,
just make sure that the default dbobjects
method is used;
i.e., don't have a method for your class or for any class that your
class extends.
If the database has no directory, then there must be a method for the
function dbexists
; reasonably so, since the evaluator and
the user need to be able to test whether an object exists on the
database, rather than trying to read the object and getting an error.
Let's extend our previous example to make directories optional.
We'll call this class "dumpFilesDB"
; it extends the
previous class to add a flag for having a directory.
setClass("dumpFilesDB", representation("dumpDataBase", directory = "logical"))
The generator function is the same, with an added argument for whether
to create a directory.
All the methods are inherited from "dumpDataBase"
, except
for dbobjects
, which returns NULL
if the
directory flag in the object is FALSE
.
dbobjects
; on detaching, all the internal storage for
the database is released.
If the contents of the database have been modified, the evaluator will
look for a method for the function dbdetach
.
This function carries over from an earlier version of S, where it was
used in the case of attaching an S list or other object.
It's name is really a misnomer; it should be called
dbsave
, since its intent is to save a modified object or
other database when that database is detached (see page 68 of the book
Statistical Models in S).
For back-compatibility, it will continue to be called
dbdetach
.
It takes two arguments, the object originally attached and the
position on the search list in which the database appeared.
If you want to, you can write a method for this function for your
database class.
Remember that it will only be called by the evaluator if some
objects have been modified in the database.
If you want to force it to be called, you can cheat by setting the
status: database.status(where) = "modified"
. (Notice that
the argument is the position or some other way of identifying the
attached database, as discussed below.)
attach
function, of class "attached"
to
identify the attached database.
With S chapters, this is in distinction to using the path
name of the directory containing the chapter.
For these it's a good idea but often not crucial.
When we come to attached objects, however, and even more to user-defined database classes, the distinction is crucial. For one thing, modifications of the attached database have no effect on the original object (page 228). But the distinction can be still more important, and subtle, in other situations.
Let's take the example of trying to change the status of a database.
Suppose we attach an object, say
myDB
, and then want to mark the database as modified (to
guarantee that dbdetach
is called.
Here's the right way:
dbAttached = attach(myDB)
datbase.status(dbAttached) = "modified"
You might think that the second line could use myDB
as an
argument instead of dbAttached
, but the meaning is
subtly different.
The evaluator manager goes through some heuristic code that tries to
match the object to the position, name, or some other way of
identifying the attached database.
This process is very definitely not guaranteed to work for arbitrary
objects; in fact, you probably wouldn't want it to in many cases.
In the example above, the point is emphasized by:
> dbAttached = attach(myDB)
> database.status(myDB) = "modified"
> database.status(myDB)
[1] "unattached"
> database.status(dbAttached)
[1] "readwrite"
(The evaluator should probably have warned in the assignment case; in
fact what it did was to attach the object temporarily, do the
requested modification, and then detach it at the end of the task.)
In any case, the essential point is always to use the attached object
in these calls.
If you didn't save it, you can get it back by calling
database.attached
with, for example, the current position
in the search list as the argument, for example, database.attached(2)
.