Anyone who has ever tried to get a database to import a lifelist, or tried to generate a regional checklist from a database of trip lists, or tried to write a program to create a database from a published regional checklists, understands that the lack of any standards for bird lists is a major obstacle to fully exploiting bird data. In this article I propose using the Standardized Generalized Markup Language (SGML) to define Document Type Definitions (DTDs) for bird checklists and trip lists. The resulting DTDs are the Standardized Checklist Format (SCLF) and Standardized Trip List Format (STLF). I believe this approach solves many of the problems of data storage and retrieval for birding data.
By checklist I mean a list of the birds of a specific area, preferably with some description of their habitat, their abundance, and so forth. By trip list I mean a list of birds observed in a particular region of space and time. From this perspective a lifelist is just a trip list where the "trip" is one's whole life, with the additional pecular property that all observations of a given species except the first have been eliminated.
Ideally, checklists and trip lists would feed off of each other, both of them being tied together in a vast repository of birding data. Trip lists would be created using customized checklists generated by querying the repository: one ought to be able to say "I'm going to southern Utah at the end of March; tell me all the birds I might see" and produce a tailor-made list. By the same token, filled-in trip lists are the raw materials for the repository; a checklist is just the summation of large numbers of trip lists.
The SGML language was designed to solve many of the problems in setting up repositories such as the one I have just described. It is a metalanguage in which one can describe document types in a portable, semantically meaningful way. Instead of describing the way a document isformatted, SGML document types describe their underlying structure - instead of specifying, for example, that a section of text is in 18-point Helvetica bold, the DTD will describe it as a "title"; many different display devices can then decide what the right way to display a title is on each particular device.
It is this ability to describe content in a portable way that has made SGML such a powerful tool in setting up textual databases, and usage of the language is rapidly expanding. It is used, for example, to describe the Hypertext Markup Language (HTML) that is used for documents in the World-Wide Web.
One point I should stress is that although SGML documents are reasonably easy to read and edit using word processors, they are not intended to be the format in which documents are presented to users, nor even necessarily the format in which documents are created. Instead SGML documents are processed by applications which present appropriate interfaces to users. Thus when I describe standardized bird list formats, I am only talking about their "internal" representation. They can be printed and otherwise displayed to users in any way desired, and eventually could be created using tools with fancy graphical interfaces.
SGML is a rich and complex language, but for our purposes we need only two key concepts: tags and entities. A tag is simply a bracket placed on a region of text describing its content. The tag which marks the beginning of the region consists of the tag name surrounded by angle brackets, while the closing tag adds a slash before the tag name. For example,
<place>Fort Morgan AL</place>says that the text "Fort Morgan AL" is a place. A document type definition is first and foremost a description of what the different kinds of text in a given document are (i.e. what its tags are) and the order in which they can appear.
The second concept we shall need is that of the entity. For our purposes entities are just a way of making textual substitutions within a document. For example, the entity declaration
<!entity TR "c-c-" -->says that "TR" is an abbreviation for "c-c-". Entities are used by enclosing them in them in the admittedly somewhat peculiar &; notation. Thus wherever &TR; appears in our sample document, it will be expanded to "c-c-" when the document is processed.
I think the use of entities will increase the acceptablity of SCLF and STLF, because it allows them to subsume existing checklist formats gracefully. They do not prejudge such questions as whether the code for "accidental" should be "A" or "X" or "Ac": they are flexible enough to let each checklist author define his own codes within the standard.
Let's see how SGML can be used to describe the structure of a trip list. For a given trip we'll want to record the observers and perhaps some information about the occasion for the trip. Then we'll want a list of the birds, and for each species we'll its Latin or English name (or both), where and when it was seen, and in what numbers. The following sample from an STLF document is a straightforward rendition of this in SGML:
<triplist> <observers>Chris Mundie, Sam Weissberg <occasion> Spring Break <bird> <English> Great Egret <Latin> Casmerodius albus <Place> Erie NWR <Time> 950304 06:45 <bird> ... </triplist>Note that I have exploited the markup minimization features of SGML to eliminate many of the unnecessary closing tags.
Now one of the thorniest problems in maintaining bird lists is the ephemerality of taxonomy. What I record today as a Three-toed Ant Thrush may tomorrow be known as a Rufous-toed Grasshopper Robin; how is the complex mapping from one taxonomic name space to another to be managed?
SGML provides a mechanism which solves this problem nicely. What is permanent in taxonomy is the act of naming itself - the name plus the name space from which it came. Trip lists are research projects, and as such ought to be footnoted. SGML allows us to attribute the name with the taxonomic system which supports it, in the form of the "authority" for the name. A complete entry for the Great Egret might then be:
<english auth=Clemens, Howard> Great Egret <latin auth=Clemens> Casmerodius albus <latin auth=Howard> Egretta alba egrettaNo matter how much lumping and splitting goes on in the future, nothing can alter the fact that this bird was, on this occasion, judged to be what Clemens calls a "Casmerodius albus" and what Howard calls a "Egretta alba egretta". By preserving this information in the trip list itself, we effectively insulate it from nomenclatural changes.
This is perhaps an appropriate moment to pause and compare using an SGML-based document for triplists to the more commonly-used alternative, namely, a database. At first the two approaches may seem quite similar. Both the SGML-based document and the database are machine processible, both permit efficient data retrieval, and the tagged text corresponds to the fields of a database.
The difference is that the SGML document is much more flexible and extensible than a traditional database; after all, it was designed to describe the vast variety of natural-language documents. Fields can be omitted or replicated as needed, and they need not appear in any fixed order. New tags, and new attributes, can be added without perturbing existing documents. SGML documents are more similar in this respect to the "feature structures" used in linguistics than they are to databases, and this makes them ideal for describing incomplete data.
To take an example: before I switched to STLF, I kept my lifelist in a tab-delimited text file so that it could be imported into databases. This meant that I could have only one Latin and one English name, with the result that the lifelist was as we have seen very fragile in the face of taxonomic change. I could of course have had two name fields, or switched to a relational database, but only by altering the database declarations and converting from the old format to the new one. With STLF, it's just not an issue.
Because checklists are summations over many trip lists, the design of SCLF is considerable more complex than that of STLF, but the underlying principles are the same. The following excerpt from a Bahamas checklist gives the flavor of SCLF:
<Area> Bahamas <Subarea> Ab (Abaco) <Subarea> An (Andros) <Subarea> Bi (Bimini) <Subarea> CS (Cay Sol) <Subarea> GB (Grand Bahama) ... <!Entity n "Ab,GB,An,El,Bi,NP"> <!Entity s "GI,Ma"> <!Entity w "An,Bi,Ab,CS,GB,NP">
Calonectris diomedea <a>GB <h>P <f>Ac Puffinus gravis <h>P <f>Ac Puffinus griseus <h>P <f>Ac Puffinus lherminieri <h>P Oceanites oceanicus <aX>BG,Ab <f>C Oceanodroma leucorhoa <q>CCCC ... </Area>
The SCLF uses the tags <Area> and </Area> to delimit checklists for particular areas. The tagged text is the name of the area; comments are enclosed in parentheses.
Each area may contain subareas marked by the <Subarea> tag. This permits, in effect, multiple checklists to be combined. In theory, all the checklists for the whole world could be merged in this way, but practically it would be cumbersome.
We see again the power that comes from the use of entity definitions. In the example above, the entities n, e, s, and w are defined to be sets of subareas, so that the northern part of an area can be referred to as simply "&n". In practice I expect that this mechanism will be used to handle a wide variety of notations. For example, in the context of a given checklist, specialized notations can be created that expand to standardized ones. For example, if a given checklist uses "v" to mean "irregular winter visitor", a "v" entity can be defined which expands to "<q>---u <l>i".
An empirical examination of existing checklists shows that five types of information are typically included: habitat; seasonal variation; status; abundance; locality (in both space and time); and subregion status. The essential idea behind SCLF is that these five types of information should be identified by means of tags, so that both the human and the computer can process them unambiguously. For the sake of succinctness, single-character tag names are used.
It is important to distinguish the tags used to mark up these five kinds of data from the codes used for the data itself. In its current form SCLF defines the tag only; the code definitions are still being researched. (I believe there was a recent proposal in Birding that could be appropriated for some of the codes.) One of the advantages of having the tags is that the name space is expanded, greatly reducing the chance of collisions. For example, "S" can safely be used for both "Swamp" and "Summer"; "w" can be both "western" and "winter", and so forth.
Here is how the five information types are encoded in SCLF.
Subregion codes may optionally be prefixed by one of the nine geographic modifiers n, s, e, w, c, ne, nw, se, sw. In addition, of course, entity references can be used, so that given the entity definitions in the example above, "<a>&n" is equivalent to "<a>Ab,GB,An,El,Bi,NP". Finally, the "<aX>" tag is like the <a> tag except that it inverts the subregion lists: <aX>GB,Ab would mean the bird appears everywhere in the area except Grand Bahama and Abaco.
Another checklist facility that is sometimes seen is parameterized seasonal variation with positional notation for subregions. Although not widespread, this technique can be very compact and so has been included in the form of the <as> tag, where "as" stands for "array of subareas". For example, if an area has five subareas A1, A2, A3, A4, and A5, the notation <q>?-?- <as>crx-a would be equivalent to <a>A1 <q>c-c-, <a>A2 <q>r-r-, <a>A3 <q>x-x-, < a>A4 <q>----, <a>A5 <q>a-a-.
The SCLF defines one additional tag, the <Family> tag. It may optionally be used to group bird data into families. This is useful for organizational purposes, and also permits better abbreviation schemes.
Indeed, the choice of naming and abbreviation schemes is a major problem when entering checklist data. Unfortunately, there are no Kripkean "rigid designators" for birds (see Crnchng th Brds). The full scientific names come fairly close, but require a lot of typing. After some research, I devised a solution which works reasonably well in practice. If the birds in the checklist are organized by families, it is possible to use the first two letters of the genus along with the first four letters of the species as an abbreviation for the complete Latin name. It is true that there are some collisions, but the system still beats typing the full name. It may be possible to do better with dedicated bird-name entry software, but I haven't liked the systems I have seen.
It goes without saying that this abbreviation scheme is strictly for the purposes of data entry. Once the abbreviation has been entered, it can be looked up and replaced with the full name. In fact, this convention should not be regarded as part of the SCLF itself.
It is time to start collecting information on bird distribution in a permanent, reusable fashion, and to begin to accumulate the tools that would simplify bird data manipulation. STLF and SCLF are, I believe, a step in the right direction. It has proven remarkably easy to construct a tool to take SCLF checklists and use them to drive a mathematical modeling program which predicts the number of lifers a given birder would get at a given site (see The Mathematical Theory of Listing). With a new generation of SGML tools coming of age, using these formats will become easier and easier.
Mundie, David A. Standardized Bird Lists / David A. Mundie Pittsburgh, PA : Polymath Systems 1995 1. Ornithology I. Techniques II. Standards 598.01 dc-20 [MARC]