iPhylo: April 2014

Roderic D. M. Page

Monday, April 21, 2014

Is collecting specimens necessary?

Some interesting threads in TAXACOM today (yes, really). The following article has appeared in Science:

Minteer, B. A., Collins, J. P., Love, K. E., & Puschendorf, R. (2014, April 18). Avoiding (Re)extinction. Science. American Association for the Advancement of Science (AAAS). doi:10.1126/science.1250953 (paywall)

The authors argue that "The availability of adequate alternative methods of documentation, including high-resolution photography, audio recording, and nonlethal sampling, provide an opportunity to revisit and reconsider field collection practices and policies."

This has brought a swift response from Kevin Winkler (Re)affirming the specimen gold standard who argues that physical specimens are vital for much of the science that museum collections support.

At the same time, David Schindel has posted on Minimum standards for e-voucher documentation, that is, DNA samples where no physical voucher exists (e.g., because the organism is a member of an endangered species, or still alive).

Wednesday, April 16, 2014

Breaking the Biodiversity Heritage Library

The Biodiversity Heritage Library (BHL) has recently introduced a feature that I strongly dislike. The post describing this feature (Inspiring discovery through free access to biodiversity knowledge... states:

Now BHL is expanding the data model for its portal to be able to accommodate references to content in other well-known repositories. This is highly beneficial to end users as it allows them to search for articles, alongside books and journals, within a single search interface instead of having to search each of these siloes separately.

What this means is that, whereas in the past a search in BHL would only turn up content actually in BHL, now that search may return results from other sources. What's not to like? Well, for me this breaks the fundamental BHL experience that I've come to rely on, namely:

If I find something in BHL I can read it there and then

With the new feature, the search results may include links to other sources. Sometimes these are useful, but sometimes they are anything but. Once you start including external links in your search results, you have limited control over what those links point to. For example, if I search BHL for the journal Revista Chilena de Historia Natural I get two hits. Cool! So I click on one hit and I can read a fairly limited set of scanned volumes in BHL, if I click on the other hit I'm taken to a page at the Digital Library of the Real Jardín Botánico of Madrid. This is a great resource, but the experience is a little jarring. Worse, for this journal the Real Jardín Botánico doesn't actually have any content, instead the "View Book" link takes me to SciElo in Chile, where I can see a list of recent volumes of this journal.

In this case, BHL is basically a link farm that doesn't give me direct access to content, but instead sends me on a series of hops around the Internet until I find something (and I could have gotten there more quickly via Google).

What is wrong with this?

There are two reasons I dislike what BHL have done. The first is that it breaks the experience of search then read within a consistent user interface. Now I am presented with different reading experiences, or, indeed, no reading at all, just links to where I might find something to read.

More subtlety, it undermines a nice feature of BHL, namely searching by taxonomic names. The content BHL has scanned has also been indexed by taxonomic name, so often I find what I'm looking for not by using bibliographic details (journal name, volume, etc.), which are often a bit messy, but by searching on a name. External content has not been indexed by name, so it can't be found in this way. Whereas before, if I search by name I would be reasonably confident that if BHL had something on that name I could find it (barring OCR errors), now BHL may well have what I looking for (in an external source) but can't show me that because it hasn't been indexed.

From my perpsective, the things I've come to rely on have been broken by this new feature (and I haven't even begun to talk about how this breaks things I rely on to harvest BHL for article metadata, which I then put into BioStor, which in turn gets fed back into BHL).

What should BHL have done?

To be clear, I'm not arguing against BHL being "able to accommodate references to content in other well-known repositories". Indeed, I'd wish they'd go further and incorporate content from BHL-Europe, whose portal is, frankly, a mess. Rather, my argument is that they should not have done this within the existing BHL portal. Doing so dilutes the fundamental experience of that portal ("if I find it I can read it").

Here's what I would do instead:

Keep the current BHL portal as it was, with only content actually scanned and indexed by BHL.
Create a new site that indexes all relevant content (e.g., BHL, BHL-Europe, and other repositories.
Model this new portal on something like CrossRef's wonderful metadata search. That is, throw all the metadata into a NoSQL database, add a decent search engine, and provide users with a simple, fast tool.
The portal should clearly distinguish hits that are to BHL content (e.g. by showing thumbnails) and hits that are to external links (and please filter links to links!).
Add taxonomic names to the index (you have these for BHL content, adding them for external content is pretty easy).
Even more useful, start indexing full text content, maybe starting at articles ("parts"). At the moment Google is doing a better job of indexing BHL content (indirectly via indexing Internet Archive) than BHL does.

Creating a new tool would also give BHL the freedom to explore some new approaches without annoying users like me who have come to rely on the currently portal working in a certain way. Otherwise BHL risks "feature creep", however well motivated.

Thursday, April 10, 2014

User interface to edit a point location

Following on from earlier posts on annotating biodiversity data (Rethinking annotating biodiversity data and More on annotating biodiversity data: beyond sticky notes and wikis) I've started playing with user interfaces for editing data.

For example, here's a simple interface to edit the location of a specimen or observation (inspired by the iNaturalist observation editor). You can play with this below or on on bl.ocks.org, and the source code is on GitHub https://gist.github.com/rdmpage/9951904.

Tuesday, April 08, 2014

The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine

An undergraduate student (Aime Rankin) doing a project with me on citation and impact of museum collections came across a paper I hadn't seen before:

Strasser, B. J. (2011, March). The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine. Isis. University of Chicago Press. doi:10.1086/658657

Unfortunately the paper is behind a paywall, but here's the abstract (you can also get a PDF here):

Today, the production of knowledge in the experimental life sciences relies crucially on the use of biological data collections, such as DNA sequence databases. These collections, in both their creation and their current use, are embedded in the experimentalist tradition. At the same time, however, they exemplify the natural historical tradition, based on collecting and comparing natural facts. This essay focuses on the issues attending the establishment in 1982 of GenBank, the largest and most frequently accessed collection of experimental knowledge in the world. The debates leading to its creation—about the collection and distribution of data, the attribution of credit and authorship, and the proprietary nature of knowledge—illuminate the different moral economies at work in the life sciences in the late twentieth century. They offer perspective on the recent rise of public access publishing and data sharing in science. More broadly, this essay challenges the big picture according to which the rise of experimentalism led to the decline of natural history in the twentieth century. It argues that both traditions have been articulated into a new way of producing knowledge that has become a key practice in science at the beginning of the twenty-first century.

It's well worth a read. It argues that sequence databases such as Genbank are essentially the equivalent of the great natural history museums of the 19th Century. There are several ironies here. One is that some early advocates of molecular biology cast it as a modern, experimental science as opposed to mere natural history. However, once the amount of molecular data became too great for individuals to easily manage, and once it became clear that many of the questions being asked required a comparative approach, the need for a centralised database of sequences (the "experimenter's museum" of the title of the paper) became increasingly urgent. Another irony is that the clash between molecular and morphological taxonomy overlooks these striking similarities in history (collecting ever increasing amounts of data eventually requiring centralisation).

Bruno Strasser's article also discusses the politics behind setting up GenBank, including the inevitable challenge of securing funding, and the concerns of many individual scientists about the loss of control over their data. A final irony is that, having gone through this process once with the formation of the big museums in the 19th century, we are going through it again with the wrangling over aggregating the digitised versions of the content of those museums.

Update: See also

Strasser, B. J. (2008, October 24). GENETICS: GenBank--Natural History in the 21st Century? Science. American Association for the Advancement of Science (AAAS). doi:10.1126/science.1163399

(via Guanyang Zhang).

Friday, April 04, 2014

More on annotating biodiversity data: beyond sticky notes and wikis

Following on from the previous post Rethinking annotating biodiversity data, here are some more thoughts on annotating biodiversity data.

Annotations as sticky notes

I get the sense that most people think of annotations as "sticky notes" that someone puts on data. In other words, the data is owned by somebody, and anyone who isn't the owner gets to make comments, which the owner is free to use or ignore as they see fit. With this model, the focus is on how the owner deals with the annotations, and how they manage the fact that their data may have changed since the annotations were made.

This model has limitations. For a start, it privileges the "owner", and puts annotators at their mercy. For example, I posted an issue regarding a record in the Museum of Comparative Zoology Herpetology database (see https://github.com/mcz-vertnet/mcz-subset-for-vertnet/issues/1). VertNet has adopted GitHub to manage annotations of collection data, which is nice, but it only works if there's someone at the other end ready to engage with people like me who are making annotations. I suspect this is mostly not going to be the case, so why would I bother annotating the data? Yes, I know that VertNet has only just set this up, but that's missing the point. Supporting this model requires customer support, and who has the resources for that? If I don't get the sense that someone is going to deal with my annotation, why bother?

So, the issues here are that the owner gets all the rights, the annotators have none, and in practice the owners might not be in a position to make use of the annotations anyway.

Wikis

OK, if the owner/annotator model doesn't seem attractive, what about wikis? Let's put the data on a wiki and let folks edit it, that'll work, right? There's a lot to be said in favour of wikis, but there's a disadvantage to the basic wiki model. On a wiki, there is one page for an item, and everyone gets to edit that same page. The hope is that a consensus will emerge, but if it doesn't then you get edit wars (e.g., When taxonomists wage war in Wikipedia). If you've made an edit, or put your data on a wiki, anyone can overwrite it. Sure, you can roll back to an earlier version, but so can anyone else.

Wikis bring tools for community editing, but overturn ownership completely, so the data owner, or indeed any individual annotator has no control over what happens to their contributions. Why would an expert contribute if someone else can undo all their hard work?

Social data

So, if sticky notes and wikis aren't the solution, what is? I've been looking at Fluidinfo lately. There's an interview here, and a book here. The company has gone quiet lately (apparently focussing on enterprise customers), but what matters here is the underlying idea, namely "social data".

Fluidinfo's model is that it is a database of objects (representing things or concepts), and anyone can add data to those objects (they are "openly writable"). The key is that every tag is linked to the user, and by default you can only add, edit, or delete your own tags. This means that if a data provider adds, say a bibliographic reference to the database, I can edit it by adding tags, but I can't edit the data provider's tags. To make this a bit more concrete, suppose we have a record for the article with the DOI 10.1163/187631293X00262. We can represent the metadata from CrossRef like this:

{
	"_id": "10.1163/187631293X00262",
	"crossref/doi" : "10.1163/187631293X00262",
	"crossref/title" : "A taxonomic review of the pondskater...",
	"crossref/journal" : "Insect Systematics & Evolution",
	"crossref/issn" : [ "1399-560X", "1876-312X"]
}

Note the use of the namespace "crossref" in the tags. This is data that, notionally, CrossRef "owns" and can edit, and nobody else. Now, as I've discussed earlier (Orwellian metadata: making journals disappear) some publishers have an annoying habit of retrospectively renaming journals. This article was published in Entomologica Scandinavica, which has since been renamed Insect Systematics & Evolution, and CrossRef gives the latter as the journal name for this article. But most citations to the article will use the old journal name. Under the social data model, I can add this information (in bold):

{
	"_id": "10.1163/187631293X00262",
	"crossref/doi" : "10.1163/187631293X00262",
	"crossref/title" : "A taxonomic review of the pondskater...",
	"crossref/journal" : "Insect Systematics & Evolution",
	"crossref/issn" : ["1399-560X", "1876-312X"],
	"rdmpage/journal" : "Entomologica Scandinavica","rdmpage/issn" : ["0013-8711" ]
}

My tags have the namespace "rdmpage", so they are "mine". I haven't overwritten the "crossref" tags. Somebody else could add their own tags, and of course, CrossRef could update their tags if they wish. We can all edit this object, we don't need permission to do so, and we can rest assured that our own edits won't be overwritten by somebody else.

This model can be quite liberating. If you are a data provider/owner, you don't have to worry about people trampling over your data, because you (and any users of your data) can simply ignore tags not in your namespace ("ignore those rdmpage' tags, that Rod Page chap is clearly a nutter"). Annotators are freed from their reliance on data providers doing anything with the annotations they created. I don't care whether CrossRef decides to revert the journal name Insect Systematics & Evolution to Entomologica Scandinavica for earlier article (or not), I can just use the "rdmpage/journal" (if it exists) to get what I think is the appropriate journal name. My annotations are immediately usable. Because everyone gets to edit in their own namespace, we don't need to form a consensus, so we don't need the version control feature of wikis to enable roll backs, there are no more edit wars (almost).

Implementation

A key feature of the Fluidinfo social data model is that the data is stored in a single, globally accessible place. Hence we need a global annotation store. Fluidinfo itself doesn't seem to have a publicly accessible database, I guess in part because managing one is a major undertaking (think Freebase). Despite Nicholas Tollervey's post (FluidDB is not CouchDB (and FluidDB's secret sauce)), I think CouchDB is exactly the way I'd want to implement this (it's here, it works, and it scales). The "secret sauce" is essentially application logic (every key has a namespace corresponding to a given user).

The more I think about this model the more I like it. It could greatly simplify the task of annotating biodiversity data, and avoid what I fear are going to be the twin dead ends of sticky note annotation and wikis.