Wednesday, December 29, 2010

The Plant List: nice data, shame it's not open

nd.large.pngThe Plant List ( has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:
You may not alter, transform, or build upon this work.
So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:
Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.
Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:
If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at to request written permission and to ensure that credits are properly made.
Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?

Thursday, December 23, 2010


Some quick notes on OCR. Revisiting my DjVu viewer experiments it really struck me how "dirty" the OCR text is. It's readable, but if we were to display the OCR text rather than the images, it would be a little offputting. For example, in the paper A new fat little frog (Leptodactylidae: Eleutherodactylus) from lofty Andean grasslands of southern Ecuador ( there are 15 different variations of the frog genus Eleutherodactylus:

  • Eleutherodactylus
  • Eleutheroclactylus
  • Eleuthewdactyliis
  • Eleiitherodactylus
  • Eleuthewdactylus
  • Eleuthewdactylus
  • Eleutherodactyliis
  • Eleutherockictylus
  • Eleutlierodactylus
  • Eleuthewdactyhts
  • Eleiithewdactylus
  • Eleutherodactyhis
  • Eleiithemdactylus
  • Eleuthemdactylus
  • Eleuthewdactyhis

Of course, this is a recognised problem. Wei et al. Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL) (hdl:2142/14919) found that 35% of names in BHL OCR contained at least one wrong character. They compared the performance of two taxonomic name finding tools on BHL OCR (uBio's taxonFinder and FAT), neither of which did terribly well. Wei et al. found that different page types can influence the success of these algorithms, and suggested that automatically classifying pages into different categories would improve performance.

Personally, it seems to me that this is not the way forward. It's pretty obvious looking at the versions of "Eleutherodactylus" above that there are recognisable patterns in the OCR errors (e.g., "u" becoming "ii", "ro" becoming "w", etc.). After reading Peter Norvig's elegant little essay How to Write a Spelling Corrector, I suspect the way to improve the finding of taxonomic names is to build a "spelling corrector" for names. Central to this would be building a probabilistic model of the different OCR errors (such as "u" → "ii"), and use that to create a set of candidate taxonomic names the OCR string might actually be (the equivalent of Google's "did you mean", which is the subject of Norvig's essay). I had hoped to avoid doing this by using an existing tool, such as Tony Rees' TAXAMATCH, but it's a website not a service, and it is just too slow.

I've started doing some background reading on the topic of spelling correction and OCR, and I've created a group on Mendeley called OCR - Optical Character Recognition to bring these papers together. I'm also fussing with some simple code to find misspellings of a given taxonomic names in BHL text, use the Needleman–Wunsch sequence alignment algorithm to align those misspellings to the correct name, and then extract the various OCR errors, building a matrix of the probabilities of the various transformations of the original text into OCR text.

One use for this spelling correction would be in an interactive BHL viewer. In addition to showing the taxonomic names that uBio's taxonFinder has located in the text, we could flag strings that could be misspelt taxonomic names (such as "Eleutherockictylus") and provide an easy way for the user to either accept or reject that name. If we are going to invite people to help clean up BHL text, it would be nice to provide hints as to what the correct answer might be.

Monday, December 20, 2010

BioStor one year on: has it been a success?

One year ago I released BioStor, which scratched my itch regarding finding articles in the Biodiversity Heritage Library. This anniversary seems to be a good time to think about where next with this project, but also to ask whether it's been successful. Of course, this rather hinges on what I mean by "success." I've certainly found BioStor to be useful, both the experience of developing it, and actually using it. But it's time to be a little more hard-headed and look at some stats. So I'm going to share the Google Analytics stats for BioStor. Below is the report for Dec 20, 2009 to Dec 19, 2010, as a PDF.


BioStor had 63,824 visits over the year, and 197,076 pageviews. After an initial flurry of visits on its launch the number of visitors dropped off, then slowly grew. Numbers dipped during the middle of the year, then started to climb again.

In order to discover whether these numbers are a little or a lot, it would be helpful to compare them with data from other biodiversity sites. Unfortunately, nobody seems to be making this information readily available. There is a slide in a BHL presentation that shows BHL having had more than 1 million visits since January 2008, and in March 2010 it was receiving around 3000 visits per day, which is an order of magnitude greater than the traffic BioStor is currently getting. For another comparison, I looked at Scratchpads, which currently comprise 193 sites. In November 2007 Scratchpads had 43,379 pageviews altogether, in November 2010 BioStor had 17,484 page views. For the period May-October 2009 Scratchpads had 74,109 visitors, for the equivalent period in 2010 BioStor had 28,110. So, BioStor is getting about a third of the traffic as the entire Scratchpad project.

Bounce rate

One of the more interesting charts is "Bounce rate", defined by Google as

Bounce rate is the percentage of single-page visits or visits in which the person left your site from the entrance (landing) page.
The bounce rate for BioStor is pretty constant at around 65%, except for two periods in March and June, when it plummeted to around 20%. This corresponds to when I set up a Wikisource installation for BioStor so that the OCR text from BHL could be corrected. Mark Holder ran a student project that used the BioStor wiki, so I'm assuming that the drop in bounce rate reflects Mark's students spending time on the wiki. BHL OCR text would benefit from cleaning, but I'm not sure Wikisources is the way to do it as it feels a little clunky. Ideally I'd like to build upon the interactive DjVu experiments to develop a user-friendly way to edit the underlying OCR text.

Is it just my itch?
Every good work of software starts by scratching a developer's personal itch - Eric S. Raymond, The Cathedral and the Bazaar

Looking at traffic by city, Glasgow (where I'm based) is the single largest source of traffic. This is hardly surprising, given that I wrote BioStor to solve a problem I was interested in, and the bulk of its content has been added by me using various scripts. This raises the possibility that BioStor has an active user community of *cough* one. However, looking at traffic by country, the UK is prominent (due to traffic primarily from Glasgow and London), but more visits come from the US. It seems I didn't end up making this site just for me.

map.pngGoogle search
Another measure of success is Google search rankings, which I've used elsewhere to compare the impact of Wikipedia and EOL pages. As a quick experiment I Googled the top ten journals in BioStor and recorded where in the search results BioStor appeared. For all but the Biological Bulletin, BioStor appeared in the top ten (i.e., on the first page of results):

JournalGoogle rank of BioStor page
Biological Bulletin12
Bulletin of Zoological Nomenclature6
Proceedings of the Entomological Society, Washington6
Proc. Linn. Soc. New South Wales3
Annals of the Missouri Botanical Garden3
Tijdschr. Ent.2
Transactions of The Royal Entomological Society of London6
Ann. Mag. nat. Hist3
Notes from the Leyden Museum5
Proceedings of the United States National Museum4

This suggests that BioStor's content is a least findable.

Where next?
The sense I'm getting from these stats is that BioStor is being used, and it seems to be a reaosnably successful, small-scale project. It would be nice to play with the Google Analytics output a bit more, and also explore usage patterns more closely. For example, I invested some effort in adding the ability to create PDFs for BioStor articles, but I've no stats on how many PDFs have been downloaded. Metadata in BioStor is editable, and edits are logged, but I've not explored the extent to which the content is being edited. If a serious effort is going to be made to clean up BHL content using crowd sourcing, I'll need to think of ways to engage users. The wiki experiments were a step in this direction, but I suspect that building a network around this task might prove difficult. Perhaps a better way is to build the network elsewhere, then try to engage it with this task (OCR correction). This was one reason behind my adopting Mendeley's OAuth API to provide a sign in facility for BioStor (see Mendeley connect). Again, I've no stats on the extent to which this feature of BioStor has been used. Time to give some serious thought to what else I can learn about how BioStor is being used.

Wednesday, December 15, 2010

TreeBASE, again

My views on TreeBASE are pretty well known. Lately I've been thinking a lot about how to "fix" TreeBASE, or indeed, move beyond it. I've made a couple of baby steps in this direction.

The first step is that I've created a group for TreeBASE papers on Mendeley. I've uploaded all the studies in TreeBASE as of December 13 (2010). Having these in Mendeley makes it easier to tidy up the bibliographic metadata, add missing identifiers (such as DOIs and PubMed ids), and correct citations to non-existent papers (which can occur if at the time the authors uploaded their data the planned to submit their paper to one journal, but it ending up being accepted in another). If you've a Mendeley account, feel free to join the group. If you've contributed to TreeBASE, you should find your papers already there.

The second step is playing with CouchDB (this years new hotness), exploring ways to build a database of phylogenies that has nothing much to do with either a relational database or a triple store. CouchDB is a document store, and I'm playing with taking NeXML files from TreeBASE, converting them to something vaguely usable (i.e., JSON), and adding them to CouchDB. For fun, I'm using my NCBI to Wikipedia mapping to get images for taxa, so if TreeBASE has mapped a taxon to the NCBI taxonomy, and that taxon has a page in Wikipedia with an image, we get an image for that taxon. The reason for this is I'd really like a phylogeny database that was visually interesting. To give you some examples, here are trees from TreeBASE (displayed using SVG), together with thumbnails of images from Wikipedia:




Snapshot 2010-12-15 10-38-02.png

Everything (tree and images) is stored within a single document in CouchDB, making the display pretty trivial to construct. Obviously this isn't a proper interface, and there's things I'd need to do, such as order the images in such a way that they matched the placement of the taxa on the tree, but at a glance you can see what the tree is about. We could then envisage making the images clickable so you could find out more about that taxon (e.g., text from Wikipedia, lists of other trees in the database, etc.).

We could expand this further by extracting geographical information (say, from the sequences included in the study) and make a map, or eventually a phylogeny on Google Earth) (see David Kidd's recent "Geophylogenies and the Map of Life" for a manifesto doi:10.1093/sysbio/syq043).

One of the big things missing from databases like TreeBASE is a sense of "fun", or serendipity. It's hard to find stuff, hard to discover new things, make new connections, or put things in context. And that's tragic. Try a Google image search for treebase+phylogeny:


Call me crazy, but I looked at that and thought "Wow! This phylogeny stuff is cool!" Wouldn't it be great if that's the reaction people had when they looked at a database of evolutionary trees?

Monday, December 13, 2010

How do I know if an article is Open Access?

One of my pet projects is to build a "Universal Article Reader" for the iPad (or similar mobile device), so that a reader can seemlessly move between articles from different publishers, follow up citations, and get more information on entities mentioned in those articles (e.g., species, molecules, localities, etc.). I've made various toys towards this, the latest being a HTML5 clone of Nature's iPhone app.

One impediment to this is knowing whether an article is Open Access, and if so, what representations are available (i.e., PDF, HTML, XML). Ideally, the "Universal Article Reader" would be able to look at the web page for an article, determine whether it can extract and redisplay the text (i.e., is the article Open Access) and if so, can it, for example, grab the article in XML and reformat it.

Some journals are entirely Open Access, so for these journals the first problem (is it Open Access?) is trivial, but a large number of journals have a mixed publishing model, some articles are Open Access, some aren't. One thing publishers could do that would be helpful would be to specify the access status of an article in a consistent manner. Here's a quick survey at how things stand at the moment.

PLoSOneEmbedded RDF, e.g. <license rdf:resource="" />
Nature Communications<meta name="access" content="Yes" /> for open, <meta name="access" content="No" /> for close
Systematic Biology<meta name="citation_access" content="all" /> for open, this tag missing if closed
BioOneNothing for article, Open Access icon next to open access articles in table of contents
BMC Evolutionary Biology<meta name ="dc.rights" content="" />
Philosophical Transactions of the Royal Society<meta name="citation_access" content="all" /> for open access
Microbial EcologyNo metadata (links and images in HTML)
Human Genomics and Proteomics<meta name ="dc.rights" content="" />

A bit of a mess. Some publishers embed this information in <meta> tags (which is good), some (such as PLoS) embed RDF (good, if a little more hassle), some leaves us in the dark, or give vidual clues such as logos (which mean nothing to a computer). In some ways this parallels the variety of ways journals have implemented RSS feeds, which has lead to some explicit Recommendations on RSS Feeds for Scholarly Publishers. Perhaps the time is right to develop equivalent recommendations for article metadata, so that apps to read the scientific literature can correctly determine whether they can display an article or not.

Thursday, December 09, 2010

Viewing scientific articles on the iPad: cloning the iPhone app using jQuery Mobile

Over the last few months I've been exploring different ways to view scientific articles on the iPad, summarised here. I've also made a few prototypes, either from scratch (such as my response to the PLoS iPad app) or using Sencha Touch (see Touching citations on the iPad).

Today, it's time for something a little different. The Sencha Touch framework I used earlier is huge and wasn't easy to get my head around. I was resigning myself to trying to get to grips with it when jQuery Mobile came along. Still in alpha, jQuery Mobile is very simple and elegant, and writing an app is basically a case of writing HTML (with a little Javascript here and there if needed). It has a few rough edges, but it's possible to create something usable very quickly. And, it's actually fun.

So, to learn a it more about how to use it, I decided to see if I could write a "clone" of's iPhone app (which I reviewed earlier). Nature's app is in many ways the most interesting iOS app for articles because it doesn't treat the article as a monolithic PDF, but rather it uses the ePub format. As a result, you can view figures, tables, and references separately.

The cloneYou can see the clone here.


I've tried to mimic the basic functionality of the app in terms of transitions between pages, display of figures, references, etc. In making this clone I've focussed on just the article display.

A web app is going to lack the speed and functionality of a native app, but is probably a lot faster to develop. It also works on a wider range of platforms. jQuery Mobile is committed to supporting a wide range of platforms, so this clone should work on platforms other than the iPad.

The app has a lot of additional functionality apart from just displaying articles, such as list the latest articles from journals, manage a user's bookmarks, and enable the user to buy subscriptions. Some of this functionality would be pretty easy to add to this clone, for example by consuming RSS feeds to get article lists. With a little effort one could have a simple, Web-based app to browse Nature content across a range of mobile devices.

Technical stuff

Nature's app uses the ePub format, but Nature's web site doesn't provide an option to download articles in ePub format. However, if you use a HTTP debugging proxy (such as Charles Proxy) when using Nature's app you can see the URLs needed to fetch the ePub file.

I grabbed a couple of ePub files for articles in Nature communications and unzipped them (.epub files are zip files). The iPad app is a single HTML file that uses some Ajax calls to populate the different views. One Ajax call takes the index.html that has the article text and replaces the internal and external links with calls to Javascript functions. An article's references, figure captions, and tables are stored in separate XML files, so I have some simple PHP scripts that read the XML and extract the relevant bits. Internal links (such as to figures and references) are handled by jQuery Mobile. External links are displayed within an iFrame.

There are some intellectual property issues to address. Nature isn't an Open Access journal, but some articles in Nature Communications are (under the Commons Attribution-NonCommercial-Share Alike 3.0 Unported License), so I've used two of these as examples. When it displays an article, Nature's app uses Droid fonts for the article heading. These fonts are supplied as an SVG file contained within the ePub file. Droid fonts are available under an Apache License as TrueType fonts as part of the Android SDK. I couldn't find SVG versions of the fonts in the Android SDK, so I use the TrueType fonts (see Jeffrey Zeldman's Web type news: iPhone and iPad now support TrueType font embedding. This is huge.). Oh, and I "borrowed" some of the CSS from the style.css file that comes with each ePub file.

Wednesday, December 08, 2010

First thoughts on CiteBank and BHL-Europe

This week saw the release of two tools from the Biodiversity Heritage Library, CiteBank and the BHL-Europe portal. Both have actually been quietly around for a while, but were only publicly announced last week.

In developing a new tool there are several questions to ask. Does something already exist that meets my needs? If it doesn't exist, can I build it using an existing framework, or do I need to start from scratch? As a developer it's awfully tempting sometimes to build something from scratch (I'm certainly guilty of this). Sometimes a more sensible approach is to build on something that already exists, particularly if what you are building upon is well supported. This is one of the attractions of Drupal, which underlies CiteBank and Scratchpads. In my own work I've used Semantic Mediawiki to support editable, versioned databases, rather than roll my own. Perhaps the more difficult question for a developer is whether they need to build anything at all. What if there are tools already out there that, if not exacty what you want, are close enough (or most likely will be by the time you finish your own tool).

CiteBank is an open access platform to aggregate citations for biodiversity publications and deliver access to biodiversity related articles. CiteBank aggregates links to content from digital libraries, publishers, and other bibliographic systems in order to provide a single point of access to the world’s biodiversity literature, including content created by its community of users. CiteBank is a project of the Biodiversity Heritage Library (BHL).

I have two reactions to CiteBank. Firstly, Drupal's bibliographic tools really suck, and secondly, why do we need this? As I've argued earlier (see Mendeley, BHL, and the "Bibliography of Life"), I can't see the rationale for having CiteBank separate from an existing bibliographic database such as Mendeley or Zotero. These tools are more mature, better supported, and address user needs beyond simply building lists of papers (e.g., citing papers when writing manuscripts).

For me, one of BHL's goals should be integrating the literature they have scanned into mainstream scientific literature, which means finding articles, assigning DOIs, and becoming in effect a digital publishing platform (like BioOne or JSTOR). Getting to this point will require managing and cleaning metadata for many thousands of articles and books. It seems to me that you want to gather this metadata from as many sources as possible, and expose it to as many eyes (and algorithms) as possible to help tidy it up. I think this is a clear case of it being better to use an existing tool (such as Mendeley), rather than build a new one. If a good fraction of the world's taxonomists shared their person bibliographies on Mendeley we'd pretty much have the world's taxonomic literature in one place, without really trying.

It's early days for BHL-Europe, and they've taken the "lets use an existing framework" approach, basing the BHL-Europe portal on DISMARC, the later being a EU-funded project to "encourage and support the interoperability of music related data".

BHL-Europe is the kind of web site only its developers could love. It's spectacularly ugly, and a classic example of what digital libraries came up with while Google was quietly eating their lunch. Here's the web site showing search results for "Zonosaurus":


Yuck! Why do these things have to be so ugly?. DISMARC was designed to store metadata about digital objects, specifically music. Look at commercial music interfaces such as iTunes, Spotify, and Or even academic projects such as mSpace.

To be useful BHL-Europe really needs to provide an interface that reflects what its users care about, for example taxonomic names, classification, and geography. It can't treat scientific literature as a bunch of lifeless metadata objects (but then again, DISMARC managed to do this for music).

Where next?
CiteBank and BHL-Europe seem further additions to the worthy but ultimately deeply unsatisfying attempts to improve access biodiversity literature. To date our field has failed to get to grips with aggregating metadata (outside of the library setting), creating social networks around that aggregation, and providing intuitive interfaces that enable users to search and browse productively. These are big challenges. I'd like to see the resources that we have put to better use, rather than being used to build tools where suitable alternatives already exist (CiteBank), or used to shoe horn data into generic tools that are unspeakably ugly (BHL-Europe portal) and not fit for purpose. Let's not reinvent the wheel, and let's not try and convince ourselves that squares make perfectly good wheels.

Thursday, December 02, 2010

Linking taxonomic databases to the primary literature: BHL and the Australian Faunal Directory

Continuing my hobby horse of linking taxonomic databases to digitised literature, I've been working for the last couple of weeks on linking names in the Australian Faunal Directory (AFD) to articles in the Biodiversity Heritage Library (BHL). AFD is a list of all animals known to occur in Australia, and it provides much of the data for the recently released Atlas of Living Australia. The data is available as series of CSV files, and these contain quite detailed bibliographic references. My initial interest was in using these to populate BioStor with articles, but it seemed worthwhile to try and link the names and articles together. The Atlas of Living Australia links to BHL, but only via a name search showing BHL items that have a name string. This wastes valuable information. AFD has citations to individual books and articles that relate to the taxonomy of Australian animals — we should treat that as first class data.

So, I cobbled together the CSV files, some scripts to extract references, ran them through the BioStor and bioGUID OpenURL resolvers, and dumped the whole thing in a CouchDB database. You can see the results at Australian Faunal Directory on CouchDB.


The site is modelled on my earlier experiment with putting the Catalogue of Life on CouchDB. It's still rather crude, and there's a lot of stuff I need to work on, but it should illustrate the basic idea. You can browse the taxonomic hierarchy, view alternative names for each taxon, and see a list of publications related to those names. If a publication has been found in BioStor then the site displays a thumbnail of the first page, and if you click on the reference you see a simple article viewer I wrote in Javascript.


For PDFs I'm experimenting with using Google's PDF viewer (the inspiration for the viewer above):


How it was made
Although in principle linking AFD to BHL via BioStor was fairly straight forward, these are lots of little wrinkles, such as errors in bibliographic metadata, and failure to parse some reference strings. To help address this I created a public group on Mendeley where all the references I've extracted are stored. This makes it easy to correct errors, add identifiers such as DOIs and ISSNs, and upload PDFs. For each article a reference to the original record in AFD is maintained by storing the AFD identifier (a UUID) as a keyword.

The taxonomy and the mapping to literature is stored in a CouchDB database, which makes a lot of things (such as uploading new versions of documents) a breeze.

It's about the links
The underlying motivation is that we are awash in biodiversity data and digitisation projects, but these are rarely linked together. And it's more than just linking, it's bring the data together so that we can compute over it. That's when things will start to get interesting.

Wednesday, November 10, 2010

Mendeley mangles my references: phantom documents and the problem of duplicate references

One issue I'm running into with Mendeley is that it can create spurious documents, mangling my references in the process. This appears to be due to some over-zealous attempts to de-duplicate documents. Duplicate documents is the number one problem faced by Mendeley, and has been discussed in some detail by Duncan Hull in his post How many unique papers are there in Mendeley?. Duncan focussed on the case where the same article may appear multiple times in Mendeley's database, which will inflate estimates of how many distinct references the database contains. It also has implications for metrics derived from the Mendeley, such as those displayed by ReaderMeter.

In this post I discuss the reverse problem, combining two or more distinct references into one. I've been uploading large collections of references based on harvesting metadata for journal articles. Although the metadata isn't perfect, it's usually pretty good, and in many cases linked to Open Access content in BioStor. References that I upload appear in public groups listed on my profile, such as the group Proceedings of the Entomological Society of Washington.

Reverse engineering Mendeley
In the absence of a good description by Mendeley of how their tools work, we have to try and figure it out ourselves. If you click on a refernece that has been recently added to Mendeley you get a URL that looks like this: where 584201 is the group id, 3708087012 is the "remoteId" of the document (this is what it's called in the SQLite database that underlies the desktop client), and the rest of the URL is the article title, minus stop words.

After a while (perhaps a day or so) Mendeley gets around to trying to merge the references I've added with those it already knows about, and the URLs lose the group and remoteId and look like this: . Let's call this document the "canonical document" (this document also has a UUID, which is what the Mendeley API uses to retrieve the document). Once the document gets one of these URLs Mendeley will also display how many people are "reading" that document, and whether anyone has tagged it.

But that's not my paper!
The problem is that sometimes (and more often than I'd like) the canonical document bears little relation to the document I uploaded. For example, here is a paper that I uploaded to the group Proceedings of the Entomological Society of Washington:

16212462.gifReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host records by Roger D Price, Ricardo L Palma, Dale H Clayton, Proceedings of the Entomological Society of Washington, 105(4):915-924 (2003).

You can see the actual paper in BioStor: To see the paper in the Mendeley group, browse it using the tag Phthiraptera:


Note the 2, indicating that two people (including myself) have this paper in their library. The URL for this paper is, but this is not the paper I added!.

What Mendeley displays for this URL is this:

Not only is this not the paper I added, there is no such paper! There is a paper entitled "A new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from Madagascar", but that is by Harry Brailovsky, not Clayton and Price (you can see this paper in BioStor as The BioStor link for the phantom paper displayed by Mendeley,, is for a third paper "A review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions". The table below shows the original details for the paper, the details for the "canonical paper" created by Mendeley, and the details for two papers that have some of the bibliographic details in common with this non-existent paper (highlighted in bold).

FieldOriginal paperMendeley
TitleReview of the genus Saemundssonia Timmermann (Phthiraptera: Philopteridae) from the Alcidae (Aves: Charadriiformes), including a new species and new host recordsA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA new genus and a new species of Daladerini (Hemiptera: Heteroptera: Coreidae) from MadagascarA review of ground beetle species (Coleoptera: Carabidae) of Minnesota, United States : New records and range extensions
Author(s)Roger D Price, Ricardo L Palma, Dale H ClaytonDH Clayton, RD PriceHarry Brailovsky

As you can see it's a bit of a mess. Now, finding and merging duplicates is a hard problem (see doi:10.1145/1141753.1141817 for some background), but I'm struggling to see why these documents were considered to be duplicates.

What I'd like to see
I'm a big fan of Mendeley, so I'd like to see this problem fixed. What I'd really like to see is the following:
  1. Mendeley publish a description of how their de-duplication algorithms work

  2. Mendeley describe the series of steps a document goes through as they process it (if nothing else, so that users can make sense of the multiple URLs a document may get over it's lifetime in Mendeley).

  3. For each canonical reference Mendeley shows the the set of documents that have been merged to create that canonical reference, and display some measure of their confidence that the match is genuine.

  4. Mendeley enables users to provide feedback on a canonical document (e.g., a button by each document in the set that enables the user to say "yes this is a match" or "no, this isn't a match").

Perhaps what would be useful is if Mendeley (or the community) assemble a test collection of documents which contains duplicates, together with a set of the canonical documents this collection actually contains, and use this to evaluate alternative algorithms for finding duplicates. Let's make this a "challenge" with prizes! In many ways I'd be much more impressed by a duplication challenge than the DataTEL challenge, especially as it seems clear that Mendeley readership data is too sparse to generate useful recommendations (see Mendeley Data vs. Netflix Data).

Monday, November 01, 2010

CouchDB and Lucene

Quick notes to self on fulltext search and CouchDB. Note that links to CouchDB are local to my machine(s),and won't work unless you are me, or have a copy of the same database running on your machine). CouchDB and Lucene adds fulltext indexing to CouchDB. After a few false starts I now have this working. The documentation is a little misleading, you don't need to clone the github repository, nor use Maven to build couchdb-lucene (at least, I didn't). Instead I grabbed couchdb-lucene-0.5.6, unpacked it, used that as is.

To configure CouchDB I ended up editing the configuration using Futon (there's a link "Add a new section" down the bottom of the Configuration page), then I restarted CouchDB. The things to add are:

os_process_timeout=60000 ; increase the timeout from 5 seconds.
fti=/path/to/python /path/to/couchdb-lucene/tools/
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}

To start couchdb-lucene, just cd couchdb-lucene-0.5.6 and bin/run.

Then it's a case of adding a fulltext index. In Futon I start adding a regular design document, then edit the Javascript. For example, here is a simple index on document titles:

"_id": "_design/lucene",
"_rev": "2-96b333dfc77866a13c0de7f856d27b6c",
"language": "javascript",
"fulltext": {
"by_title": {
"index": "function(doc) {
var ret=new Document();
return ret;

Once the indexing has been completed, you can search the CouchDB database using a URL like this: http://localhost:5984/col2010ref/_fti/_design/lucene/by_title?q=frog+new+species.

Lots more to do here, but with spatial queries and now fulltext search, it's time to start building something...

Monday, October 25, 2010

Are names really the key to the big new biology?

David ("Paddy") Patterson, Jerry Cooper, Paul Kirk, Rich Pyle, and David Remsen have published an article in TREE entitled "Names are key to the big new biology" (doi:10.1016/j.tree.2010.09.004). The abstract states:

Those who seek answers to big, broad questions about biology, especially questions emphasizing the organism (taxonomy, evolution and ecology), will soon benefit from an emerging names-based infrastructure. It will draw on the almost universal association of organism names with biological information to index and interconnect information distributed across the Internet. The result will be a virtual data commons, expanding as further data are shared, allowing biology to become more of a ‘big science’. Informatics devices will exploit this ‘big new biology’, revitalizing comparative biology with a broad perspective to reveal previously inaccessible trends and discontinuities, so helping us to reveal unfamiliar biological truths. Here, we review the first components of this freely available, participatory and semantic Global Names Architecture.
Do we need names?

Reading this (full disclosure, I was a reviewer) I can't wondering whether the assumption that names are key really needs to be challenged. Roger Hyam has argued that we should be calling time on biological nomenclature, and I wonder whether for a generation of biologists brought up on DNA barcodes and GPS, taxonomy and names will seem horribly quaint. For a start, sequences and GPS coordinates are computable, we can stick them in computers and do useful things with them. DNA barcodes can be used to infer identity, evolutionary relationships, and dates of divergence. Taken in aggregate we can infer ecological relationships (such as diet, e.g., doi:10.1371/journal.pone.0000831), biogeographic history, gene flow, etc. While barcodes can tells us something about an organism, names don't. Even if we have the taxonomic description we can't do much with it — extracting information from taxonomic descriptions is hard.

Furthermore, formal taxonomic names don't seem terribly necessary in order to do a lot of science. Patterson et al. note that taxa may have "surrogate" names":

Surrogates include provisional names and specimen, culture or strain numbers which refer to a taxon. 'SAR-11' ('SAR' refers to the Sargasso Sea) was a surrogate name given in 1990 to an important member of the marine plankton. Only a decade later did it become known as Pelagibacter ubique.

The name Pelagibacter ubique was published in 2002 (doi:10.1038/nature00917), although as a Candidatus name (doi:10.1099/00207713-45-1-186), not a name conforming to the International Code of Nomenclature of Bacteria. I doubt the lack of a name that follows this code is hindering the study of this organism, and researchers seem happy to continue to use 'SAR11'.

So, I think that as we go forward we are going to find nomenclature struggling to establish its relevance in the age of digital biology.

If we do need them, how do we manage them?
If we grant Patterson et al. their premise that names matter (and for a lot of the legacy literature they will), then how do we manage them? In many ways the "Names are key to the big new biology" paper is really a pitch for the Global Names Architecture or GNA (and it's components GNI, GNITE, and GNUB). So, we're off into alphabet soup again (sigh). The more I think about this the more I want something very simple.

All I want here is a database of name strings and tools to find them in documents. In other words, uBio.

Broadly defined to include articles, books, DNA sequences, specimens, etc. I want an database of [name,document] pairs (BHL has a huge one), and a database of documents.


Realistically, given the number and type of documents there will be several "document" databases, such as GenBank and GBIF. For citations Mendeley looks very promising. If we had every taxonomic publication in Mendeley, tagged with scientific names, then we'd have the bibliography of life. Taxonomic nomenclators would be essentially out of business, given that their function is to store the first publication of a name. Given a complete bibliography we just create a timeline of usage for a name and note the earliest [name,document] pair:

There are a few wrinkles to deal with. Firstly, names may have synonyms, lexical variants, etc. (the Patterson et al. paper has a nice example of this). Leaving aside lexical variants, what we want is a "view" of the [name,document] pairs that says this subset refer to the same thing (the "taxon concept").


We can obsess with details in individual cases, but at web-scale there are only two ones that spring to mind. The first is the Catalogue of Life, the second is NCBI. The Catalogue of Life lists sets of names and reference that it regards as being the same thing, although it does unspeakable things to many of the references. In the case of NCBI the "concepts" would be the sets of DNA sequences and associated publications linked to the same taxonomy id. Whatever you think of the NCBI taxonomy, it is at least computable, in the sense that you could take a taxon and generate a list of publications 'about" that taxon.

So, we have names, [name,document] pairs, and sets of [name,document] pairs. Simples.

Thursday, October 21, 2010

Mendeley, BHL, and the "Bibliography of Life"

One of my hobby horses is the disservice taxonomic databases do their users by not linking to original scientific literature. Typically, taxonomic databases either don't cite primary literature, or regurgitate citations as cryptic text strings, leaving the user to try and find item being referred to. With the growing number of publishers that are digitising legacy literature and issuing DOIs, together with the Biodiversity Heritage Library's (BHL) enormous archive, there's really no excuse for this.

Taxonomic databases often cite references in abbreviated forms, or refer to individual pages, rather than citable units such as articles (see my Nomenclators + digitised literature = fail post for details). One way to translate these into links to articles would be to have a tool that could find a page within an article, or could match an abbreviated citation to a full one. This task would be relatively straightforward if we had the "bibliography of life," a freely accessible bibliography of every taxonomic paper ever published. Sadly, we don't...yet.

Bibliography of life

Mendeley is rapidly building a very large bibliography (although exactly how large is a matter of some dispute, see Duncan Hull's How many unique papers are there in Mendeley?), and I'm starting to explore using it as a way to store bibliographic details on a large scale. For example, an increasing number of smaller museum or society journals are putting lists of all their published articles on the web. Typically these are HTML pages rather than bibliographic data, but with a bit of scraping we can convert them to something useful, such as RIS format and import them in to Mendeley. I've started to do this, creating Mendeley groups for individual journals, e.g.:

These lists aren't necessarily complete nor error-free, but they contain the metadata for several thousand articles. If individual societies and museums made their list of publications freely available we would make significant progress towards building a bibliography of life. And with the social networking features of Mendeley, we could have groups of collaborators clean up any errors in the metadata.

Of course, this isn't the only way to do this. I suspect I'm rather atypical in building Mendeley groups containing articles from only one journal, as opposed to groups based on specific topics, and of course we could also tackle the problem by creating groups with a taxonomic focus (such as all taxonomic papers on amphibians). Furthermore, if and when more taxonomists join Mendeley and share their personal bibliographies, we will get a lot more relevant articles "for free." This is Mendeley's real strength in my opinion: it provides rich tools for users to do what they most want to do (manage their PDFs and cite them when they write papers), but off the back of that Mendeley can support larger tasks (in the same way that Flickr's ability to store geotagged photos has lead to some very interesting visualisations of aggregated data).

For some of the journals I've added to Mendeley I just have bibliographic data, the actual content isn't freely available on line, and in some cases isn't event digitised. But for some journals the content exists in BHL, it's "just" a matter of finding it. This is where my BioStor project comes in. For example, BHL has scanned most of the journal Spixiana. While BHL recognises individual volumes (see it has no notion of articles. To find these I scraped the tables of contents on the Spixiana web site and ran them through BioStor's OpenURL resolver. If you visit the BioStor page for the journal ( you will see that most of the articles have been identified in BHL, although there are a few holes that will need to be filled.

These articles are listed in a Mendeley group for Spixiana, with the articles linked to BioStor wherever possible.

CiteBank and on not reinventing the wheel
If we were to use Mendeley as the primary platform for aggregating taxonomic publications, then I see this as the best way to implement "CiteBank". BHL have created CiteBank as an "an open access repository for biodiversity publications" using Drupal. Whatever one thinks of Drupal, bibliographic management is not an area where it shines. I think the taxonomic community should take a good look at Mendeley and ask themselves whether this is the platform around which they could build the bibliography of life.

Friday, October 08, 2010

Towards an interactive DjVu file viewer for the BHL

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself. Because it's just a bitmap image there's no obvious way to do simple things such as select and copy some text, click on some text and correct the OCR, or highlight some text as a taxonomic name or bibliographic citation. This is frustrating, and greatly limits what we can do with BHL's content.

In March I wrote a short post DjVu XML to HTML showing how to pull out and display the text boxes for a DjVu file. I've put this example, together with links to the XSLT file I use to do the transformation online at Display text boxes in a DjVu page. Here's an example, where each box (a DIV element) corresponds to a fragment of text extracted by OCR software.


The next step is to make this interactive. Inspired by Google's Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?), I've revisited this problem. One thing the Google PDF viewer does nicely is enable you to select a block of text from a PDF page, in the same way that you can in a native PDF viewer such as Adobe Acrobat or Mac OS X Preview. It's quite a trick, because Google is displaying a bitmap image of the PDF page. So, can we do something similar for DjVu?

The thing I'd like to do is something what is shown below — drag a "rubber band" on the page and select all the text that falls within that rectangle:


This boils down to knowing for each text box whether it is inside or outside the selection rectangle:



We could try and solve this by brute force, that is, query each text box on the page to see whether it overlaps with the selection or not, but we can make use of a data structure called an R-tree to speed things up. I stumbled across Jon-Carlos Rivera's R-Tree Library for Javascript, and so was inspired to try and implement DjVu text selection in a web browser using this technique.

The basic approach is as follows:

  1. Extract text boxes from DjVu XML file and lay these over the scanned page image.

  2. Add each text box to a R-tree index, together with the "id" attribute of the corresponding DIV on the web page, and the OCR text string from that text box.

  3. Track mouse events on the page, when the user clicks with the mouse we create a selection rectangle ("rubber band"), and as the mouse moves we query the R-tree to discover which text boxes have any portion of their extent within the selection rectangle.

  4. Text boxes in the selection have their background colour set to an semi-transparent shade of blue, so that the user can see the extent of the selected text. Boxes outside the selection are hidden.

  5. When the user releases the mouse we get the list of text boxes from the R-tree, and concatenate the text corresponding to each box, and finally display the resulting selection to the user.

Copying text

So far so good, but what can we do with the selected text? One obvious thing would be to copy and paste it (for example, we could select a species distribution and paste it into a text editor). Since all we've done is highlight some DIVs on a web page, how can we get the browser to realise that it has some text it can copy to the clipboard? After browsing Stack Overflow I came across this question, which gives us some clues. It's a bit of a hack, but behind the page image I've hidden a TEXTAREA element, and when the user has selected some text I populate the TEXTAREA with the corresponding text, then set the browser's selection range to that text. As a consequence, the browser's Copy command (⌘C on a Mac) will copy the text to the clipboard.


You can view the demo here. It only works in Safari and Chrome, I've not had a chance to address cross-browser compatibility. It also works in the iPad, which seems a natural device to support interactive editing and annotation of BHL text, but you need to click on the button On iPad click here to select text before selecting text. This is an ugly hack, so I need to give a bit more thought to how to support the iPad touch screen, while still enabling users to pan and zoom the page image.

Next steps

This is all very crude, but I think it shows what can be done. There are some obvious next steps:

  • Enable selected text to be edited so that we can correct the underlying OCR text.

  • Add tools that operate on the selected text, such as check whether it is a taxonomic name, or if it is a bibliographic citation we could attempt to parse it and locate it online (such as David Shorthouse's reference parser).

  • Select parts of the page image itself, so that we could extract a figure or map.

  • Add "post it note" style annotations.

  • Add services that store the edits and annotations, and display annotations made by others.

Lots to do. I foresee a lot of Javascript hacking over the coming weeks.

Tuesday, October 05, 2010

Scripting Life

Not really a blog post, more a note to self. If I ever did get around to writing a book again, I think Scripting Life would be a great title.

PLoS Biodiversity Hub launches


The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:

The vision behind the creation of PLoS Hubs is to show how open-access literature can be reused and reorganized, filtered, and assessed to enable the exchange of research, opinion, and data between community members.

PLoS Hubs: Biodiversity provides two main functions to connect researchers with relevant content. First, open-access articles on the broad theme of biodiversity are selected and imported into the Hub. In time, the content will also be enhanced so that the articles are connected with data, and we will provide features to make the articles easier for people to use. These two functions - aggregation and adding value - build on the concept of open access, which removes all the barriers to access and reuse of journal article content.

Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs. Despite this (or perhaps, because of it), I've become a member of the steering committee for the Biodiversity Hub. There's clearly a lot of interest in repurposing the content found in scientific articles, and I think we're going to see an increasing number of similar projects from the major players in science publishing, Open Access or otherwise. One of the challenges is going to be moving beyond the obvious things (such as making taxon names clickable) to enable new kinds of ways of reading, navigating, and querying the literature, and exploring ways to track the use that is made of the information in these articles. Biodiversity studies are ideally placed to explore this as the subject is data rich and much of that data, such as specimens and DNA sequences, persist over time and hence get reused (data citation gets very boring if the data is used just once). We also have obvious ways to enrich navigation, such as spatially and taxonomically.

For now the PLoS Biodiversity Hub is very pretty, but it's more a statement of intent than a real demonstration of what can be done. Let's hope our field gets its act together and seizes the opportunity that initiatives like the Hub represents. Publishers are desperate to differentiate themselves from their competitors by providing added value as part of the publication process, and they provide a real use case for all the data that the biodiversity projects have been accumulating over the last couple of decades.

Friday, October 01, 2010

Replicating and forking data in 2010: Catalogue of Life and CouchDB

Time (just) for a Friday folly. A couple of days ago the latest edition of the Catalogue of Life (CoL) arrived in my mailbox in the form of a DVD and booklet:

While in some ways it's wonderful that the Catalogue of Life provides a complete data dump of its contents, this strikes me as a rather old-fashioned way to distribute it. So I began to wonder how this could be done differently, and started to think of CouchDB. In particular, I began to think of being able to upload the data to a service (such as Cloudant) where the data could be stored and replicated at scale. then I began to think about forking the data. The Catalogue of Life has some good things going for it (some 1.25 million species, and around 2 million names), and is widely used as the backbone of sites such as EOL, GBIF, and, but parts of it are broken. Literature citations are often incomplete or mangled, and in places it is horribly out of date.

Rather than wait for the Catalogue of Life to fix this, what if we could share the data, annotate it, correct mistakes, and add links? In particular, what if we link the literature to records in the Biodiversity Heritage Library so at we can finally start to connect names to the primary literature (imagine clicking on a name and being able to see the original species description). We could have something akin to github, but instead of downloading and forking code, we download and fork data. CouchDB makes replicating data pretty straightforward.

So, I've started to upload some Catalogue of Life records to a CouchDB instance at Cloudant, and write a simple web site to display these records. For example, you can see the record for at

The e9fda47629c1102b9a4a00304854f820 in this URL is the UUID of the record in CouchDB, which is also the UUID embedded in the (non-functional) CoL LSIDs. This ensures the records have a unique identifier, but also one that is related to the original record. You can search for names, or browse the immediate hierarchy around a name. I hope to add more records over time as I explore this further — at the moment I've added a few lizards, wasps, and conifers while I explore how to convert the CoL records into a sensible JSON object to upload to CouchDB.

The next step is to think about this as a way to distribute data (want a copy of CoL, just point your CouchDB at the Cloudant URL and replicate it), and to think about how to build upon the basic records, editing and improving them, then thinking about how to get that information into a future version of the Catalogue.

Friday, September 24, 2010

Mendeley Connect

When I first launched BioStor (an article finding tool built on the top of the (Biodiversity heritage Library) I wanted people to be able to edit metadata and add references, but also minimise the chances that junk would get added. As a quick and dirty deterrent I used reCAPTCHA, so anybody adding a reference or editing the metadata had to pass a CAPTHCA before their edits were accepted.

While reCAPTCHA does the trick, it can be tedious for somebody editing a lot of articles to have to pass a CAPTHCA every time they edit an article. Ed Baker of the International Commission on Zoological Nomenclature (ICZN) has a project to identify all the articles in the Bulletin of Zoological Nomenclature, and has been gently bugging me to add a login feature to BioStor. I played for a while with OpenID, but it occurred to me that Mendeley might be a more sensible strategy. Mendeley's API supports OAuth, a protocol where you can grant an application access to another application, but without giving away any passwords. It's used by Twitter and Facebook, among others. Indeed, a growing number of sites on the web are using Twitter and/or Facebook services to enable users to log in, rather than write their own code to support login, usernames, passwords, etc.

In the case of BioStor, I've added a link to sign in via Mendeley. if you click on it you get taken to a page like this:

If you're happy for BioStor to connect to Mendeley, you click on Accept and BioStor won't bug you to fill in a CAPTCHA. Once Mendeley's API matures it would be nice to add features such as the ability to add a reference in BioStor straight to your Mendeley library (this is doable now, but the Mendeley API looses some key metadata such as page numbers).

But, thinking more broadly, Mendeley has an opportunity here to provide services similar to Facebook Connect. For example, instead of simply having buttons on web pages to bookmark papers, we could have buttons indicating how many people had added a paper to their library, and whether any of those people were in your contacts. We could extend this further an create something like Facebook's Open Graph Protocol, which supports the "Like" button. Or perhaps, we could have an app that integrates with Facebook and harvests your "Likes" that are papers.

Food for thought. Meantime, I hope users like Ed will find BioStor less tedious to use now that they can log in via Mendeley.

Wednesday, September 22, 2010


@mikeal a little tedious. you can take OSM and then convert it to SHP and then than a minute ago via web

The tweet above inspired me to take a quick look at GeoCouch, a version of CouchDB that supports spatial queries. This is something I need if I'm going to start playing seriously with CouchDB. So, it was off to Installing and working with GeoCouch, grabbing a copy of HomeBrew (yet another package manager for Mac OS X), in the hope of installing GeoCouch. Things went fairly smoothly, although it took what seemed like an age to build everything. But I now have GeoCouch running. Previously I'd been running CouchDB using, which launches vanilla CouchDB. However, if you launch CouchDBX after starting GeoCouch from the command line, CouchDBX is talking to GeoCouch.

I then grabbed shp2geocouch to try some shape files (I grabbed some shape files from the IUCN to play with). If you're on a Mac grab GISLook to get Quick Look previews of these files. Since I'm new to ruby there were a couple of gotchas, such as lacking some prerequisites (httparty and couchrest, both installed by typing gem install <name of package>), and there was the small matter of needing to add ~/.gem/ruby/1.8/bin to my path so I could find shp2geocouch (spot the ruby neophyte). The shape file didn't get processed completely, but at least I managed to get some data into GeoCouch.

So far I've been playing with the examples at, and things seem to work. At least, the basic bounding box queries work. I'm tempted to play with this some more (and get my head arounbd GeoJSON), perhaps trying to recreate the functionality of my Elsevier Challenge entry, for which I wrote a custom key-value database that was awfully clunky.

Finding scientific articles in a large digital archive: BioStor and the Biodiversity Heritage Library

npre20104928-1.thumb.pngYesterday I uploaded a manuscript to Nature Precedings that describes the inner workings of BioStor. The title is "Finding scientific articles in a large digital archive: BioStor and the Biodiversity Heritage Library", and you can grab it here: hdl:10101/npre.2010.4928.1.

Manuscripts describing databases are usually pretty turgid affairs, and this isn't an exception, despite my attempts to spice it up with the tale of Leviathan, oops, Livyatan (see doi:10.1038/nature09381 and Wikipedia). Plus, I can't escape the thought that BioStor would have been a lot more fun to write if I'd used a key-value database like CouchDB. I fear this is often the way of things. By the time it comes to writing something up, you realise that if you could start over you'd do it rather differently.

Monday, September 13, 2010

BHL and the iPad

@elyw I'd leave bookmarking to 3rd party, e.g. Mendeley. #bhlib specific issues incl. displaying DjVu files, and highlighting taxon namesless than a minute ago via Tweetie for Mac

Quick mock-up of a possible BHL iPad app (made using OmniGraffle), showing a paper from BioStor( Idea is to display a scanned page at a time, with taxonomic names on page being clickable (for example, user might get a list of other BHL content for this name). To enable quick navigation all the pages in the document being viewed are displayed in a scrollable gallery below main page.


Key to making this happen is being able to display DjVu files in a sensible way, maybe building on DjVu XML to HTML. Because BHL content is scanned, it makes sense to treat content as pages. We could extract OCR text and display that as a continuous block of text, but the OCR is sometimes pretty poor, and we'd also have to parse the text and interpret its structure (e.g., this is the title, these are section headings, etc.), and that's going to be hard work.

Friday, September 10, 2010

Touching citations on the iPad

Quick demo of the mockup I alluded to in the previous post. Here's a screen shot of the article "PhyloExplorer: a web server to validate, explore and query phylogenetic trees" (doi:10.1186/1471-2148-9-108) as displayed as a web-app on the iPad. You can view this at (you don't need an iPad, although it does work rather better on one).

I've taken the XML for the article, and redisplayed it as HTML, with (most) of the citations highlighted in blue. If you touch one (or click on it if you're using a desktop browser) then you'll see a popover with some basic bibliographic details. For some papers which are Open Access I've extracted thumbnails of the figures, such as for "PhyloFinder: an intelligent search engine for phylogenetic tree databases" (doi:10.1186/1471-2148-8-90), shown above (and in more detail below):

The idea is to give the reader a sense of what the paper is about, beyond can be gleaned from just the title and authors. The idea was inspired by the Biotext search engine from Marti Hearst's group, as well as Elsevier's "graphical abstract" noted by Alex Wild (@Myrmecos).

Here's a quick screencast showing it "live":

The next step is to enable the reader to then go and read this paper within the iPad web-app (doh!), which is fairly trivial to do, but it's Friday and I'm already late...

CouchDB, Mendeley, and what I really want in an iPad article viewer

Playing with @couchdb, starting to think of the Mendeley API as a read/write JSON store, and having a reader app built on that...less than a minute ago via Tweetie for Mac

It's slowly dawning on me that many of the ingredients for an alternative different way to browse scientific articles may already be in place. After my first crude efforts at what an iPad reader might look like I've started afresh with a new attempt, based on the Sencha Touch framework. The goal here isn't to make a polished app, but rather to get a sense of what could be done.

The first goal is to be able to browse the literature as if it was a connected series of documents (which is what, of course, it is). This requires taking the full text of an article, extracting the citations, and making them links to further documents (also with their citations extracted, and so on). Leaving aside the obvious problem that this approach is limited to open access articles, an app that does this is going to have to store a lot of bibliographic data as the reader browses the literature (otherwise we going to have to do all the processing on the fly, and that's not going to be fast enough). So, we need some storage.

One option is to write a MySQL database to hold articles, books, etc. Doable (I've done more of these than I care to remember), but things get messy pretty quickly, especially as you add functionality (tags, fulltext, figures, etc.).

Another option is to use RDF and a tripe store. I've played with linked data quite a bit lately (see previous "Friday follies" here and here), and I thought that a triple store would be a great way support an article browser (especially as we add additional kinds of data, such as sequences, specimens, phylogenies, etc.). But linked data is a mess. For the things I care about there are either no canonical identifiers, or too many, and rarely does the primary data provider served linked data compliant URLs (e.g., NCBI), hence we end up with a plethora of wrappers around these sources. Then there's the issue of what vocabularies to use (once again, there are either none, or too many). As a query language SPARQL isn't great, and don't even get me started on the issue of editing data. OK, so I get the whole idea of linked data, it's just that the overhead of getting anything done seems too high. You've got to get a lot of ducks to line up.

So, I started playing with CounchDB, in a fairly idle way. I'd had a look before, but didn't really get my head around the very different way of querying a database that CouchDB requires. Despite this learning curve, CouchDB has some great features. It stores documents in JSON, which makes it trivial to add data as objects (instead of mucking around with breaking them up into tables for SQL, or atomising them into triples for RDF), it supports versioning right out of the box (vital because metadata is often wrong and needs to be tidied up), and you talk to it using HTTP, which means no middleware to get in the way. You just point your browser (or curl, or whatever HTTP tool you have) and send GET, POST, PUT, or DELETE commands. And now it's in the cloud.

In some ways ending up with CouchDB (or something similar) seems inevitable. The one "semantic web" tool that I've made most use of is Semantic MediaWiki, which powers the NCBI to Wikipedia mapping I created in June. Semantic Mediawiki has it's uses, but occasionally it has driven me to distraction. But, when you get down to it, Semantic Mediawiki is really just a versioned document store (where the documents are typically key-value pairs), over which have been laid a pretty limited query language and some RDF export features. Put like this, most of the huge Mediawiki engine underlying Semantic MediaWiki isn't needed, so why not cut to the chase and use a purpose-built versioned document store? Enter CouchDB.

Browsing and Mendeley
So, what I have in mind is a browser that crawls a document, extracting citations, and enabling the reader to explore those. Eventually it will also extract all the other chocolatey goodness in an article (sequences, specimens, taxonomic names, etc.), but for now I'm focussing on articles and citations. A browser would need to store article metadata (say, each time it encounters an article for the first time), as well as update existing metadata (by adding missing DOIs, PubMed ids, citations, etc.), so what easier way than as JSON in a document store such as CouchDB? This is what I'm exploring at the moment, but let's take a step back for a second.

The Mendeley API, as poorly developed as it is, could be treated as essentially a wrapper around a JSON document store (the API stores and returns JSON), and speaks HTTP. So, we could imagine a browser that crawls the Mendeley database, adding papers that aren't in Mendeley as it goes. The act of browsing and reading would actively contribute to the database. Of course, we could spin this around, and argue that a crawler + CouchDB could pretty effectively create a clone of Mendeley's database (albeit without the social networking features that come with have a large user community).

This is another reason why the current crop of iPad article viewers, Mendeley's included, are so disappointing. There's the potential to completely change the way we interact with the scientific literature (instead of passively consuming PDFs), and Mendeley is ideally positioned to support this. Yes, I realise that for the vast majority of people being able to manage their PDFs and format bibliographies in MS Word are the killer features, but, seriously, is that all we aspire too?