Friday, November 10, 2017

Exploring images in the Biodiversity Literature Repository

A post by on the Plaza blog Expanded access to images in the Biodiversity Literature Repository has prompted me to write up a little toy I created earlier this week.

The Biodiversity Literature Repository (BLR) is a repository of taxonomic papers hosted by Zenodo. Where possible Plazi have extracted individual images and added those to the BLR, even if the article itself is not open access. The justification for being able to do this is presented here: DOI:10.1101/087015. I'm not entirely convinced by their argument (see Copyright and the Use of Images as Biodiversity Data) but rather than rehash that argument I decide dit would be much more fun to get a sense of what is in the BLR. I built a tool to scrape data from Zenodo and store it in CouchDB, put a simple search engine on top (using the search functionality in Cloudant) to search within the figure captions, and wrote some code to use a cloud-based image server to generate thumbnails for the images in Zenodo (some of which are quite big). The tool is hosted at Heroku, you can try it out here: https://zenodo-blr-interface.herokuapp.com/.

Screenshot 2017 11 10 11 03 30

This is not going to win any design awards, I'm simply trying to get a feel for what imagery BLR has. My initial reactions was "wow!". There's a rich range of images, including phylogenies, type specimens, habitats, and more. Searching by museum codes, e.g. NHMUK is a quick way to discover images of specimens from various collections.

Screenshot 2017 11 10 11 22 05

Based on this experiment there are at least two things I think would be fun to do.

Adding more images

BLR already has a lot of images, but the biodiversity literature is huge, and there's a wealth of imagery elsewhere, including journals not in BLR, and of course the Biodiversity Heritage Library (BHL). Extracting images from articles in BHL would potentially add a vast number of additional images.

Machine learning

Machine learning is hot right now, and anyone using iNaturalist is probably aware of their use of computer vision to suggest identifications for images you upload. It would be fascinating to apply machine learning to images in the BLR. Even basic things such as determining whether an image is a photo or a drawing, how many specimens are included, what the specimen orientation is, what part of the organism is being displayed, is the image a map (and of what country) would be useful. There's huge scope here for doing something interesting with these images.

The toy I created is very basic, and merely scratches the surface of what could be done (Plazi have also created their own tool, see http://github.com/punkish/zenodeo). But spending a few minutes browsing the images is well worthwhile, and if nothing else is a reminder of both how diverse life is, and how active taxonomists are in trying to discover and describe that diversity.

Friday, October 06, 2017

Notes on finding georeferenced sequences in GenBank

Notes on how many georeferenced DNA sequences there are in GenBank, and how many could potentially be georeferenced.

BCT	Bacterial sequences
PRI	Primate sequences
ROD	Rodent sequences
MAM	Other mammalian sequences
VRT	Other vertebrate sequences
INV	Invertebrate sequences
PLN	Plant and Fungal sequences
VRL	Viral sequences
PHG	Phage sequences
RNA	Structural RNA sequences
SYN	Synthetic and chimeric sequ
UNA	Unannotated sequences

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=nucleotide nucleotides
&term=ddbj embl genbank with limits[filt]
NOT transcriptome[All Fields] ignore transcriptome data
NOT mRNA[filt] ignore mRNA data
NOT TSA[All Fields] ignore TSA
NOT scaffold[All Fields] ignore scaffold
AND src lat lon[prop] include records that have source feature "lat_lon"
AND 2010/01/01:2010/12/31[pdat] from this date range
AND gbdiv_pri[PROP] restrict search to PRI division (primates)
AND srcdb_genbank[PROP] Need this if we query by division, see NBK49540

Numbers of nucleotide sequences that have latitude and longitudes in GenBank for each year.

DatePRIRODMAMVRTINVPLN
2010/01/01412725529551926927174
2011/01/013711204816017657784947968
2012/01/01658034214216968406027314
2013/01/01297349761107647041123435
2014/01/011529044761145986807614018
2015/01/0117452719831784336353835501
2016/01/0158261512631489875789322813
2017/01/0193817581017107127506628180

Numbers of nucleotide sequences that don't have latitude and longitudes in GenBank for each year but do have the country field and hence could be georeferenced.

DatePRIRODMAMVRTINVPLN
2010/01/01666026545534326666257756692
2011/01/01399832666210337177401598664
2012/01/015377559072835533286945103379
2013/01/011092848058013663736971995817
2014/01/019727349267515991377816135372
2015/01/0189226774139646057885867167337
2016/01/0164303384108606223895711145111
2017/01/0111474352049124115991219109747

Wednesday, October 04, 2017

TDWG 2017: thoughts on day 3

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics.

After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations. Javier provided some pithy observations on standards, particularly the fate of official versus unofficial "community" standards (the community standards tend to be simpler, easier to use, and hence win out), and the potentially stifling effects standards can have on innovation, especially if conforming to standards becomes the goal rather than merely a feature.

The session Using Big Data Techniques to Cross Dataset Boundaries - Integration and Analysis of Multiple Datasets demonstrated the great range of things people want to do with data, but made little progress on integration. It still strikes me as bizarre that we haven't made much progress on minting and reusing identifiers for the same entities that we keep referring too. Channeling Steve Balmer:

Identifiers, identifiers, identifiers, identifiers

It's also striking to compare Javier de la Torre's work with Carto where there is a clear customer-driven focus (we need these tools to deliver this to users so that they can do what they want to do) versus the much less focussed approach of our community. Many of the things we aspire to won't happen until we identify some clear benefits for actual users. There's a tendency to build stuff for our own purposes (e.g., pretty much everything I do) or build stuff that we think people might/should want, but very little building stuff that people actually need.

TDWG also has something of an institutional memory problem. Franck Michel gave an elegant talk entitled A Reference Thesaurus for Biodiversity on the Web of Linked Data which discussed how the Muséum national d'Histoire naturelle's taxonomic database could be modelled in RDF (see for example http://taxref.mnhn.fr/lod/taxon/60878/10.0). There's a more detailed description of this work here:

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

What struck me was how similar this was to the now deprecated TDWG LSID vocabulary, still used my most of the major taxonomic name databases (the nomenclatures). This is an instance where TDWG had a nice, workable solution, it lapsed into oblivion, only to be subsequently reinvented. This isn't to take anything away from Frank's work, which has a thorough discussion of the issues, and has a nice way to handle the the difference between asserting that two taxa are the same (owl:equivalentClass) and that a taxon/name hybrid (which is what many databases serve up because they don't distinguish between names and taxa) and a taxon might be the same (linking via the name they both share).

The fate of the RDF served by the nomenclators for the last decade illustrates a point I keep returning too (see also EOL Traitbank JSON-LD is broken). We tend to generate data and standards because it's the right thing to do, rather than because there's actually a demonstrable need for that data and those standards.

Bitcoin, biodiversity, and micropayments for open data

I gave a "wild ideas" talk at TDWG17 suggesting that the biodiversity community use Bitcoin to make micropayments to use data.

The argument runs like this:

  1. We like open data because it's free and it makes it easy to innovate, but we struggle to (a) get it funded and (b) it's hard to demonstrate value (hence pleas for credit/attribution, and begging for funding).
  2. The alternative of closed data, such as paying a subscription to access a database limits access and hence use and innovation, but generates an income to support the database, and the value of the database is easy to measure (it's how much money it generates).
  3. What if we have a "third model" where we pay small amounts of money to access data (micropayments)?

Micropayments as a way to pay creators is an old idea (it was part of Ted Nelson's Xanadu vision). Now that we have cryptocurrencies such as Bitcoin, micropayments are feasible. So we could imagine something like this:

  1. Access to raw datasets is free (you get what you pay for)
  2. Access to cleaned data comes at a cost (you are paying someone else to do the hard, tedious work of making the data usable)
  3. Micropayments are made using Bitcoin
  4. To help generate funds any spare computational capacity in the biodiversity community is used to mine Bitcoins

After the talk Dmitry Mozzherin sent me a link to Steem, and then this article about Steemit appeared in my Twitter stream:

Clearly this is an idea that has been bubbling around for a while. I think there is scope for thinking about ways to combine a degree of openness (we don't want to cripple access and innovation) with a way to fund that openness (nobody seems interested in giving us money to be open).

Tuesday, October 03, 2017

TDWG 2017: thoughts on day 1

Some random notes on the first day of TDWG 2017. First off, great organisation with the first usable conference calendar app that I've seen (https://tdwg2017.sched.com).

I gave the day's keynote address in the morning (slides below).

It was something of a stream of consciousness brain dump, and tried to cover a lot of (maybe too much) stuff. Among the topics I covered were Holly Bik's appeal for better links between genomic and taxonomic data, my iSpecies tool, some snarky comments on the Semantic Web (and an assertion that the reason that GenBank succeeded was due more to network effects than journals requiring authors to submit sequences there), a brief discussion of Wikidata (including using d3sparql to display classifications, see here), and the use of Hexastore to query data from BBC Wildlife. I also talked about Ted Nelson, Xanadu, using hypothes.is to annotate scientific papers (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday), social factors in building knowledge graphs (touching on ORCID and some of the work by Nico Franz discussed here), and ended with some cautionary comments on the potential misuse of metrics based on knowledge graphs (using "league tables" of cited specimens, see GBIF specimens in BioStor: who are the top ten museums with citable specimens?).

TDWG is a great opportunity to find out what is going on in biodiversity informatics, and also to get a sense of where the problems are. For example, sitting through the Financial Models for Sustaining Biodiversity Informatics Products session you couldn't help being struck by (a) the number of different projects all essentially managing specimen data, and (b) the struggle they all face to obtain funding. If this was a commercial market there would be some pretty drastic consolidation happening. It also highlights the difficulty of providing services to a community that doesn't have much money.

I was also struck by Andrew Bentley's talk Interoperability, Attribution, and Value in the Web of Natural History Museum Data. In a series of slides Andrew outlined what he felt collections needed from aggregators, researchers, and publishers, e.g.:

Chatting to Andrew at the evening event at the Canadian Museum of Nature, I think there's a lot of potential for developing tools to provide collections with data on the use and impact of their collections. Text mining the biodiversity literature on a massive scale to extract (a) mentions of collections (e.g., their institutional acronyms) and (b) citations of specimens could generate metrics that would be helpful to collections. There's a great opportunity here for BHL to generate immediate value for natural history collections (many of which are also contributors to BHL).

Also had a chance to talk to Jorrit Poelen who works on Global Biotic Interactions (GloBI). He made some interesting comparisons between Hexastores (which I'd touched on in my keynote) and Linked Data Fragments.

The final session I attended was Towards robust interoperability in multi-omic approaches to biodiversity monitoring. The overwhelming impression was that there is a huge amount of genomic data, much of which does not easily fit into the classic, Linnean view of the world that characterises, say, GBIF. For most of the sequences we don't know what they are, and that might not be the most interesting question anyway (more interesting might be "what do they do?"). The extent to which these data can be shoehorned into GBIF is not clear to me, although doing so may result in some healthy rethinking of the scope of GBIF itself.

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

Bob mesibov The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Generalised

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.

Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

Journal pbio 2002231 g001

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Sunday, August 20, 2017

Notes on displaying big trees using Google Maps/Leaflet

Notes to self on web map-style tree viewers. The basic idea is to use Google Maps or Leaflet to display a tree. Hence we need to compute tiles. One approach is to use a database that supports spatial queries to store the x,y coordinates of the tree. When we draw a tile we compute the coordinates of that tile, based on position and zoom level, do a spatial query to extract all lines that intersect with the rectangle for that tile, and draw those.

A nice example of this is Lifemap (see also De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624).

It occurs to me that for trees that aren't too big we could do this without an external database. For example, what if we used a Javascript implementation of an R-tree, such as imbcmdth/RTree or its fork leaflet-extras/RTree. So, we could compute the coordinates of the nodes in the tree in "geographic" space, store the bounding box for each line/arc in an R-tree, then query that R-tree for lines that intersect with the bounding box of the relevant tile. We could use a clipping algorithm to only draw the bits of the lines that cross the tile itself.

Web maps, at least in my experience, make trips to a tile server to fetch a tile, we would want instead to call a routine within our web page, because all the data would be loaded into that page. So we'd need to modify the tile creating code.

The ultimate goal would be to have a single page web app that accepts a Newick-style tree and converts into a browsable, zoomable visualisation.