Thursday, March 06, 2014

IDCC14 notes, day 2: keynote Atul Butte

Part 2 in a series of notes on IDCC 2014, the 9th International Digital Curation Conference, held San Francisco, 24-27 feb.

Day two kicked off with a fantastic keynote by Atul Butte, Associate Professor in Medicine and Pediatrics, Stanford University School of Medicine: Translating a trillion points of data into therapies,diagnostics and new insights into disease [PDF] [Video on Youtube]. This one was well worth a separate blogpost. 

Butte starts his presentation with some great examples of how the availability of a wealth of open data has already radically changed bio/medial research. Over one million datasets are now openly available in the GeneChip standardized format. A search for breast cancer samples in NCBI Geo datasets database gives 40k results, more than the best lab will ever have in their stores. And PubChem has more samples than all pharma companies combined, completely open.

The availability of this data is leading to new developments. Butte cites a recent study that by combining datasets revealed ‘overfitting’, where everybody does an experiment in exactly the same way leading to reproducable results that are irrelevant to the real world.

But this is tame compared to the change in the whole science ecosystem with the advent of online marketplaces. Butte goes on to show a number of thriving ecommerce sites - “add to shopping cart!” - where samples can be bought for competitive prices. Conversant Bio is a marketplace for discarded samples from hospitals, with identifiers stripped off. Hospitals have limited freezer space, have biopsy samples that can be sold, and presto. What about the ethics? "Ethics is a regional thing. They can get away with a lot if stuff in Boston we can't do in Palo Alto." Now any lab can buy research samples for a low price and develop new blood marker tests. This way recently a test was developed for preeclampsia, the disease now best known from Downton Abbey.

Marketplaces also have sprung up for services, such as This is a clearinghouse for medical research services, including animal tests. Thousands of companies provide these worldwide. Butte stresses that it's is not just a race to the bottom and to China, but that this also creates opportunity for new specialised research niches, such as a lab specializing in mouse coloscopies. Makes it possible to do real double blind tests by just buying more tests from different vendors (with different certifications, just to spread). This makes it especially interesting to investigate other effects of tested and approved drugs. Which is a good thing, because the old way of research on new drugs is not sustainable when patents run out (the “pharma patent cliff of 2018”). 

This new science ecosystem is built on top of the availability of open data sets, but there are questions to be solved for the sustainability. Butte sees two players here, funders and repositories themselves.
Incentives for sharing are lacking. Altmetrics are just beginning, and funders need to kick in. Secondary use grants are an interesting new development. Clinical trials may be the next big thing. The most expensive experiments in the world, costing $200 mln each. 50% fails and not even a paper is written about them... Butte expects funders to start requiring publications on negative trails and publishing of the raw data.
The international repositories are at the moment mostly government funded and this may run out. Butte thinks that mirroring and distributing is the future. He also stresses that repositories need to bring the cost down - outsourcing! - and real show use cases, that will inspire people. The repositories that will win are the ones that yield the best research.

Sunday, March 02, 2014

IDCC14 notes, day 1: 4c project workshop

Part 1 in a series of notes on IDCC 2014, the 9th International Digital Curation Conference, held San Francisco, 24-27 feb.

In stark contrast with the 'European' 2013 edition, held last year in my hometown Amsterdam, at this IDCC over 80% of the attendees were from the US. That’s what you get with a west coast location, and unfortunately it was not made up by more delegates from Asia and down under. However as the conference progressed it became clear that despite the vast differences in organisation and culture, we’re all running into the same problems.

IDCC 2014 Day 1: pre-conference workshop on 4C project is an EU financed project to make a inventory of the costs of curation (archiving, stewardship, digital permanence etc.) With a 2 year project span it’s relatively short. The main results will be a series reports, a framework for analysis (based on risk management) and the ‘curation cost exchange’, a website where (anonimized) budgets can be compared.
The project held a one-day pre-conference workshop “4C - are we on the right track?” at which a roadmap and some intermediate results were presented, mixed with more interactive sessions for feedback from the participants. It didn’t always work (the schedule was tight) but still it was a day full of interesting discussions.
Neil Grindley noted that since the start of the project the goal has shifted from “just helping people calculate the cost” to a wider context. Beyond the actual cost (model) of curation: also the context, risks management, benefits and ROI. ROI is especially important for influencing decision makers, given the limited resources.

d3-1 - evaluation of cost models and needs gaps analysis draft

Cost models

Cost models are difficult to compare and hard to do. Top topics of interest: risks/trustworthiness, sustainability and data protection issues. Some organizations are unable or unwilling to share budgets. Special praise was given to the Dutch Royal Library (KB) for being a very open organisation for disclosing their business costs.
The exponential drop of storage costs has stopped. The rate has fallen from 30-40% to at most 12%. It is impossible to calculate costs for indefinite storage. This lead to a remark from the audience: "we're just as guilty as the researchers really, our horizon is the finish of the project.” We have to use different time scales - you have to have some short time benefits, but also keep the long term in scope.
However, costs are much more than storage. Rule of a thumb: 1/2 ingest, 1/3 storage, 1/6 access. Preservation and access are not necessarily linked. Example is the LOC twitter archive which they keep on tape. Once (if) legal issues currently prohibiting opening this archive are resolved, access might be possible via amazon’s 'open data sets' where you pay for acces by using EC2. The economics work because amazon keeps it on non-presistent media and provides access, and LOC keeps it on persistent media but no access.

Other misc notes

A detailed mockup of the cost exchange website was demoed and if all the functionality can be realized, this may be a very useful resource.

The workshop included a primer on professional risk management, based on ISO 31000 standard. “Just read this standard, it's not very boring!”. Originally from engineering, risk management is now considered mature for other fields as well. 

German Nestor project, really clear definitions on what a repository is, a useful resource comparable to the JISC reports:

Open Planets Foundation - great tools.

CDL DataShare is online - a really nice, clean interface.