Saturday, May 30, 2009
OR09: Repository workflows: LoC's KISS approach to workflow
OR09: Repository workflows: ICE-TheOREM, semantic infra for theses
OR09: Social marketing and success factors of IR’s.
Wednesday, May 27, 2009
OR09: Keynote by John Wilbanks
Locks and Gears: Digital Repositories and the Digital Commons - John Wilbanks, Vice President of Science, Creative Commons
Great presentation - in content as well in format. Worth looking at the slides [slideshare - of a similar presentation two weeks earlier]. [Which was good, because it was awkwardly scheduled at the end of the afternoon, that's great with a fresh jetlag, straight after the previous panel session without as much as a toilet break.]
The unfortunately familiar story of journals on the internet, scholars' rights eroding, which causes interlocking problems that prevent the network effect.
Choice quotes:
“20 years ago, we would have rather believed there be a worldwide web of free research knowledge, than Wikipedia.”
"The great irony is that the web was designed for scientific data, and now it works really well for porn and shoes."
The CC licenses are a way of making it happen with journals. However, for data even CC-BY is making it hard to do useful integration of different datasets. Survey of 1000 bio databases: >250 different licenses! Opposite law of open source software: the most conservative license wins.
Example of what can happen if data is set free: Proteomecommons.org: bittorent for genomes. Thanks to CC Zero.
What can we do?
Solve locally, share globally.
Use standards. And don’t fork them.
Lead by example.
Q: opinion on wolfram alfa? Or Google Squared?
A: pretty cool, doubts about scaling. It may be this or something else, rather open source than ‘magic technology’. But it’s a sign that the web is about to crack.
“The only thing that’s proven to scale is distributed networks.”
(my comment - with an estimated 500.000 servers, that is precisely what Google is...)
OR09: Panel session - Insights from Leaders of Open Source Repository Organizations
A panel with the big three open source players (Dspace’s Michelle Kimpton and Fedora Commons’ Sandy Payette, freshly merged into Duraspace, ePrints’ Les Carr) and Lee Dirks from Microsoft. Zentity (no, not Zentity - 1.0 was officially announced at this conference) brings up lots of good questions. Unfortunately it didn’t get to an interesting exchange of ideas.
I’ll concentrate on Microsoft, as they were the elephant in the room. Warning: opinions ahead.
Microsoft is walking a thin line, their stance has been very defensive. Dirks started out quipping that “We wanted to announce Microsoft merging with ePrints, we got together yesterday, but we couldn’t agree on who was going to take over who.”
He went on stressing that this is Microsoft Research and they're not required to make a profit. Putting on a philanthropist guise, he went on that their goal is to offer an open source repository solution to organizations that already have campus licenses. “How can we help you use software that you already paid for but maybe don’t use?”. They claim they don't want to pull people away from open source solutions.
The most interesting parts were what he was *not* saying. Which open source does MS not want to pull us away from - Java? MySQL? Eclipse? Or did he only mean open source repository packages?
Yeah right… getting visual studio, IIS, SQL server and the most dangerous of all, Sharepoint a foot in the door.
An audience question that nailed the central issue: "The question will be lock-in. commitment in other parts of the lifecycle are therefore more important. Zentity hooks you up everywhere in the MS stack."
Dirks responded with "Everything we’ve done, is built on open API’s, be it Sharepoint or Office or whatever. You could reconstruct it all yourself."
Well with all respect to the Mono and Wine efforts, I wouldn't call Sharepoint and Office API's you could easily replace. The data will still be in a black box. Especially if you want to make any use of the collaboration facilities. Having open API's on the outside is fine and dandy, but one thing we're learned so far with repositories is that it is hard to create an exchange (metadata)format that is neither too limited nor so complicated it hinders adoption.
On an audience question his stance on data preservation, Dirks initially replied that ODF would solve this, including provenance metadata. No mention of the controversy around this file format - what use is an xml format that cannot be understood? - or on filetypes outside the Office Universe.
When this debate stranded, Sandy Payette turned the mood around by mentioning that MS has contributed much to interoperability issues. It is indeed good to keep in mind that MS is not just big and bad - they aren't. A company that employs Accordionguy can't be all that bad. The trouble is, you have to stay aware and awake, for they aren't all that good, either. Imagine an Office-style lock-in for collaboratories.
Tuesday, May 26, 2009
OR09: NSF Datanet-curating scientific data
The first non-split plenary (why a large part of the first two days consisted of 'split plenaries' baffled me, and I was not the only one).
Two speakers, two approaches. First John Kunze from UCDL, focussing in the microlevel with a strategy of keeping it simple. "Imagining the non-repository", "avoid the deadly embrace" of tight standards: decouple by design, lower the barrier of entry.
One of the ways to accomplish this is by staying lo-tech: instead of fullblown database systems, use a plain file system and naming conventions: pairtree. I really like this approach. I've worked in large digitization projects with third parties delivering content on harddisks. They bulk at databases and complicated metadata schemes, but this might just be doable for them. Good stuff.
CDL has a whole set of curation microsystems, as they call it. I'm going to keep an eye out for this.
The second talk, by Sayeed Choudhury (Johns Hopkins), focussed on the macro level of data conservancy. This was more abstract, and he started out with the admission that "we don’t have the answers, there are unsolved unknowns - otherwise we wouldn’t have gotten that NSF grant".
Interesting: one of the partner institutions (not funded by NSF) is Zoom Intelligence – a venture capital firm, interested in creating software services on research data. First VS's bought into ILS, now they pop up here... we must be doing something right!
Otherwise, the talk was mostly abstract and longer term strategy.
Monday, May 25, 2009
OR09: Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons
Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons - Wendy White (University of Southampton)
Insightful, passionate kick-ass presentation, with some excellent diagrams in the slides (alas I found no link yet), especially one that puts the repository in the middle of the scientific workflow. The message was clear: tough times ahead for repositories – we have to be an active part of the flow, otherwise we may not survive.
Current improvements (see slides: linking into HR instead of LDAP to follow history of deployment, lightbox for presentation of nontext material) are strategy-driven, which is a step forward from tech-driven, but still piecemeal.
Predicts grants for large scale collaboration processes could be tipping point for changing lone researcher paradigm.
(in my opinion, this may well be true for some fields, even in the humanities, but not for all. Interesting that for instance The Fascinator Desktop aim to serve those ‘loners’).
Stress that Open access is not just idealism, it can also benefit in highly competitive fields – cites a research group that got a contract because the company contacted them after they could see what their researchers where doing.
“build on success stories: symbols and mythology”.
“Repository managers have fingers in lots of pies, we are in a very good position to take on the key bridging role.”
It will however require a culture change, also in the management sphere. In the Q&A she noted that Southhampton is lucky to have been through that process already.
All in all, a good strategic longer term overview, and quite urgent.
Sunday, May 24, 2009
OR09: PEI's Drupal strategy for VRE and repositories
Small province in Canada, middle of nowhere, pop 140k, only uni on the island. UPEI is doing very some good stuff, made some radical choices. They fundamentally transformed the library from traditional staff to techies. Number of staff didn’t change (25), but the number of techs increased from 1 to 5, plus a pool of freelancers.
VRE's using Drupal
Strong push for VRE’s, using Drupal as platform. Low entry barrier: any researcher can request one! All customisations are non-specific as a rule, so all users benefit in the end. If researcher brings additional funding, contract devs are hired to speed up the process.
Some clients have developed rich Drupal plugins themselves (depends on a willing postgrad :-)
Currently 50+ VRE’s. Example of a globe-spanning VRE: Advancing Interdisciplinary Research in Singing
But the same environment is also used for local history projects with social elements (“tag this image”).
Why going opensource? Improves code and documentation quality by emberrassment factor: “Going opensource is like running through the hotel at night naked – you want to be at least presentable”.
Repository: Drupal+Fedora=Islandora
PEI developed Islandora as frontend for Fedora repository. However, from the users POV it is completely hidden: they log in to the VRE, this silently handles depositing in the rep.
Both Drupal and Fedora are ‘strong systems’ with a lot of capabilities. However by definition all data and metadata go in Fedora, to separate data from application layer and make migration possible. This needs to be strongly enforced as some things are easier in Drupal.
Very neat integration betwee data objects in repository and VRE: Researchers can search specifically within the objects, as in “search for data sets in which field X has value between 7 and 8”. Done by mapping the data to an xml format, then mapping xml fields to search params. For fields where xml data formats are available and commonly used this is a real boon (example of marine biology).
BTW fifteen minutes in the talk I connected the acronym PEI with the name Prince Edward Island. PEI must be so famous in the repository world that it either needn't be explained at all, or that it was mentioned so briefly that it slipped me by...
OR09: Purdue's investigation on Research Data Repositories
OR09 day 1, session 2a: Michael Witt (Purdue University) "Eliciting Faculty Requirements for Research Data Repositories
Preliminary results of investigation in what researchers want regarding data (repositories). Some good stuff. Hope the slides will be published soon - or the report for that matter.
See Seans weblog for the ten primary questions, good for self-evalution also. Mark Leggott then quickly added an additional 11th question to his slideshow - how much is in your wallet...
Method: interviews and followup survey with twenty scientists, transcribed (using Nvivo). “It was like drinking from a firehose.” For each, a “data curation profile” was created, with example data & description. Will beinteresting when it comes out.
OR09: on subject based repositories
Open Repositories 2009, day one, session 1b.
Phew! OR09 is over, and my jetlag almost. An intense conference that was certainly worth it, the content was generally interesting and well-presented. I'll be posting my conference notes here the coming few days.
First session on Monday morning were two talks on two subject based repositories. The planned third one, on a Japanese one, was cancelled - unfortunately as I know very little of what’s happening there regarding OA.
First came Julie Ann Kelly (University of Minnesota) on AgEcon, a repository for Agricultural Economics, a field with a strong working paper tradition. It was set up in the gopher days (not so surprising, as the critter originated in Minnesota).
Interesting was the reason: in this fields, working papers are citable, but the reference format was a mess.
Even more interesting: because of this, it also became the de facto place for depositing appendices to articles - datasets! The repository accepts them and they have the same citing format. There is a lesson here... solve a real problem, and content will come.
Usage statistics: only 53% of downloads comes from people, 43.6% is googlebot (rest other spiders). 66% of visitors come through google straight to results, not through the frontend anymore. Then 19% are some other search engines: leaves 14% coming through front.
Further notes:
Why is life easier in a subject repository?
- Focussed topic makes metadata easier, common vocabularies exists etc.
- Recruitment (of other institutions) is easier (specialists in one profession tend to meet frequently, recruiting can piggyback on conferences etc).
And why is it harder?
- organising the community is hard work - 170 institutions with each between 1 and 300 submitters creates a lot of traffic on quality issues. They frequently hire studens for the correcting.
Minnesota is consolidating its repositories from 5-6 different systems to Islandora. AgEcon will be one of them.
They want to use this Drupal based system also to add social networking, akin to Ethicshare. Ethicshare is interesting: a social citation manager (a la Citeulike/Bibsonomy) plus repository plus social network plus calendar and then some more, for a specific field of study, in this case ethic research. Commoditisation coming?
The second subject repository was on Economists Online, presented by Vanessa Proudman of Tilburg University. Interesting to see this is in many ways the opposite approach. EO is a big European project that works top-down, tries to get the big players aboard first as incentive for the others, and emphasizes quality above all. Whereas AE was a grassroots bottom-up model, that empowered small institutions.
It's a work in progress, only mockups shown. These look slick, with a well thought-out UI. Interesting: with every object in the result list, statistics will be shown inline (ajax), and can be downloaded in multiple formats.
Small pilot with 10 datasets per participating institution, DDI format, Dataverse as preferred solution. Provenance of datasets is very complicated: there are many contributors to the data life cycle, dataset owners, sources, providers, all must be accredited.
Like AE, EO stresses that subject-based repositories have different characteristics. They will organize a dedicated conference on subject repositories in january 2010 in London, as they note that the subject rarely comes up at general repository conferences.
Interest in attending: mail subjectrep_parts@lists.uvt.nl