Saturday, May 30, 2009

OR09: Repository workflows: LoC's KISS approach to workflow

Open Repositories 2009, day 2, session 6b.

Leslie Johnston (Library of Congress)

My summary:

A practical approach to dealing with data from varying sources, keep it as simple as possible, but not simpler.
The ingest tools look very useful for any type of digitization project, especially when working with an externel party (such as a specialized scanning company).
The inventory tool may be even more useful, as lifecycle events are generally not  well covered by traditional systems, be it CMS or ILS.

Background

LoC acts as durable storage deposit target for widely varying projects and institutions. Data transfers for archiving range between an usb stick in the mail to 2Tb transferred straight over the network. The answer to dealing with this: simple protocols, developed together with uc digilib (see also John Kunze).

Combined, this is not yet full a repository, but it covers many aspects of ingest and archive functionality. Rest will come. Aim: provide persistent access at file level.

Simple file format: BagIt

Submitter is asked to describe files it in BagIt format. 

BagIt is a standard for packaging files; METS files will fit in there, too. However, BagIt wascreated because we needed something much, much, much simpler. It’s not as detailed; description is a manifest, it may omit relationships, individual descriptions, etc. It is very lightweight (actually too light: we’ve started creating further profiles for certain types of content).

LoC will support Bagit similarly and simultaneously to MODS & METS.

Simple tools

Simple tools for ingest:
- parallel receiver (can handle network transactions over rsync, ftp, http, https)
- validator (checks file format)
- verifyit (checksums files)
These tools are supplied as java lib, java desktop application, and LocDrop webapp (prototype for SWORD ingest).

Integration between transfer and inventory is very important: trying to retrieve the correct information later is very hard.

After receiving, inventory tool records lifecycle events.
Why a standardized tool: 80% of workflow overlap between projects.


All tools availble open source [sourceforge]. What's currently missing will be added soon.

OR09: Repository workflows: ICE-TheOREM, semantic infra for theses

Open Repositories 2009, day 2, session 6b.

Summary: great concept, convincing demonstration. Excellent stuff.

Part if ICE project, a JISC funded experiment with ORE.
[paper] (seems stuck behind login?)

Importance of ORE: “ORE is a really important protocol – it has been missing for the web for most of its life so far.” (DH: Amen!)

Motivations for TheOREM: check ORE – is it applicable and useful? What are different ways of using? How do SWORD and ORE combine?
Pracitally: improving theses visibility, embargoes as enabler.

Interesting: in the whole repository system, the management of embargoes is separated from the repository by design. A special system serves resourcemaps for the unembargoed, IR polls these regularly. Interesting: this reflects the real-world political issues, and makes it easier to bring quite radical changes.

Demonstrator (with the Fascinator) with one thesis, with reference to data object: molecule description in chemical markup language (actual data).
Simple authoring environment in openoffice Writer (Word is also supported), stylesheet + convention based approach. When uploaded, the doc is taken apart to atomistic xml objects in Fedora. The chemical element is a separate object with relation to the doc, versioning etc.

Embargo metadata is written as text in the doc (on title page; date noted using convention,KISS approach), and a style (p-meta-date-embargo) is applied. The thesis is again ingested - and voila, the part of the thesis with embargo is now hidden.

This simple system also allows dialogue between student and tutor - remarks on the text - to be embedded in the document itself (and hidden to the outside by default). It looks deceivingly like Words's own comments, which I imagine will ease the uptake.

Sidenote: policy in this project is that only submitter can ever change embargo data. So it is recommended to use openID rather than institutional logins, as PhD graduates tend to move on, and then nobody can change it anymore.

Q (from Les Carr): supervisors won’t like to have their interaction with students complicated by tech. What is their benefit?
A: automatic backing up is a big benefit, also of the workflow (ie. the comments in the document text). We *know* students appreciate it. Supers may not like it but everyone else will, and then they’ll have to.

(note DH: this is of course in the sciences, it will be an interesting challange to get the humanities to adhere to stylesheet and microformatting conventions)

Q: can this workflow also generate the ‘authentic and blessed copy’ of the final thesis?
A: Not in project scope, we still produce the pdf  for that. In theory this might be a more authentic copy, but they might scream at the sight of this tech.

OR09: Social marketing and success factors of IR’s.

Open Repositories 2009, day 2, session 5b. 

Social marketing and success factors of IR’s: two thorough but not very exciting sessions. Though the lack of excitement is maybe also because the message is quite sobering: we already know what needs to be done, but it is very hard to change the (institutional) processes involved.

(where social marketing doesn’t stand for web2.0 goodness, but for marketing with the aim of changing social behaviour, using the tools of commercial marketing).

Generally, face to face contact works best - on faculty scale, or in smaller institution like UPEI.

One observation that stuck with me is that the mere word repository is passive, where we want to emphasize exposure. This is precisely our problem as a whole in moving the repository into an active part at the center of the academic research workflow, instead of a passive end point.

Finaly, the list of good examples started out with Cream of science! We tend to take it for granted here in the Netherlands, and focus on where we're stuck; it’s good to be reminded how well that has worked and still does.

Interim news from uMich Miracle project (Making Institutional Repositories A Collaborative Learning Environment).
Not very exciting yet, might change when they’ve accumulated more data (it’s a work in progress, five case studies of larger US institutions, widely varying in policy, age, technology). 

Focus on “outcome instead of output”.
Focus on external measurements of success, instead of internal (ie number of objects etc). Harder to enumerate, less easy, but gets more honest results.

Wednesday, May 27, 2009

OR09: Keynote by John Wilbanks

Open Repositories 2009, day 1, keynote.

Locks and Gears: Digital Repositories and the Digital Commons - John Wilbanks, Vice President of Science, Creative Commons

Great presentation - in content as well in format. Worth looking at the slides [slideshare - of a similar presentation two weeks earlier]. [Which was good, because it was awkwardly scheduled at the end of the afternoon, that's great with a fresh jetlag, straight after the previous panel session without as much as a toilet break.]

The unfortunately familiar story of journals on the internet, scholars' rights eroding, which causes interlocking problems that prevent the network effect.

Choice quotes:
“20 years ago, we would have rather believed there be a worldwide web of free research knowledge, than Wikipedia.”
"The great irony is that the web was designed for scientific data, and now it works really well for porn and shoes."

The CC licenses are a way of making it happen with journals. However, for data even CC-BY is making it hard to do useful integration of different datasets. Survey of 1000 bio databases: >250 different licenses! Opposite law of open source software: the most conservative license wins.

Example of what can happen if data is set free: Proteomecommons.org: bittorent for genomes. Thanks to CC Zero.

What can we do?
Solve locally, share globally.
Use standards. And don’t fork them.
Lead by example.


Q: opinion on wolfram alfa? Or Google Squared?
A: pretty cool, doubts about scaling. It may be this or something else, rather open source than ‘magic technology’. But it’s a sign that the web is about to crack.
“The only thing that’s proven to scale is distributed networks.”

(my comment - with an estimated 500.000 servers, that is precisely what Google is...)

OR09: Panel session - Insights from Leaders of Open Source Repository Organizations

Open repositories 2009, day 1, session 4.

A panel with the big three open source players (Dspace’s Michelle Kimpton and Fedora Commons’ Sandy Payette, freshly merged into Duraspace, ePrints’ Les Carr) and Lee Dirks from Microsoft. Zentity (no, not Zentity - 1.0 was officially announced at this conference) brings up lots of good questions. Unfortunately it didn’t get to an interesting exchange of ideas.

I’ll concentrate on Microsoft, as they were the elephant in the room. Warning: opinions ahead.

Microsoft is walking a thin line, their stance has been very defensive. Dirks started out quipping that “We wanted to announce Microsoft merging with ePrints, we got together yesterday, but we couldn’t agree on who was going to take over who.”

He went on stressing that this is Microsoft Research and they're not required to make a profit. Putting on a philanthropist guise, he went on that their goal is to offer an open source repository solution to organizations that already have campus licenses. “How can we help you use software that you already paid for but maybe don’t use?”. They claim they don't want to pull people away from open source solutions.

The most interesting parts were what he was *not* saying. Which open source does MS not want to pull us away from - Java? MySQL? Eclipse? Or did he only mean open source repository packages?
Yeah right… getting visual studio, IIS, SQL server and the most dangerous of all, Sharepoint a foot in the door.

An audience question that nailed the central issue: "The question will be lock-in. commitment in other parts of the lifecycle are therefore more important. Zentity hooks you up everywhere in the MS stack."
Dirks responded with "Everything we’ve done, is built on open API’s, be it Sharepoint or Office or whatever. You could reconstruct it all yourself."

Well with all respect to the Mono and Wine efforts, I wouldn't call Sharepoint and Office API's you could easily replace. The data will still be in a black box. Especially if you want to make any use of the collaboration facilities. Having open API's on the outside is fine and dandy, but one thing we're learned so far with repositories is that it is hard to create an exchange (metadata)format that is neither too limited nor so complicated it hinders adoption.

On an audience question his stance on data preservation, Dirks initially replied that ODF would solve this, including provenance metadata. No mention of the controversy around this file format - what use is an xml format that cannot be understood? - or on filetypes outside the Office Universe.

When this debate stranded, Sandy Payette turned the mood around by mentioning that MS has contributed much to interoperability issues. It is indeed good to keep in mind that MS is not just big and bad - they aren't. A company that employs Accordionguy can't be all that bad. The trouble is, you have to stay aware and awake, for they aren't all that good, either. Imagine an Office-style lock-in for collaboratories.

Tuesday, May 26, 2009

OR09: NSF Datanet-curating scientific data

Open Repositories 2009, Day 1, session 3. NSF Datanet-curating scientif data, John Kunze and Sayeed Choudhury.

The first non-split plenary (why a large part of the first two days consisted of 'split plenaries' baffled me, and I was not the only one).

Two speakers, two approaches. First John Kunze from UCDL, focussing in the microlevel with a strategy of keeping it simple. "Imagining the non-repository", "avoid the deadly embrace" of tight standards: decouple by design, lower the barrier of entry.

One of the ways to accomplish this is by staying lo-tech: instead of fullblown database systems, use a plain file system and naming conventions: pairtree. I really like this approach. I've worked in large digitization projects with third parties delivering content on harddisks. They bulk at databases and complicated metadata schemes, but this might just be doable for them. Good stuff.

CDL has a whole set of curation microsystems, as they call it. I'm going to keep an eye out for this.

The second talk, by Sayeed Choudhury (Johns Hopkins), focussed on the macro level of data conservancy. This was more abstract, and he started out with the admission that "we don’t have the answers, there are unsolved unknowns - otherwise we wouldn’t have gotten that NSF grant".

Interesting: one of the partner institutions (not funded by NSF) is Zoom Intelligence – a venture capital firm, interested in creating software services on research data. First VS's bought into ILS, now they pop up here... we must be doing something right!

Otherwise, the talk was mostly abstract and longer term strategy.

Monday, May 25, 2009

OR09: Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons

Day 1, session 2b.

Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons - Wendy White (University of Southampton)

Insightful, passionate kick-ass presentation, with some excellent diagrams in the slides (alas I found no link yet), especially one that puts the repository in the middle of the scientific workflow. The message was clear: tough times ahead for repositories – we have to be an active part of the flow, otherwise we may not survive.

Current improvements (see slides: linking into HR instead of LDAP to follow history of deployment, lightbox for presentation of nontext material) are strategy-driven, which is a step forward from tech-driven, but still piecemeal.

Predicts grants for large scale collaboration processes could be tipping point for changing lone researcher paradigm.

(in my opinion, this may well be true for some fields, even in the humanities, but not for all. Interesting that for instance The Fascinator Desktop aim to serve those ‘loners’).

Stress that Open access is not just idealism, it can also benefit in highly competitive fields – cites a research group that got a contract because the company contacted them after they could see what their researchers where doing.

“build on success stories: symbols and mythology”.
“Repository managers have fingers in lots of pies, we are in a very good position to take on the key bridging role.”
It will however require a culture change, also in the management sphere. In the Q&A she noted that Southhampton is lucky to have been through that process already.

All in all, a good strategic longer term overview, and quite urgent.

Sunday, May 24, 2009

OR09: PEI's Drupal strategy for VRE and repositories

OR09, day 1, session 2a. Research 2.0: Evolving Support for the Research Landscape by Mark Leggott (University of PEI) - [slides here]  - [blog here]

Small province in Canada, middle of nowhere, pop 140k, only uni on the island. UPEI is doing very some good stuff, made some radical choices. They fundamentally transformed the library from traditional staff to techies. Number of staff didn’t change (25), but the number of techs increased from 1 to 5, plus a pool of freelancers.

VRE's using Drupal

Strong push for VRE’s, using Drupal as platform. Low entry barrier: any researcher can request one! All customisations are non-specific as a rule, so all users benefit in the end. If researcher brings additional funding, contract devs are hired to speed up the process.

Some clients have developed rich Drupal plugins themselves (depends on a willing postgrad :-)

Currently 50+ VRE’s. Example of a globe-spanning VRE: Advancing Interdisciplinary Research in Singing

But the same environment is also used for local history projects with social elements (“tag this image”).

Why going opensource? Improves code and documentation quality by emberrassment factor: “Going opensource is like running through the hotel at night naked – you want to be at least presentable”.

Repository: Drupal+Fedora=Islandora

PEI developed Islandora as frontend for Fedora repository. However, from the users POV it is completely hidden: they log in to the VRE, this silently handles depositing in the rep.

Both Drupal and Fedora are ‘strong systems’ with a lot of capabilities. However by definition all data and metadata go in Fedora, to separate data from application layer and make migration possible. This needs to be strongly enforced as some things are easier in Drupal.

Very neat integration betwee data objects in repository and VRE: Researchers can search specifically within the objects, as in “search for data sets in which field X has value between 7 and 8”. Done by mapping the data to an xml format, then mapping xml fields to search params. For fields where xml data formats are available and commonly used this is a real boon (example of marine biology).

Great stuff altogether. The small size may give them an advantage, they operate like a startup, listen to their users, pool resources effectively and are not afraid to make radical choices.

BTW fifteen minutes in the talk I connected the acronym PEI with the name Prince Edward Island. PEI must be so famous in the repository world that it either needn't be explained at all, or that it was mentioned so briefly that it slipped me by...

OR09: Purdue's investigation on Research Data Repositories

OR09 day 1, session 2a: Michael Witt (Purdue University) "Eliciting Faculty Requirements for Research Data Repositories

Preliminary results of investigation in what researchers want regarding data (repositories). Some good stuff. Hope the slides will be published soon - or the report for that matter.

See Seans weblog for the ten primary questions, good for self-evalution also. Mark Leggott then quickly added an additional 11th question to his slideshow - how much is in your wallet...

Method: interviews and followup survey with twenty scientists, transcribed (using Nvivo). “It was like drinking from a firehose.” For each, a “data curation profile” was created, with example data & description. Will beinteresting when it comes out.

OR09: on subject based repositories

Open Repositories 2009, day one, session 1b.

Phew! OR09 is over, and my jetlag almost. An intense conference that was certainly worth it, the content was generally interesting and well-presented. I'll be posting my conference notes here the coming few days.

First session on Monday morning were two talks on two subject based repositories. The planned third one, on a Japanese one, was cancelled - unfortunately as I know very little of what’s happening there regarding OA.

First came Julie Ann Kelly (University of Minnesota) on AgEcon, a repository for Agricultural Economics, a field with a strong working paper tradition. It was set up in the gopher days (not so surprising, as the critter originated in Minnesota).

Interesting was the reason: in this fields, working papers are citable, but the reference format was a mess.

Even more interesting: because of this, it also became the de facto place for depositing appendices to articles - datasets! The repository accepts them and they have the same citing format. There is a lesson here... solve a real problem, and content will come.

Usage statistics: only 53% of downloads comes from people, 43.6% is googlebot (rest other spiders). 66% of visitors come through google straight to results, not through the frontend anymore. Then 19% are some other search engines: leaves 14% coming through front.

Further notes:

Why is life easier in a subject repository?

  • Focussed topic makes metadata easier, common vocabularies exists etc.
  • Recruitment (of other institutions) is easier (specialists in one profession tend to meet frequently, recruiting can piggyback on conferences etc).

And why is it harder?

  • organising the community is hard work - 170 institutions with each between 1 and 300 submitters creates a lot of traffic on quality issues. They frequently hire studens for the correcting.

Minnesota is consolidating its repositories from 5-6 different systems to Islandora. AgEcon will be one of them.

They want to use this Drupal based system also to add social networking, akin to Ethicshare. Ethicshare is interesting: a social citation manager (a la Citeulike/Bibsonomy) plus repository plus social network plus calendar and then some more, for a specific field of study, in this case ethic research. Commoditisation coming?

The second subject repository was on Economists Online, presented by Vanessa Proudman of Tilburg University. Interesting to see this is in many ways the opposite approach. EO is a big European project that works top-down, tries to get the big players aboard first as incentive for the others, and emphasizes quality above all. Whereas AE was a grassroots bottom-up model, that empowered small institutions.

It's a work in progress, only mockups shown. These look slick, with a well thought-out UI. Interesting: with every object in the result list, statistics will be shown inline (ajax), and can be downloaded in multiple formats.

Small pilot with 10 datasets per participating institution, DDI format, Dataverse as preferred solution. Provenance of datasets is very complicated: there are many contributors to the data life cycle, dataset owners, sources, providers, all must be accredited.

Like AE, EO stresses that subject-based repositories have different characteristics. They will organize a dedicated conference on subject repositories in january 2010 in London, as they note that the subject rarely comes up at general repository conferences.

Interest in attending: mail subjectrep_parts@lists.uvt.nl