Library spring: 2009

Monday, June 15, 2009

OR 09: three more neat Fedora implementations

Open Repositories 2009, Day 4

Three more notable sessions on implementing Fedora. Hopefully, the penultimate post before a final round-up. What a frantic infodump this conference was...

Enhanced Content Models for Fedora - Asger Blekinge-Rasmussen (State and University Library Denmark)

A hardcore technical talk, though impressive in the elegance of the two points shown: bringing the OO model to Fedora object creation, and a DB style ‘view’ for easy creating searching and browsing UIs.

The first is created as an extension of Fedora 3’s standard Content Models, yet backward-compatible, which is a feat. Notable extra’s: declares allowed relations (in OWL lite), schema for xml datastreams. Includes validator service (which is planned as disseminator, too). Open source [sourceforge].

Fedora objects can be manipulated at quite high level using API, but population needs to be done at much lower level. Thus most systems roll their own. Our solution: templates, data objects created as instances of CM’s, not unlike OO programming. Makes default values very easy. No need for handcoded foxml anymore, halleluja! Create, discover, clone templates using template web service.

Then there are repository views, which bundle atomic objects into logical records. Search engine record might be made up of bundle of Fedora objects.
Defined by annotated relations; view angles to create different logical records.
‘view = none’: then omitted from results (useful for small particles you don’t want to have show up in queries, for instance separate slides).

These simple API additions make it easy to create elaborate, simple GUI’s. Which includes the first one I’ve seen that comes close to a workable interface for relationship management - not quite a full drag’n drop, but getting there.

Beyond the Tutorial:Complex Content Models in Fedora 3 - Peter Gorman, Scott Prater (University of Wisconsin Digital Collections Center)
[presentation]

Summary: A hands-on walk through of the Wisconsin DIY approach. Also, an excellent example of what a well-done Prezi presentation can look like: literally zooming in on details then zooming out on the global context was really helpful to see the forest for the trees.

The outset: migrating >1million complex, heterogeneous digital objects into Fedora. Use abstract CM’s, atomistic, gracefully absorb new kinds and new combinations of content. Philosophy: 'fit the model to the content, not the content to the model'.
(Not in prodction yet, prototype app; keep eye out for 'Uni Wisconsin digital collections')

Prater starts out with the note that it’s humbling to see that the Hydra and escidoc people have been working on the same problem. However IMHO there’s no reason for embarrassment, as their basic solution is very elegant.

Using MODS for toplevel datastream (similar approach to Hydra). STRUCT datastream: a valid METS document, tying objects to hierarchy. Important point: CM’s don’t define structure, that’s for STRUCT and RELS-EXT.

Every object starts with a FirstClassObject, which points to 0-n child objects of arbitrary types. If zero it’s a citation. To deal with sibling relationships (ie 2 pages in specific order), an umbrella element is put on top with a METS resource map. This allows full METS functionality. Linking using simple STRUCT and RELS-EXT. Advantage over doing everything in RESLEXTS: that doesn’t allow to express sequencing.

Now, to tie this ‘object soup’ together in an app (common problem for lots of objects, to turn the soup into a tree), the solution is simple: always use one monolithic disseminator, viewMETS(). This takes PID for FirstClassObject, returns valid METS doc containing object and all its (grand)children.

This is brilliant: a one-stop API to get the full object tree from a given PID, hiding the complexity of the umbrella object and the METS description involved.

The only part they’re not very satisfactory yet about is how to relate related items between FirstClassObjects and relations between two top-level logical objects (ie journal and article) that are sometimes parent/child, sometimes not.

To which Asger chimed in that his ‘angle view’, demonstrated in the talk before, would be a possible solution for this. I saw them discussing later... I love it when a plan comes together.

When Ruby Met Fedora- Matt Zumwalt (Media Shelf)

A live demonstration of ActiveFedora which made my fingers itch to start coding straight away - until I remembered Ruby’s Unicode issues, rats.

The philosophy behind: use Fedora for long-lived content, but be able to quickly create short-timed services and apps.

ActiveFedora can be used without Rails, or even without Ruby (you can call it from the shell). However, Ruby’s OO model maps very well on Fedora. The key difference with say java or C++: you don’t know what kind of object you’ll get back to a call.

The demo shows the standard rails environment, except the Model directory. There, calls to ActiveRecord are replaced with calls to ActiveFedora. AF exposes Fedora objects with multiple properties. Qualified DC is built-in, but the has_properties function allows for easy extension.

An interesting advantage of this approach is that the methods as used by developers use the same jargon as the metadata users are used to. “they communicate much better when a method’s called dc.subject.”

There’s quite a bit to do ATM. They’ve received funding to hire a student to finally write real documentation. Other extensions: built-in SOLR integration, more generators for standard situations, basic CM integration. Interesting is the approach to integrating MODS: use the existing, mature java libraries, which is easy when using JRuby as interpreter.

Thursday, June 11, 2009

OR 09: eScidoc's infrastructure

eSciDoc Infrastructure: a Fedora-based e-Research Framework - Frank Schwichtenberg, Matthias Razum (FIZ Karlsruhe)

I had not expected this presentation to be as good as it was - it was a real eye-opener for me. It dealt solely and bravely on the underlying structure of eScidoc, not the solutions built on top of them (such as PubMan). So, delving into the technical nitty gritty.

So far, to me eSciDoc has been an interesting promise that seemed to take forever to materialize into non-vaporware. DANS wanted to use it as the basis for the Fedora-based incarnation of their data repository EASY, a plan they had to abandon when their deadline was looming near and the eScidoc API's were still not frozen. Apart from that, the infrastructure seemed also needlessly complex - why was another content model layer necessary on top of Fedora's own?

The idea behind the eScidoc approach is to take a user-centric approach, which in case of the infra, that's the programmer. What would she like to see, instead of Fedora's plain datastreams?
Tentative answer: an application-oriented object view.

eScidoc takes a full atomistic approach to content modelling: an Item is mapped to a fedora object (without assumption about the metadata profile - keeping it flexible). Then, Item has Component. An Item in practice consists of two fedora objects, with a ‘hasComponent’ relation between.

Object can be in arbitrary hierarchies: except the top hierarchies which are reserved for ‘context’, which can be used for institutional hierarchies (a common approach, I can live with that). All relationships are expressed as structmaps.

So far so good, but now the really neat part.

Consequences of the atomistic content model for versioning: a change can occur in any of the underlying fedora objects of a compound object, with consequences for both.
The eScidoc API's store the Object lifecycle automatically. And when one Component changes or is added, the Item object also changes version, but not the other Components.
(the presentations slides are really instructive on this, worth checking out when they're online).

This also delivers truly persistent ID’s (multiple types supported: DOI, handle, etc), separate from fedora’s PID’s which are not really persistent. And every version has one - both of the compound and the separate Item objects. All changes (update/release/submit events etc.) are logged in version log has events, if I remember correctly this log can be used for rollback ie it is a full transaction log.

This is the reason that the security model has to be in the escidoc layer, not fedora's (though the same policies & structures xacml are used). This is eScidoc's answer to the question common to many fedora projects: how to extend fedora's limited security? It might be best to take the whole security layer out of Fedora.

IMHO this is very exciting. This is about the last thing that a project would need to roll yourself - it is incredibly complex to get working correct and durable - and here it is, backed by a body of academic research - it is a German project after all. For me, this puts eScidoc firmly on the shortlist of frameworks.

Wednesday, June 10, 2009

OR 09: blogosphere links

Nearly three weeks afterwards, it's time to round up the OR 09 posts... Unfortunately, library life got in the way. Meanwhile, why not read the opinions of these honoured colleagues, that are undoubtly better informed:

loomware.typepad.com/ (Mark Leggott)

Open Repositories 2009 - Peter Sefton's trip report (ptsefton.com)
Open Repositories 2009 – Peter Sefton's further thoughts (caulcairss.wordpress.com)

Leslie Carr (repositoryman.blogspot.com)

John Robertson (Strathclyde)

http://repositoryblog.com/archives/18

http://www.weblogs.uhi.ac.uk/sm00sm/2009/05/

http://jhulibrariestravel.blogspot.com/2009/05/open-repostories-2009.html (Elliot Metsger, Johns Hopkins)

Finally, another bunch'o'links:
http://repositorynews.wordpress.com/2009/05/28/open-repositories-2009/

Friday, June 05, 2009

OR09: Four approaches to implementing Fedora

Open repositories 2009, day three, afternoon.

So far, the conference had not been disappointing, but now it got really interesting. The sessions I followed in the afternoon each highlighted a specific approach of the problem that IMHO has been standing in the way of wider Fedora acceptance: middleware.

What these four have in common, is that they all take leverage an existing OSS product and adapt it to use Fedora as datastore.

1. Facilitating Wiki/Repository Communication with Metadata - Laura M. Bartolo

Summary: interesting approach, a traditional Fez spiced up with Mediawiki. With minimal coding a relative seamless integration.
For this to work, contributors need to know MediaWiki markup, and to really integrate, must learn the fez-specific search markup. Also, I'm not sure how well this can be scaled up to true compound objects, given Fez' limitations.

Notes:
Goal: disseminating of research resources. Specific sites for specific science fields, ie soft matter wiki, materials failure case studies.
MatDL repository: has a repository (Fedora+Fez), want to open up two-way communicating. Example: Soft matter expert community, set up with MediaWiki. "Mediawiki hugely lowers the barrier for participating": familiarity gives low learning curve.

The question: how to integrate the repository with the wiki two-way.

Thinking from user-centric approach. Accommodate user; support complex objects (more useful for research & teaching) thus describe them parts as individual objects.

Components:
- Wiki2Fedora
Batch run. Finds wiki upload file, converts referencing wiki pages to DC metadata for ingest in rep. (wiki has comment, rights, author sections -> very doable) Manual post-processing (Fez Admin review area function)
-Search results plug-in for wiki: display repository results in wiki search. Adds to mediawiki markup, to enable writing standard fez queries in the content.

Sites: Repository - Wiki

2. Fedora and Django for an image repository: a new front-end - Peter Herndon (Memorial Sloan-Kettering Cancer Center)

Summary: using Django as a CMS, internally developed adapters to Fedora 3.1.

My gut feeling: A specific use case, images only, so rather limited in scope. Despite choosing the 'hook up with mainstream package' strategy, effectively still a NIH-based rolling their own. That makes the issues even more instructive.

Notes:
Adapting a CMS that expects SQL underneath is challenging - the plugin needs to be a full object-to-relational database mapper.
Also, Fedora 'delete' caused 'undesired results', 'inactive' should be used.
Further, some more unexpected oddities: had to write their own LDAP plugin to make it work, django has tagging but again plugin was needed to limit this to controlled vocabularies. Performance was not a problem.
Interesting: repository for images only, so exif and the like can be used - tags added using Adobe Bridge! The tested, successful strategy: make use what is already familiar.
In the Q&A the question came up: why use Fedora in this case anyway? Indeed the only reason would be preservation, otherwise it would have saved a lot of trouble to use Django Blobstore.

The django-fedora plugins are available at bitbucket.org.

3. Islandora: a Drupal/Fedora Repository System - Mark A Leggott (University of PEI)

Summary:
Islandora looks *very* promising. I noted before (UPEI's Drupal VRE strategy) that UPEI is a place to watch - they are making radical choices with impressive outcomes.

Notes:
UPEI's culture is opensource friendly. They use Moodle and Evergreen (apparently, they were the first Evergreen site in production).

Rationale: opensourcing an in-house system reinforces good behaviour: full documentation, quality code.

As noted before, UPEI's repositories are hidden behind VRE (see [link]). VRE's are geared towards the researchers. Example of approach: the first thing people do when they set up a VRE is create a webpage. That's what a project needs, and so it's used as a hook to reel people in, they're up and running within a few hours.

The VRE is Drupal; Fedora is for data assets, metadata, policies.
Base Islandora consists of three plugins: Drupal-Fedora connection plugin, xacml filter, rule engine for searches.

This 'rule engine' is indeed very cool.
In a later private conversation with Mark Leggott, he clarified that Islandora indeed uses an atomistic complex object model for research data; the rule engine declares how these can be searched from within Drupal. Example, a dataset consisting of a number of measuring points, each with a set of instruments, atomistically in Fedora; can be queried as 'all the results from specific measure point', 'all the result from instrument x', 'instrument x in specific period' etc.
We haven't reached Nirvana yet, to make the deconstructing of the data objects possible, they have to adhere to specific format (xml). But it's impressive nevertheless.

Other Drupal plugins add functionality for specific data. Impressive example: Drupal FCK editor used as TEI editor, after editing, automatically ads version to datastream. Very cool and 'Just Works' (cheery tweet).

Marine Natural Products Lab: best example of the setup for VRE which includes extensive repository (searchable within the critter xml).

Previous versions used drupal 5/fedora 2, not maintained; currently drupal 6/fedora 3.1

Q: did you replace the drupal storage layer, or do you sync?
A: sometimes it’s saved in the drupal layer, when it doesn’t need to go into fedora (temporary data, while we build the content model). Drupal filesystem is a potential bottleneck when large datablobs

Q: are you bound to content models?
A: standard fedora cm’s, you can build them yourself or change the delivered one. The models are exposed, you can see how it works. We first installed Fez to see how Fedora worked.

4. Project Hydra: Designing & Building a Reusable Framework for Multipurpose, Multifunction, Multi-institutional Repository-Powered Solutions - Tom Cramer (Stanford University), Richard Green (University of Hull), Bess Sadler (University of Virginia) et al.

Summary:
I'm even more excited about Hydra than about Islandora. Different approach: create "A lego set of services". In other words, a toolkit for the common parts of applications.
It all looks really good. Two gotchas though. Firstly, it is still a work in progress. Can we afford to wait? Secondly, there are issues with the Unicode support of Ruby on Rails.

For more info: D-Lib.

Notes:
Modelled after the current 12+ use cases of repositories in use at partner institutions, both institutional and personal.
It needs generic templates - which sometimes may do the job - otherwise it won’t come off the ground.
Hydra will have common content models and datastream names. But ultimately they want Hydra to be able to cope with almost anything. A MODS datastream will always have to be there, but not necessarily as primary (so can be done via dissemminator).

Four multifunctional sections:

Deposit
manage (edit objects, set access)
search & browse
deliver
plus plumbing: authent, author, complex workflow

Using Rails with ActiveFedora. Turns out Rails lives up to its reputation: they are way ahead of their initial roadmap, now expect full production app by fall.

Specs 3/4 ready, coding 1,5/4.
Demo: http://hydra-dev.stanford.edu/etds

Presentation builds on top of blacklight OPAC. Virgina already has a beta version of their catalogue up using blacklight.

Monday, June 01, 2009

OR09: On the new DuraSpace Foundation, and Fedora in particular

Open Repositories 2009, day 3, morning: three sessions on Fedora.

The morning started with a joint presentation by Sandy Payette (Fedora Commons) and Michele Kimpton (DSpace Foundation), focussing on strategy and organisation; after caffeine break, Fedora+DSpace tech overview by Brad McLean; finally, developers' open house.

I'll cover it in one blog post (this or09 series is getting a bit long in the tooth, isn't it?). For the actual info on DuraSpace and all, see the DuraSpace website. The tech issues were covered more in depth in further sessions.

The merger, by new almost old news, though the incorporation lies still in the future: Fedora Commons and the Dspace user Group will become DuraSpace. The 'cloud' product, that originally had the same name, is renamed DuraCloud.

Not the easiest of presentations, as there is a good deal of scepticism around the merger, and not just on the twitter #or09 channel. Payette and Kimpton handled it very professionally, dare I say gracefully. Both standing on the floor, in front of the audience, talking in turns (did I imagine it, or did I really hear them taking over a sentence, in Huey & Dewey style?), while an assistant standing behind the laptop was going back and forth through the slides in perfect timing.

All in all, they pulled it off to come across as a seamless team. That bodes well.

Also well was a frankness in the Q&A (as well as later in the developers open house). After noting some difficulties in finding the right strategy for open source development: "we do not aim to mold DSpace's opensource structure to the Fedora core committer, on the contrary".

"We have to ask ourself: are we really community driven in the Fedora project? We've been closed in the past, we're opening up." Fedora has started using a new tracker, actually modelled on DSpace's model; "please use it, our tracker is our new inbox."

On the state of Fedora - many and diverse new users.

Escidoc is now deployable.

WGBH OpenVault - including annotated video

Forced Migration Online

Jewish Women Archive - runs in EC2, first of a new wave of smaller archives now coming online using limited resources.

Notably missing on a slide listing 'major contributors': Mediashelf, Sun, and Microsoft Research: VTLS. Possibly a sponsoring issue? It was more than a bit odd, given their standing in the past.

Q: "How do yo see the future of DSpace vs. fedora - do they compete?"

A: "Fedora’s architecture is great, but we also need ‘service bundles’. CMS style on top for instance. The architecture will stay open for any kind of app on top. DSpace is going the other direction. Opportunity is to make sure we're not doing identical things with different frameworks."

It is *so* easy to read this as 'the products will meet in the middle', but this was carefully avoided. However, in the tech talk later it was mentioned that Fedora-DSpace replication back and forth experiments are actively worked on.

I think I'm not alone in thinking that the products will merge eventually. It will take some time, but they will.

Q: (cites another software company merger, IIRC Oracle and Peoplesoft) – merger brings great unrest in communities, which one is going to die? Are F&D moving together? Technical and cultural changes for both communities? etc.

A: Payette: any kind of software eventually becomes obsolete. We are determined not to let that happen, and for that it needs to be modular and organic. Side by side, cause they both do things well. When overlap starts to happen, that may change, but by the module.

Peter Sefton chimed in: very positive. Right decision at the right time. Focus on cloud computing is essential, feels that this is what we’re moving towards, and our current monolithic repositories need to adapt to that.

Some DSpace 1.x upcoming features: statistics, embargo, batch editing. I don't know that much about DSpace, and it shows: I was surprised that these weren't covered yet. Esp. batch editing and embargo, pretty basic features. I know too little of DSpace to judge the announced 2.0 features, apart from the DuraCloud integration using Akubra.

Fedora 3.2 highlights:

SWORD API 1.3. Of course. Nice though
new web admin client. Not all of the features implemented, so the java client hasn't been deprecated - it will in future. This is a big deal, as the client is also useful for metadata editing staff.
akubra: store files by ID, pluggable, stackable, multiplexing (ie on multiple storage environments that to the API look as one big one). Experimental, meaning included but not turned on by default.

Finally, the Fedora developer open house was like getting the pulse of the developer community. Summary: there are pains, communication has been problematic, with a gap between the committers and the community. My impression is that it is finally being talked about, and the core developers in the panel admitting that a change is needed. A constructive and open approach.

Saturday, May 30, 2009

OR09: Repository workflows: LoC's KISS approach to workflow

Open Repositories 2009, day 2, session 6b.

Transfer and Inventory Components of Developing Repository Services

Leslie Johnston (Library of Congress)

My summary:

A practical approach to dealing with data from varying sources, keep it as simple as possible, but not simpler.

The ingest tools look very useful for any type of digitization project, especially when working with an externel party (such as a specialized scanning company).

The inventory tool may be even more useful, as lifecycle events are generally not well covered by traditional systems, be it CMS or ILS.

Background

LoC acts as durable storage deposit target for widely varying projects and institutions. Data transfers for archiving range between an usb stick in the mail to 2Tb transferred straight over the network. The answer to dealing with this: simple protocols, developed together with uc digilib (see also John Kunze).

Combined, this is not yet full a repository, but it covers many aspects of ingest and archive functionality. Rest will come. Aim: provide persistent access at file level.

Simple file format: BagIt

Submitter is asked to describe files it in BagIt format.

BagIt is a standard for packaging files; METS files will fit in there, too. However, BagIt wascreated because we needed something much, much, much simpler. It’s not as detailed; description is a manifest, it may omit relationships, individual descriptions, etc. It is very lightweight (actually too light: we’ve started creating further profiles for certain types of content).

LoC will support Bagit similarly and simultaneously to MODS & METS.

Simple tools

Simple tools for ingest:

- parallel receiver (can handle network transactions over rsync, ftp, http, https)

- validator (checks file format)

- verifyit (checksums files)

These tools are supplied as java lib, java desktop application, and LocDrop webapp (prototype for SWORD ingest).

Integration between transfer and inventory is very important: trying to retrieve the correct information later is very hard.

After receiving, inventory tool records lifecycle events.

Why a standardized tool: 80% of workflow overlap between projects.

All tools availble open source [sourceforge]. What's currently missing will be added soon.

OR09: Repository workflows: ICE-TheOREM, semantic infra for theses

Open Repositories 2009, day 2, session 6b.

ICE-TheOREM - End to End Semantically Aware eResearch Infrastructure for Theses

Jim Downing (University of Cambridge), Peter Sefton (University of Southern Queensland)

Summary: great concept, convincing demonstration. Excellent stuff.

Part if ICE project, a JISC funded experiment with ORE.

[paper] (seems stuck behind login?)

Importance of ORE: “ORE is a really important protocol – it has been missing for the web for most of its life so far.” (DH: Amen!)

Motivations for TheOREM: check ORE – is it applicable and useful? What are different ways of using? How do SWORD and ORE combine?

Pracitally: improving theses visibility, embargoes as enabler.

Interesting: in the whole repository system, the management of embargoes is separated from the repository by design. A special system serves resourcemaps for the unembargoed, IR polls these regularly. Interesting: this reflects the real-world political issues, and makes it easier to bring quite radical changes.

Demonstrator (with the Fascinator) with one thesis, with reference to data object: molecule description in chemical markup language (actual data).

Simple authoring environment in openoffice Writer (Word is also supported), stylesheet + convention based approach. When uploaded, the doc is taken apart to atomistic xml objects in Fedora. The chemical element is a separate object with relation to the doc, versioning etc.

Embargo metadata is written as text in the doc (on title page; date noted using convention,KISS approach), and a style (p-meta-date-embargo) is applied. The thesis is again ingested - and voila, the part of the thesis with embargo is now hidden.

This simple system also allows dialogue between student and tutor - remarks on the text - to be embedded in the document itself (and hidden to the outside by default). It looks deceivingly like Words's own comments, which I imagine will ease the uptake.

Sidenote: policy in this project is that only submitter can ever change embargo data. So it is recommended to use openID rather than institutional logins, as PhD graduates tend to move on, and then nobody can change it anymore.

Q (from Les Carr): supervisors won’t like to have their interaction with students complicated by tech. What is their benefit?

A: automatic backing up is a big benefit, also of the workflow (ie. the comments in the document text). We *know* students appreciate it. Supers may not like it but everyone else will, and then they’ll have to.

(note DH: this is of course in the sciences, it will be an interesting challange to get the humanities to adhere to stylesheet and microformatting conventions)

Q: can this workflow also generate the ‘authentic and blessed copy’ of the final thesis?

A: Not in project scope, we still produce the pdf for that. In theory this might be a more authentic copy, but they might scream at the sight of this tech.

OR09: Social marketing and success factors of IR’s.

Open Repositories 2009, day 2, session 5b.

Social marketing and success factors of IR’s: two thorough but not very exciting sessions. Though the lack of excitement is maybe also because the message is quite sobering: we already know what needs to be done, but it is very hard to change the (institutional) processes involved.

Social marketing approach to IR, a Canadian perspective.

(where social marketing doesn’t stand for web2.0 goodness, but for marketing with the aim of changing social behaviour, using the tools of commercial marketing).

Generally, face to face contact works best - on faculty scale, or in smaller institution like UPEI.

One observation that stuck with me is that the mere word repository is passive, where we want to emphasize exposure. This is precisely our problem as a whole in moving the repository into an active part at the center of the academic research workflow, instead of a passive end point.

Finaly, the list of good examples started out with Cream of science! We tend to take it for granted here in the Netherlands, and focus on where we're stuck; it’s good to be reminded how well that has worked and still does.

Secrets of succes - identifying success factors in IR's.

Interim news from uMich Miracle project (Making Institutional Repositories A Collaborative Learning Environment).

Not very exciting yet, might change when they’ve accumulated more data (it’s a work in progress, five case studies of larger US institutions, widely varying in policy, age, technology).

Focus on “outcome instead of output”.

Focus on external measurements of success, instead of internal (ie number of objects etc). Harder to enumerate, less easy, but gets more honest results.

Wednesday, May 27, 2009

OR09: Keynote by John Wilbanks

Open Repositories 2009, day 1, keynote.

Locks and Gears: Digital Repositories and the Digital Commons - John Wilbanks, Vice President of Science, Creative Commons

Great presentation - in content as well in format. Worth looking at the slides [slideshare - of a similar presentation two weeks earlier]. [Which was good, because it was awkwardly scheduled at the end of the afternoon, that's great with a fresh jetlag, straight after the previous panel session without as much as a toilet break.]

The unfortunately familiar story of journals on the internet, scholars' rights eroding, which causes interlocking problems that prevent the network effect.

Choice quotes:
“20 years ago, we would have rather believed there be a worldwide web of free research knowledge, than Wikipedia.”
"The great irony is that the web was designed for scientific data, and now it works really well for porn and shoes."

The CC licenses are a way of making it happen with journals. However, for data even CC-BY is making it hard to do useful integration of different datasets. Survey of 1000 bio databases: >250 different licenses! Opposite law of open source software: the most conservative license wins.

Example of what can happen if data is set free: Proteomecommons.org: bittorent for genomes. Thanks to CC Zero.

What can we do?
Solve locally, share globally.
Use standards. And don’t fork them.
Lead by example.

Q: opinion on wolfram alfa? Or Google Squared?
A: pretty cool, doubts about scaling. It may be this or something else, rather open source than ‘magic technology’. But it’s a sign that the web is about to crack.
“The only thing that’s proven to scale is distributed networks.”

(my comment - with an estimated 500.000 servers, that is precisely what Google is...)

OR09: Panel session - Insights from Leaders of Open Source Repository Organizations

Open repositories 2009, day 1, session 4.

A panel with the big three open source players (Dspace’s Michelle Kimpton and Fedora Commons’ Sandy Payette, freshly merged into Duraspace, ePrints’ Les Carr) and Lee Dirks from Microsoft. Zentity (no, not Zentity - 1.0 was officially announced at this conference) brings up lots of good questions. Unfortunately it didn’t get to an interesting exchange of ideas.

I’ll concentrate on Microsoft, as they were the elephant in the room. Warning: opinions ahead.

Microsoft is walking a thin line, their stance has been very defensive. Dirks started out quipping that “We wanted to announce Microsoft merging with ePrints, we got together yesterday, but we couldn’t agree on who was going to take over who.”

He went on stressing that this is Microsoft Research and they're not required to make a profit. Putting on a philanthropist guise, he went on that their goal is to offer an open source repository solution to organizations that already have campus licenses. “How can we help you use software that you already paid for but maybe don’t use?”. They claim they don't want to pull people away from open source solutions.

The most interesting parts were what he was *not* saying. Which open source does MS not want to pull us away from - Java? MySQL? Eclipse? Or did he only mean open source repository packages?
Yeah right… getting visual studio, IIS, SQL server and the most dangerous of all, Sharepoint a foot in the door.

An audience question that nailed the central issue: "The question will be lock-in. commitment in other parts of the lifecycle are therefore more important. Zentity hooks you up everywhere in the MS stack."
Dirks responded with "Everything we’ve done, is built on open API’s, be it Sharepoint or Office or whatever. You could reconstruct it all yourself."

Well with all respect to the Mono and Wine efforts, I wouldn't call Sharepoint and Office API's you could easily replace. The data will still be in a black box. Especially if you want to make any use of the collaboration facilities. Having open API's on the outside is fine and dandy, but one thing we're learned so far with repositories is that it is hard to create an exchange (metadata)format that is neither too limited nor so complicated it hinders adoption.

On an audience question his stance on data preservation, Dirks initially replied that ODF would solve this, including provenance metadata. No mention of the controversy around this file format - what use is an xml format that cannot be understood? - or on filetypes outside the Office Universe.

When this debate stranded, Sandy Payette turned the mood around by mentioning that MS has contributed much to interoperability issues. It is indeed good to keep in mind that MS is not just big and bad - they aren't. A company that employs Accordionguy can't be all that bad. The trouble is, you have to stay aware and awake, for they aren't all that good, either. Imagine an Office-style lock-in for collaboratories.

Tuesday, May 26, 2009

OR09: NSF Datanet-curating scientific data

Open Repositories 2009, Day 1, session 3. NSF Datanet-curating scientif data, John Kunze and Sayeed Choudhury.

The first non-split plenary (why a large part of the first two days consisted of 'split plenaries' baffled me, and I was not the only one).

Two speakers, two approaches. First John Kunze from UCDL, focussing in the microlevel with a strategy of keeping it simple. "Imagining the non-repository", "avoid the deadly embrace" of tight standards: decouple by design, lower the barrier of entry.

One of the ways to accomplish this is by staying lo-tech: instead of fullblown database systems, use a plain file system and naming conventions: pairtree. I really like this approach. I've worked in large digitization projects with third parties delivering content on harddisks. They bulk at databases and complicated metadata schemes, but this might just be doable for them. Good stuff.

CDL has a whole set of curation microsystems, as they call it. I'm going to keep an eye out for this.

The second talk, by Sayeed Choudhury (Johns Hopkins), focussed on the macro level of data conservancy. This was more abstract, and he started out with the admission that "we don’t have the answers, there are unsolved unknowns - otherwise we wouldn’t have gotten that NSF grant".

Interesting: one of the partner institutions (not funded by NSF) is Zoom Intelligence – a venture capital firm, interested in creating software services on research data. First VS's bought into ILS, now they pop up here... we must be doing something right!

Otherwise, the talk was mostly abstract and longer term strategy.

Monday, May 25, 2009

OR09: Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons

Day 1, session 2b.

Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons - Wendy White (University of Southampton)

Insightful, passionate kick-ass presentation, with some excellent diagrams in the slides (alas I found no link yet), especially one that puts the repository in the middle of the scientific workflow. The message was clear: tough times ahead for repositories – we have to be an active part of the flow, otherwise we may not survive.

Current improvements (see slides: linking into HR instead of LDAP to follow history of deployment, lightbox for presentation of nontext material) are strategy-driven, which is a step forward from tech-driven, but still piecemeal.

Predicts grants for large scale collaboration processes could be tipping point for changing lone researcher paradigm.

(in my opinion, this may well be true for some fields, even in the humanities, but not for all. Interesting that for instance The Fascinator Desktop aim to serve those ‘loners’).

Stress that Open access is not just idealism, it can also benefit in highly competitive fields – cites a research group that got a contract because the company contacted them after they could see what their researchers where doing.

“build on success stories: symbols and mythology”.
“Repository managers have fingers in lots of pies, we are in a very good position to take on the key bridging role.”
It will however require a culture change, also in the management sphere. In the Q&A she noted that Southhampton is lucky to have been through that process already.

All in all, a good strategic longer term overview, and quite urgent.

Sunday, May 24, 2009

OR09: PEI's Drupal strategy for VRE and repositories

OR09, day 1, session 2a. Research 2.0: Evolving Support for the Research Landscape by Mark Leggott (University of PEI) - [slides here] - [blog here]

Small province in Canada, middle of nowhere, pop 140k, only uni on the island. UPEI is doing very some good stuff, made some radical choices. They fundamentally transformed the library from traditional staff to techies. Number of staff didn’t change (25), but the number of techs increased from 1 to 5, plus a pool of freelancers.

VRE's using Drupal

Strong push for VRE’s, using Drupal as platform. Low entry barrier: any researcher can request one! All customisations are non-specific as a rule, so all users benefit in the end. If researcher brings additional funding, contract devs are hired to speed up the process.

Some clients have developed rich Drupal plugins themselves (depends on a willing postgrad :-)

Currently 50+ VRE’s. Example of a globe-spanning VRE: Advancing Interdisciplinary Research in Singing

But the same environment is also used for local history projects with social elements (“tag this image”).

Why going opensource? Improves code and documentation quality by emberrassment factor: “Going opensource is like running through the hotel at night naked – you want to be at least presentable”.

Repository: Drupal+Fedora=Islandora

PEI developed Islandora as frontend for Fedora repository. However, from the users POV it is completely hidden: they log in to the VRE, this silently handles depositing in the rep.

Both Drupal and Fedora are ‘strong systems’ with a lot of capabilities. However by definition all data and metadata go in Fedora, to separate data from application layer and make migration possible. This needs to be strongly enforced as some things are easier in Drupal.

Very neat integration betwee data objects in repository and VRE: Researchers can search specifically within the objects, as in “search for data sets in which field X has value between 7 and 8”. Done by mapping the data to an xml format, then mapping xml fields to search params. For fields where xml data formats are available and commonly used this is a real boon (example of marine biology).

Great stuff altogether. The small size may give them an advantage, they operate like a startup, listen to their users, pool resources effectively and are not afraid to make radical choices.

BTW fifteen minutes in the talk I connected the acronym PEI with the name Prince Edward Island. PEI must be so famous in the repository world that it either needn't be explained at all, or that it was mentioned so briefly that it slipped me by...

OR09: Purdue's investigation on Research Data Repositories

OR09 day 1, session 2a: Michael Witt (Purdue University) "Eliciting Faculty Requirements for Research Data Repositories

Preliminary results of investigation in what researchers want regarding data (repositories). Some good stuff. Hope the slides will be published soon - or the report for that matter.

See Seans weblog for the ten primary questions, good for self-evalution also. Mark Leggott then quickly added an additional 11th question to his slideshow - how much is in your wallet...

Method: interviews and followup survey with twenty scientists, transcribed (using Nvivo). “It was like drinking from a firehose.” For each, a “data curation profile” was created, with example data & description. Will beinteresting when it comes out.

OR09: on subject based repositories

Open Repositories 2009, day one, session 1b.

Phew! OR09 is over, and my jetlag almost. An intense conference that was certainly worth it, the content was generally interesting and well-presented. I'll be posting my conference notes here the coming few days.

First session on Monday morning were two talks on two subject based repositories. The planned third one, on a Japanese one, was cancelled - unfortunately as I know very little of what’s happening there regarding OA.

First came Julie Ann Kelly (University of Minnesota) on AgEcon, a repository for Agricultural Economics, a field with a strong working paper tradition. It was set up in the gopher days (not so surprising, as the critter originated in Minnesota).

Interesting was the reason: in this fields, working papers are citable, but the reference format was a mess.

Even more interesting: because of this, it also became the de facto place for depositing appendices to articles - datasets! The repository accepts them and they have the same citing format. There is a lesson here... solve a real problem, and content will come.

Usage statistics: only 53% of downloads comes from people, 43.6% is googlebot (rest other spiders). 66% of visitors come through google straight to results, not through the frontend anymore. Then 19% are some other search engines: leaves 14% coming through front.

Further notes:

Why is life easier in a subject repository?

Focussed topic makes metadata easier, common vocabularies exists etc.
Recruitment (of other institutions) is easier (specialists in one profession tend to meet frequently, recruiting can piggyback on conferences etc).

And why is it harder?

organising the community is hard work - 170 institutions with each between 1 and 300 submitters creates a lot of traffic on quality issues. They frequently hire studens for the correcting.

Minnesota is consolidating its repositories from 5-6 different systems to Islandora. AgEcon will be one of them.

They want to use this Drupal based system also to add social networking, akin to Ethicshare. Ethicshare is interesting: a social citation manager (a la Citeulike/Bibsonomy) plus repository plus social network plus calendar and then some more, for a specific field of study, in this case ethic research. Commoditisation coming?

The second subject repository was on Economists Online, presented by Vanessa Proudman of Tilburg University. Interesting to see this is in many ways the opposite approach. EO is a big European project that works top-down, tries to get the big players aboard first as incentive for the others, and emphasizes quality above all. Whereas AE was a grassroots bottom-up model, that empowered small institutions.

It's a work in progress, only mockups shown. These look slick, with a well thought-out UI. Interesting: with every object in the result list, statistics will be shown inline (ajax), and can be downloaded in multiple formats.

Small pilot with 10 datasets per participating institution, DDI format, Dataverse as preferred solution. Provenance of datasets is very complicated: there are many contributors to the data life cycle, dataset owners, sources, providers, all must be accredited.

Like AE, EO stresses that subject-based repositories have different characteristics. They will organize a dedicated conference on subject repositories in january 2010 in London, as they note that the subject rarely comes up at general repository conferences.

Interest in attending: mail subjectrep_parts@lists.uvt.nl

Library spring