Wednesday, June 29, 2011

OR11: New in EPrints 3.3: large scale research data, and the Bazaar.

As I mentioned in the overview, I was very impressed by what's happening in the Eprints community. The new features of the upcoming 3.3 are impressive as they seem to strike the right balance between pragmatism and innovation. Thanks to an outstandingly enthousiastic and open developer community, they're giving DSpace (and to a lesser extend Duraspace) a run for the money.

Red Bull
 could've been the motto of the Eprints community

Support for research data repositories

The new large scale research data is also a hallmark for pragmatic simplicity. EPrints avoid getting very explicit about subject data classification and control, taking a generic approach that can be extended.

Research data can come in two container datatypes, ‘Dataset’ and ‘Experiment’. A Dataset is a standalone, one-off collection of data. The metadata reflects the collection. The object can contains one or more documents, and must also have a read-me file attached, which is a human-oriented manifest, as, though machine-oriented complex metadata is possible, it would deter actual use.

The other datatype is Experiment. This describes a structural process that may result in many datasets. The metadata reflects process and supports the Open Provenance Model.

Where the standard metadata don’t suffice, one of the data streams belonging to the object can be an xml file. If I understood correctly, xpath expressions can then be used for querying and browsing. Effectively this unleashes the shackles of the standard metadata definitions and creates flexibility similar to Fedora. It's very similar to what we're trying to do in the FLUOR project with a SAKAI plugin that acts as a GUI for a data repository in Fedora. Combining user-friendliness with configurable, flexible metadata schemes is a tough one to pull off, I'll certainly keep an eye out on the way EPrints accomplishes this.

The Bazaar

The EPrints Bazaar is plug-in management system and/or an ‘App Store’ for EPrints, inspired by Wordpress. For an administrator it's fully GUI driven, versatile and pretty fool-proof. For developers it looks pretty easy to develop for (I had no trouble following the example with my rusty coding skills).

The primary design goal was that the repository including API must always stay up. They’re clever bastards: they based the plug-in structure on the Debian package mechanism, including the tests for dependencies and conflicts, which makes it very stable. Internally, they’ve run it for six months without a single interruption. Now that’s eating your own dog food!

Country road
Off the beaten track

EPrints as a CRIS

The third major new functionality of 3.3 is CERIF import & export. Primarily this is meant to link eprints repositories automatically to CRIS systems, but for smaller institutions that need to comply with reports in CERIF format but don’t have a system yet, using eprints itself may suffice as pretty much all the necessary metadata is in there. The big question is whether the import/export would allow a full lossless roundtrip, as I joined this session halfway (after an enthousiastic tweet prompted me to change rooms) I might've missed that.

This sounds very appealing to me. Unfortuntaly, the situation in the Netherlands is very different, as a CRIS has been mandatory for decades for the Dutch Universities. Right now we’re in the middle of an European tender for a new, nationwide system, and the only thing I can say is that it’s not without problems. How I’d love to experiment with this instead in my institution, but alas, that won't be possible politically

The EPrints attitude

As Les Carr couldn’t make it stateside, he presented it from the UK. The way this was set up was typical for the can-do attitude of the eprints developers: Skypeing in to a laptop which was put before a mike, and whenever the next slide was needed Les would cheerily call out ‘next slide please!’, after which the stateside companion theatrically reached out for the spacebar of the other laptop, connected to the beamer. Avoid neat technology for technology’s sake and keep it simple and effective.

Wednesday, June 22, 2011

OR11: opening plenary

See also: OR11 overview

The opening session by Jim Jagielski, President of the Apache Software Foundation, focussed on how to make an open source development project viable, whether it produces code or concepts. As El Reg reports today, doing open source is hard. The ASF has a unique experience in running open projects (see also is apache open by rule). Much nodding in agreement all around, as what he said made good sense, but hard to put in practice. Some choice advise:

Communication is all-important. Despite all the new media that come and go, the mailing list still is king. Any communication that happens elsewhere - wikis, IRC, blogs, twitter, FB, etc - needs to be (re)posted to the list before it officially exists and can be considered. A mailing list is a communication channel which is asynchronous and participants can control themselves, meaning read or skip it at their time of choice, not the time mandated by the medium. A searchable archive of the list is a must.

Software development needs a meritocracy. Merit is built up over time. It’s important that merit never expires, as much open source committers are volunteers who need to be able to take time off when life gets in the way (babies, job change, etc).

You need at least three active committers. Why three? So they can take a vote without getting stuck. You also need ‘enough eyeballs’ to go over a patch or proposal. A vote at ASF needs minimally three positive votes and no negatives.
To create a community, you also need a ‘shepherd’, someone who is knowledgable yet approachable by newbies. It’s vital to keep a community open, so not to let the talent pool become too small. To stay attractive, that you need to find out what’s the ‘itch’ that your audience wants to scratch.

The more 'idealistic' software licenses (GPL and all) are "a boon firstmost to lawyers", because the terms ‘share alike’ and ‘commercial use’ are not (yet) clear in juridical context. Choosing an idealistic license can limit the size of the community for projects where companies play a major role. A commenter added that this mirrors the problems of the Creative Commons licenses. In a way, the apache license mirrors CCzero, which CC created to tackle those.

Tuesday, June 21, 2011

Open Repositories 2011 overview

Open Repositories was great this year. Good atmosphere, lots of interesting news, good fun. It's hard to make a selection from 49k of notes (in raw utf8 txt!). This post is a general overview, more details (and specific topics) will follow later.

Bright lights, bit state!
Texas State History Museu
My key points:

1. Focus on building healthy open source communities

The keynote by Jim Jagielski, President of the Apache Software Foundation, set the tone for much what was to come. An interesting talk on how to create viable open source projects from a real expert. The points raised in this talk came back often in panel discussions, audience questions and presentations later.
More details here.

2. The Fedora frameworks are growing up

Both Hydra and Islandora now have a growing installed base, commercial support available, and a thriving ecosystem. They've had to learn the lessons on open source building the hard way, but they have their act together. Fez and Muradora were only mentioned in the context of migrating away.
Also, several Fedora projects that don't use Hydra still use the Hydra Content Model. If this trend of standardizing on a small number of de facto standard CM's, that would greatly ease mixing and moving between Fedora middleware layers.

3. Eprints’ pragmatic approach: surprisingly effective and versatile

Out of curiosity I attended several EPrints sessions, and I was pleasantly surprised, if not stunned by what was shown. Especially the support for research data repositories looks to strike the right balance between supporting complex data and metadata types, while keeping it simple and very usable out-of-the box. And also the Bazaar, which tops Wordpress in ease of maintainance and installation, but on a a solid engineering base that's inspired by Debian's package manager. Very impressive!
More details here.
Texans take 'em by the horns!

Misc. notes
See part #3: Misc notes

Elsewhere on the web

OR11 Conference programpresentations.
Richard Davis, ULCC: #1 overview#2 the Developers Challenge, #3: eprints vs. dspace.
Disruptive Library Technology Jester day 1, day 2, day 3.
Leslie Johnson - a good round-up with focus on practical solutions.
#or11 Tweet archive on twapperkeeper

Photosets: bigD, keitabando, yours truly, all Flickr images tagged with or11, Adrian Stevenson (warning: FB!).

Other observations

Unlike OR09, the audience was not very international. Italians and Belgians were relatively overrepresented with three and six respectively. I spotted just one German, one Swede and one Swiss, and I was the lone Dutchman. The UK was the exception, though many were presenters of JISC funded projects, which usually have budget allocated for knowledge dissemmination.

As OR alternates between Europe and the US, the ratio of participants tends to be weighed to the 'native continent' anyway. But the recession seems to be hitting travel budgets hard in Europe now.
As there were interesting presentations from Japan, Hong Kong and New Zealand, the rumour floating around that OR12 might be in Asia sounded attractive, I'd be very curious to hear more about what's going on there in repositories and open access. The location of OR12 should be announced within a month, let's see.

[updated June 27th, added more links to other writeups; updated June 28, added Hydra CM uptake]

Monday, June 20, 2011

Catching up on old news, I came across an interesting presentation on CNI this spring on the Data Management Plans initiative. Abstract, recording of the presentation on youtube, slides.

DMP online is a great starting point (and one of the inspirations for CARDS) and this looks like the right group of partners to extend it into a truly generic resource. What's notable about the presentation is also the sensible reasons outlined for collaboration between this quite large group of prestigious institutions.All in all, something to keep an eye on.