Saturday, July 02, 2011

OR11: Misc notes

The state of Texas
I like going to conferences alone, it’s much easier to meet new people from all over the world than when you’re with a group, groups tend to cling together. With a multitracking conference like OR11 however, the downside is that there’s so much to miss. Especially since I like to check out sessions from fields I’m not familiar with. At OR11, I wanted to take the pulse of DSpace and Eprints, and not just faithfully stick with the Fedora talks.

In this entry, I focus on bits and bobs I found noteworthy, rather than give a complete description. I skip over sessions that were excellent but have already widely covered elsewhere (for instance at library jester) such as Clifford Lynch closing plenary.

“Sheer Curation” of Experiments: data, process and provenance, Mark Hedges 
slides [pdf]

"Sheer curation" is meant to be lightweight, with curation quietly integrated in the normal workflow. The scientific process is complex with many intermediate steps that are discarded. The deposit at the end approach misses these. Goal of this JISC project is to capture provenance experimental structure. It follows up on Scarp (2007-2009).

I really liked the pragmatic approach (I've written this sentence often - I really like pragmatism!). As the researchers tend to work on a single machine and heavily use the file system hierarchy, they wrote a program that runs as a background process on the scientists’ computer. Quite a lot of metadata can be captured from log files, headers, filenames. Notably, it also helps that much work on metadata and vocabulary has already been done in the field in the form of limited practices and standards.

Being pragmatic also means discarding nice-to-haves such as persistent identifiers. That would require the researchers to standardise beyond their own computer and that’s asking too much.

The final lesson learned sounded familiar: it took more, much more time than anticipated to find out what it is the researchers really want.


SWORD2: looks promising and useful, and actually rather simple. Keeping the S was a design constraint. Hey, otherwise we’d end up with Word, and one is more than enough!

Version 2 will do full Create/Read/Update/Delete (CRUD). Though a service can always be configured to deny a certain actions. It’s modelled on Google’s Gdata and makes an elegant use of Resource Maps and dedicated action URLs.

CottageLabs, one of the design partners, made a really introduction video to Sword v2 demonstrating how it works:

It looks really useful and indeed still easy (as per Einstein's famous quip, as simple as possible but not simpler). If you’re a techie, dive into If you’re not, just add Sword compliance to your project requirements!

Ethicshare & Humbox, two sessions on community building

Two examples of successful subject-oriented communities that feature a repository, each with some good ideas to nick.

Ethicshare is a community repository that aggregates social features for bioethics:

  • one of the project partners is a computer scientist who studies social communities. Because of this mutual interest (for the programmer it’s more than just a job) they have had the resources to fine tune the site.
  • the field has a strong professional society that they closely work with.
  • glitches at beginning were a strong deterrent to success - so yes, release early and often, but not with crippling bugs!
  • the most popular feature is a folder for gathering links, and many people choose to make them public (it’s private by default).
  • before offering it to the whole site, new features are tried out on a small, active group of around 30 testers.
  • for the next grant phase they needed more users quickly, so they bought ads. $300 for Facebook ads yielded for 500 clickthroughs, $2000 Google ads 5000. This (likely) contributed to number of unique visitors rising from 4k to 20k per month. Tentative conclusion: these ads cost relatively little and are effective for such a specialized subject, the targeting is really quite good.

Lessons from the UK based Humbox project:

  • approach: analyse what scientists were doing already in real life, in paper and file cabinets, mimic it and extend it.
  • "the repository is not about research papers, it is about the people who write them": the profile page is the heart, putting the user at the centre. Like Facebook’s, it has two distinct views: an outside version about you (to show off), and internal version for you (with your interests). This reminds me of the success of the original, pre-yahoo delicious, which also cleverly put self-interest first with the social sharing as a side-effect.
  • Find a need that's not covered by existing systems: Humbox fills a need to share stuff, not just with students - for that the LCMS is the natural place to go to - but with colleagues, since the course-centric nature of LCMS’s tends to lock colleagues out.
  • Most feedback came from community workshops. Participants often became local evangelists.
  • Comments often were corrections. 60% of the authors changed a resource after a comment - and the 40% comments not leading to a correction also include positives, so the attitude towards criticism was quite positive.
  • over 50% of users modified or augmented material from others, sometimes reuploading it to the site.
  • Humbox only takes Creative Commons licenses, with an educational side-effect: some users indicated they also started looking in other places (such as flickr) for cc material as a result.

The Learning Registry: “Social Networking for Metadata”
slides [google docs]

I just want to mention this for the sheer scope and size of this initiative. It’s [explicative] ambitious.

The aim to gather all social networking metadata! To limit the scope, they won’t do normalising, or offer search or a query api, that's all left to the users of the gathered dataset. But all, they mean everything on the net: data, metadata and paradata (by which I understand they mean the relationships with other data).

Agreements are in the works with major partners (see last slide). The big elephant in the room was Facebook (no surprise, sigh) which wasn’t mentioned at all. (as I'm writing this, Google+ has just been announced, there is some hope after all of the slightly creepy evil eventually triumphing over the even more evil).

They call their approach a do-ocracy. Very agile design principles. Real-time everything in the open: all code and specs are written directly in Google Docs (table of contents, a google spreadsheet). NoSQL master-master storage system, well thought-out architecture, production will run on ec2. Everything will be open, except data harvested from commercial partners.

Something to keep an eye on:


MODS is the new DC. In recent projects, MODS seems to have replaced Dublin Core as the baseline standard for metadata exchange. Interesting development.

Wednesday, June 29, 2011

OR11: New in EPrints 3.3: large scale research data, and the Bazaar.

As I mentioned in the overview, I was very impressed by what's happening in the Eprints community. The new features of the upcoming 3.3 are impressive as they seem to strike the right balance between pragmatism and innovation. Thanks to an outstandingly enthousiastic and open developer community, they're giving DSpace (and to a lesser extend Duraspace) a run for the money.

Red Bull
 could've been the motto of the Eprints community

Support for research data repositories

The new large scale research data is also a hallmark for pragmatic simplicity. EPrints avoid getting very explicit about subject data classification and control, taking a generic approach that can be extended.

Research data can come in two container datatypes, ‘Dataset’ and ‘Experiment’. A Dataset is a standalone, one-off collection of data. The metadata reflects the collection. The object can contains one or more documents, and must also have a read-me file attached, which is a human-oriented manifest, as, though machine-oriented complex metadata is possible, it would deter actual use.

The other datatype is Experiment. This describes a structural process that may result in many datasets. The metadata reflects process and supports the Open Provenance Model.

Where the standard metadata don’t suffice, one of the data streams belonging to the object can be an xml file. If I understood correctly, xpath expressions can then be used for querying and browsing. Effectively this unleashes the shackles of the standard metadata definitions and creates flexibility similar to Fedora. It's very similar to what we're trying to do in the FLUOR project with a SAKAI plugin that acts as a GUI for a data repository in Fedora. Combining user-friendliness with configurable, flexible metadata schemes is a tough one to pull off, I'll certainly keep an eye out on the way EPrints accomplishes this.

The Bazaar

The EPrints Bazaar is plug-in management system and/or an ‘App Store’ for EPrints, inspired by Wordpress. For an administrator it's fully GUI driven, versatile and pretty fool-proof. For developers it looks pretty easy to develop for (I had no trouble following the example with my rusty coding skills).

The primary design goal was that the repository including API must always stay up. They’re clever bastards: they based the plug-in structure on the Debian package mechanism, including the tests for dependencies and conflicts, which makes it very stable. Internally, they’ve run it for six months without a single interruption. Now that’s eating your own dog food!

Country road
Off the beaten track

EPrints as a CRIS

The third major new functionality of 3.3 is CERIF import & export. Primarily this is meant to link eprints repositories automatically to CRIS systems, but for smaller institutions that need to comply with reports in CERIF format but don’t have a system yet, using eprints itself may suffice as pretty much all the necessary metadata is in there. The big question is whether the import/export would allow a full lossless roundtrip, as I joined this session halfway (after an enthousiastic tweet prompted me to change rooms) I might've missed that.

This sounds very appealing to me. Unfortuntaly, the situation in the Netherlands is very different, as a CRIS has been mandatory for decades for the Dutch Universities. Right now we’re in the middle of an European tender for a new, nationwide system, and the only thing I can say is that it’s not without problems. How I’d love to experiment with this instead in my institution, but alas, that won't be possible politically

The EPrints attitude

As Les Carr couldn’t make it stateside, he presented it from the UK. The way this was set up was typical for the can-do attitude of the eprints developers: Skypeing in to a laptop which was put before a mike, and whenever the next slide was needed Les would cheerily call out ‘next slide please!’, after which the stateside companion theatrically reached out for the spacebar of the other laptop, connected to the beamer. Avoid neat technology for technology’s sake and keep it simple and effective.

Wednesday, June 22, 2011

OR11: opening plenary

See also: OR11 overview

The opening session by Jim Jagielski, President of the Apache Software Foundation, focussed on how to make an open source development project viable, whether it produces code or concepts. As El Reg reports today, doing open source is hard. The ASF has a unique experience in running open projects (see also is apache open by rule). Much nodding in agreement all around, as what he said made good sense, but hard to put in practice. Some choice advise:

Communication is all-important. Despite all the new media that come and go, the mailing list still is king. Any communication that happens elsewhere - wikis, IRC, blogs, twitter, FB, etc - needs to be (re)posted to the list before it officially exists and can be considered. A mailing list is a communication channel which is asynchronous and participants can control themselves, meaning read or skip it at their time of choice, not the time mandated by the medium. A searchable archive of the list is a must.

Software development needs a meritocracy. Merit is built up over time. It’s important that merit never expires, as much open source committers are volunteers who need to be able to take time off when life gets in the way (babies, job change, etc).

You need at least three active committers. Why three? So they can take a vote without getting stuck. You also need ‘enough eyeballs’ to go over a patch or proposal. A vote at ASF needs minimally three positive votes and no negatives.
To create a community, you also need a ‘shepherd’, someone who is knowledgable yet approachable by newbies. It’s vital to keep a community open, so not to let the talent pool become too small. To stay attractive, that you need to find out what’s the ‘itch’ that your audience wants to scratch.

The more 'idealistic' software licenses (GPL and all) are "a boon firstmost to lawyers", because the terms ‘share alike’ and ‘commercial use’ are not (yet) clear in juridical context. Choosing an idealistic license can limit the size of the community for projects where companies play a major role. A commenter added that this mirrors the problems of the Creative Commons licenses. In a way, the apache license mirrors CCzero, which CC created to tackle those.

Tuesday, June 21, 2011

Open Repositories 2011 overview

Open Repositories was great this year. Good atmosphere, lots of interesting news, good fun. It's hard to make a selection from 49k of notes (in raw utf8 txt!). This post is a general overview, more details (and specific topics) will follow later.

Bright lights, bit state!
Texas State History Museu
My key points:

1. Focus on building healthy open source communities

The keynote by Jim Jagielski, President of the Apache Software Foundation, set the tone for much what was to come. An interesting talk on how to create viable open source projects from a real expert. The points raised in this talk came back often in panel discussions, audience questions and presentations later.
More details here.

2. The Fedora frameworks are growing up

Both Hydra and Islandora now have a growing installed base, commercial support available, and a thriving ecosystem. They've had to learn the lessons on open source building the hard way, but they have their act together. Fez and Muradora were only mentioned in the context of migrating away.
Also, several Fedora projects that don't use Hydra still use the Hydra Content Model. If this trend of standardizing on a small number of de facto standard CM's, that would greatly ease mixing and moving between Fedora middleware layers.

3. Eprints’ pragmatic approach: surprisingly effective and versatile

Out of curiosity I attended several EPrints sessions, and I was pleasantly surprised, if not stunned by what was shown. Especially the support for research data repositories looks to strike the right balance between supporting complex data and metadata types, while keeping it simple and very usable out-of-the box. And also the Bazaar, which tops Wordpress in ease of maintainance and installation, but on a a solid engineering base that's inspired by Debian's package manager. Very impressive!
More details here.
Texans take 'em by the horns!

Misc. notes
See part #3: Misc notes

Elsewhere on the web

OR11 Conference programpresentations.
Richard Davis, ULCC: #1 overview#2 the Developers Challenge, #3: eprints vs. dspace.
Disruptive Library Technology Jester day 1, day 2, day 3.
Leslie Johnson - a good round-up with focus on practical solutions.
#or11 Tweet archive on twapperkeeper

Photosets: bigD, keitabando, yours truly, all Flickr images tagged with or11, Adrian Stevenson (warning: FB!).

Other observations

Unlike OR09, the audience was not very international. Italians and Belgians were relatively overrepresented with three and six respectively. I spotted just one German, one Swede and one Swiss, and I was the lone Dutchman. The UK was the exception, though many were presenters of JISC funded projects, which usually have budget allocated for knowledge dissemmination.

As OR alternates between Europe and the US, the ratio of participants tends to be weighed to the 'native continent' anyway. But the recession seems to be hitting travel budgets hard in Europe now.
As there were interesting presentations from Japan, Hong Kong and New Zealand, the rumour floating around that OR12 might be in Asia sounded attractive, I'd be very curious to hear more about what's going on there in repositories and open access. The location of OR12 should be announced within a month, let's see.

[updated June 27th, added more links to other writeups; updated June 28, added Hydra CM uptake]

Monday, June 20, 2011

Catching up on old news, I came across an interesting presentation on CNI this spring on the Data Management Plans initiative. Abstract, recording of the presentation on youtube, slides.

DMP online is a great starting point (and one of the inspirations for CARDS) and this looks like the right group of partners to extend it into a truly generic resource. What's notable about the presentation is also the sensible reasons outlined for collaboration between this quite large group of prestigious institutions.All in all, something to keep an eye on.