Saturday, July 02, 2011

OR11: Misc notes

The state of Texas
I like going to conferences alone, it’s much easier to meet new people from all over the world than when you’re with a group, groups tend to cling together. With a multitracking conference like OR11 however, the downside is that there’s so much to miss. Especially since I like to check out sessions from fields I’m not familiar with. At OR11, I wanted to take the pulse of DSpace and Eprints, and not just faithfully stick with the Fedora talks.

In this entry, I focus on bits and bobs I found noteworthy, rather than give a complete description. I skip over sessions that were excellent but have already widely covered elsewhere (for instance at library jester) such as Clifford Lynch closing plenary.

“Sheer Curation” of Experiments: data, process and provenance, Mark Hedges 
slides [pdf]

"Sheer curation" is meant to be lightweight, with curation quietly integrated in the normal workflow. The scientific process is complex with many intermediate steps that are discarded. The deposit at the end approach misses these. Goal of this JISC project is to capture provenance experimental structure. It follows up on Scarp (2007-2009).

I really liked the pragmatic approach (I've written this sentence often - I really like pragmatism!). As the researchers tend to work on a single machine and heavily use the file system hierarchy, they wrote a program that runs as a background process on the scientists’ computer. Quite a lot of metadata can be captured from log files, headers, filenames. Notably, it also helps that much work on metadata and vocabulary has already been done in the field in the form of limited practices and standards.

Being pragmatic also means discarding nice-to-haves such as persistent identifiers. That would require the researchers to standardise beyond their own computer and that’s asking too much.

The final lesson learned sounded familiar: it took more, much more time than anticipated to find out what it is the researchers really want.


SWORD2: looks promising and useful, and actually rather simple. Keeping the S was a design constraint. Hey, otherwise we’d end up with Word, and one is more than enough!

Version 2 will do full Create/Read/Update/Delete (CRUD). Though a service can always be configured to deny a certain actions. It’s modelled on Google’s Gdata and makes an elegant use of Resource Maps and dedicated action URLs.

CottageLabs, one of the design partners, made a really introduction video to Sword v2 demonstrating how it works:

It looks really useful and indeed still easy (as per Einstein's famous quip, as simple as possible but not simpler). If you’re a techie, dive into If you’re not, just add Sword compliance to your project requirements!

Ethicshare & Humbox, two sessions on community building

Two examples of successful subject-oriented communities that feature a repository, each with some good ideas to nick.

Ethicshare is a community repository that aggregates social features for bioethics:

  • one of the project partners is a computer scientist who studies social communities. Because of this mutual interest (for the programmer it’s more than just a job) they have had the resources to fine tune the site.
  • the field has a strong professional society that they closely work with.
  • glitches at beginning were a strong deterrent to success - so yes, release early and often, but not with crippling bugs!
  • the most popular feature is a folder for gathering links, and many people choose to make them public (it’s private by default).
  • before offering it to the whole site, new features are tried out on a small, active group of around 30 testers.
  • for the next grant phase they needed more users quickly, so they bought ads. $300 for Facebook ads yielded for 500 clickthroughs, $2000 Google ads 5000. This (likely) contributed to number of unique visitors rising from 4k to 20k per month. Tentative conclusion: these ads cost relatively little and are effective for such a specialized subject, the targeting is really quite good.

Lessons from the UK based Humbox project:

  • approach: analyse what scientists were doing already in real life, in paper and file cabinets, mimic it and extend it.
  • "the repository is not about research papers, it is about the people who write them": the profile page is the heart, putting the user at the centre. Like Facebook’s, it has two distinct views: an outside version about you (to show off), and internal version for you (with your interests). This reminds me of the success of the original, pre-yahoo delicious, which also cleverly put self-interest first with the social sharing as a side-effect.
  • Find a need that's not covered by existing systems: Humbox fills a need to share stuff, not just with students - for that the LCMS is the natural place to go to - but with colleagues, since the course-centric nature of LCMS’s tends to lock colleagues out.
  • Most feedback came from community workshops. Participants often became local evangelists.
  • Comments often were corrections. 60% of the authors changed a resource after a comment - and the 40% comments not leading to a correction also include positives, so the attitude towards criticism was quite positive.
  • over 50% of users modified or augmented material from others, sometimes reuploading it to the site.
  • Humbox only takes Creative Commons licenses, with an educational side-effect: some users indicated they also started looking in other places (such as flickr) for cc material as a result.

The Learning Registry: “Social Networking for Metadata”
slides [google docs]

I just want to mention this for the sheer scope and size of this initiative. It’s [explicative] ambitious.

The aim to gather all social networking metadata! To limit the scope, they won’t do normalising, or offer search or a query api, that's all left to the users of the gathered dataset. But all, they mean everything on the net: data, metadata and paradata (by which I understand they mean the relationships with other data).

Agreements are in the works with major partners (see last slide). The big elephant in the room was Facebook (no surprise, sigh) which wasn’t mentioned at all. (as I'm writing this, Google+ has just been announced, there is some hope after all of the slightly creepy evil eventually triumphing over the even more evil).

They call their approach a do-ocracy. Very agile design principles. Real-time everything in the open: all code and specs are written directly in Google Docs (table of contents, a google spreadsheet). NoSQL master-master storage system, well thought-out architecture, production will run on ec2. Everything will be open, except data harvested from commercial partners.

Something to keep an eye on:


MODS is the new DC. In recent projects, MODS seems to have replaced Dublin Core as the baseline standard for metadata exchange. Interesting development.

No comments: