Thursday, June 11, 2009

OR 09: eScidoc's infrastructure

eSciDoc Infrastructure: a Fedora-based e-Research Framework - Frank Schwichtenberg, Matthias Razum (FIZ Karlsruhe)

I had not expected this presentation to be as good as it was - it was a real eye-opener for me. It dealt solely and bravely on the underlying structure of eScidoc, not the solutions built on top of them (such as PubMan). So, delving into the technical nitty gritty.

So far, to me eSciDoc has been an interesting promise that seemed to take forever to materialize into non-vaporware. DANS wanted to use it as the basis for the Fedora-based incarnation of their data repository EASY, a plan they had to abandon when their deadline was looming near and the eScidoc API's were still not frozen. Apart from that, the infrastructure seemed also needlessly complex - why was another content model layer necessary on top of Fedora's own?

The idea behind the eScidoc approach is to take a user-centric approach, which in case of the infra, that's the programmer. What would she like to see, instead of Fedora's plain datastreams?
Tentative answer: an application-oriented object view.

eScidoc takes a full atomistic approach to content modelling: an Item is mapped to a fedora object (without assumption about the metadata profile - keeping it flexible). Then, Item has Component. An Item in practice consists of two fedora objects, with a ‘hasComponent’ relation between.

Object can be in arbitrary hierarchies: except the top hierarchies which are reserved for ‘context’, which can be used for institutional hierarchies (a common approach, I can live with that). All relationships are expressed as structmaps.

So far so good, but now the really neat part.

Consequences of the atomistic content model for versioning: a change can occur in any of the underlying fedora objects of a compound object, with consequences for both.
The eScidoc API's store the Object lifecycle automatically. And when one Component changes or is added, the Item object also changes version, but not the other Components.
(the presentations slides are really instructive on this, worth checking out when they're online).

This also delivers truly persistent ID’s (multiple types supported: DOI, handle, etc), separate from fedora’s PID’s which are not really persistent. And every version has one - both of the compound and the separate Item objects. All changes (update/release/submit events etc.) are logged in version log has events, if I remember correctly this log can be used for rollback ie it is a full transaction log.

This is the reason that the security model has to be in the escidoc layer, not fedora's (though the same policies & structures xacml are used). This is eScidoc's answer to the question common to many fedora projects: how to extend fedora's limited security? It might be best to take the whole security layer out of Fedora.


IMHO this is very exciting. This is about the last thing that a project would need to roll yourself - it is incredibly complex to get working correct and durable - and here it is, backed by a body of academic research - it is a German project after all. For me, this puts eScidoc firmly on the shortlist of frameworks.

No comments: