Monday, June 15, 2009

OR 09: three more neat Fedora implementations

Open Repositories 2009, Day 4

Three more notable sessions on implementing Fedora. Hopefully, the penultimate post before a final round-up. What a frantic infodump this conference was...


Enhanced Content Models for Fedora - Asger Blekinge-Rasmussen (State and University Library Denmark)

A hardcore technical talk, though impressive in the elegance of the two points shown: bringing the OO model to Fedora object creation, and a DB style ‘view’ for easy creating searching and browsing UIs.

The first is created as an extension of Fedora 3’s standard Content Models, yet backward-compatible, which is a feat. Notable extra’s: declares allowed relations (in OWL lite), schema for xml datastreams. Includes validator service (which is planned as disseminator, too). Open source [sourceforge].

Fedora objects can be manipulated at quite high level using API, but population needs to be done at much lower level. Thus most systems roll their own. Our solution: templates, data objects created as instances of CM’s, not unlike OO programming. Makes default values very easy. No need for handcoded foxml anymore, halleluja! Create, discover, clone templates using template web service.

Then there are repository views, which bundle atomic objects into logical records. Search engine record might be made up of bundle of Fedora objects.
Defined by annotated relations; view angles to create different logical records.
‘view = none’: then omitted from results (useful for small particles you don’t want to have show up in queries, for instance separate slides).

These simple API additions make it easy to create elaborate, simple GUI’s. Which includes the first one I’ve seen that comes close to a workable interface for relationship management - not quite a full drag’n drop, but getting there.


Beyond the Tutorial:Complex Content Models in Fedora 3 - Peter Gorman, Scott Prater (University of Wisconsin Digital Collections Center)
[presentation]

Summary: A hands-on walk through of the Wisconsin DIY approach. Also, an excellent example of what a well-done Prezi presentation can look like: literally zooming in on details then zooming out on the global context was really helpful to see the forest for the trees.

The outset: migrating >1million complex, heterogeneous digital objects into Fedora. Use abstract CM’s, atomistic, gracefully absorb new kinds and new combinations of content. Philosophy: 'fit the model to the content, not the content to the model'.
(Not in prodction yet, prototype app; keep eye out for 'Uni Wisconsin digital collections')

Prater starts out with the note that it’s humbling to see that the Hydra and escidoc people have been working on the same problem. However IMHO there’s no reason for embarrassment, as their basic solution is very elegant.

Using MODS for toplevel datastream (similar approach to Hydra). STRUCT datastream: a valid METS document, tying objects to hierarchy. Important point: CM’s don’t define structure, that’s for STRUCT and RELS-EXT.

Every object starts with a FirstClassObject, which points to 0-n child objects of arbitrary types. If zero it’s a citation. To deal with sibling relationships (ie 2 pages in specific order), an umbrella element is put on top with a METS resource map. This allows full METS functionality. Linking using simple STRUCT and RELS-EXT. Advantage over doing everything in RESLEXTS: that doesn’t allow to express sequencing.

Now, to tie this ‘object soup’ together in an app (common problem for lots of objects, to turn the soup into a tree), the solution is simple: always use one monolithic disseminator, viewMETS(). This takes PID for FirstClassObject, returns valid METS doc containing object and all its (grand)children.

This is brilliant: a one-stop API to get the full object tree from a given PID, hiding the complexity of the umbrella object and the METS description involved.

The only part they’re not very satisfactory yet about is how to relate related items between FirstClassObjects and relations between two top-level logical objects (ie journal and article) that are sometimes parent/child, sometimes not.

To which Asger chimed in that his ‘angle view’, demonstrated in the talk before, would be a possible solution for this. I saw them discussing later... I love it when a plan comes together.


When Ruby Met Fedora- Matt Zumwalt (Media Shelf)

A live demonstration of ActiveFedora which made my fingers itch to start coding straight away - until I remembered Ruby’s Unicode issues, rats.

The philosophy behind: use Fedora for long-lived content, but be able to quickly create short-timed services and apps.

ActiveFedora can be used without Rails, or even without Ruby (you can call it from the shell). However, Ruby’s OO model maps very well on Fedora. The key difference with say java or C++: you don’t know what kind of object you’ll get back to a call.

The demo shows the standard rails environment, except the Model directory. There, calls to ActiveRecord are replaced with calls to ActiveFedora. AF exposes Fedora objects with multiple properties. Qualified DC is built-in, but the has_properties function allows for easy extension.

An interesting advantage of this approach is that the methods as used by developers use the same jargon as the metadata users are used to. “they communicate much better when a method’s called dc.subject.”

There’s quite a bit to do ATM. They’ve received funding to hire a student to finally write real documentation. Other extensions: built-in SOLR integration, more generators for standard situations, basic CM integration. Interesting is the approach to integrating MODS: use the existing, mature java libraries, which is easy when using JRuby as interpreter.

Thursday, June 11, 2009

OR 09: eScidoc's infrastructure

eSciDoc Infrastructure: a Fedora-based e-Research Framework - Frank Schwichtenberg, Matthias Razum (FIZ Karlsruhe)

I had not expected this presentation to be as good as it was - it was a real eye-opener for me. It dealt solely and bravely on the underlying structure of eScidoc, not the solutions built on top of them (such as PubMan). So, delving into the technical nitty gritty.

So far, to me eSciDoc has been an interesting promise that seemed to take forever to materialize into non-vaporware. DANS wanted to use it as the basis for the Fedora-based incarnation of their data repository EASY, a plan they had to abandon when their deadline was looming near and the eScidoc API's were still not frozen. Apart from that, the infrastructure seemed also needlessly complex - why was another content model layer necessary on top of Fedora's own?

The idea behind the eScidoc approach is to take a user-centric approach, which in case of the infra, that's the programmer. What would she like to see, instead of Fedora's plain datastreams?
Tentative answer: an application-oriented object view.

eScidoc takes a full atomistic approach to content modelling: an Item is mapped to a fedora object (without assumption about the metadata profile - keeping it flexible). Then, Item has Component. An Item in practice consists of two fedora objects, with a ‘hasComponent’ relation between.

Object can be in arbitrary hierarchies: except the top hierarchies which are reserved for ‘context’, which can be used for institutional hierarchies (a common approach, I can live with that). All relationships are expressed as structmaps.

So far so good, but now the really neat part.

Consequences of the atomistic content model for versioning: a change can occur in any of the underlying fedora objects of a compound object, with consequences for both.
The eScidoc API's store the Object lifecycle automatically. And when one Component changes or is added, the Item object also changes version, but not the other Components.
(the presentations slides are really instructive on this, worth checking out when they're online).

This also delivers truly persistent ID’s (multiple types supported: DOI, handle, etc), separate from fedora’s PID’s which are not really persistent. And every version has one - both of the compound and the separate Item objects. All changes (update/release/submit events etc.) are logged in version log has events, if I remember correctly this log can be used for rollback ie it is a full transaction log.

This is the reason that the security model has to be in the escidoc layer, not fedora's (though the same policies & structures xacml are used). This is eScidoc's answer to the question common to many fedora projects: how to extend fedora's limited security? It might be best to take the whole security layer out of Fedora.


IMHO this is very exciting. This is about the last thing that a project would need to roll yourself - it is incredibly complex to get working correct and durable - and here it is, backed by a body of academic research - it is a German project after all. For me, this puts eScidoc firmly on the shortlist of frameworks.

Wednesday, June 10, 2009

OR 09: blogosphere links

Nearly three weeks afterwards, it's time to round up the OR 09 posts... Unfortunately, library life got in the way. Meanwhile, why not read the opinions of these honoured colleagues, that are undoubtly better informed:

loomware.typepad.com/ (Mark Leggott)

Open Repositories 2009 - Peter Sefton's trip report (ptsefton.com)
Open Repositories 2009 – Peter Sefton's further thoughts (caulcairss.wordpress.com)

Leslie Carr (repositoryman.blogspot.com)

John Robertson (Strathclyde)

http://repositoryblog.com/archives/18

http://www.weblogs.uhi.ac.uk/sm00sm/2009/05/

http://jhulibrariestravel.blogspot.com/2009/05/open-repostories-2009.html (Elliot Metsger, Johns Hopkins)

Finally, another bunch'o'links:
http://repositorynews.wordpress.com/2009/05/28/open-repositories-2009/


Friday, June 05, 2009

OR09: Four approaches to implementing Fedora

Open repositories 2009, day three, afternoon.

So far, the conference had not been disappointing, but now it got really interesting. The sessions I followed in the afternoon each highlighted a specific approach of the problem that IMHO has been standing in the way of wider Fedora acceptance: middleware.

What these four have in common, is that they all take leverage an existing OSS product and adapt it to use Fedora as datastore.


1. Facilitating Wiki/Repository Communication with Metadata - Laura M. Bartolo

Summary: interesting approach, a traditional Fez spiced up with Mediawiki. With minimal coding a relative seamless integration.
For this to work, contributors need to know MediaWiki markup, and to really integrate, must learn the fez-specific search markup. Also, I'm not sure how well this can be scaled up to true compound objects, given Fez' limitations.

Notes:
Goal: disseminating of research resources. Specific sites for specific science fields, ie soft matter wiki, materials failure case studies.
MatDL repository: has a repository (Fedora+Fez), want to open up two-way communicating. Example: Soft matter expert community, set up with MediaWiki. "Mediawiki hugely lowers the barrier for participating": familiarity gives low learning curve.

The question: how to integrate the repository with the wiki two-way.

Thinking from user-centric approach. Accommodate user; support complex objects (more useful for research & teaching) thus describe them parts as individual objects.

Components:
- Wiki2Fedora
Batch run. Finds wiki upload file, converts referencing wiki pages to DC metadata for ingest in rep. (wiki has comment, rights, author sections -> very doable) Manual post-processing (Fez Admin review area function)
-Search results plug-in for wiki: display repository results in wiki search. Adds to mediawiki markup, to enable writing standard fez queries in the content.

Sites: Repository - Wiki


2. Fedora and Django for an image repository: a new front-end - Peter Herndon (Memorial Sloan-Kettering Cancer Center)


Summary: using Django as a CMS, internally developed adapters to Fedora 3.1.

My gut feeling: A specific use case, images only, so rather limited in scope. Despite choosing the 'hook up with mainstream package' strategy, effectively still a NIH-based rolling their own. That makes the issues even more instructive.

Notes:
Adapting a CMS that expects SQL underneath is challenging - the plugin needs to be a full object-to-relational database mapper.
Also, Fedora 'delete' caused 'undesired results', 'inactive' should be used.
Further, some more unexpected oddities: had to write their own LDAP plugin to make it work, django has tagging but again plugin was needed to limit this to controlled vocabularies. Performance was not a problem.
Interesting: repository for images only, so exif and the like can be used - tags added using Adobe Bridge! The tested, successful strategy: make use what is already familiar.
In the Q&A the question came up: why use Fedora in this case anyway? Indeed the only reason would be preservation, otherwise it would have saved a lot of trouble to use Django Blobstore.

The django-fedora plugins are available at bitbucket.org.



3. Islandora: a Drupal/Fedora Repository System - Mark A Leggott (University of PEI)

Summary:
Islandora looks *very* promising. I noted before (UPEI's Drupal VRE strategy) that UPEI is a place to watch - they are making radical choices with impressive outcomes.

Notes:
UPEI's culture is opensource friendly. They use Moodle and Evergreen (apparently, they were the first Evergreen site in production).

Rationale: opensourcing an in-house system reinforces good behaviour: full documentation, quality code.

As noted before, UPEI's repositories are hidden behind VRE (see [link]). VRE's are geared towards the researchers. Example of approach: the first thing people do when they set up a VRE is create a webpage. That's what a project needs, and so it's used as a hook to reel people in, they're up and running within a few hours.

The VRE is Drupal; Fedora is for data assets, metadata, policies.
Base Islandora consists of three plugins: Drupal-Fedora connection plugin, xacml filter, rule engine for searches.

This 'rule engine' is indeed very cool.
In a later private conversation with Mark Leggott, he clarified that Islandora indeed uses an atomistic complex object model for research data; the rule engine declares how these can be searched from within Drupal. Example, a dataset consisting of a number of measuring points, each with a set of instruments, atomistically in Fedora; can be queried as 'all the results from specific measure point', 'all the result from instrument x', 'instrument x in specific period' etc.
We haven't reached Nirvana yet, to make the deconstructing of the data objects possible, they have to adhere to specific format (xml). But it's impressive nevertheless.


Other Drupal plugins add functionality for specific data. Impressive example: Drupal FCK editor used as TEI editor, after editing, automatically ads version to datastream. Very cool and 'Just Works' (cheery tweet).

Marine Natural Products Lab: best example of the setup for VRE which includes extensive repository (searchable within the critter xml).

Previous versions used drupal 5/fedora 2, not maintained; currently drupal 6/fedora 3.1

Q: did you replace the drupal storage layer, or do you sync?
A: sometimes it’s saved in the drupal layer, when it doesn’t need to go into fedora (temporary data, while we build the content model). Drupal filesystem is a potential bottleneck when large datablobs

Q: are you bound to content models?
A: standard fedora cm’s, you can build them yourself or change the delivered one. The models are exposed, you can see how it works. We first installed Fez to see how Fedora worked.


4. Project Hydra: Designing & Building a Reusable Framework for Multipurpose, Multifunction, Multi-institutional Repository-Powered Solutions - Tom Cramer (Stanford University), Richard Green (University of Hull), Bess Sadler (University of Virginia) et al.



Summary:
I'm even more excited about Hydra than about Islandora. Different approach: create "A lego set of services". In other words, a toolkit for the common parts of applications.
It all looks really good. Two gotchas though. Firstly, it is still a work in progress. Can we afford to wait? Secondly, there are issues with the Unicode support of Ruby on Rails.

For more info: D-Lib.

Notes:
Modelled after the current 12+ use cases of repositories in use at partner institutions, both institutional and personal.
It needs generic templates - which sometimes may do the job - otherwise it won’t come off the ground.
Hydra will have common content models and datastream names. But ultimately they want Hydra to be able to cope with almost anything. A MODS datastream will always have to be there, but not necessarily as primary (so can be done via dissemminator).

Four multifunctional sections:
  • Deposit
  • manage (edit objects, set access)
  • search & browse
  • deliver
  • plus plumbing: authent, author, complex workflow
Using Rails with ActiveFedora. Turns out Rails lives up to its reputation: they are way ahead of their initial roadmap, now expect full production app by fall.

Specs 3/4 ready, coding 1,5/4.
Demo: http://hydra-dev.stanford.edu/etds

Presentation builds on top of blacklight OPAC. Virgina already has a beta version of their catalogue up using blacklight.

Monday, June 01, 2009

OR09: On the new DuraSpace Foundation, and Fedora in particular

Open Repositories 2009, day 3, morning: three sessions on Fedora.

The morning started with a joint presentation by Sandy Payette (Fedora Commons) and Michele Kimpton (DSpace Foundation), focussing on strategy and organisation; after caffeine break, Fedora+DSpace tech overview by Brad McLean; finally, developers' open house.

I'll cover it in one blog post (this or09 series is getting a bit long in the tooth, isn't it?). For the actual info on DuraSpace and all, see the DuraSpace website. The tech issues were covered more in depth in further sessions.

The merger, by new almost old news, though the incorporation lies still in the future: Fedora Commons and the Dspace user Group will become DuraSpace. The 'cloud' product, that originally had the same name, is renamed DuraCloud.

Not the easiest of presentations, as there is a good deal of scepticism around the merger, and not just on the twitter #or09 channel. Payette and Kimpton handled it very professionally, dare I say gracefully. Both standing on the floor, in front of the audience, talking in turns (did I imagine it, or did I really hear them taking over a sentence, in Huey & Dewey style?), while an assistant standing behind the laptop was going back and forth through the slides in perfect timing.


All in all, they pulled it off to come across as a seamless team. That bodes well.

Also well was a frankness in the Q&A (as well as later in the developers open house). After noting some difficulties in finding the right strategy for open source development: "we do not aim to mold DSpace's opensource structure to the Fedora core committer, on the contrary".

"We have to ask ourself: are we really community driven in the Fedora project? We've been closed in the past, we're opening up." Fedora has started using a new tracker, actually modelled on DSpace's model; "please use it, our tracker is our new inbox."

On the state of Fedora - many and diverse new users.

Escidoc is now deployable.

WGBH OpenVault - including annotated video

Forced Migration Online 

Jewish Women Archive - runs in EC2, first of a new wave of smaller archives now coming online using limited resources.

Notably missing on a slide listing 'major contributors': Mediashelf, Sun, and Microsoft Research: VTLS. Possibly a sponsoring issue? It was more than a bit odd, given their standing in the past.


Q: "How do yo see the future of DSpace vs. fedora - do they compete?"

A: "Fedora’s architecture is great, but we also need ‘service bundles’. CMS style on top for instance. The architecture will stay open for any kind of app on top. DSpace is going the other direction. Opportunity is to make sure we're not doing identical things with different frameworks."

It is *so* easy to read this as 'the products will meet in the middle', but this was carefully avoided. However, in the tech talk later it was mentioned that Fedora-DSpace replication back and forth experiments are actively worked on.

I think I'm not alone in thinking that the products will merge eventually. It will take some time, but they will.

Q: (cites another software company merger, IIRC Oracle and Peoplesoft) – merger brings great unrest in communities, which one is going to die? Are F&D moving together? Technical and cultural changes for both communities? etc.

A: Payette: any kind of software eventually becomes obsolete. We are determined not to let that happen, and for that it needs to be modular and organic. Side by side, cause they both do things well. When overlap starts to happen, that may change, but by the module.

Peter Sefton chimed in: very positive. Right decision at the right time. Focus on cloud computing is essential, feels that this is what we’re moving towards, and our current monolithic repositories need to adapt to that.

 

Some DSpace 1.x upcoming features: statistics, embargo, batch editing. I don't know that much about DSpace, and it shows: I was surprised that these weren't covered yet. Esp. batch editing and embargo, pretty basic features. I know too little of DSpace to judge the announced 2.0 features, apart from the DuraCloud integration using Akubra.

Fedora 3.2 highlights:

  • SWORD API 1.3. Of course. Nice though
  • new web admin client. Not all of the features implemented, so the java client hasn't been deprecated - it will in future. This is a big deal, as the client is also useful for metadata editing staff.
  • akubra: store files by ID, pluggable, stackable, multiplexing (ie on multiple storage environments that to the API look as one big one). Experimental, meaning included but not turned on by default.


Finally, the Fedora developer open house was like getting the pulse of the developer community. Summary: there are pains, communication has been problematic, with a gap between the committers and the community. My impression is that it is finally being talked about, and the core developers in the panel admitting that a change is needed. A constructive and open approach.