Paul Wheatley has made a self-proclaimed rant on his blog on the Open Planets Foundation web site. He says, on UDFR after reading its documentation
“I see nothing describing what concrete digital preservation problems this system will help with solving. I hope I’ve just missed it. But I’m really worried that it doesn’t exist.”
He’s kind enough to say, referring to one of my recent posts
“Chris Rusbridge just got a bit closer to articulating some problems/aims in a blog post related to the Archive Team’s appeal to crowd source the “formats problem”. It’s interesting to see that Chris’s list is mainly about tools that do handy things to certain formats. This seems helpful and a bit more practical, although I would say that although Chris title’s the list “…what is the file format problem that we need to solve?”, most of the entries still sound more like solutions than problems! We as a community really are bad at articulating our challenges and requirements, and just can’t wait to dive into the solution. My worry of course is that we then create an amazing technical solution to a problem we don’t have.”
Remind you of anything? The answer to the ultimate question of life, the universe and everything (in Douglas Adams “The hitchhiker’s guide to the galaxy”) turned out to be 42. But no-one could work out what the question was. As Paul says, we fall for this time and time again.
(BTW I didn’t know until today that many integers have their own Wikipedia pages! The one on 42 is quite extensive, but not primarily because of Doglas Adams.)
OK so we have to try harder. Being a very old techie, I started programming in procedural languages like FORTRAN and low level assemblers; Pascal was about as advanced as I got. The first time I heard about Object-oriented computing, it was explained to me as follows: an object comprises a data structure and a set of methods. You can only process the data structure via the methods; the object binds the two and requires both. I don’t find that definition or explanation in the (few) object-oriented text books I’ve looked at, but it makes sense to me.
How is it relevant to digital preservation and the file format problem? It’s precisely and directly relevant. We have focused nearly all our preservation effort onto the data structures rather than the objects, and we’ve forgotten about the methods. Well, maybe we haven’t forgotten, we wring our hands a bit and some folk mumble about emulation as if that might fix it. But the object is holistic; you can’t access an object in a meangful way without BOTH the data structure and the methods.
Now I don’t mean that every data structure in an archive needs to be accompanied by the original methods (software) that created or processed it. That wouldn’t make sense. We have to allow that a reasonably functional variant of an object can be formed from the data structure and a functionally similar (but actually different) implementation of the methods. We can also often allow a reduced set of methods; so a user or re-user of an object may not need the methods to create or update the data structure, only the methods to access or compute from it.
Given that, can I have a better go at characterising the file format problem? I’ll have a try. I’m going to continue with the object-oriented metaphor from above, even though I realise this will put some folk off. I also fear that I’m BOUND to get my language mixed up a bit, so I may need help, but I hope it makes some sense.
a) Given an arbitrary existing data structure X (known or suspected to be part of an object as defined above) that we need to process in some way, the first problem is to identify the set of methods needed to process X. This generally means finding a suitable program that will run on our computing environment. For a large number of data structures (files), proceeding on an individual data structure basis, this is going to be a long, hard job.
b) In the more common case where X is an instance of a class of data structures of a common format, we reduce the problem in (a) to three new problems: first (b1) we have to classify significant data structures into known types (ie file format types), then (b2) we need to find ways to identify the class of data structure X, then (b3) we need to identify the sets of methods available to process data structures of that class (ie files of that type). This is clearly much harder if there are only a few data structure instances for each type, but when there are billions of instances of a relatively small number of file format types, we hope to recover our major up-front investment. In this context, 10,000 file format types might count as a relatively small number!
c) Having identified the methods for X, we need to be able to actually run the methods on the data structure. This can be arbitrarily easy or hard depending on the dissonance between the computing environment we want to use and the computing environment the methods were constructed for. This might be resolved by finding alternate implemetations of the methods for the object. We also need to know how to run the methods to achieve the result we’re looking for (documentation!).
So far this is a problem framed in the present: for this data structure or file, can I find and run a method, or program, to do what I need with it. But archivists have decided they want to solve another problem.
d) Archivists want to ensure that future generations will be able to identify the classes of data structures like X, and find and use the methods to process them, for an arbitrarily long time into the future (the OAIS somewhat circular definition seems to be that long term is long enough for this to be a problem).
This is, to my mind, definitely a “nice to have” feature, but I’m not totally convinced that it is a problem that it is ESSENTIAL to solve in all cases. It is certainly not applied as rigorously in the analogue world. Museums didn’t decide to discard clay tablets or the Rosetta Stone even though they couldn’t read them. They don’t keep a Norse dictionary next to the Sagas. They rely on scholars to come equipped with suitable knowledge to be able to approach, to access and to process the physical objects. It’s perfectly plausible to imagine scholars of the future being trained with the arcane skills required to handle ancient data structures. And in my view, if you solve this problem for the here and now, there’s a reasonably good chance that it will still work tomorrow, and when tomorrow comes, the next day. It’s recursion all the way forwards! (Until it isn’t of course; but NOTHING is going to guarantee survival through the next great computing discontinuity!)
Anyway, that’s how I’m trying to frame the problem:
b1) classify known data structure types (and their variants) via some scheme that makes sense
b2) work out ways to identify the class of data structure for a particular instance
b3) identify the methods needed to process classes of data structures, and find them
c1) work out how to run those methods on different computing environments, or
c2) identify different iplementations of the methods that will run on our chosen environment, and if possible
d) work out how to ensure this whole structure will continue to work indefinitely.
To make this all work, it is certainly clear we need
e) some sort of information structures to pool our societal knowledge that contributes to the solution of these problems. Digital preservationists have tended to call these “registries”, while Jason thinks of them as a wiki.Registries are more excluding, which reduces the ability of society at large to contribute, but might improve accuracy. Wikis are more inclusive, so can capture more societal input, possibly at the expense of accuracy. Woops, I’m back in solution space again, sorry!
By the way, I deliberately didn’t add validate the data structure. While validating the data structure might be nice, and might help us find appropriate methods, what do we do if it fails validation? Throw the data structure away? Or do the best we can? Trust Postel’s Law! Validate when you create or update, be generous when you access.