The solution is… 42! What was the problem?

4 Jul

Paul Wheatley has made a self-proclaimed rant on his blog on the Open Planets Foundation web site. He says, on UDFR after reading its documentation

“I see nothing describing what concrete digital preservation problems this system will help with solving. I hope I’ve just missed it. But I’m really worried that it doesn’t exist.”

He’s kind enough to say, referring to one of my recent posts

Chris Rusbridge just got a bit closer to articulating some problems/aims in a blog post related to the Archive Team’s appeal to crowd source the “formats problem”. It’s interesting to see that Chris’s list is mainly about tools that do handy things to certain formats. This seems helpful and a bit more practical, although I would say that although Chris title’s the list “…what is the file format problem that we need to solve?”, most of the entries still sound more like solutions than problems! We as a community really are bad at articulating our challenges and requirements, and just can’t wait to dive into the solution. My worry of course is that we then create an amazing technical solution to a problem we don’t have.”

Remind you of anything? The answer to the ultimate question of life, the universe and everything (in Douglas Adams “The hitchhiker’s guide to the galaxy”) turned out to be 42. But no-one could work out what the question was. As Paul says, we fall for this time and time again.

(BTW I didn’t know until today that many integers have their own Wikipedia pages! The one on 42 is quite extensive, but not primarily because of Doglas Adams.)

OK so we have to try harder. Being a very old techie, I started programming in procedural languages like FORTRAN and low level assemblers; Pascal was about as advanced as I got. The first time I heard about Object-oriented computing, it was explained to me as follows: an object comprises a data structure and a set of methods. You can only process the data structure via the methods; the object binds the two and requires both. I don’t find that definition or explanation in the (few) object-oriented text books I’ve looked at, but it makes sense to me.

How is it relevant to digital preservation and the file format problem? It’s precisely and directly relevant. We have focused nearly all our preservation effort onto the data structures rather than the objects, and we’ve forgotten about the methods. Well, maybe we haven’t forgotten, we wring our hands a bit and some folk mumble about emulation as if that might fix it. But the object is holistic; you can’t access an object in a meangful way without BOTH the data structure and the methods.

Now I don’t mean that every data structure in an archive needs to be accompanied by the original methods (software) that created or processed it. That wouldn’t make sense. We have to allow that a reasonably functional variant of an object can be formed from the data structure and a functionally similar (but actually different) implementation of the methods. We can also often allow a reduced set of methods; so a user or re-user of an object may not need the methods to create or update the data structure, only the methods to access or compute from it.

Given that, can I have a better go at characterising the file format problem? I’ll have a try. I’m going to continue with the object-oriented metaphor from above, even though I realise this will put some folk off. I also fear that I’m BOUND to get my language mixed up a bit, so I may need help, but I hope it makes some sense.

a) Given an arbitrary existing data structure X (known or suspected to be part of an object as defined above) that we need to process in some way, the first problem is to identify the set of methods needed to process X. This generally means finding a suitable program that will run on our computing environment. For a large number of data structures (files), proceeding on an individual data structure basis, this is going to be a long, hard job.

b) In the more common case where X is an instance of a class of data structures of a common format, we reduce the problem in (a) to three new problems: first (b1) we have to classify significant data structures into known types (ie file format types), then (b2) we need to find ways to identify the class of data structure X, then (b3) we need to identify the sets of methods available to process data structures of that class (ie files of that type). This is clearly much harder if there are only a few data structure instances for each type, but when there are billions of instances of a relatively small number of file format types, we hope to recover our major up-front investment. In this context, 10,000 file format types might count as a relatively small number!

c) Having identified the methods for X, we need to be able to actually run the methods on the data structure. This can be arbitrarily easy or hard depending on the dissonance between the computing environment we want to use and the computing environment the methods were constructed for. This might be resolved by finding alternate implemetations of the methods for the object. We also need to know how to run the methods to achieve the result we’re looking for (documentation!).

So far this is a problem framed in the present: for this data structure or file, can I find and run a method, or program, to do what I need with it. But archivists have decided they want to solve another problem.

d) Archivists want to ensure that future generations will be able to identify the classes of data structures like X, and find and use the methods to process them, for an arbitrarily long time into the future (the OAIS somewhat circular definition seems to be that long term is long enough for this to be a problem).

This is, to my mind, definitely a “nice to have” feature, but I’m not totally convinced that it is a problem that it is ESSENTIAL to solve in all cases. It is certainly not applied as rigorously in the analogue world. Museums didn’t decide to discard clay tablets or the Rosetta Stone even though they couldn’t read them. They don’t keep a Norse dictionary next to the Sagas. They rely on scholars to come equipped with suitable knowledge to be able to approach, to access and to process the physical objects. It’s perfectly plausible to imagine scholars of the future being trained with the arcane skills required to handle ancient data structures. And in my view, if you solve this problem for the here and now, there’s a reasonably good chance that it will still work tomorrow, and when tomorrow comes, the next day. It’s recursion all the way forwards! (Until it isn’t of course; but NOTHING is going to guarantee survival through the next great computing discontinuity!)

Anyway, that’s how I’m trying to frame the problem:

b1) classify known data structure types (and their variants) via some scheme that makes sense

b2) work out ways to identify the class of data structure for a particular instance

b3) identify the methods needed to process classes of data structures, and find them

c1) work out how to run those methods on different computing environments, or

c2) identify different iplementations of the methods that will run on our chosen environment, and if possible

d) work out how to ensure this whole structure will continue to work indefinitely.

To make this all work, it is certainly clear we need

e) some sort of information structures to pool our societal knowledge that contributes to the solution of these problems. Digital preservationists have tended to call these “registries”, while Jason thinks of them as a wiki.Registries are more excluding, which reduces the ability of society at large to contribute, but might improve accuracy. Wikis are more inclusive, so can capture more societal input, possibly at the expense of accuracy. Woops, I’m back in solution space again, sorry!

By the way, I deliberately didn’t add validate the data structure. While validating the data structure might be nice, and might help us find appropriate methods, what do we do if it fails validation? Throw the data structure away? Or do the best we can? Trust Postel’s Law! Validate when you create or update, be generous when you access.

About these ads

11 Responses to “The solution is… 42! What was the problem?”

  1. Andy Jackson (@anjacks0n) 4 July, 2012 at 22:45 #

    I think this is pretty much the issue. We are always preserving a process, a performance – i.e. both the data and it’s interpretation. Put more simply, we are in the business of preserving software, whether we like it or not. In some cases, where there are well established formats that are strongly consistent across implementations, we can pretend this is not the case. This allows us to treat preserving the data and preserving the software as separate processes linked only by one or more format identifiers. However, I fear this kind of social norm is the exception rather than the rule.

    I think there is one more necessary refinement, however, if we are to capture how things are rather than how we might wish them to be. We must acknowledge that the in-memory representation of a digital object (the state) is fundamentally NOT identical to the stored artefact (the bitstream). The software decides which aspects of the performance should persist, and how. Thus, ‘Save As…’ is the zeroth preservation action, which migrates data from a live process to a series of bytes which can be used to reconstruct that process.

    I believe that, if we are really going to pin down the relationship between bytestreams, the software that reads/writes them, and the performances they produce, we must understand the not only the relationship between the ‘methods’ and the ‘state’, but also differences between that live state and the persisted state.

    To put that into terms closer to those in your post, a file format is precisely that which defines persistant data structures and how to interpret them (i.e. not just the data structures). This means that most of the activities you identify (a,b,c etc.) are ongoing in the industry, and that we should perhaps focus on observing and capturing how the broader community is solving the issue, rather than trying to solve it ourselves.

    I hope that makes some sense. I keep meaning to write my position up properly, but can’t find the time. Maybe all these discussions will finally push me into doing it!

    • Andrew Wilson (@ancwil) 5 July, 2012 at 00:17 #

      Chris, Andy, this is a great discussion. I’m interested in your reference to ‘performance’ Andy. The ‘performance’ model articulated by the National Archives of Australia (in 2002), is explicitly not about preserving the process (ie. in the NAA view the hardware and software used to create the performance) but about finding ways to preserve the performance over time. You seem to have something different in mind and I wondered if you could expand on your comment “We are always preserving a process, a performance – i.e. both the data and it’s interpretation”. Is it necessary to preserve the ‘process’ in order to replicate the performance? If over time we are able to use different hardware and software to interpret the persistent data structure and replicate all the essential parts of the performance, do we need to preserve the original software?

    • Andy Jackson (@anjacks0n) 5 July, 2012 at 10:43 #

      I’m enjoying this conversation, but I think late night ranting is a dangerous game, and I’m not sure how clear I was by the end of that comment! I agree, Andrew, that the data, the process and the performance are distinct, and I did mean to use the word ‘performance’ in that way (as per NAA 2002). And yes, I agree that if preserving the performance is your goal, then you should be free to shift the data and the process as required in order to reconstruct the performance (and therefore that any ‘significant properties’ you use to validate this shift must be couched in the language of the performance, rather than the format or process).

      Therefore, while I agree that we don’t *need* to preserve the original software, I would argue that in general, preserving the performance is actually equivalent to preserving the software, because the ‘meaning’ of the data is only defined unambiguously by the software. Therefore, literally preserving the software may be the most effective way of preserving the performance. There are notable exceptions, i.e. well established formats where the ‘semantics’ of the data are sufficiently clear that this is not necessary, but I fear they represent a minority of file formats.

      Having said that, for those few very well defined and standardised formats, I think the role of the format specification is to define the data format and it’s interpretation, in order to ensure performances are consistent. i.e. a format specification is a significant property scheme.

    • Euan Cochrane 7 July, 2012 at 04:01 #

      Hi all,

      I just wanted to add a small follow up to Andy’s last comment. I suspect (but can’t confirm due to lack of data on any of the options) that preserving the original software and using that to maintain the performance will not only be more effective at preserving the performance but will also end up being more efficient and generally easier and cheaper than the alternatives.

      Average people understand software and software interfaces and most applications are designed to be able to be installed and used by people with little training (or come with documentation enabling that). The ability to install and use old software is a large component of what I believe would be needed for an average Digital Preservation Practitioner to implement a workflow that used original software to preserve performances. Such a person wouldn’t need to be a technical expert, and wouldn’t need a large amount of training, and so, importantly, they wouldn’t need to be paid a high salary. So that is one of the reasons it seems to me to be a more efficient option than alternatives that require expensive technically proficient/expert, highly trained staff.

      There are also the economies of scale involved with using original software for preserving performances and a myriad of other negative reasons why it seems cheaper/more efficient (reasons why the alternatives are worse).

      More generally I’m pleased and excited that at least some of us seem to be coming to an agreement on the goal of digital preservation: preserving performances!

    • Andy Jackson (@anjacks0n) 7 July, 2012 at 13:52 #

      I would prefer say that the goal of digital preservation is simply re-use. In general, this means preserving the performance, but in my opinion, that language implies that it is all about enabling some future user to re-experience individual resources. There are other important use cases that focus on re-using items /en masse/, such as indexing for discovery and feature extraction for data mining. These are related, of course (the performance must be understood for the feature extraction to be accurate) but not the same and thus require different infrastructure.

    • Euan Cochrane 7 July, 2012 at 14:01 #

      Completely agree Andy.

      The language issues are a pain.

  2. Andrew Wilson (@ancwil) 10 July, 2012 at 07:29 #

    Andy, going back to your post of 5 July (sorry I’m tardy!) I’m not sure I agree with your jump from preserving the performance to “I would argue that in general, preserving the performance is actually equivalent to preserving the software”. If you accept for the moment the performance model, the basis for the concept is that it is not necessary to be able to reproduce at some unknown future time, the original performance. Hence the need to understand the essence or ‘significant properties’ of the performance, as these are what need to be carried forward over time, not the original “look and feel”. From this flows the idea that the original software does not need to be kept. I think this is probably the most contentious bit of the performance view (?). Since, I generally (I have some reservations) accept the performance model, I’m much less wedded to the idea that original software has to be kept. I wouldn’t deny that there is a place for emulation in digital preservation strategies, but I don’t agree that this is what we have to do. Perhaps as an archivist, I’m willing to live with less faithfulness to the original ‘look and feel’ as long as we can establish and document the authenticity trail from the moment of ingest (or even before).

    • Euan Cochrane 10 July, 2012 at 07:48 #

      Hi Andew (et al),

      To me the issues come in the difficulty of identifying automatically what the significant properties of any object are and in particular, automatically identifying when aspects of the object that often get lumped together with “look and feel” are actually integral to the conveying the meaning of the information the original object was meant to convey to the end-user.
      If we can’t automatically identify those properties then we will never be able to do the type of preservation that relies on significant property extraction and comparison in a cost-effective way on a large scale
      (as we will have to manually identify and check for these properties which, I assume due to lack of evidence, would be cost-prohibitive). — we will still be able to do it on a small scale but that is not very useful.
      It is for that reason that it seems to be more cost effective to rely on preserving the whole original performance because then you can know that you have preserved the significant properties without even having to identify them (so long as you are comfortable with the fidelity of an emulated version of the original environment).

      So as I see it, until we can develop tools that comprehensively and effectively automatically identify significant properties (which we may someday be able to do using machine learning and related techniques) then we have to take the alternatives (such as emulation) seriously as they seem to be more practical right now.

    • Andy Jackson (@anjacks0n) 10 July, 2012 at 23:24 #

      Well, this cuts to the heart of the matter that I’ve had so much trouble trying to express. Well, here goes…

      So, the premise is that we shall judge the reproduction of some performance, whether emulated of via migration, by using some formal scheme to capture and compare the ‘significant properties’ of the two performances. I have three issues with this. Firstly, I’m not sure we should be the people who get to decide what is ‘significant’. Secondly, I believe this approach is both extremely difficult (as Euan points out) and practically unnecessary – i.e, that when we do want to evaluate a reproduction we don’t need to create a special new ‘preservation language’ to make it work. However, I’ll leave those for another time, because my third problem goes much deeper.

      I’ll start to pick this apart using a really simple example – ASCII. To render ASCII, we must implement a process: if the byte value is 0×61, plot a glyph that look like ‘a’ and move to the next spot. Capturing the contextual information this depends on, the mapping table and the glyphs, is not sufficient to reproduce this. The rendering is fundamentally a process, a projection, a computation, not static information. The process can be written down, but this is just migrating it to another language, and if you use prose will need to be re-implemented in order to interpret and make use of it. We always end up implementing or porting software – the documentation of the rendering is just helping us get it right.

      Unfortunately, in general, the properties of a rendering process cannot be captured by a simple document compose of prose and declarative data. The only unambiguous language one can use to describe any process any performance may contain is a Turing complete language. Any Turing complete language will suffice, but a language less powerful than that (e.g. the kind of simple declarative data structures we often prefer, or something like boolean logic) will not be able to capture everything we may want to preserve. Thus, if preserving general processes means we are required to preserve one or more Turing-complete formal languages, we are preserving software, by definition.

    • Andy Jackson (@anjacks0n) 11 July, 2012 at 11:29 #

      BTW, I’m not saying that you cannot migrate your data down to a model which is simpler, e.g. discarding ‘look and feel’. I am, however, saying that this is simply equivalent to being dependent on *less* software, rather than on *no* software. I believe this is formally equivalent to a normalisation strategy, and I think it will turn out to be a perfectly valid thing to do. The difficult part is how to be sure about what you are throwing away, which is why keeping a copy of the original software is a necessary failover system for so many cases.

  3. Andy Jackson (@anjacks0n) 11 July, 2012 at 11:25 #

    By the way, I really must point out that all of this abstract cogitation is all about building information systems that exploit format information, and we don’t actually need to build that FIRST!

    As I blogged recently (http://www.openplanetsfoundation.org/blogs/2012-07-06-biodiversity-and-registry-ecosystem), collecting the information we need and making is safe is a technically simple task we can get on with right now. We collect the links and resources, and use a wiki to describe why they are useful. We have been doing this for some time, and could do so even if PRONOM/UDFR/etc. didn’t exist. In fact, all of my comments are really a statement that I do not believe we are going to get those detailed models right until we have collected a lot more information and a lot more of the individuals involved understand formats and software a lot more deeply than we do now.

    So let’s get started!

Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: