Call to arms: solve the file format problem

3 Jul

Readers may well know, if there’s one thing I really care about, it’s digital preservation. Actually no, delete that. I don’t care about digital preservation AT ALL. If there’s one thing I really care about, it’s trying to ensure that people in the future can read and use important digital objects from their (and our own) past. That’s why I’m really pumped up about Jason Scott’s call that we make November 2012 as “SOLVE THE FILE FORMAT PROBLEM MONTH”. It’s a great post, go read it…

You’re back? Good. Remember, this is a problem that we “digital preservation experts” (and I think I can still just about include myself in that category) have been waffling on about for years. It is a real problem, although I have claimed it is not as bad as we used to think. David Rosenthal has called format obsolescence “the Prostate Cancer of Preservation“, not in relation to any sudden swift deadliness, but rather a widespread nature with in most cases not very severe effects (my half of the human race is much more likely to die with prostate cancer than because of it). And he points out that many of the proposed approaches to format obsolescence risk causing more damage than benign neglect would have done. You read David’s post? Great to see you back again…

There have been various approaches to aspects of this problem. One approach is to try to gather authoritative information about the various file formats in registries; PRONOM is one example of this approach; the proposed Global Digital Format Registry (GDFR) is another. Another (linked) approach is to provide tools to help identify the file type class of a particular file found “in the wild” as it were, based on various clues around and within the file. DROID is one example. Still another approach is to try to ensure that files of a particular format are wellformed by validating them; JHOVE is such a tool.

One problem with these approaches has been the demand for an authoritative view. They are the results of collaborations of insiders. They ignore the vast amount of information held outside the insiders. Was it Bill Joy of Sun who said “most of the smartest people work for somebody else“? Well, most of the information about file formats is known by people other than the insiders and experts. Gamers, hobbyists and the back-bedroom crew fascinated by old stuff kniw far more about many more formats than do the insiders and experts.

That’s what is so refreshing about Jason Scott’s call to arms. He doesn’t care about insiders and experts. I have no idea what Jason Scott thinks about OAIS or the subtleties of representation information, but I’d be willing to bet that it’s way more “expletive deleted” than my own jaundiced view. Get smart people to work together to do stuff, that’s a positive and valuable attitude.

But… what is the file format problem that we need to solve? It could be

  • lists of file formats with useful information about them
  • file format specifications (and variations)
  • tools to identify file formats, given files (cf DROID etc)
  • tools to validate file formats (cf JHOVE)
  • tools to migrate file formats from obsolescent to more modern forms (also known as “Save As…”)
  • tools to emulate older environments to allow obsolescent file formats to be handled
  • tools to process obsolescent file formats in current environments (related to “migration on demand”)
  • all of the above… gathering all available information about file formats and tools for handling them together.

… or other things I haven’t thought of!

I don’t know quite what Jason has in mind but if he gets even part of that done it’s likely to be something useful. My guess is he’ll be looking for maximum impact rather than maximum polish or maximum “correctness”. I’d like to join in!

About these ads

7 Responses to “Call to arms: solve the file format problem”

  1. Andrew Wilson (@ancwil) 3 July, 2012 at 23:16 #

    Yes, great but Jason seems to be labouring under the very mistaken belief that no-one has done anything about this in the last few decades. All of us who work in digital preservation, whatever community we’re from, have been working on this problem and we’ve all contributed little bits of work to some sort of developing solution. I think there’s also an inherent assumption in Jason’s post that solving the problem means it will go away forever. That ain’t going to happen no matter how much we might wish it to. Unless we somehow come up with a high-level approach that can be implemented forever on whatever formats come into existence in the next thousand years. Somehow I don’t think we will get there for quite a while yet. What we have now, and will continue to have, it seems to me, are interim solutions. And I’m not sure we can or should want for much more than that – not because I’m a pessimist but because I think we need to keep on working on this issue for as long as humans use computing technology. Otherwise we stagnate.

    • Andy Jackson (@anjacks0n) 5 July, 2012 at 10:05 #

      I’m sorry I mis-represented your position in my comment below – I didn’t mean to, and I know you know this isn’t solved. I agree that this should be an ongoing responsibility for a number of organisations, and we will need more than a file format reference stack in order to solve the problems we face in digital preservation. I do, however, believe that we need the file format reference stack in order to get the other stuff right.

  2. Andy Jackson (@anjacks0n) 4 July, 2012 at 09:20 #

    If Jason is “labouring under the very mistaken belief that no-one has done anything about this in the last few decades”, I think he can be forgiven for it. We’ve done a lot of modelling, written a lot of papers, built a lot of registries. But, apart from PRONOM, which doesn’t really cover the same territory as Jason wants to address, what have we got? How full are our registries? How usable and discoverable are they? How far have we publicised them? If we’ve ‘solved’ this problem, why doesn’t he know about it?

    I think if you read Jason’s full post, you’ll realise he knows this is not a one off. http://ascii.textfiles.com/archives/3645 “Think what giving a month every year will do for a problem like this.”

  3. Jason Scott (@textfiles) 4 July, 2012 at 14:33 #

    Loved both your postings, Chris. You’re a thinker!

    I love phrases like “laboring under the misbelief” because it is the kind of mealy-mouthed decanter-sniffing phrase of the “go away, plebe” that I’ve dealt with for a bunch of stuff I do.

    I think you can be reasonably assured that if I presented a keynote at the Joint Conference of Digital Libraries, and also presented at two Personal Digital Archiving conferences (here here), that at some point in those conferences I hung a bit with people dedicated to the issues of file formats, or for whom this has been a problem, or who informed me of various aspects of issues. And besides telling me the problem, I would also be aware what steps would be already made towards the problem.

    Heck, as the guy who runs textfiles.com, I have been building collections like this for decades: http://www.textfiles.com/programming/FORMATS/ – I know fully well there’s plenty of work.

    I have a Wiki page with initial sketches here: http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012
    One of the first lines I put in about the context project is this:

    “This is not a “sprung from the forehead of Zeus” attempt to completely re-boot the process of enumerating the many formats out there. Much work has been done and there is much to share.”

    No, I am not a new registry in “competition” for “mindshare” on the issue. I am a chaos agent, like we’ve been with Archive Team (different than archive.org, by the way), turning the theoretical and the progressive into the real. When Archive Team started, people sniffed how we were using WGET instead of some properly standards compliant web archive format. Within a short time, WE CHANGED WGET TO SUPPORT WARC. And I can assure you, our ability to download the picplz photo sharing site in 36 hours, using an open-standards/open-source tool like the Archive Team Warrior VM was something very few other organizations could turn around and do.

    So, what I’m going to do is frame this issue in such a way that I will loose an army of folks on the problem, cut away from politics and variant focus. I hope to make it a thousand. The resulting wiki and files from that wiki, that directory, will be completely open. It will, of course, pull information from all known registries, since that level of work has been done, but it will aggressively track down experts and folks weighing in on all layers of the “file format” issue, from developers to archivists, and put it in one place.

    After 30 solid days of this, we’ll have…. something. That thing will be, I bet, very, very large. It will go in many directions. It will absolutely be inferior in some ways to some registries and it will absolutely trounce others. But most importantly, we’ll have it in a form and way that others can use, move with, or absorb back into the other registries.

    The goal is to enumerate every file format. Every step we take pushes it that way. At the end of the journey, maybe we won’t have every file format. Or maybe we will. Let’s find out.

    • Andrew Wilson (@ancwil) 4 July, 2012 at 23:58 #

      OK. Nothing I said was critical of what Jason is calling for, nor do I think it won’t come up with something worthwhile. Despite Andy, I never suggested at any point in my comment that the problem has been solved. I know as well as you that it hasn’t been. I think its a fantastic idea to address ONE of the file format problems in the way Jason is suggesting. But I don’t think that development of a file format registry however comprehensive will SOLVE the problem of file formats. There will still remain all the problems Chris has raised in his post above. A registry of the sort Jason is imagining will be a great and important step on the journey but it won’t be THE solution.

Trackbacks/Pingbacks

  1. Response to the “call to arms” post « Unsustainable Ideas - 4 July, 2012

    [...] I wrote in excitement at Jason Scott’s call to arms to make November 2012 the month to “solve the file format [...]

  2. The solution is… 42! What was the problem? « Unsustainable Ideas - 4 July, 2012

    [...] kind enough to say, referring to one of my recent posts “Chris Rusbridge just got a bit closer to articulating some problems/aims in a blog post related to t…. It’s interesting to see that Chris’s list is mainly about tools that do handy things [...]

Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: