Yesterday I wrote in excitement at Jason Scott’s call to arms to make November 2012 the month to “solve the file format problem“. While excited, I’m not quite clear yet what the problem is, and suggested some possibilities. I got a couple of comments, which is nice as usually what I write here disappears into the great void that is the interwebs. The second is an excellent response to the first, but I wanted to make my own response.
Andrew Wilson commented:
“Yes, great but Jason seems to be labouring under the very mistaken belief that no-one has done anything about this in the last few decades. All of us who work in digital preservation, whatever community we’re from, have been working on this problem and we’ve all contributed little bits of work to some sort of developing solution. I think there’s also an inherent assumption in Jason’s post that solving the problem means it will go away forever. That ain’t going to happen no matter how much we might wish it to. Unless we somehow come up with a high-level approach that can be implemented forever on whatever formats come into existence in the next thousand years. Somehow I don’t think we will get there for quite a while yet. What we have now, and will continue to have, it seems to me, are interim solutions. And I’m not sure we can or should want for much more than that – not because I’m a pessimist but because I think we need to keep on working on this issue for as long as humans use computing technology. Otherwise we stagnate.”
Two parts to this, really: is Jason ignoring the good work already done, and will a one-time hit solve the problem?
On the first part, there are a few good signs. First, Jason works for (if not at) the Internet Archive, and there are plenty of folk there who know a lot about what the digital preservation and archives professions have done in this area. Second, the planning has started on a wiki, and around a dozen services relevant to file formats are already listed there (as is my blog post from yesterday, which was a bit of a surprise). Of course, being a wiki, your mileage may vary depending on when you look at it.
But have we really done so well? Yes, quite a bit of work has been done. Later yesterday there was the announcement of the UDFR service, successor to GDFR and supposedly a munging together with PRONOM. For the purposes of this post, I have deliberately not looked at UDFR; it will take more thought and analysis than I have time for right now. Before UDFR, the shining light in this area was PRONOM. I wrote a few posts about PRONOM last year, including “What would success look like?“.
Let me make it quite clear, I have great admiration for the PRONOM team and for TNA in supporting them. For me, part of the problem with PRONOM is resources. The PRONOM team can never get big enough to find and validate all the information that PRONOM deserves to hold. Indeed they almost admit as such by offering a facility for the general citizens to make suggestions. So I did, and the results up to November last year are reported in that post. I’ve just been and checked, and the last update of the entry I commented on is dated as April 2012, but still contains almost no data. This is the MS Excel 2007 .xlsx format, and we have only an outline record. Almost a year later, and a citizen’s attempt to help PRONOM has still not been acted on. Of course, not every suggestion would be appropriate, but I reckon linking the description of MS Excel’s xlsx format to the appropriate standards would be useful, don’t you?
And even if they did manage to accept more information, would it help? I have a problem with some early Mac Powerpoint 4 files on this computer; my latest version of Powerpoint (2004) won’t open them. Let’s find out what PRONOM can tell us about Powerpoint 4? (By the way, the search capability on PRONOM just sucks. Really. Start from the front and try to find Powerpoint 4 and see where you have to go!) What does the entry (x-fmt/88) say?
“This is an outline record only, and requires further details, research or authentication to provide information that will enable users to further understand the format and to assess digital preservation risks associated with it if appropriate. If you are able to help by supplying any additional information concerning this entry, please return to the main PRONOM page and select ‘Add an Entry’.”
“Developed by: None.”
(That’s pretty much all it says about the latest versions too, by the way.) I also have some problems with mind map files, Neither piece of mind mapping software nor their formats are mentioned in PRONOM.
If software or file formats are mentioned, all we really have is some sparse, basic information about them, with maybe a signature of some sort to aid in identification. Even if the entries existed, and I wanted to read my Powerpoint 4 or Mind Manager files, it’s no help. To read them, I need some sort of tool.
Archivists are of their nature careful and precise. They want things to be right. They don’t want to promulgate wrong information, or muddy a provenance trail. How many archives acted when Geocities was going down? Or the many other services that are dying even now? How many archives helped with the Deathwatch? Maybe some , but mostly those actions were from the rogue archivists. Stuff protocol, let’s get in there and do what we can, that’s the Archive Team attitude. Bold and brash and maybe wrong, but some people have access to some of their stuff who would have lost it, while archives mostly looked the other way.
Yes, I’m sceptical that you can “solve the file format problem” at all, let alone as Andrew suggests, in one month. But I reckon by mobilising a massive citizen’s effort and playing fast and loose with the rules, a huge amount of progress may be possible. For instance, suppose we found a way to mobilise the Open Source community to bring together some of the Open Source efforts to handle obsolescent file formats? It only needs a few clever people to add a filter for Powerpoint 4 to the OpenOffice family (see a previous post of mine on the Digital Curation Blog), and most of my problems with that format will be fixed, quite probably for my lifetime. Hell, with a bit of support from Microsoft (who don’t make any money out of Powerpoint 4, I’m sure!) I’d guess half a dozen skilled folk could get it done in the month.
Yes, if you managed to update PRONOM/UDFR to list 95% of all known file types, within a year that would probably have dropped to 80% as new file types emerge (although I suspect fewer types are emerging than did in the 1980s and 1990s). So you do need continuing efforts. But you would have had to build a better citizen participation model, and probably a better governance model, and those would have lasting value in themselves. There are many ways in which an effort like this can have continuing value.
I’ll leave the last word to Andrew Jackson, who responded to Andrew Wilson’s comment:
‘If Jason is “labouring under the very mistaken belief that no-one has done anything about this in the last few decades”, I think he can be forgiven for it. We’ve done a lot of modelling, written a lot of papers, built a lot of registries. But, apart from PRONOM, which doesn’t really cover the same territory as Jason wants to address, what have we got? How full are our registries? How usable and discoverable are they? How far have we publicised them? If we’ve ‘solved’ this problem, why doesn’t he know about it?
‘I think if you read Jason’s full post, you’ll realise he knows this is not a one off. http://ascii.textfiles.com/archives/3645 “Think what giving a month every year will do for a problem like this.”’