Legacy document formats

27 Sep

As the Jason Scott November File Format month of action comes closer [update: original post here and wiki page here], and also as I wrestle with trying to access some 50 or so Powerpoint 4.0 files from the late 1990s, I am reminded of a post I wrote on the Digital Curation Blog in March 2008 (now rehosted on the DCC web site). I thought it might be useful to re-post an edited version here.

‘On the O’Reilly XML blog, which I [used to] read with interest (particularly in relation to the shenanigins over OOXML and ODF standardisation), Rick Jelliffe writes An Open Letter to Microsoft, IBM/Lotus, Corel and others on Lodging Old File Formats with ISO. He points out that

“Corporations who were market leaders in the 1980s and 1990s for PC applications have a responsibility to make sure that documentation on their old formats are not lost. Especially for document formats before 1990, the benefits of the format as some kind of IP-embodying revenue generator will have lapsed now in 2008. However the responsibility for archiving remains.

“So I call on companies in this situation, in particular Microsoft, IBM/Lotus, Corel, Computer Associates, Fujitsu, Philips, as well as the current owners of past names such as Wang, and so on, to submit your legacy binary format documentation for documents (particularly home and office documents) and media, to ISO/IEC JTC1 for acceptance as Technical Specifications.[...] Handing over the documentation to ISO care can shift the responsibility for archiving and making available old documentation from individual companies, provide good public relations, and allow old projects to be tidied up and closed.”

[Some further paragraphs I didn’t quote then:
“For nations where the 17 year patent time applies, there seems little reason why formats from 1990 and before could not be quickly submitted and dealt with in this way. However, given the enormous benefits that openness brings in increasing the size of the pie, I suggest that even recent formats, for example formats before 2001, should also be submitted to ISO as Technical Specifications in this way with some appropriate RAND-z IP covenant or license.

Examples of these formats that spring to mind include:

  • All Microsoft Office binary and text and media formats, including RTF and Visio
  • All IBM/Lotus binary and text and media formats, including Visicalc
  • All Corel formats, including WordPerfect

Furthermore, I call on archiving and regulatory bodies to investigate encouraging and supporting this kind of activity. As well as office document formats, there are substantial legacy collections of financial and engineering documents which would also benefit from the same treatment. It should go without saying, but the Macintosh, Amiga, OS/2, and applications on the many different versions of UNIX may also have hosted popular applications whose documentation may be in danger of being lost unless it is lodged with a suitable formal international technical library, such as ISO/IEC.

The ISO/IEC Technical Specification is a good, low-fuss medium for making sure that older formats do not disappear, and without requiring costly rewrites or changes.”]

This is in principle a Good Idea. However, ISO documents are not Open Access; the specifications Rick refers to would benefit greatly from being Open. They would form vitally important parts of our effort to preserve digital documents. Instead of being deposited in ISO, they should be regarded as part of Representation Information for those file types, and deposited in a variety (more than one, for safety’s sake) of services such as PRONOM at The National Archive in the UK, the proposed Harvard/Mellon Global Digital Format Registry, the Library of Congress Digital Preservation activity or the DCC’s own Registry/Repository of Representation Information.’

It might be worth pointing out that Microsoft has become one of the Good Guys here (more or less), maing its specifications widely available (although not the Powerpoint 4.0 and earlier specs I’m interested in).

Some commentary on my proposed destinations… at the moment, PRONOM apparently is focused on hosting signatures rather than any other file format documentation, and is very oriented towards formats of direct relevance to the UK Government. GDFR got folded into the more recent UDFR activity, which appears to have been a development project only; I see no sustainability plan. The LoC activity continues and would be a good place. The DCC RRoRI activity got absorbed into the EU CASPAR project; I don’t know whether it is still active under the Alliance for Permanent Access.

… And now I would add, perhaps the Internet Archive would be a useful Open, neutral place…

About these ads

Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: