More on PRONOM

17 Aug

I was quite un-prepared for the success of my post on my experiences with PRONOM, but I was very pleased with the comments that arrived. The National Archives people who run PRONOM were generous in giving a balanced answer to a rather un-balanced post, and there were many other useful comments, both in replies, other blog posts and on twitter.

Some of these comments led me to look more closely at the National Archives Labs pages on the Linked Data project, and that led me to look at the PRONOM vocabulary. I was a little concerned at some of the definitions, eg for File format:

A file format is an encoded digital object, which may be a file, or a bit stream embedded within a file, and which may be processed or rendered in human or machine readable form. It is an arbitrary method of storing digital content in a file, allowing its later retrieval or interchange with other people and very specific combinations of hardware and software. There are many different formats for different kinds of digital content and often formats can have multiple versions. File formats can be software-independent, or can be developed in conjunction with specific software products.”

OK, maybe I’m being picky here. But a vocabulary needs to be pretty accurate language (but perhaps not too finicky, see below). Surely, a file is an encoded digital object. A file format is… an encoding of a digital object? A method or schema for encoding a digital object?

[Aside: does a vocabulary in the Linked Data world need to be correct in its definitions, or is it a matter of being correct in the relationships of the parts, and sufficiently sensible in its definitions that actual usage is more-or-less right? Ie is the vocabulary really defined by formal definition or by usage. Dublin Core folk might have insights beyond mine…]

In a tweet, Andy Jackson said “I’m tending to avoid the word format! e.g. JHOVE [says] File [conformsTo] Spec or Adobe [says] Reader [conformsTo] PDFSpec”. I must admit I wasn’t very happy with this; it’s pretty clear that something can still be a file of type X even if it does not quite conform to the specification. For most of us, a PDF file can still be a PDF file even if it is slightly broken in relation to either the spec or the standard, and even if it is not created by an Adobe product. I believe there are well known cases where files created by a software company’s products fail to match their own specifications (sometimes these are bugs, sometimes these are “features”!). Remember Postel’s law: “Be conservative in what you send; be liberal in what you accept”. That’s critical for digital preservation…

Is there an alternative available? A few years ago Steve Abrams, then at Harvard and working on the Global Digital Format Registry (GDFR, now developing into UDFR) came to visit us at the DCC at Edinburgh. There was a certain amount of tension amongst colleagues in the DCC about the relationship of file format registries such as PRONOM and GDFR to “full OAIS” representation registries such as the DCC’s own RRORI. Partly to address this, we got Steve to give a talk on “Format typing for the preservation of datasets and databases”. If you are interested, you can find the presentation on the DCC web site, as well as a recording of his talk (which doesn’t work on my Mac; YMMV).

Steve’s informal definition of a file format is “a serialized encoding of an abstract information model”. He went on to hint at a taxonomy of ontological classes, abstract families, concrete formats, and relationships. He suggested there were:

“Four conceptual entities

-AIM Abstract information model

-CIS Coded information set (semantic)

-SIS Structural information set (syntactic)

-SBS Serialized byte stream”

There were also 3 encodings between these conceptual entities. A format then was a triple from those 3 encodings.

Well, that looks pretty rigorous. Is it any more helpful? At first I thought that clarity was really important, so maybe going down that route was valuable. But then I realised: nearly all the entries of the simple PRONOM database are pretty much empty. How would we ever hope to fill those empty details if we look for even more complicated information on file formats, in forms never envisaged by those who wrote the specifications or programs concerned? We need to Keep It Simple, Steve. So sorry, Steve, I think the conceptual entities and their various encodings are a step too far.

But if I’m not happy with the existing PRONOM definition, I must at least propose a better one. It doesn’t have to be completely precise (and probably never can be), but it shouldn’t be misleading. I think I would slightly simplify Steve’s informal definition. A really good definition would allow us to sense the difference between a minor variation on a file format, and a different version of the file format; however I suspect that’s beyond me. Anyway my attempt is:

“A file format is an encoding of a file type. A file may (or may not) be a container containing zero or more files of various formats. File formats may be defined by a specification, or  by a reference software system. Many file formats exist in forms with minor variations, and many also in more than one version. Typing of file formats should be interpreted generously rather than strictly, but sufficiently precisely to distinguish versions where such distinctions have significant preservation consequences.”

Kind of leaves open what a file type is, but hey, I have to leave some problems for the readers (;-).

[This is my first attempt at writing a post with MARSedit, in response to my truly awful experiences producing the previous short post with WordPress’ own editor. Let’s see how it goes…]


Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: