Comments on the revised PRONOM Vocabulary Specification

9 Nov

It’s interesting that The National Archives is developing a Linked Data version of PRONOM. I spent some time back in August poking around at various bits of PRONOM, including a quick comment or two on the draft PRONOM Vocabulary Specification current at that stage (dated 25 May 2011). Since then it’s been revised; the latest version is dated 26 October 2011, and has some good improvements.

I don’t really know much about the technical aspects of namespaces and vocabularies, so maybe I’m not the right person to be making comments. However, the PRONOM guys seemed to appreciate the comments I made earlier, so it’s worth having a go. And there are folk who fully understand the technical nuances, who will comment on those.

One serious issue for me (explored a little already on twitter) is the balance between precision/formalism and comprehensibility. Almost any human concept has a slipperiness that can lead to endless arguments. Computer systems work better with precise, formally-defined terms, but people don’t cope well with these (nerds and geeks excepted). I’ve just read @pinpoint on the social graph, so I’m primed for the futility of precision in human endeavours!

As an example, the earlier definition of a file format was just plain wrong:

“A file format is an encoded digital object, which may be a file, or a bit stream embedded within a file, and which may be processed or rendered in human or machine readable form…”

I remembered a presentation given by Steve Abrams about the (then) GDFR project (see http://www.dcc.ac.uk/events/talks-seminars/database-seminar-format-typing-preservation-data-sets-and-databases links to PowerPoint). Steve defined a format formally (slides 17 and 18):

“Four conceptual entities

AIM            Abstract information model

CIS            Coded information set            (semantic)

SIS            Structural information set            (syntactic)

SBS            Serialized byte stream

“Three encodings

FEM            Format encoding model            FEM : AIMCIS

FEF            Format encoding form            FEF     :  CISSIS

FES            Format encoding scheme            FES    : SIS     → SBS

A format is a triple, F = (FCS, FEF, FES)”

Well, maybe. Note, Steve and others may have moved on from this in the later UDFR work; I don’t have later information. But for my money, this is way over the top. It may be accurate, but given over 5000 graphical file formats alone (thanks @kevingashley), we’d never manage to gather all that information. In the balance between human-comprehensible and machine-processable, in this case I think the former has to win.

So from my point of view, the current PRONOM definition of file format is a pretty reasonable compromise (FWIW some of the words look close to some suggestions I made earlier):

“A file format is an encoding of a file type that can be rendered or interpreted in a consistent, expected and meaningful way, through the intervention of a particular piece of software or hardware which has been designed to handle that format. A file may (or may not) be a container containing zero or more files of various formats. File formats may be defined by a specification, or by a reference software system. Many file formats exist in forms with minor variations, and many also in more than one version. Typing of file formats should be interpreted generously rather than strictly, but sufficiently precisely to distinguish versions where such distinctions have significant interpretive consequences.”

File format is defined as a rdfs:Class; I’m assuming this means there can be an arbitrary number of instances of that class. The 3 other classes defined are Compression Type, Character Encoding and Sotware Package [sic]. I don’t have major concerns at Compression Type, although it did spark off a question in my mind:

Is the decoded version of a file encoded with lossless compression always identical to the original object pre-encoding?

At first glance, of course it is identical. But I think when we use terms like lossy and lossless, we refer to the information content rather than the precise bit sequences. So on decoding, might a different choice be made on some bit sequences? Just asking!

The Character Encoding definition looks reasonable to me. However, I’ve been burned enough times with the complexities of character encodings not even to attempt to go there (other than to note that I’m not sure how reliably one can name a character encoding)!

[UPDATE: after first posting this, I was wondering about a couple of other character-related issues. One is bit-length. We tend to think exclusively in terms of multiples of 8-bit bytes these days, but this is about preservation of past objects, and many of these had  6 or 7-bit characters. However, these formats and their associated digital objects are extremely rare, and can perhaps be best dealt with in comments rather than a specific vocabulary element.

OTOH, the distinction between text and binary DOES seem to me an important primary property for a file format that the current vocabulary draft does not capture; binary is at best a secondary characteristic in the current draft. END UPDATE]

I do really love the term Sotware Package (indeed it reminds me of a novel I read once called The Sotweed Factor!), but sadly we must assume they mean Software. The definition looks OK at first glance:

“Individual programs or a suite of programs that are executed by the computer to accomplish a single task. Software packages exist to perform a wide variety of functions from operating system basic scripting to web development. Software packages require a specific combination of hardware and operating system in order to function.”

My mind buzzes with various complications like apps, browser extensions, plugins, OS commands etc, but I suspect in the context of describing the mechanism for the interpretation of files, this isn’t too bad. I do have a slight concern however in the light of my earlier experiences trying to find information on the current PRONOM site. There doesn’t seem to be a clear information model underlying that site covering the relationships of things like the file format and the software. Particularly the human-readable names of the file formats are often software package names. So for example, if you search on PRONOM for “xlsx”, you will get to the record with PUID fmt/214, with the Name “Microsoft Excel for Windows”. It’s confusing. There needs to exist an information model that links the abstraction that is represented by the particular version of the spreadsheet format with zero or more software packages (some of which might be described as definitive) and zero or more specifications, some of which may be public, some open, some private, some proprietary etc.

Moving on to properties, the Internet Media Type (or MIME Type) could perhaps have a comment that some file formats can be labelled with more than one MIME Type, and vice versa. Likewise for File Extension.

I’m not at all sure what Media Format is doing here! Since most (possibly not all) file types are independent of media, I’m not sure it’s worth a separate property. But if it exists unused maybe it is no great burden.

I’m sure Version isn’t quite right. The property is defined as:

“The specific version number or letter of the compression technique, file format or character encoding. It is the number or letter used to distinguish this version from previous and subse-quent [sic] versions, and usually follows the naming convention established by the manufacturer.”

Pardon me, but that’s a version identifier, not a version. We should refer back to the file format definition here; the definition needs to distinguish minor variants of the file format (not requiring a different version), as well as distinct versions. The latter should be (but are not always) identified by a separate version identifier.

The definitions of database and dataset are not ones that I would agree with. Some databases are not binary in nature; the term dataset is often used to include databases as well as other classes of file. I’m not sure this collection of file format genres is ever going to get widespread agreement; I’d certainly like to see an attempt at a Venn diagram that covered the complete information space of file format genres! It looks like “Un-structured text” is the only available label for other types the PRONOM people didn’t think of (ie “other”). But maybe it is useful as a human-comprehensible property (rather than being strictly accurate).

Big grumble on version control: the Latest Version is described as n/a, and the Previous Version as n/a! Come on guys, that is just not good enough. Label your versions, link your versions, it’s important (at least there’s a date to distinguish them).

Now don’t get me wrong here. This seems to me important and 95% on the right track. My comments above could be completely rubbish, but are offered in the hope they might help. I’m hoping to get to the forthcoming PRONOM/DROID workshop (not including a link as I’m not sure how open it is), but I’m not yet sure I can.

I’ve some further thoughts on the ability of others to contribute to this effort (the PRONOM resource rather than the vocabulary definition) in terms of the data content, but this is long enough already so those will have to wait for another day.

Advertisements

Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: