So what was that twitter rant against PRONOM all about?

5 Aug

A couple of days ago, I let fly a small rant on twitter about The National Archive’s PRONOM service:

“We tend to believe in PRONOM. But is it useless? Try a search for Wordstar or WordPro. Nothing”

“Even searching PRONOM for Excel doesn’t point me to specs, latest versions, or even that massive ISO standard AFAIK.”
“Last time I tried, PRONOM wasn’t really “open for suggestions” (or they weren’t acted on). Is it now?”
“Wouldn’t it be GREAT if we could crowd-source specs (RepInfo) into PRONOM and see it added (maybe with an un-verified health warning)?”
“Or is Wikipedia a fundamentally more useful tool for getting/saving info on file formats than PRONOM (or even UDFR?)?”
So what was that all about?
I have an idea (well, an old idea that I’m still keen on) that I’ll blog about a bit later. But in preparation for blogging about it, I thought I’d like to find some information about once-important, now-obsolete file formats. Since PRONOM is the world’s best service for information about file formats for preservationists, it seemed the obvious place to start. I was pretty fed up to find so little there.
Pronom describes itself thus:
“The online registry of technical information. PRONOM is a resource for anyone requiring impartial and definitive information about the file formats, software products and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical or business value. “
Now, the file formats I was initially looking for are not in common use today, so (from their point of view) perhaps it’s reasonable that they should not feature. However, as you can see from the tweets, I tried more recent formats. First I tried Excel. If you do a simple search for “Excel” you get 10 entries, the second of which is PowerPoint (XP)! The first 3 entries say “Description in preparation” (those 3 include Excel(XP) and Excel(2003)); the remaining 7 say “No description available”. Ok, perhaps we’re not really interested in the software, but in the file format. I’m happy to report that things are a little better here. Once I worked out how to search for a file format (click on the File Format tab from the Simple Search page), I tried .xlsx for the latest version of Excel. Contrary to my earlier opinion, there is a result, pointing to fmt/214, software Microsoft Excel for Windows 2007. The latter seems to be the only clickable link, so here goes… This gives us an entry for that software product. The description reads:
“This is an outline record only, and requires further details, research or authentication to provide information that will enable users to further understand the format and to assess digital preservation risks associated with it if appropriate. If you are able to help by supplying any additional information concerning this entry, please return to the main PRONOM page and select ‘Add an Entry’.”

Bear in mind here that this format is supposedly described by the OOXML specification, standardised by ISO. But this is not mentioned (maybe I should have a go at adding an entry… done, pointing to the not-open ISO 29500 version and the “technically aligned” ECMA version; they’ll get back to me in 10 working days, reference TNA110004376). In fact the only helpful piece of information anywhere is an external signature of .xlsx (back where we started).

If you repeat this process with .xls, you do get a bit more information, specifically including detailed signature information on at least some variants, and a link to specifications (created by OpenOffice.org).

I did notice in my exploration however that dBase II was mentioned. But that seems to be as far as that goes. There’s essentially no information about it, other than listing the owner as dBase Inc, and giving a web-site (dbase.com) whose domain name seems to have lapsed.

I was originally shocked to find NOTHING about two file formats that are likely to emerge as personal collections start being archived. But I ended up being totally confused about what PRONOM was really expected to achieve.

By the way, there are a couple of hints of future development, one a link to the National Archive Labs site on making PRONOM available as Linked Open Data, and one via the Information Resources page mentioning UDFR.  I also spotted very late on a link to latest PRONOM changes (see http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml) that does indicate that new entries are being added, with Georgia Tech Research Institute being prominent in doing so. Most of the new entries still only say “Outline entry added” though. I remain bemused.

Going back to my original rant, there’s quite a lot of interesting information about Wordstar on Wikipedia (see http://en.wikipedia.org/wiki/WordStar), but so far I have not tracked down a specification.

Advertisements

10 Responses to “So what was that twitter rant against PRONOM all about?”

  1. Ross Spencer 9 August, 2011 at 10:58 #

    Hi Chris,

    Thanks for the tweets, and opening up a discussion It is great to see PRONOM being discussed.

    We appreciate PRONOM does not provide a comprehensive record about each format that we maintain PUIDs for. The main reason for this is how our research and efforts are focussed. In concert with DROID, PRONOM helps provide the community with the tools to identify what they have. Our first focus therefore is ensuring that we have a comprehensive collection of signatures. Our second focus is in providing unique identifiers for formats so that users like yourself and other institutions can talk about those formats and gather more information about what you have in your repository.

    While we’d like to have more detailed information about any format in question the resource required to populate the database with anything useful Is quite significant. This is one of the driving factors behind the linked data project and we’d like to harness the quality information we already have available. Resources like Wikipedia or its linked data counterpart dbpedia.org can provide more detailed information and links. PRONOM can only serve to loosely emulate such resources but can certainly help to point users at this information and other various resources more easily through linking. It is our hope that other institutions will choose to publish their own data about formats relevant to them using our vocabulary and we can either choose to link to that information as well, or, via a well defined provenance model consume the data into our own model and attribute the original publishers more effectively than we can now.

    We can understand why there may be a community perception about our receptiveness to community input as there was a significant time period up until about 18 months ago where we didn’t have enough time or resource to be able to respond or work as pro-actively in seeking information as we would have liked to. I think the linked data project helps to address much of the concern in the community about PRONOM being ‘open for suggestion’, however, I think it is also important to point out we’re now in a position where we’re much more able to communicate with the users of the service and gratefully accept contributions to records. While we ask the community to appreciate we can only act on this with limited resource we do our best to get data into PRONOM as quickly as possible. You have seen our news feeds recently about our successful collaboration with William Underwood at Georgia Tech Research Institute in America to increase the number of signatures in PRONOM (http://www.nationalarchives.gov.uk/news/519.htm). If another institution can offer similarly well defined and structured information we can include this in our database with ease. One of the ways the community can contribute to PRONOM is via the newly re-structured online submission form: http://www.nationalarchives.gov.uk/contact/contactform.asp?id=13 The more information provided initially, the easier it is for our team to deal with.

    In terms of finding WordStar in PRONOM I think what you might have highlighted is a small bug in the free-text search capabilities of the interface. You can find WordStar under the following X-PUIDS (205, 206, 236, 237, 260, 261, 262, 370), WordPro under x-fmt/340. You will see however that these are only outline entries so do require a bit of help to fill them in. We can take a look at the search function in PRONOM but it is something we’re also able to address with the new development.

    Please do contact us via the PRONOM@nationalarchives.gov.uk inbox if you want to talk to us further and keep an eye on the labs site (http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/01/linked-data-and-pronom/) for more information regarding the Linked Data Pronom project and how that might benefit the community more and more as it progresses.

    Kind Regards,

    Ross Spencer
    Digital Preservation Researcher
    The National Archives

    • Chris Rusbridge 9 August, 2011 at 11:22 #

      Thanks very much for a great response, Ross. Perhaps the interface bug(s) were the cause of some of my problems. I was certainly convinced early on that there was no entry for .docx, .xlsx etc, but then when writing the post I found them by another route.

      At the moment I don’t understand how the linked data element can help with data quality issues, but I’m happy to suspend disbelief to some extent, for now. I was pleased about your story on the cooperation with Georgia Tech (which I hadn’t actually seen before), but I’d certainly like to see that extended to more than signatures. Since writing the blog post I have used your data submission form; I’m note sure it’s quite right yet, as I found it a little tricky to get the information I wanted to fit the form. However, that’s from memory; maybe I’ll have another go and record my reactions a bit more closely.

      I’m personally convinced that the community as a whole has a great deal of information that would help fill out those outline records. To get them to come forward they’ll need to believe the information will be used. Do you have any targets times for validating suggested information? Do you have any way of indicating that information has been suggested but has not yet been validated?

  2. Jenny Mitcham 9 August, 2011 at 12:20 #

    Hi Chris – Interesting discussion and you make some valid points! I guess like most tools and services, PRONOM and DROID are still growing and developing and perhaps they will never be ‘complete’ as technology keeps moving and changing. However, I just have a couple of things to say in their defence.

    At the Archaeology Data Service, we have recently started using DROID and PRONOM in earnest to collect and record file level metadata about the files in our archive. The reason we chose to go down this route was that although it wasn’t a complete solution (we have a lot of weird and wonderful archaeological file types that do not yet feature in PRONOM), we felt that there was a good opportunity for us to feed into the development of DROID and help to make it more useful – to us and hopefully to others too.

    Over the course of the last 6 months I have had a lot of contact with the DROID and PRONOM developers sending them strange sample files and DROID identification problems to grapple with (they are probably sick of me!) in the hope that they will be able to incorporate new file types into PRONOM/DROID. In response to this they have added some new file signatures and enhanced existing ones. Although they are busy people and do not have large amounts of time and resources to devote to this, they are always friendly and helpful and encourage me to specify which file types are high priority ones to get sorted out so that they can look at these first.

    I think that DROID and PRONOM have great potential and I agree with you Chris that the digital archiving community could be contributing more in order to make them better. If more people fed into it (and I guess if TNA had more resources allocated to deal with this feedback) then I think we could make this even more useful to all.

    All the best,
    Jen

    • Chris Rusbridge 9 August, 2011 at 13:07 #

      Thanks Jenny. I think input into PRONOM from folks like you at the ADS and others is essential for PRONOM to work. TNA will have their work cut out with more mainstream file types, but the digital preservation world as a whole (which places such reliance on the idea of PRONOM and UDFR as a successor) needs more specialist file formats. There is a plethora of important science data formats, for instance, that TNA is very unlikely to know about or understand (CIF, CDF, etc etc). I do hope ADS can continue to add as much as possible in the way of information about the file types you come across. Thanks again for the ocmment.

  3. edsu 9 August, 2011 at 13:30 #

    Ross said:

    Resources like Wikipedia or its linked data counterpart dbpedia.org can provide more detailed information and links. PRONOM can only serve to loosely emulate such resources but can certainly help to point users at this information and other various resources more easily through linking.

    That actually sounds like a great project, to link up preservation formats in PRONOM with articles on Wikipedia, where possible, and vice-versa. This way you could potentially use the fuller human readable descriptions for file formats from Wikipedia in PRONOM, while still retaining the more rigorous notion of identity that DROID and PRONOM share in their PUIDs. Also, people could discover PRONOM while they are reading about a file format in Wikipedia. It sounds like a great partnership idea. I imagine it’s something that the local GLAM-Wiki folks might potentially work with you on, or even me perhaps, if you are interested.

  4. billroberts 10 August, 2011 at 12:18 #

    Chris – great that you raise this issue as I think it’s important. Ross makes a couple of essential points:

    1) there is a value in just having an identifier and a signature as it enables different people to know they are talking about the same thing (within the limits of signature reliability and specificity anyway).

    2) having more info on formats would be better, but it takes a lot of effort to produce it.

    PRONOM is more or less the only show in town at the moment in collecting and presenting representation information and it’s unreasonable to expect any one institution to be able to do everything, or even to have the time to review and publish contributions from others.

    As you mentioned, the UDFR project (see https://bitbucket.org/udfr/main/wiki/Home for latest news) is also working on this problem. They are developing a multi-user platform that should make it easy for people to publish info about file formats and associated software. Like the latest developments in PRONOM, they are planning to use a Linked Data approach to publishing the info and are in touch with the PRONOM folks and others regarding compatibility of metadata models, identifiers systems etc.

    I’ve been doing some work in this area too over the last 9 months or so, for the National Archives of the Netherlands and for the Open Planets Foundation. There are a few articles and links to papers on the OPF blog (http://www.openplanetsfoundation.org). My main focus has been to look at what needs to be in place to enable a broader ecosystem of representation information sharing and use.

    This has started from two main principles – that we need to encourage and enable as many people as possible to publish representation info in an interoperable way, to increase the coverage of information available; and that we need to be able to separate out factual information about formats and software from institutional preferences or policies on formats, rendering software, migration software etc.

    So we’re in the process of working up some guidelines on how to share representation information, hoping to get a good balance of low barriers to entry and high interoperability (not necessarily an easy compromise). And also looking at how a user of representation information would want to gather and apply that information.

    I’ve been talking to a couple of groups that have some useful information on formats and software and I’m hoping to be able to use this as a pilot or demo of the process – as a learning experience with a side effect of increasing the pool of widely available file format info.

  5. Mark Conrad 12 August, 2011 at 16:05 #

    Hi Chris,

    The work that GTRI is contributing to PRONOM is being done in collaboration with the National Archives and Records Administration (US) – specifically the Applied Research Division of the Office of Information Services (http://blogs.archives.gov/online-public-access/?p=3737). We have a number of students who are searching the internet for file format specifications, example files, viewer/players, and metadata extractors. Finding authoritative information is no easy task.

    We share this information with the folks at GTRI who develop internal signatures to accurately identify files in a particular format. This information is then sent to the National Archives UK folks for inclusion in PRONOM and DROID.

    NARA is very interested in improving our ability to automatically identify file formats. Some estimates are that NARA will have to manage somewhere in the range of 10 trillion digital objects in thousands of formats over the next decade. We already have hundreds of millions of files to manage. Our collaboration with GTRI and the National Archives UK allows us to make progress toward that goal.

    In terms of your questions about what file formats are and are not registered in PRONOM and DROID, I have posted a list of file formats extracted from the Digital Record Object Identification (DROID) tool’s file format signature file v51. It includes the names of the formats, the version number of the formats, and the PRONOM Unique Identifier (PUID). You can find it here:

    http://www.slideshare.net/NARACAST/file-formats-in-droid-signature-filev51.

    Hope this is helpful.

    Mark Conrad
    Applied Research Division
    Office of Information Services
    National Archives and Records Administration
    Rocket Center, West Virginia
    United States of America

    • Chris Rusbridge 12 August, 2011 at 16:17 #

      Thanks mark, that is very interesting and helpful. I’ve got the message that PRONOM is currently mainly focused on identification of file formats, end hence the signatures, rather than on the other information. But I certainly hope that where you do find information on format specs, these could be added too. There are far too many outline records for PRONOM to be able to make its claim for being THE technical registry (my emphasis).

      You mentioned that finding authoritative information is not an easy task, and I’ll second this as I’ve done some of those searches myself. I hope that even if you come across information that looks like it might be right but not definitely authoritative (compare for example OpenOffice.org’s versions of MS format specs) that it be made available, perhaps with a caveat. Sometimes a little is better than no information. Otherwise we are basically in the cryptanalysis game!

  6. Chris Rusbridge 12 August, 2011 at 16:19 #

    Thanks to everyone who has replied here. What will all this information and some comments on twitter and elsewhere, I now have further food for thought. If I can turn that into something coherent I’ll post it here later. But, thanks again

Trackbacks/Pingbacks

  1. Comments on the revised PRONOM Vocabulary Specification « Unsustainable Ideas - 9 November, 2011

    […] National Archives is developing a Linked Data version of PRONOM. I spent some time back in August poking around at various bits of PRONOM, including a quick comment or two on the draft PRONOM Vocabulary […]

Comments always welcome, will be treated as CC-BY

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: