Is the PDF format appropriate for preserving documents with long perspective?

19 Mar

Paul Wheatley drew attention to this question on Stack Exchange yesterday:

“PDF is almost a de facto standard when it comes to exchanging documents. One of the best things is that always, on each machine, the page numbers stay the same, so it can be easily cited in academic publications etc.

But de facto standard is also opening PDFs with Acrobat Reader. So the single company is making it all functioning fluently.

However, thinking in longer perspective, say 50 years, is it a good idea to store documents as PDFs? Is the PDF format documented good enough to ensure that after 50 years it will be relatively easy to write software that will read such documents, taking into account that PDF may be then completely deprecated and no longer supported?”

I tried to respond, but fell foul of Stack Exchanges login/password rules, which mean I’ve created a password I can’t remember. And I was grumpy because our boiler isn’t working AFTER it’s just been serviced (yesterday, too), so I was (and am) cold. Anyway, I’ve tried answering on SE before and had trouble, and I thought I needed a bit more space to respond. My short answer was going to be:

“There are many many PDF readers available implemented independently of Adobe. There are so many documents around in PDF, accessed so frequently, that the software is under constant development, and there is NO realistic probability that PDF will be unreadable in 50 years, unless there is a complete catastrophe (in which case, PDF is the least of your worries). This is not to say that all PDF documents will render exactly as now.”

Let’s backtrack. Conscious preservation of artefacts of any kind is about managing risk. So to answer the question about whether a particular preservation tactic (in this case using PDF as an encoding format for information) is appropriate for a 50-year preservation timescale, you MUST think about risks.

Frankly, most of the risks for any arbitrary document (a container for an intellectual creation) have little to do with the format. Risks independent of format include:

  • whether the intellectual creation is captured at all in document form,
  • whether the document itself survives long enough and is regarded as valuable enough to enter any system that intends to preserve it,
  • whether such a system itself can be sustained over 50 years (the economic risks here being high),
  • not to mention whether in 50 years we will still have anything like current computer and internet systems, or electricity, or even any kind of civilisation!

So, if we are thinking about the risks to a document based on its format, we are only thinking about a small part of the total risk picture. What might format-based risks be?

  • whether the format is closed and proprietary
  • whether the format is “standardised”
  • whether the format is agressively protected by IP laws, eg copyright, trademark, patents etc
  • whether the format requires, or allows DRM
  • whether the format requires (or allows) inclusion of other formats
  • the complexity of the format
  • whether the development of the format generally allows backwards compatibility
  • whether the format is widely used
  • whether tools to access the format are closed and licensed
  • whether tools to access the format are linked to particular computer systems environments
  • whether various independent tools exist
  • how good independent tools are at creating, processing or rendering the format

and no doubt others. By the way the impact of these risks all differ. You have to think about them for each case.

So let’s see how PDF does… no, hang on. There are several families within PDF. There’s the “bog-standard” PDF. There’s PDF/A up to v2. There’s PDF/A v3. There are a couple of other variants including one for technical engineering documents. Let’s just think about “bog-standard” PDF: Adobe PDF 1.7, technically equivalent to ISO standard ISO 32000-1:2008:

  • The format was proprietary but open; it is now open
  • it is the subject of an ISO standard, out of the control of Adobe (this might have its own risks, including the lack of openness of ISO standards, and the future development of the standard)
  • it allows, but does not require DRM
  • it allows, but does not require the inclusion of other formats
  • PDF is very complex and allows the creation of documents in many different ways, not all of which are useful for all future purposes (for example, the characters in a text can be in completely arbitrary order, placed by location on the page rather than textual sequence)
  • PDF has generally had pretty good backwards compatibility
  • the format is extremely widely used, with many billions of documents worldwide, and no sign of usage dropping (so there will be continuing operational pressure for PDF to continue accessible)
  • many PDF creating and reading tools are available from multiple independent tool creators; some tools are open source (so you are not likely to have to write such tools)
  • PDF tools exist on almost all computer systems in wide use today
  • some independent PDF tools have problems with some aspects of PDF documents, so rendering may not be completely accurate (it’s also possible that some Adobe tools will have problems with PDFs created by independent tools). Your mileage may vary.

So, the net effect of all of that, it seems to me is that provided you steer clear of a few of the obvious hurdles (particularly DRM), it is reasonable to assume that PDF is perfectly fine for preserving most documents for 50 years or so.

What do you think?

  1. bryan 29 April, 2013 at 12:35 #

    Quite right! Sure there are faults and risks with PDF, but we must not let the perfect be the enemy of the good. There are much much worse formats out there …

