The DOI has no clothes, and Publishers have taken them away!

19 Feb

So what’s the Digital Object Identifier for, really? I thought it was a permanent identifier so that we could link from one article to the articles it references in a pretty seamless fashion. OK, not totally seamlessly, since a DOI is not a URI, but all we have to do is stick http://dx.doi.org/ on the front of a DOI, and we’re there. So we should end up with an almost seamless worldwide web of knowledge (not Web of Knowledgetm, that’s someone’s proprietary business product).

Obviously the Publishers must play a large part in making this happen. They support the DOI system through their membership of Crossref, and supplying the metadata to make it work. And sometimes they remember that when they transfer a journal from one publisher or location to another, they can fix the resulting mess simply by changing the redirect inside the DOI system. (And sometimes they forget, but that’s another story.)

And of course, these big, toll-access, subscription-based Publishers trumpet all the Added Value that their publishing processes put onto the articles that we write and give to them (and referee for them, and persuade our libraries to buy for them, and…). So obviously that Added Value will extend to ensuring that all references have DOIs where available? A pretty simple thing to add in the copy-editing stage, I would have thought.

Except that they don’t. They display few if any DOIs in their reference lists of “their” articles. In fact my limited, non-scientific evidence-collecting suggests to me that they probably do the opposite to Adding Value: remove DOIs from manuscripts submitted to them. OK, I have no direct evidence of the removal claim, but I reckon there is pretty good circumstantial evidence.

I don’t have a substantial base of articles to work from (not being affiliated with a big library any more), but I’ve had a scan at the reference section of several recent articles from a selection of publishers. What do I see?

Take for example this editorial in Nature Materials:

Nature. (2013). Beware the impact factor. Nature materials, 12(2), 89. doi:10.1038/nmat3566

Yes, there’s a DOI in the reference I used. Mendeley picked that DOI up automatically from the paper. If I use that paper in a reference, the DOI will be included by Mendeley. This presumably  also happens with EndNote and other reference managers. (Here’s me inserting a citation for (Shotton, Portwin, Klyne, & Miles, 2009) from EndNote… yes, there it is, down the bottom with a big fat DOI in it.) (This is part of my circumstantial evidence for Value Reduction by Publishers! We give them DOIs, they take them away.)

Anyway, looking at that Nature editorial, there are no DOIs in the reference list. Reference 7 is:

7. Campanario, J. M. J. Am. Soc. Inf. Sci. Technol. 62, 230–235 (2011).

I tried copy/pasting that into Google. I get two results, neither of which appears to be a JASIST article. OK let’s try this one, in a completely different field, from an Elsevier journal:

McCabe, M. J., Snyder, C. M., & Fagin, A. (2013). Open Access versus Traditional Journal Pricing: Using a Simple “Platform Market” Model to Understand Which Will Win (and Which Should). The Journal of Academic Librarianship, 39(1), 11–19. doi:10.1016/j.acalib.2012.11.035

Again, none of the referenced articles have DOIs included in the reference list. Here’s a recent reference:

Jeon, D. -S.,&Rochet, J. -C. (2010). The pricing of academic journals: A two-sided market perspective. American Economic Journal: Microeconomics, 2, 222–255.

Maybe that article (and all of the others) doesn’t have a DOI? Same trick with Google, we don’t get there straight away, we get to another search, for articles with the word “perspective” in that journal… which does get us to the right place. And yes, the article does have a DOI (10.1257/mic.2.2.222). Let’s try this article; surely Nucleic Acids Research is one of the good guys?

Fernández-Suárez, X. M., & Galperin, M. Y. (2013). The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic acids research, 41(D1), D1–7. doi:10.1093/nar/gks1297

No DOIs in the reference list. Here’s an odd one, from Nature again:

Piwowar, H. A. (2013). Value all research products. Nature, 493(7431), 159. doi:10.1038/493159a

Here they include no DOIs for actual articles, but there are URL-DOIs for Figshare! The first two references are:

1. Priem, J., Costello, K. & Dzuba, T. Figshare http://dx.doi.org/10.6084/m9.figshare.104629 (2012).

2. Fausto, S. et al. PLoS ONE 7, e50109 (2012).

Do the latest OA publishers do any better? Sadly, IJDC appears not to show DOIs in references. I couldn’t see any in references in the most recent PLoS one article I looked at (Grieneisen and Zhang, 2012). Nor Carroll (2011) in PLoS Biology. But yes, definitely some DOIs in references in Lister, Datta et al (2010) in PLoS Computational Biology.

What about the newest kid on the block? You know, the cheap publisher who’s going to lead to the downfall of the scholarly world as we know it? Yes! The wonderful article by Taylor and Wedel (2013) in PeerJ has references liberally bestowed with DOIs!

When I tweeted my outrage about this situation, someone suggested it’s just the publishers simply following the style guides. WTF?

Publishers! You want us to believe you are adding value to our srticles? Then use the Digital Object Identifier system. Keep the DOIs we give you, and add the DOIs we don’t!

PS At one stage in preparing for this post I tried copying reference lists from PDFs and pasting them into Word. You should try it some time. It’s an absolute disaster, in many cases! Which is NOT the fault of PDF, it is the fault of the system used to create the PDF… ie the Publisher’s system. Added Value again?

PPS: here’s that reference inserted by EndNote:

Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article. PLoS Comput Biol, 5(4), e1000361. http://dx.doi.org/10.1371%2Fjournal.pcbi.1000361

EDIT: As the comments below suggest, my post is generally true insofar as PDF versions of articles are concerned, although even there some publishers (eg BioMedCentral) do incorporate a hidden clickable link behind the reference (in BMC’s case to PubMed rather than the DOI). Several publishers have MUCH better behaviours in their HTML versions, with both explicitly visible DOIs and clickable versions of references). Sadly, HTML has no agreed container format, and is next to useless for storing articles for later reference, so it is most likely that the articles you store and use on your computer will be the sort of stunted PDFs I describe here. I still claim: this is not good enough.

Changes to publisher agreements #PDFTRIBUTE

14 Feb

This is a bit late, as I couldn’t find the relevant agreement at the time just after Aaron Swartz’s death when the #PDFTRIBUTE movement started.. But I was taken by that proposal, that in memory of Aaron Swartz we should all try to liberate documents. The easiest and safest way to do that is to liberate our own documents. One way of achieving that is to deposit them in institutional repositories, or other places like Figshare. The safest way is to liberate documents you are publishing now or in the future.

Of course, many publishers don’t want you to do that, and they try to make you sign documents with various names like “Consent to Publish” that transfer your copyright to them. Sometimes they allow you to retain certain rights, sometimes including the ability to deposit a copy. But sometimes the document they ask you to sign doesn’t allow that.

My first suggestion is: read those documents carefully! They essentially take away all the rights to your work. My second suggestion: if you sign one of these documents, keep a copy. You may need to know later what you have signed!

I made a resolution many years ago only to publish in Open Access publications. This was easier for me than for many academics, as my job did not depend on publication in the same way. However, a few years ago I was asked to contribute a chapter to a book that was being compiled as a Festschrift for an ex-boss, a person I much admired. So I agreed.

The publisher sent me a Consent to Publish form, via email. It was a 2 page PDF with some moderately dense legalese on it. There were terms that I didn’t like such as “The copyright to the Contribution identified above is transferred to the Publisher…” and “The copyright transfer covers the exclusive right to reproduce and distribute the Contribution, including…” lots of stuff. Not good. So I had a chat with another ex-colleague [EDIT: Charles Oppenheim] who is a bit of a copyright expert (but not a lawyer, and I hasten to add, neither am I). Between us we came up with a few changes. I tried to edit the PDF without success, as it was locked. So in the end I printed it out, struck out clauses I didn’t like, wrote in by hand clauses I did like, initialled them, signed the amended document and sent it off. No response from publisher, contribution accepted, then after the book was published I uploaded the chapter to the local IR. I kept a copy of the signed document.

So what did I change? The first clause quoted above was changed to “The exclusive right to publish the Contribution identified above in book form is granted to the Publisher…”. The second clause above was changed to “The right granted covers the exclusive right to reproduce and distribute the Contribution in book form, including…” (the rest of that sentence continued unchanged “… reprints, translation, photographic reproductions, microform, electronic form (offline, online), or any other reproductions of similar nature…”).

Further down there was a sentence “The Author retains the right to publish the Contribution in any collection consisting solely of the Author’s own works without charge and subject only to notifying the Publisher in writing  prior to such publication of the intent to do so and to ensuring that the publication by the Publisher is properly credited and that the relevant copyright notice is repeated verbatim.” I suspect this clause was redundant as I had retained far wider rights than that (all rights except those transferred), but since I specifically wanted to put it into the IR with my other works, I changed “… Author’s own works without charge…” to “… Author’s own works or a collection of works of the Author’s institution without charge…”.

I also added at the end “The Author asserts his moral right to be identified as the author of the Contribution”.

I’ve no idea if this is useful to anyone, but I offer it in case it might be. As noted, I am not a lawyer and this is not legal advice. Your mileage may vary, as they used to say in internet mailing lists when I was young, and your licence terms almost certainly will vary. But there’s no harm in trying to get the best deal you can, and amending and signing the proposed terms is one way. I reckon it’s better than asking. Send it in; they probably won’t even read it themselves. But keep a copy in case!

More back from Microsoft

28 Nov

Following my posting of the initial response to my Open Letter, Jim Thatcher wrote back to me:

“Thanks Chris. My team will be working on this issue to try to come up with a more concrete path forward. For now, if any of your readers have specific needs I would encourage them to work with third-party vendors (as you did) to convert the obsolete documents. Archival organizations with long-term structural needs that can’t be addressed by a one-shot conversion project can contact me directly at jim.thatcher(at)Microsoft.com.

Please keep me apprised of thee community’s suggestions for crowd-sourcing those document formats.

Regards,
Jim Thatcher
Principal Program Manager
Office Standards and Interoperability”

I’m currently struggling to find any person or organisation willing to take some leadership in that response; this is moving beyond what I can achieve in a back bedroom! So BL, TNA, LoC, NARA, NLA, NLNZ, OCLC, JISC, Knowledge Exchange, EC, even CNI and DPC (and others), I’m looking at you!

Response to the Open Letter on obsolete Microsoft file formats

26 Nov

You may remember the Open Letter I sent to Tony Hey of Microsoft and published a few weeks back (https://unsustainableideas.wordpress.com/2012/10/22/open-letter-ms-obsolete-formats/). Well I’m please to say that Tony has responded. I’ve included the full text of his response below.

“Chris

I have a reply from Jim Thatcher in the Office team:

1)      We do not currently have specifications for these older file formats.

2)      It is likely that those employees who had significant knowledge of these formats are no longer with Microsoft.

3)      We can look into creating new licensing options including virtual machine images of older operating systems and old Office software images licensed for the sole purpose of rendering and/or converting legacy files.

4)      One approach we could consider is for Microsoft to participate in a “crowd source” project working with archivists to create a public spec of these old file formats.

I think it would be sensible for you to talk directly with Jim – and Natasa in the UK – to see if there are some creative options that Microsoft could pursue with the archivist community.

OK?”

Now it’s worth pointing out that this is a response from a coupe of individuals to my Open Letter, not a formal commitment by Microsoft. But it’s a good start and we need to work to make more good come from it.

The first two points are more or less as expected, and as suggested by several commenters (some picked out in my blog post of selected comments https://unsustainableideas.wordpress.com/2012/10/31/comments-on-open-letter-ms/).

The later points are very welcome, and I would be very happy to see them taken forward.

Is there any appropriate group to work with Microsoft on appropriate licence terms for older software to render or migrate legacy files?

Is there an appropriate group to coordinate crowd-sourcing interaction with Microsoft? I can see at least 4 approaches:

a) Create a set of sample files from all obsolete versions available under CC0, eg  the CURATEcamp 24 hour worldwide file id hackathon (#fileidhack) public format corpus .

b) Work on a complete set of format identifying signatures, so that an unknown file can be properly identified.

c) Work as suggested on the specs of some of these older formats. Based on a quick look at an old MS Word file, some of these older formats are not that complicated.

d) Work to include these formats in Open Source Office suites, so we can migrate files into the future at no cost to Microsoft.

All of these would need a little leadership so that Microsoft didn’t get bogged down with interaction costs. Some could take place without funding and with little more than leadership (as in File Formats November or #fileidhack, for example). Some might need more resources and a bit of funding.

I think Microsoft has returned service. What next, folks?

Comments from some Responses to “Open letter to Microsoft on specs for obsolete file formats”

31 Oct

I have been rather overwhelmed by the wonderful response to my Open Letter. There are many excellent comments among the 100 or so. Most comments are simple words of support for the idea, and are very much appreciated for that, but a small number had other messages that perhaps deserve to be a bit more accessible. I’m quoting a few of them here. Where more than one person made similar remarks, I’ve rather arbitrarily selected among them. Thankyou to everyone for your support, encouragement and ideas.

I should also say right up front that Tony Hey has responded briefly that he is checking, so I remain optimistic that something might just come from this.

Lee Dirks

Howard Besser’s early comment was typical of many comments in support of the idea. He also paid homage to Lee Dirks. I had mentioned Lee in an early draft of my letter, and I truly believe he would have supported this idea internally as strongly as he could. In the end I took the mention out as diluting my main point, but I’m delighted to see it brought out here by Howard:

“Chris, we all know that this would be an important step that would facilitate digital preservationists in doing their jobs. And it is particularly timely, given that our community has recently lost our major advocate within Microsoft– Lee Dirks.”

Widespread problem

Libor Coufal of the National Library of Australia makes it clear that my particular problem with old PowerPoints is potentially widespread:

“Like many other memory institutions, National Library of Australia has a load of files in legacy MS formats. Just a quick search in our small testing sample of files returned several PowerPoint 4 files which can’t be open with the current PowerPoint version. Any initiative which would help to solve this problem is very welcome.”

Other advantages… and patents

Gary McGath (who writes a blog related to File Formats) pointed out other advantages for Microsoft:

“I hadn’t realized Microsoft has documented as much as it has till I started looking around its Open Specifications pages. Expanding their scope as you suggest would certainly be beneficial, and I’d like to mention another benefit to Microsoft: It would let other people do their work for them. Microsoft has little interest in spending money to support formats from the nineties, but if other people take up the slack in open source they will add to the long-term value of the formats, and thus give people more confidence in the long-term viability of their current formats.”

Gary also raises another concern, on patent implications. I believe this might well be covered by the terms of Microsoft’s Open Specification Promise, but it’s worth mentioning here:

“Microsoft alludes to patent licensing without getting into specifics. It would give even more confidence if we could be confident that open-source implementations wouldn’t have the threat of patent lawsuits hanging over their heads.”

Emulation

Euan Cochrane takes me to task for the tone of a remark about emulation:

“This is a great idea and I would be very impressed if Microsoft were to release their standards documentation.

“I am, however, a little concerned about your statement that:
‘At present there appear to be only two routes for migration: one relies on technology preservation (or emulation) in the form of systems that can (and are licensed to) run a sufficiently early version of MS PowerPoint, and the second is via this small company, Zamzar. Neither of these solutions can be relied on for the long term.’
(emphasis added [by Euan])

“Emulation can be viable over the long term and therefore migration by emulation can be viable over the long term (e.g. this. Furthermore I fail to see why you believe that migration is likely to be any more viable. I can give countless examples of the use of emulation right now (such as for mobile phone software development), and examples of the use of emulation going back decades.

“Nevertheless, to be able to have a viable long-term emulation solution we will need access to the software of yesteryear. As such I would love to see your open letter extended to include a request for access to old Microsoft software. It would not have to be without cost and perhaps could include a custom license for use only by memory institutions and/or with other restrictions. “

My Open Letter didn’t make this point clearly, but licensing issues were at the back of my mind when I made that point about emulation. Others were concerned too. Mark Jordan wrote:

“If emulation is the only practical option for accessing these files, then Microsoft should relax the licenses on its older products to allow installation and use specifically for digital preservation purposes. These older products would not compete with current ones, and preservationists would have relief from the single biggest non-technical problem with emulation — software licenses.”

Meanwhile gwhatjeff (I don’t have a more complete name) had a practical suggestion that cuts across the emulation and hardware preservation approaches:

“Hi Mark – I agree that Microsoft could do a lot to relax licensing standards, but there are plenty of old Office licenses and media available for ~$10 per. I’m going to purchase some older versions (Off95/Win95 or Off2000,Win2k) to support getting this set up in emulation. Even without some sort of hosted emulation, I believe most digital archive organizations could set up a legacy OS and file conversion PC for < $200. The harder part would be to get the various hardware components set up properly, particularly networking equipment or floppy drives.”

No specs

MetalSamurai (again I don’t have a more complete name) was one of those concerned (as am I) that there are actually no specifications to release:

“You can bet that in this case the code *is* the documentation*. So you’re asking MS to release the source for old versions of Office. I can’t even keep a straight face thinking that.

“[* Really. Even now the MS Office document format is “what the software decides it is”. That’s why the Mac version never quite displayed documents the same way as the Windows version. There was no proper documentation. And MS will have no interest in writing it now. MS have however promised that the next version of Office will finally conform to the ISO standard they bought a few years ago. I’ll believe it when I see it.]”

This was backed up by Jeff Meyer, who wrote:

“Hi – as a former Microsoftie who worked in Office marketing in the late 90′s, I fear MetalSamurai is correct & that this is neither simple nor straightforward, as Jerome has described. The letter’s assertion that the number of people who might be familiar with these formats at Microsoft is declining is an understatement. It might be declining from 2 to 1 or 1 to none at this point. […] I highly doubt there’s an old spec doc sitting around in their files. If there is, it’s full of errors.”

I’ve extracted a couple of sentences here that are off this particular point, but include them later. You can see Jeff’s comment un-edited in the comment stream.

Other solutions?

There were several suggestions for other solutions or approaches. Henk Koning wrote:

“I would like to suggest that you consider less far going alternatives. It might very well be impossible for Microsoft to do what you ask, for instance because there is no authoritative file spec, or specs have been changed in an inorderly fashion, or there are anomalies in the specs that would not be understood by the world nowadays (and ridiculed). How to make this a safe journey for Microsoft?

“What I can think of:
– Microsoft starts giving support to specific migration problems
– specs are given piecemeal wise, related to specific migration problems, on request, and after signing a confidentiality agreement
– Microsoft opens a migration service with limited responsibility (but an estimate of the success of the migration)
– ?”

I’m not in favour of confidentiality agreements in the long-term preservation arena; openness seems to me the key. Nevertheless Henk’s suggestion might allow competitive commercial alternatives to become available.

Jeff Meyer also wrote:

“Euan’s answer suggests the only reliable – and already available – method for this preservation, which is emulation. In fact, it would probably be easier for Microsoft to set up a hosted instance of an older version of Windows & Office just for converting old files than it would be for them to reverse-engineer their own standard. […] Not trying to be a spoilsport, just trying to suggest a solution that will get what you want – access to your files – reliably and quickly.”

(The sentence removed here is already quoted above.)

Jeff also added, in a later comment (remember, he’s an ex-Microsoftian!):

“Chris – I’m interested in figuring out a way to help, but am confused by many of the comments on this thread (maybe people aren’t reading the other comments?). Is the goal of having the specification for the sake of having the specification itself, or is it to reveal the content of potentially unreadable files? If the former, that may be searching for the nonexistent. If the latter, then you do not need an explicit specification document in order to do that. You just need working software that implements the specification.”

Libor Coufal of the National Library of Australia responded to this:

“Jeff: Yes, the ultimate goal is (not only) to reveal the content, but more importantly to save it in a newer, working version. If you can have access to software which can do it, then you’re saved. But what if you don’t? How much trouble (and expense) you’ve got to go into to get you there? And is this a long-term viable solution? I guess, having the specifications available would give everyone a greater confidence that such a solution can be developed, not only now but also anytime in the future. Having said that, I perfectly understand that it may not be viable neither, but if it is, it would definitely be very appreciated (as you can see from the comments).”

Other suppliers too

A number of comments highlighted that we should take this initiative to other suppliers as well as Microsoft. Apple was mentioned a few times, as in this comment by Ben Fino-Radin:

“Full endorsement. Now, who is going to write the open letter to convince Apple to update their ‘old software’ page?”

Kara Van Massen supported this, and points out another powerful argument:

“Very much in support of this. I hope Microsoft sets an example that other companies (ahem, Apple) might follow. Not only is this important for cultural heritage and memory institutions, but perhaps even more so for corporate assets in legacy formats that have business or legal reasons for preservation.”

Rescue mission?

William Anderson has a good point on the need for rescue missions. To some extent, “File format November” might form part of that, but in reality many more tightly focused efforts will be needed:

“Microsoft can help set a standard of practice for others to emulate. However, as has been pointed out […], specifications may be missing, and knowledge already lost. If this is the case, then perhaps these formats need to be nominated for a rescue mission. It’s clear that the content they encode is at risk of permanent loss.”

Thanks to everyone who has commented, and as noted all comments are accessible (unless they’ve inadvertently been killed by the spam filter, in which case, my apologies).

Open letter to Microsoft on specs for obsolete file formats

22 Oct

The main text of this blog post is a letter I have sent to Tony Hey of Microsoft, asking him to use his influence to get specifications for older obsolete file formats published on Microsoft’s Open Specifications Page. If you support this, please leave me a comment below endorsing the letter (note, the spam filter may delete or refer for moderation any comments containing URLs).

Dear Tony,

Open Letter on specifications for obsolete file formats

I am writing to you, as the most senior person I know in Microsoft, to ask you to use your influence to ensure Microsoft adds specifications for older Office (and other) file formats to the Microsoft Open Specifications page. I have put this Open Letter on my blog (https://unsustainableideas.wordpress.com/), and (if you agree) would like to put any reply from you on that blog as well. I will also solicit further support for this letter on that blog, in the form of comments of endorsement.

Microsoft’s Open Specifications page and the accompanying Open Specification Promise were both very welcome developments, for which Microsoft is rightly applauded. However, the Specifications only go back to Office 97-2003 formats. I have some MS Word and MS Excel documents from earlier versions of Office that seem to open well in more modern software, so perhaps their file formats are compatible, at least to some extent. However, PowerPoint 4.0 files do not open at all in modern MS Office applications, and the file format is understood to be very different.

I have been attempting to convert some 50 or so PowerPoint 4.0 files to more modern formats (to migrate them, in digital preservation parlance), and have documented the process in a series of posts on my blog. The post at https://unsustainableideas.wordpress.com/2012/10/02/powerpoint-4-0-story-so-far/ sums up the exercise, and there is one further post about a small company that has succeeded in converting some files for me. At present there appear to be only two routes for migration: one relies on technology preservation (or emulation) in the form of systems that can (and are licensed to) run a sufficiently early version of MS PowerPoint, and the second is via this small company, Zamzar. Neither of these solutions can be relied on for the long term.

While my main focus in this letter is on older formats within the basic Office set, the specifications for related software such as Microsoft Works and early versions of Microsoft Access would also be helpful for preservation purposes.

You might ask: why should Microsoft put effort today into making these specifications available? I believe Microsoft’s software tools are not merely temporary mechanisms for profit in the marketplace, but (by dint of their flexibility and success) tools that the wider world has used to create billions of cultural artefacts that may be of lasting value. By declining to help make these obsolete file formats accessible, Microsoft is locking up this cultural content, and will eventually throw away the key.

Andrew Jackson of the British Library (who helped me with my initial attempts to convert my PowerPoint 4.0 files) has studied the population of older file formats in a dataset of 2.5B web resources from the UK Web Archive. He found that PowerPoint 4.0 has been persisting on the UK web until fairly recently. For ALL PowerPoint files with identifiable versions created from 1996 to 2010, PowerPoint 4.0 and PowerPoint 95 represent around 2.5%, and for PowerPoints created up to 2002 the proportion of the older formats was 27%. We can be confident that many, many more such resources will exist in private file systems.

Why should Microsoft act now? First, because the number of people within Microsoft who understand these formats must be declining. Second, the specifications themselves (to the extent that they exist as simple documents) must also be at risk of loss through accident or some grand tidy-up process that discounts older material as irrelevant. Third, because many of the early adopters who used these products in the 1990s are, like me, coming up to or past retirement. I believe there will be an increasing swell of documents from some of these people flowing into archives for preservation over the next several years. Many of these will be documents from people of much greater cultural and scientific importance than me, but who have less time and/or ability to pursue possible solutions to an obsolescence problem. Fourth, I think this is consistent with the direction you have helped Microsoft to take since joining them.

I’m also motivated by another factor: Jason Scott’s call for action to “Solve the File Format Problem” scheduled for this November (original post here and wiki page here http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012).  Jason is a member of the Archive Team of “rogue archivists”, who attempts to save disappearing web sites, and is seeking a crowd-sourced solution to the lack of information on obsolete file formats. It would be wonderful if Microsoft could add to that information by making these specifications available in November.

What would this cost Microsoft? On the face of it, simply the staff effort to gather the relevant specifications and make them available. Of course, the documents may not exist as well-written specifications, in which case I would urge Microsoft to make as much information available as possible, allowing others to make sense of them against the ”ground truth” of existing files. It would be wonderful if Microsoft could make available a migration tool, but this would obviously be a larger effort wth longer term implications. Indeed, in the long run it might be more cost effective to support an open migration tool.

The benefit to Microsoft in doing this would be in enhancing its reputation as a responsible company that understands and acts on the implications of its past work.

Possible outcomes could include input filters for open software such as OpenOffice or Libre Office, input converters for SlideShare and others, and possible Microsoft or  commercial 3rd party migration tools.

The societal benefits of this would include better preservation of a subset of cultural artefacts, a better understanding of the content of presentations in early days, which may document discoveries or encapsulate persuasion arguments for significant change programs. Ultimately, this is about a richer cultural heritage. My own presentations in PowerPoint 4.0 date from the time when I was Director of the JISC Electronic Libraries Programme, and document how we sought to persuade the community to go forward with that campaign, and some of the adjustments that were made to it.

I have found a previous Open Letter on a similar subject, from Rick Jelliffe on the O’Reilly XML blog, at http://www.oreillynet.com/xml/blog/2008/03/an_open_letter_to_microsoft_ib.html.

I would really appreciate your views on whether this might be possible.

Yours, Chris

The PowerPoint 4.0 adventure: what did I learn?

15 Oct

Part of the point of my attempt to access my old PowerPoint 4 files (see here and here for the latest state) was to see what I could learn from it, with half an eye on the Jason Scott November month of action on file formats, see also planning here). So, what did I learn that might be of more general interest than the specific case?

I guess the first thing is: the Internet is your friend! I knew that, and so do you, but I continue to be amazed at the extent that people will go out of their way to help you if you ask (not always, but often). Of course, this is partly due to the particular set of people who know (of) me…

Faced with a set of files that you know little about, and that don’t “automatically” open (based on defaults in your file system and operating system), the first thing you will need is some mechanism to identify the file format. This may apply even if the file system does appear to know about the file format. The files in this case were a mixture:

  • a) some had a .ppt file type and opened correctly
  • b) some had a .ppt file type but PowerPoint 2004 refused to open them
  • c) some had no file type extension (files migrated from Macintosh System 7 OS without the resource fork).

We can ignore group (a). Group (b) files were of two types, it turned out: some were PowerPoint 4.0 files, but some were Word 6.0 files that had wrongly had a .ppt extension added (by me, usually due to some clue in the file name like “slides”, forgetting that prior to using PPT 4 I made slides in Word and printed them to transparencies). Group (c) files were a mixture of PPT 4 and Word 6 files (with an odd Macintosh Write file thrown in for good measure).

In order to work out what to do, you really do have to know what you have got. So, the first thing you need is a pointer to a set of tools that can identify file formats. Those tools may require a set of signatures to help them. But, for an amateur like me, you probably don’t want to run a professional-level tool; you probably need a simple procedure that will help. In my case, this was opening each file with Word 2004 using the “recover any text” option, and looking for some characteristic content after the main bulk of the slide content. In my case there was some text characteristic of PowerPoint like:

“dRClick to edit Master text styles

Second Level

Third Level

Fourth Level

Fifth Level”

In addition there was some information that I recognised as relating to the original Mac directory structure for the files, and a few instances of the text “PowerPoint 4.0”. Note, in this case it is clearly important to know the particular version of the file format that you have; PowerPoint is NOT sufficient!

[As an aside, David Rosenthal has this to say on file format identification tools:

“Several people responded to my criticism of format identification tools. Matt said:
‘I do agree that identification of textual formats is increasingly important, and further efforts are probably needed in this area.’
“I don’t agree and have said so in the past. As regards Web formats, to the extent to which format identification tools agree with the code in the browsers they don’t tell us anything useful, and to the extent to which they disagree with the code in the browsers they are simply wrong. Applying these tools as part of a Web preservation pipeline is at best a waste of resources and at worst actively harmful.”

Now it’s always worth reading David’s text carefully, and clearly he’s referring to objects that form part of displayed web pages. However, these are not the only kinds of files on the web; many make other files available via the web, and for some of these in my opinion David is wrong.]

The beauty of using Word 2004 for this job was that the files that were really Word files mis-identified as “.ppt” opened flawlessly in Word and were clearly different!

Once you know what you’ve got, you probably want information on the risks (to you) for content in that format. What problems are you likely to have “rendering” the files (ie causing them to display their content as they should)? What problems are likely if you try to migrate the files (ie open and “save as” some more modern format)? Is older software available to you that could open the files? Are older computer systems available to you? You could sum these up as, asking what is the degree of obsolescence of the files? Finally, you need some hints as to the action window that you have available. I could have converted these files some years ago via a colleague who had software that would open them, but her machine has been updated and the newer version has lost this option (thanks for nothing, Microsoft).

It’s worth noting that a lot of the “official” advice on obsolescence that you might find is useless. Various sites will classify formats as obsolete that are still perfectly easy to open and migrate from. Indeed, I suspect that there’s no really helpful way to classify obsolescence (I tried and failed). And it changes… before this exercise I would have classified PPT 4 as pretty high on any obsolescence scale, but now we have a simple and free migration option (or low cost, if you expect to have a lot of these).

So now you know what formats your files are in, and you have some idea of the risks to those files. Perhaps it’s time to take action! Now you’ll need to know:

  • d) What software is available to render, and preferably to migrate (save as) the files? And which of these options is free or cheap enough for you?
  • e) What services are available to render, and if possible to migrate the files to a newer format? And which of these options is free or cheap enough for you?
  • f) What older technology routes do you or your contacts have access to, that might help you to render and/or migrate the files? This might involve getting access to software licences that are no longer commonly available.
  • g) What older environments could you or your contacts emulate? Again, this might need access to software licences that are no longer commonly available.

For cases (f) and (g), PowerPoint 4.0 licences would not be appropriate, as they would probably not give me a useful “save as” option (although they might help me to triage the content by being able to view the slides as I intended them, and there might be an indirect route such as “Print to PDF”). I’d need a licence that was newer than PowerPoint 4.0 but older than PowerPoint 2004 (which no longer supports the old file format).

In this case, the answer to (d- software that could render or migrate) was: none. I could not find software available to me that could even render the files reasonably well.

The answer to (e- services that could render or migrate) was initially: none. But Zamzar came good when I asked them and sent them some examples. In a later case, I was having problems migrating some newer PPT files that had embedded objects (graphs from Excel), and Zamzar managed to convert these as well, and suggested they might add this to their standard option, too.

The answer to (f- older technology) for me was: none. But through my contact network I did find someone who had access to an appropriate Mac with the intermediate software. This was OK for proof of concept, but would not have been suitable for converting all my 50-odd files. It’s possible that I could have bought a licence for that software and run it on my Mac, but I didn’t want to spend that much on an option that still might not work. I suppose I could have grabbed a version off a torrent somewhere, but I do try to stay legal!

The answer to (g- emulation options) for me was: none that were feasible. There were emulation options suggested, but they still needed the intermediate software (see above).

To summarise: faced with older files that you cannot open, I think you need the following information, in roughly this order:

  1. information to help identify the formats at an appropriate level of precision,
  2. information on risks to your content, once the format has been identified, and
  3. information on routes that will allow you to render (not least for triage purposes) and possibly migrate the files to a more modern format.

I realise there are use cases where it is essential that the file be presented in its original format, but these use cases are of little interest to me. I want to read Sir Walter Scott’s works in a modern edition, not the original, but I’m not a Scott scholar! Likewise, I’m interested in my older content for its use to me, not to study how it looked in the old days.