Open postcode? That’ll be a “no” then!

14 Mar

A month or so ago I got an email from the OpenRightsGroup, asking me to write to a minister supporting the idea of retaining the Postcode database as Royal Mail is privatised, and making it Open. The suggested text was as follows:

“Dear [Minister of State for Business and Enterprise]
“We live in an age where location services underpin a great chunk of the economy, public service delivery and reach intimate aspects of our lives through the rise of smartphones and in-car GPS. Every trip from A to B starts and ends in a postcode.
“In this context, a national database of addresses is both a critical national asset and a natural monopoly, which should not be commercially exploited by a single entity. Instead, the Postcode Address File should be made available for free reuse as part of our national public infrastructure.The postcode is now an essential part of daily life for many purposes. Open availbaility would create re-use and mashup opportunities with an economic value far in excess of what can be realised from a restrictive licence.
“I am writing to you as the minister responsible to ask for a public commitment to:
“1) Keep the Postcode Address File (PAF) under public ownership in the event of the Royal Mail being privatised.
“2) Release the PAF as part of a free and open National Address Dataset.”

A few days ago I got a response. I think it must be from a person, as the writer managed to mis-spell my name (not likely to endear him (or her) to me!)

“Dear Mr Rushbridge,

“Thank you for your email of 6 February to the Minister for Business and Enterprise, Michael Fallon MP, regarding the Postcode Address File (PAF).

“I trust you will understand that the Minister receives large amounts of correspondence every day and regretfully is unable to reply to each one personally.  I have been asked to reply.

“The Government’s primary objective in relation to Royal Mail is to secure a sustainable universal postal service.  The postcode was developed by Royal Mail in order to aid delivery of the post and is integral to Royal Mail’s nationwide operations.  However, we recognise that postcode data has now become an important component of many other applications, for example sat-navs.

“In light of PAF’s importance to other users, there is legislation in place to ensure that PAF must be made available to anyone who wishes to use it on terms that are reasonable.  This allows Royal Mail to charge an appropriate fee whilst also ensuring that other users have access to the data.  The requirement is set out in the Postal Services Act 2000 (as amended by the Postal Services Act 2011) and will apply regardless of who owns Royal Mail.  It is this regulatory regime, and not ownership of Royal Mail, that will ensure that PAF continues to be made available on reasonable terms.  Furthermore, Ofcom, the independent Regulator, has the power to direct Royal Mail as to what ‘reasonable’ terms are.  Ofcom are currently consulting on the issue of PAF regulation and more information can be found on their website at: http://www.ofcom.org.uk.

“On the question of a National Address Register, the UK already has one of the most comprehensive addressing data-sets in the world in the form of the National Address Gazetteer (NAG).  The NAG brings together addressing and location data from Ordnance Survey, Local Authorities and Royal Mail; the Government is committed to its continuation as the UK’s definitive addressing register.

“The Government is similarly committed to ensuring that the NAG is used to its full benefit by both public and private sector users, and keeps pricing and licensing arrangements under review with the data owners.  Alongside our commitment to the NAG, the Government is continuing to consider the feasibility of a national address register.

“I trust you will find this information helpful in explaining the position on this subject.

“Yours sincerely,

“BIS MINISTERIAL CORRESPONDENCE UNIT”

So, that’ll be a “No” then. But wait! Maybe there’s a free/open option? No such luck! From Royal Mail’s website, it looks like £4,000 for unlimited use of the entire PAF (for a year?), or £1 per 100 clicks. You can’t build an open mashup on that basis. Plus there’s a bunch of licences to work out and sign.

What about the wonderful National Address Gazeteer? It’s a bit hard to find out, as there seem to be mutiple suppliers, mainly private sector. Ordnance Survey offers AddressBase via their GeoPlace partnership, which appears [pdf] to cost £129,950 per year plus £0.008 per address for the first 5 million addresses! So that’s not exactly an Open alternative, either!

Now I’m all for Royal Mail being sustainable. But overall, I wonder how much better off the whole economy would be with a Open PAF than with a closed PAF?

Advertisements

Some research data management terminology

22 Feb

Terminology in this area is confusing, and is used differently in different projects. For the purposes of a report I’m writing, unless otherwise specified, we will use terminology in the following way:

  • Data management is the handing and care of data (in our case research data) throughout its lifecycle. Data management thus will potentially involve several different actors.
  • Data management plans refer to formal or informal documents describing the processes and technologies to be deployed in data management, usually for a research project.
  • Data deposit refers to placing the data in a safe location, normally distinct from the environment of first use, where it has greater chance of persisting, and can be accessed for re-use (sometimes under conditions). Often referred to as data archiving.
  • Data re-use refers to use made of existing data either by its creators, or by others. If re-use is by the data creators, the implication is that the purpose or context has changed.
  • Data sharing is the process of making data available for re-use by others, either by data deposit, or on a peer to peer basis.
  • Data sharing plans refer to the processes and technologies to be used by the project to support data sharing.

Some JISCMRD projects made a finer distinction between data re-use and data re-purposing. I couldn’t quite get that. So I’m balancing on the edge of an upturned Occam’s Razor and choosing the simpler option!

Does this make sense? Comments welcomed!

How to plan your research data management (planning is not writing the plan!)

21 Feb

David duChemin, a Humanitarian Photographer from Vancouver, wrote a bog postduC13 at the start of 2013 (in the “New Year Resolution” season) entitled “Planning is just guessing. But with more pie charts and stuff”. He writes:

“Planning is good. Don’t get me wrong. It serves us well when we need a starting point and a string of what ifs.  I’m great at planning. Notebooks full of lists and drawings and little check-boxes, and the only thing worse than planning too much is not planning at all. It’s foolish not to do your due-diligence and think things through. Here’s the point it’s taken me 4 paragraphs to get to: you can only plan for what you’ll do, not for what life will do to you.”

OK he doesn’t really think planning is just guessing; in the post he’s stressing the need for flexibility, but also pointing out that planning (however flawed) is better than not planning.

That blog post is part of what inspired me to write this. Another part is a piecce of work that I’m doing that seems to have gone on forever. It seems like a good idea to put this up and see what comments I get that might be helpful.

Planning to manage the data for your research project is not the same thing as filling in a Checklist, or running DMP Online. The planning is about the thinking processes, not about answering the questions. The short summary of what follows is that planning your research data management is really an integral part of planning your research project.

So when planning your research data management, what must you do?

First, find out what data relevant to your planned research exists. You traditionally have to do a literature search; just make sure you do a data search as well. You need to ensure you’re aware of all relevant data resources that you and your colleagues have locally, and data resources that exist elsewhere. Some of these will be tangentially referenced in the literature you’ve reviewed. So the next step is to work out how you can get access to this data and use it if appropriate. It doesn’t have to be open; you can write to authors and data creators requesting permission (offering a citation in return). Several key journals have policies requiring data to be made available, if you need to back up your request.

The next step, clearly, is to determine what data you need to create: what experiments to run, what models, what interviews, what sources to transcribe. This is the exciting bit, the research you want to do. But it should be informed by what exists.

Now before planning how you are actually going to manage this data, you need to understand the policies and rules under which you must operate, and (perhaps even more important) the services and support that is available to you. Hidden in the policies and rules will be requirements for your data management (data security, privacy, backup, continued availability, etc). Hidden in the services and support will be some that will be very useful to you, and will save you time and diverted resources (institutional backup services, institutional data repositories, etc). As suggested above, these services and support could come from your group, your institution, your discipline, your scientific society, or your invisible college of colleagues around the world.

So now you can plan to manage your data. You may need to address many issues:

  • Identification, provenance and version control: how to connect associated datasets with the experimental events and sources from which they derived, and the conditions and circumstances associated.
  • Storage: how and where to store the data, so that you and your colleagues (who may be in other institutions and/or other countries with different data protection regimes) can work on it conveniently but securely. Issues like data size, rate of data creation, rate of data update may all be relevant here. Data backup! Encryption for sensitive data taken off-site. Access control. Annotation. Documentation.
  • Processing: how will you analyse and process your data, and how will you store the results. Back to provenance and version control!
  • Sharing: How to make data available to others, and under what conditions. Where will you deposit it? With what associated information to make it usable? Depends on the data of course, and issues such as data sensitivity. May also depend on data size etc. Which data to share? Which data to report?

That’s not everything but it’s the core. When you’ve done the basic planning at this sort of level, you can get down to writing the Plan! At this point the specific requirements of research funder and institution will come into play, and tools like DCC DMP Online will be useful. They may even remind you of key issues you had forgotten or ignored, or local services you (still) didn’t know about.

At this point you don’t know whether your research will be funded, so there is a limit to the amount of effort you should put into this. NERC wants a very much simplified one-page outline data management plan; it may be more sensible to have a 2 or 3-page plan covering the stuff above, and condense down (or up) as required by your funder.

But you’re still only at the first stage of your research data management planning! If you are lucky enough to get your project funded, there will be a project initiation phase, when you gather the resources (budget, staff, equipment, space). Effectively you’re going to build the systems and establish the protocols that will deliver your research project. At this point you should refine your plan, and add detail to some elements you were able to leave rather vague before. Now you’re moving from good intentions to practical realities. And given that life does throw unexpected events at you (staff leaving, IT systems failing, new regulations coming in), you may need to do this re-planning more than once. Keep them all! They are Records that could be useful to you in the future. In a near-worst case, they could form part of your defence against accusations of research malpractice!

My point is, this isn’t so much good research data management planning, as good planning for your research.


duC13 duChemin, D. (2013). Planning Is Just Guessing. But With More Pie Charts and Stuff. Vancouver, BC. Retrieved from http://davidduchemin.com/2013/01/planning-and-guessing/

The DOI has no clothes, and Publishers have taken them away!

19 Feb

So what’s the Digital Object Identifier for, really? I thought it was a permanent identifier so that we could link from one article to the articles it references in a pretty seamless fashion. OK, not totally seamlessly, since a DOI is not a URI, but all we have to do is stick http://dx.doi.org/ on the front of a DOI, and we’re there. So we should end up with an almost seamless worldwide web of knowledge (not Web of Knowledgetm, that’s someone’s proprietary business product).

Obviously the Publishers must play a large part in making this happen. They support the DOI system through their membership of Crossref, and supplying the metadata to make it work. And sometimes they remember that when they transfer a journal from one publisher or location to another, they can fix the resulting mess simply by changing the redirect inside the DOI system. (And sometimes they forget, but that’s another story.)

And of course, these big, toll-access, subscription-based Publishers trumpet all the Added Value that their publishing processes put onto the articles that we write and give to them (and referee for them, and persuade our libraries to buy for them, and…). So obviously that Added Value will extend to ensuring that all references have DOIs where available? A pretty simple thing to add in the copy-editing stage, I would have thought.

Except that they don’t. They display few if any DOIs in their reference lists of “their” articles. In fact my limited, non-scientific evidence-collecting suggests to me that they probably do the opposite to Adding Value: remove DOIs from manuscripts submitted to them. OK, I have no direct evidence of the removal claim, but I reckon there is pretty good circumstantial evidence.

I don’t have a substantial base of articles to work from (not being affiliated with a big library any more), but I’ve had a scan at the reference section of several recent articles from a selection of publishers. What do I see?

Take for example this editorial in Nature Materials:

Nature. (2013). Beware the impact factor. Nature materials, 12(2), 89. doi:10.1038/nmat3566

Yes, there’s a DOI in the reference I used. Mendeley picked that DOI up automatically from the paper. If I use that paper in a reference, the DOI will be included by Mendeley. This presumably  also happens with EndNote and other reference managers. (Here’s me inserting a citation for (Shotton, Portwin, Klyne, & Miles, 2009) from EndNote… yes, there it is, down the bottom with a big fat DOI in it.) (This is part of my circumstantial evidence for Value Reduction by Publishers! We give them DOIs, they take them away.)

Anyway, looking at that Nature editorial, there are no DOIs in the reference list. Reference 7 is:

7. Campanario, J. M. J. Am. Soc. Inf. Sci. Technol. 62, 230–235 (2011).

I tried copy/pasting that into Google. I get two results, neither of which appears to be a JASIST article. OK let’s try this one, in a completely different field, from an Elsevier journal:

McCabe, M. J., Snyder, C. M., & Fagin, A. (2013). Open Access versus Traditional Journal Pricing: Using a Simple “Platform Market” Model to Understand Which Will Win (and Which Should). The Journal of Academic Librarianship, 39(1), 11–19. doi:10.1016/j.acalib.2012.11.035

Again, none of the referenced articles have DOIs included in the reference list. Here’s a recent reference:

Jeon, D. -S.,&Rochet, J. -C. (2010). The pricing of academic journals: A two-sided market perspective. American Economic Journal: Microeconomics, 2, 222–255.

Maybe that article (and all of the others) doesn’t have a DOI? Same trick with Google, we don’t get there straight away, we get to another search, for articles with the word “perspective” in that journal… which does get us to the right place. And yes, the article does have a DOI (10.1257/mic.2.2.222). Let’s try this article; surely Nucleic Acids Research is one of the good guys?

Fernández-Suárez, X. M., & Galperin, M. Y. (2013). The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic acids research, 41(D1), D1–7. doi:10.1093/nar/gks1297

No DOIs in the reference list. Here’s an odd one, from Nature again:

Piwowar, H. A. (2013). Value all research products. Nature, 493(7431), 159. doi:10.1038/493159a

Here they include no DOIs for actual articles, but there are URL-DOIs for Figshare! The first two references are:

1. Priem, J., Costello, K. & Dzuba, T. Figshare http://dx.doi.org/10.6084/m9.figshare.104629 (2012).

2. Fausto, S. et al. PLoS ONE 7, e50109 (2012).

Do the latest OA publishers do any better? Sadly, IJDC appears not to show DOIs in references. I couldn’t see any in references in the most recent PLoS one article I looked at (Grieneisen and Zhang, 2012). Nor Carroll (2011) in PLoS Biology. But yes, definitely some DOIs in references in Lister, Datta et al (2010) in PLoS Computational Biology.

What about the newest kid on the block? You know, the cheap publisher who’s going to lead to the downfall of the scholarly world as we know it? Yes! The wonderful article by Taylor and Wedel (2013) in PeerJ has references liberally bestowed with DOIs!

When I tweeted my outrage about this situation, someone suggested it’s just the publishers simply following the style guides. WTF?

Publishers! You want us to believe you are adding value to our srticles? Then use the Digital Object Identifier system. Keep the DOIs we give you, and add the DOIs we don’t!

PS At one stage in preparing for this post I tried copying reference lists from PDFs and pasting them into Word. You should try it some time. It’s an absolute disaster, in many cases! Which is NOT the fault of PDF, it is the fault of the system used to create the PDF… ie the Publisher’s system. Added Value again?

PPS: here’s that reference inserted by EndNote:

Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article. PLoS Comput Biol, 5(4), e1000361. http://dx.doi.org/10.1371%2Fjournal.pcbi.1000361

EDIT: As the comments below suggest, my post is generally true insofar as PDF versions of articles are concerned, although even there some publishers (eg BioMedCentral) do incorporate a hidden clickable link behind the reference (in BMC’s case to PubMed rather than the DOI). Several publishers have MUCH better behaviours in their HTML versions, with both explicitly visible DOIs and clickable versions of references). Sadly, HTML has no agreed container format, and is next to useless for storing articles for later reference, so it is most likely that the articles you store and use on your computer will be the sort of stunted PDFs I describe here. I still claim: this is not good enough.

Changes to publisher agreements #PDFTRIBUTE

14 Feb

This is a bit late, as I couldn’t find the relevant agreement at the time just after Aaron Swartz’s death when the #PDFTRIBUTE movement started.. But I was taken by that proposal, that in memory of Aaron Swartz we should all try to liberate documents. The easiest and safest way to do that is to liberate our own documents. One way of achieving that is to deposit them in institutional repositories, or other places like Figshare. The safest way is to liberate documents you are publishing now or in the future.

Of course, many publishers don’t want you to do that, and they try to make you sign documents with various names like “Consent to Publish” that transfer your copyright to them. Sometimes they allow you to retain certain rights, sometimes including the ability to deposit a copy. But sometimes the document they ask you to sign doesn’t allow that.

My first suggestion is: read those documents carefully! They essentially take away all the rights to your work. My second suggestion: if you sign one of these documents, keep a copy. You may need to know later what you have signed!

I made a resolution many years ago only to publish in Open Access publications. This was easier for me than for many academics, as my job did not depend on publication in the same way. However, a few years ago I was asked to contribute a chapter to a book that was being compiled as a Festschrift for an ex-boss, a person I much admired. So I agreed.

The publisher sent me a Consent to Publish form, via email. It was a 2 page PDF with some moderately dense legalese on it. There were terms that I didn’t like such as “The copyright to the Contribution identified above is transferred to the Publisher…” and “The copyright transfer covers the exclusive right to reproduce and distribute the Contribution, including…” lots of stuff. Not good. So I had a chat with another ex-colleague [EDIT: Charles Oppenheim] who is a bit of a copyright expert (but not a lawyer, and I hasten to add, neither am I). Between us we came up with a few changes. I tried to edit the PDF without success, as it was locked. So in the end I printed it out, struck out clauses I didn’t like, wrote in by hand clauses I did like, initialled them, signed the amended document and sent it off. No response from publisher, contribution accepted, then after the book was published I uploaded the chapter to the local IR. I kept a copy of the signed document.

So what did I change? The first clause quoted above was changed to “The exclusive right to publish the Contribution identified above in book form is granted to the Publisher…”. The second clause above was changed to “The right granted covers the exclusive right to reproduce and distribute the Contribution in book form, including…” (the rest of that sentence continued unchanged “… reprints, translation, photographic reproductions, microform, electronic form (offline, online), or any other reproductions of similar nature…”).

Further down there was a sentence “The Author retains the right to publish the Contribution in any collection consisting solely of the Author’s own works without charge and subject only to notifying the Publisher in writing  prior to such publication of the intent to do so and to ensuring that the publication by the Publisher is properly credited and that the relevant copyright notice is repeated verbatim.” I suspect this clause was redundant as I had retained far wider rights than that (all rights except those transferred), but since I specifically wanted to put it into the IR with my other works, I changed “… Author’s own works without charge…” to “… Author’s own works or a collection of works of the Author’s institution without charge…”.

I also added at the end “The Author asserts his moral right to be identified as the author of the Contribution”.

I’ve no idea if this is useful to anyone, but I offer it in case it might be. As noted, I am not a lawyer and this is not legal advice. Your mileage may vary, as they used to say in internet mailing lists when I was young, and your licence terms almost certainly will vary. But there’s no harm in trying to get the best deal you can, and amending and signing the proposed terms is one way. I reckon it’s better than asking. Send it in; they probably won’t even read it themselves. But keep a copy in case!

More back from Microsoft

28 Nov

Following my posting of the initial response to my Open Letter, Jim Thatcher wrote back to me:

“Thanks Chris. My team will be working on this issue to try to come up with a more concrete path forward. For now, if any of your readers have specific needs I would encourage them to work with third-party vendors (as you did) to convert the obsolete documents. Archival organizations with long-term structural needs that can’t be addressed by a one-shot conversion project can contact me directly at jim.thatcher(at)Microsoft.com.

Please keep me apprised of thee community’s suggestions for crowd-sourcing those document formats.

Regards,
Jim Thatcher
Principal Program Manager
Office Standards and Interoperability”

I’m currently struggling to find any person or organisation willing to take some leadership in that response; this is moving beyond what I can achieve in a back bedroom! So BL, TNA, LoC, NARA, NLA, NLNZ, OCLC, JISC, Knowledge Exchange, EC, even CNI and DPC (and others), I’m looking at you!

Response to the Open Letter on obsolete Microsoft file formats

26 Nov

You may remember the Open Letter I sent to Tony Hey of Microsoft and published a few weeks back (https://unsustainableideas.wordpress.com/2012/10/22/open-letter-ms-obsolete-formats/). Well I’m please to say that Tony has responded. I’ve included the full text of his response below.

“Chris

I have a reply from Jim Thatcher in the Office team:

1)      We do not currently have specifications for these older file formats.

2)      It is likely that those employees who had significant knowledge of these formats are no longer with Microsoft.

3)      We can look into creating new licensing options including virtual machine images of older operating systems and old Office software images licensed for the sole purpose of rendering and/or converting legacy files.

4)      One approach we could consider is for Microsoft to participate in a “crowd source” project working with archivists to create a public spec of these old file formats.

I think it would be sensible for you to talk directly with Jim – and Natasa in the UK – to see if there are some creative options that Microsoft could pursue with the archivist community.

OK?”

Now it’s worth pointing out that this is a response from a coupe of individuals to my Open Letter, not a formal commitment by Microsoft. But it’s a good start and we need to work to make more good come from it.

The first two points are more or less as expected, and as suggested by several commenters (some picked out in my blog post of selected comments https://unsustainableideas.wordpress.com/2012/10/31/comments-on-open-letter-ms/).

The later points are very welcome, and I would be very happy to see them taken forward.

Is there any appropriate group to work with Microsoft on appropriate licence terms for older software to render or migrate legacy files?

Is there an appropriate group to coordinate crowd-sourcing interaction with Microsoft? I can see at least 4 approaches:

a) Create a set of sample files from all obsolete versions available under CC0, eg  the CURATEcamp 24 hour worldwide file id hackathon (#fileidhack) public format corpus .

b) Work on a complete set of format identifying signatures, so that an unknown file can be properly identified.

c) Work as suggested on the specs of some of these older formats. Based on a quick look at an old MS Word file, some of these older formats are not that complicated.

d) Work to include these formats in Open Source Office suites, so we can migrate files into the future at no cost to Microsoft.

All of these would need a little leadership so that Microsoft didn’t get bogged down with interaction costs. Some could take place without funding and with little more than leadership (as in File Formats November or #fileidhack, for example). Some might need more resources and a bit of funding.

I think Microsoft has returned service. What next, folks?