My digital preservation utopia

6th December 2011

At Build last month, Jeremy Keith gave a presentation about preserving our websites, documents, and personal timelines. He talked about avoiding data loss, and shared his fears for the future. I really soaked up his thoughts, even though I had to get up and speak directly after him.

Anyhow, it’s an important topic that I’ve often lazily considered. I share a lot about myself on the web, and even archive details about my family history in certain places. I want to feel that none of this is pointless, and that it’ll have some legacy. If I eventually have offspring, I’d like them to have a record of everything, and hope they’d add to it, pass it down the generations, and keep this personal history intact.

So, long story short. When out and about with web folks, a few pints down the line, I often share my idea for some sort of centralised super data archive we’d all use to preserve our data for generations to come.

Now, I’m somewhat naive when it comes to complex storage systems, encryption methods, security and all that jazz, so go easy on me. Also, this is probably more worthy of a tweet than a post, but 140 characters isn’t enough. I’ve written this at 100mph without much care, so I apologise for the tone and anticipated errors.

My utopian unachievable idea

In my utopian misguided mind, I imagine the following possibly flawed scenario:

Over the next few years, all the services we love (Twitter, Flickr, Foursquare, etc) make sure they follow Cameron’s Orbital Content model, allowing us to easily export all or data in a raw format such as XML with accompanying folders of raw assets such as photos. We could then take that data to any other sites as services come and go, or inevitably get bought by Facebook. This first point might actually happen.

Now the dream. The government (or some other organisation we can supposedly trust) builds a massive Act-Of-God-proof data centre in remote Northumberland or somewhere like that.

Each year, perhaps on a set date or National Export Day or whatever, we each download our raw data from all our services, back up our own sites, photos, important documents and stuff, and update something I imagine we’d call our Super Zips. It’d be like doing an annual tax return, but for ourselves.

We’d then use some magic tool to encrypt our Super Zip folders if we’re security conscious. We upload these to the government or whomever’s Act-Of-God-proof data centre in the background, Backblaze-style.

For this to be useful, our raw data might need to be converted or refactored every decade or so, should we fall out with XML, JSON, HTML or some other unexpected language madness. This refactoring would be a decision we made each year prior to submitting our Super Zips.

Each of us has two Key masters that we choose from our families or friends and assign each year when we send our data. These key masters might change if people die or we divorce them or whatever, but essentially these people need to sync up to sign a release for our Super Zips should we die, go missing, get abducted by aliens, or go work at Facebook.

The government (or whomever holds our data) release the Super Zip to the key masters so they have a complete (or at least, no older then 12 months) copy of everything we wanted to pass down the line. Our key masters know our secret code so they can decrypt our data.

Now, another dream. Right now we view our photos, our checkins, our articles and so on in a certain way, in certain frameworks, designed in certain ways. Over time, tastes will change, platforms and operating systems will come and go. So, hopefully our Super Zips of raw data will be plugged in or uploaded and be interpreted by the sites or tools of the day and display our articles, photos, and other stuff in a manner that future generations will appreciate.

Perhaps the XML of today would power some sort of augmented reality Minority Report headfuck in 2081, or be plugged directly into our Great Great Great Great Great Great Great Grandchildren’s brains and turned into an interactive maze or something. Like all of the above suggestions, I have absolutely no idea what I’m talking about.

So, there you are. I expect you are laughing at me.

I know this is flawed

I understand that the government is not a fine custodian and that they’d probably close it down in 50 years, or they’d spend billions on the Information Technology and end up with it being run off’ve Wordpress or something. I know that JPG, PNG and other formats might not survive the millennium, which is why I suggest magic conversion tools at relevant periods in the future.

Most of all, I feel better for getting that out, but do appreciate that it’s probably ridiculous. I write it in hope that in decades or centuries people will look back and I’ll seem like some sort of Nostradamus for the digital age and be posthumously offered a knighthood which I hope my family would refuse on principle (the Empire and all that).

Then again, chances are that I’ll forget to renew my domain name in a few years or get hacked, lose my site and all my articles, and like my entire online history, this sooth that I say will be lost forever, just like all the stuff you put online today.


Remy Sharp

# Remy Sharp responded on 6th December 2011 with...

I’ve given this some thought myself, also down the pub - probably up in arms about how we can save the sorry state we’ll soon be in.

Anyway, here was my idea, as crazy as it might sound:

Arduino box sits on your home network that’s able to reach the web and has a big fuckoff sized harddrive (being Arduino means you can load it as you please). You configure it to monitor specific services - Flickr, Twitter, BookFace, etc.

The box quietly each day (or week or however often it can) pulls the latest data from your favourite sites.

The result: a copy of your online presence sitting at home, waiting for your house to burn down - ie. yeah, it would be nice if there was some kind of bolt on service with the Arduino backup box that synced up to the cloud.  I guess that’s what they call business.

Anyway - it means we hack the systems like Twitter that is locking away our data already (try getting the complete list of tweets from your first day on twitter - yeah, ain’t happening - that data don’t belong to you no more) - and it also means we’re not reliant on anti-act-of-god - government-type-jibbly companies.  i.e. we take matters in our own hands.

There’s likely holes in that idea.  If there’s not - come find me, I’m sure I could help build it! :)

Matthew Pennell

# Matthew Pennell responded on 6th December 2011 with...

I’m part-way through building what is essentially the first part of this idea - a lifestream application that not only aggregates and displays all your data from the various services to which you share, but also takes a snapshot of the underlying data and stores it locally so that your digital progress can be recreated and displayed regardless of the status of the original service. I’ve got a bit stuck with controlling the cronjob frequency though, and hadn’t considered the need to download and store images and other media locally as well…

Kris Coverdale

# Kris Coverdale responded on 7th December 2011 with...

I think this is a fascinating topic, especially for meaningful artefacts such as photos, essays, etc.  I do find myself wondering at our obsession with storing all of our online actions though e.g. having access to every tweet we’ve ever made. 

By the same token we ought to be as data obsessive about our offline life too - should we record every interesting pub conversation we have for instance.  I had a great chat with Bob 10 years ago about the dot com bubble bursting and what next - if only I could access it, but its lost forever except in my and Bob’s imperfect recollections.

The problem with storing everything is also that of volume - how do you find the real good stuff from the vast swathes of experiences we have, tens of thousands of tweets, thousands of forum posts, photos, products bought and sold, etc.  Either we need great search, take the time to curate (or have AI intelligent enough to curate for us, or use crowd sourcing e.g. most favourited / retweeted are more valuable), or we need to just start to let go a bit more…

Still forming an opinion, but I think for the vast majority of small talk passed over the internet I’m happy to just let it go, and try to curate out the stuff that I really want to keep (great photos for instance) and relentlessly try to defend / backup those to the exclusion of all else.


A good starting point from a practical point of view is to read this FAQ on digital wills.

Colin Burn-Murdoch

# Colin Burn-Murdoch responded on 12th December 2011 with...

See ThinkUp - it does the job of archiving your content on Facebook & Twitter (and Google+)... with a few more plugins it might do the job.

Personally I’d rather have it hosted on a commercial account that I control, than a government sponsored one. It’s reasonably safe to assume I’ll have time to move the content if necessary (and a further backup of the database doesn’t go amiss).

