Digital preservation

Created: 2015-01-01

Contents

For now, this will be somewhat of a backup of my Quora blog on the topic, which is now not being maintained. (I’m trying to integrate content from here to more fitting pages.)

Thoughts on storing information in a useful/easily-accessible way. Archiving, backups, single-source publishing, source code management, redundancy (local, cloud), link rot, etc. Essentially, how can we best store thoughts so that later (a day? month? years?) we can easily find them again.

Requirements for good data archiving solutions

Some ideas for now:

Static page generation

Best Jekyll doc seems to be the video available from here: algonquindesign/jekyll. Okay, so I’ve finally figured out how to use jekyll in conjunction with Github. My test website is up on Hello world. See riceissa/abc for the source. The trickiest part was modifying all of jekyll’s default internal links (e.g. automatically generated links to blog posts) to use {{site.baseurl}}, so that it worked both locally and through Github pages.

I’m still trying to figure out how to do math properly through jekyll, because at the moment I must use two backslashes, since one gets eaten up by Markdown (pandoc can prevent this, so either switch to hakyll or use a plugin for jekyll?—does Github even allow plugins for jekyll?)

Ones with a star (*) are more useful.

I’m still trying to find that speech about link rot and URL shorteners.

EDIT: I found the speech: The Splendiferous Story of Archive Team. The guy’s name is Jason Scott Sadofsky and he seems to have a lot of stuff e.g. T E X T F I L E S. He actually seems to be part of the Archiveteam, which makes him even cooler. (How did I find the speech again after two days of trying? [So that I can get better at finding old things.] Through this article: This Group Is Saving The Web From Itself (And Rescuing Your Stuff), where I got his name. My search terms on Google to find that article were “saving the internet archive”; after I recognized the name, I just Googled “jason scott”, and found his Wikipedia page and website, whose design I remembered from when I read the speech before.)

The relevant bit on URL shorteners is:

URL shorteners may be one of the worst ideas, one of the most backward ideas, to come out of the last five years. In very recent times, per-site shorteners, where a website registers a smaller version of its hostname and provides a single small link for a more complicated piece of content within it.. those are fine. But these general-purpose URL shorteners, with their shady or fragile setups and utter dependence upon them, well. If we lose TinyURL or bit.ly, millions of weblogs, essays, and non-archived tweets lose their meaning. Instantly.

Terminally Incoherent

from Luke.

The message is clear.

Also example of backing up dotfiles: maciakl/.dotfiles. Somewhere I remember reading a post about how to set this up nicely (by Luke), but can’t seem to locate it at the moment.

Gwern on archiving

Archiving URLs see also http://www.gwern.net/About#long-content. I can’t believe I had forgotten about this! Okay and linked on the first page is this: http://www.gwern.net/docs/2011-muflax-backup.pdf (from muflax).

URL shortening

where is the speech that someone at Digital Library of Free Books, Movies, Music & Wayback Machine made about how URL shortening sites cause a lot of link rot?

EDIT: Okay I found this at least: http://www.archiveteam.org/index.php?title=URLTeam and this links to Why URL shorteners are bad. Not bad.

EDIT2: Okay finally found it. See More on link rot and the Internet Archive (EDIT: found the speech!!).

What do people do about archiving information that they don’t own?

e.g. a journal article that you didn’t write. This sort of thing could be very useful for people to have access to. Storing them locally wouldn’t be a problem, and putting them up online shouldn’t cause too much trouble either. Websites like http://archive.org always just archive whatever they can find anyway.

Actually, how does Digital Library of Free Books, Movies, Music & Wayback Machine deal with copyright issues?

Git/GitHub

See What are some unconventional and unique uses of Git? and also the article linked in the description of the question (http://www.wired.com/wiredenterprise/2012/02/github-revisited/).

Git in particular is excellent in terms of having a local mirror as well as one online (e.g. by using Github/Bitbucket), along with all of the changes that have been made. This makes the data redundant/safe. It doesn’t seem to be bad either in terms of sharing, since everything will be in plaintext. Posting on Quora might be a problem though, since it doesn’t use Markdown or the like.

I suppose the other big problem with using source-control is the question of storing binary data. Use a different place to store those? (e.g. putting all plaintext on Github but uploading photos to Wordpress, and linking them?) Deal with binaries in git as well, and make many small projects so that it doesn’t slow down git?

Random questions (not Quora-quality yet)

Some of these have probably already been asked, so.

External mirroring

The way I see it, external page savers like the Internet Archive and WebCite shouldn’t be trusted; rather, they’re a convenient way to avoid copyright violations (since they’re hosting the files, not you); and they provide a time buffer for you to get local copies.

I don’t think I should be surprised at all by this, but I was quite impressed with Perma.cc’s contingency plan, which seems like an obvious improvement over standard web services that simply shut down and give no notice1.


  1. In the case of Quora, private blogs were almost immediately disabled and deleted after announcement (though archives were emailed out to owners); Google reader gave under three months to backup data.


Tags: digital preservation.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.