Digital preservation
Contents
For now, this will be somewhat of a backup of my Quora blog on the topic, which is now not being maintained. (I’m trying to integrate content from here to more fitting pages.)
Thoughts on storing information in a useful/easily-accessible way. Archiving, backups, single-source publishing, source code management, redundancy (local, cloud), link rot, etc. Essentially, how can we best store thoughts so that later (a day? month? years?) we can easily find them again.
Requirements for good data archiving solutions
Some ideas for now:
- Pandoc-flavored markdown
- LaTeX, then use LaTeX2WP or LaTeX–HTML converters (the benefit of this is that printing is covered, as long as your LaTeX skills are decent)
- raw HTML (no)
- using different formats for different content may be possible, e.g. using markdown for most content, but switching to LaTeX for math-heavy content.
Static page generation
- Static Site Generators —complete(?) list
- Static site generators
- Jekyll • Simple, blog-aware, static sites
- Hakyll - Home
- Switching from Jekyll to Hakyll
- The Switch to Hakyll
Best Jekyll doc seems to be the video available from here: algonquindesign/jekyll. Okay, so I’ve finally figured out how to use jekyll in conjunction with Github. My test website is up on Hello world. See riceissa/abc for the source. The trickiest part was modifying all of jekyll’s default internal links (e.g. automatically generated links to blog posts) to use {{site.baseurl}}
, so that it worked both locally and through Github pages.
I’m still trying to figure out how to do math properly through jekyll, because at the moment I must use two backslashes, since one gets eaten up by Markdown (pandoc can prevent this, so either switch to hakyll or use a plugin for jekyll?—does Github even allow plugins for jekyll?)
More on link rot and the Internet Archive (EDIT: found the speech!!)
Ones with a star (*) are more useful.
- Why URL shorteners suck
- URL shorteners suck less, thanks to the Internet Archive and 301Works
- URLTeam - Archiveteam
- URLTE.AM
- Learn From History: Why The Internet Archive Is So Important
- Brewster’s trillions: Internet Archive strives to keep web history alive
- on url shorteners
- 301works-faq : Free Download & Streaming : Internet Archive
I’m still trying to find that speech about link rot and URL shorteners.
EDIT: I found the speech: The Splendiferous Story of Archive Team. The guy’s name is Jason Scott Sadofsky and he seems to have a lot of stuff e.g. T E X T F I L E S. He actually seems to be part of the Archiveteam, which makes him even cooler. (How did I find the speech again after two days of trying? [So that I can get better at finding old things.] Through this article: This Group Is Saving The Web From Itself (And Rescuing Your Stuff), where I got his name. My search terms on Google to find that article were “saving the internet archive”; after I recognized the name, I just Googled “jason scott”, and found his Wikipedia page and website, whose design I remembered from when I read the speech before.)
The relevant bit on URL shorteners is:
URL shorteners may be one of the worst ideas, one of the most backward ideas, to come out of the last five years. In very recent times, per-site shorteners, where a website registers a smaller version of its hostname and provides a single small link for a more complicated piece of content within it.. those are fine. But these general-purpose URL shorteners, with their shady or fragile setups and utter dependence upon them, well. If we lose TinyURL or bit.ly, millions of weblogs, essays, and non-archived tweets lose their meaning. Instantly.
Terminally Incoherent
from Luke.
- Personal Backups
- Putting your Vim files under version control
- You Need Backups
- Backing up your work: Common Sense 101
- Maybe the Backup Problem Will Resolve Itself In Time
- Backup is not just for geeks
The message is clear.
Also example of backing up dotfiles: maciakl/.dotfiles. Somewhere I remember reading a post about how to set this up nicely (by Luke), but can’t seem to locate it at the moment.
Gwern on archiving
Archiving URLs see also http://www.gwern.net/About#long-content. I can’t believe I had forgotten about this! Okay and linked on the first page is this: http://www.gwern.net/docs/2011-muflax-backup.pdf (from muflax).
URL shortening
where is the speech that someone at Digital Library of Free Books, Movies, Music & Wayback Machine made about how URL shortening sites cause a lot of link rot?
EDIT: Okay I found this at least: http://www.archiveteam.org/index.php?title=URLTeam and this links to Why URL shorteners are bad. Not bad.
EDIT2: Okay finally found it. See More on link rot and the Internet Archive (EDIT: found the speech!!).
What do people do about archiving information that they don’t own?
e.g. a journal article that you didn’t write. This sort of thing could be very useful for people to have access to. Storing them locally wouldn’t be a problem, and putting them up online shouldn’t cause too much trouble either. Websites like http://archive.org always just archive whatever they can find anyway.
Actually, how does Digital Library of Free Books, Movies, Music & Wayback Machine deal with copyright issues?
Git/GitHub
See What are some unconventional and unique uses of Git? and also the article linked in the description of the question (http://www.wired.com/wiredenterprise/2012/02/github-revisited/).
Git in particular is excellent in terms of having a local mirror as well as one online (e.g. by using Github/Bitbucket), along with all of the changes that have been made. This makes the data redundant/safe. It doesn’t seem to be bad either in terms of sharing, since everything will be in plaintext. Posting on Quora might be a problem though, since it doesn’t use Markdown or the like.
I suppose the other big problem with using source-control is the question of storing binary data. Use a different place to store those? (e.g. putting all plaintext on Github but uploading photos to Wordpress, and linking them?) Deal with binaries in git as well, and make many small projects so that it doesn’t slow down git?
Random questions (not Quora-quality yet)
Some of these have probably already been asked, so.
- Is single-sourcing worth one’s time for most things? For some things?
- How important is grammar/capitalization for recording most thoughts? (cf AKC’s post [on wordpress?] about grammar not being important)
- Is Evernote reliable?
- How do I easily create multiple mirrors of my data, both locally and online?
- How do I reliable backup online data?
- How do I keep a revision list of online data?—version control it somehow? how?
- Why is Quora’s search feature unreliable at finding the information I want? e.g. it can’t find comments.
- Why does Quora make everything so hard to find? —e.g. there are no chronological lists of all posts of a user, no place to find all the posts I’ve upvoted, and so on.
- How do I make sharing easier between Quora and other sites?
External mirroring
The way I see it, external page savers like the Internet Archive and WebCite shouldn’t be trusted; rather, they’re a convenient way to avoid copyright violations (since they’re hosting the files, not you); and they provide a time buffer for you to get local copies.
I don’t think I should be surprised at all by this, but I was quite impressed with Perma.cc’s contingency plan, which seems like an obvious improvement over standard web services that simply shut down and give no notice1.
In the case of Quora, private blogs were almost immediately disabled and deleted after announcement (though archives were emailed out to owners); Google reader gave under three months to backup data.↩
The content on this page is in the public domain.