Archiving websites on Unix-like systems

View source | View history | Atom feed for this file

Note: this page was originally written for the wikiHow article “How to Archive Websites on Unix Like Systems”. The writing of this page was sponsored by Vipul Naik.

Link rot (the phenomenon where old websites and pages disappear, making links useless) is a huge problem on the internet.1 For readers, it is disappointing when old websites no longer exist. For writers whose writing on the web is extensively hyperlinked, there is the problem of not being able to verify information when links die. Whatever the motivation, it is important to be able to archive information on the internet.

Steps

  1. Understand your motivations for archiving the site. It is important to know what you are trying to do in archiving the site. There are several reasons one might want to archive a website, and these influence your strategy:

    • You may just want an archive of the site to exist, without necessarily caring about who maintains it. If this is the case, refer to the step “Check if there are recent archives of the website.”
    • You may want to maintain a personal copy (i.e. have the archive locally). This might be the case if you can’t trust others to keep archives online (e.g. if the site content is copyrighted or illegal, then it may be subject to takedowns, in which case it is safer to have local copies).
    • You may want only the snapshot of the site at a certain point in time. This might be the case if you are writing something and want to your citations to be durable; you can just take a snapshot of the site at the time when you used the site as a resource.
    • You may want to continually make snapshots of the site. This might be the case if you have an unseemly obsession with the site’s content or its owners.
    • You may want only parts of the site (e.g. the pages you cite).

    Note that the above points aren’t mutually exclusive (although some are contradictory, like wanting parts of a site versus the whole site); in other words, you may want someone to have an archive, you yourself to keep an archive, and for continual snapshots to be made, all at the same time.

  2. Consider contacting the site owners to obtain dumps. It’s possible the site owners will oblige by for instance giving you their backups of the site. However also keep in mind that this will make transparent to them your intentions, so they may react in an unfavorable way (e.g. by password-protecting parts of the site).

  3. Check if there are recent archives of the website. It is important to remember that doing work isn’t inherently valuable; believing so would be to commit the make-work bias.2 Archiving large websites can often take a lot of time and effort. If someone else has already done the bulk of the work for you, then there is no need to feel compelled to repeat the task all over again.

    If someone else has made a recent backup of the site, then you might be done (if you just wanted someone else to have a copy) or close to being done (if you just wanted one local snapshot; your task reduces to that of mirroring their backup, which is likely substantially easier than the original task).

    There are several places one should check before proceeding with one’s own archiving solution (which is presented in the rest of the steps):

  4. Check to see if there are any specialized tools to download the site. There are sometimes special tools for downloading particular content. For instance, if you want to download many YouTube videos, there is a tool called youtube-dl. Many sites also have an API (application programming interface) that allows you to easily programmatically access content. This is the case with for example reddit, which as a well-documented API (though also note that there is already a fairly comprehensive archive of reddit, which was itself created using reddit’s API).

  5. Figure out if the site has limit to how many requests you can make in a certain amount of time. If no one else has recently archived the site and if there are no specialized tools to archive the site, then you are on your own. It is good practice to figure out if the site you are targeting has any special rules on how much or how quickly you can download content. Many sites have clearly-defined rules, which are frequently listed under their terms of service/use. Even if you don’t personally care whether the site wants to limit your access, this is still good to keep in mind, since they may have automatic IP banning (and other measures) in place to prevent people from overloading their servers or “stealing” their content.

    See the step “Use multiple computers or IP addresses to speed up archival” for ways to get around this limit.

  6. Try running a wget command on the site. Wget is a program that can download content from the internet, and is included by default on many Unix-like systems. Sometimes, a single wget command can download all or most of the site.3 If this is the case, the task becomes trivial. However, it is still important to start out the crawl on a page that is likely to link to many other pages, since wget begins at the starting page (called the “seed”) and recursively follows internal links. Locating a sitemap page is ideal, if the site has one.

    Here is a sample command with many useful flags turned on:

    wget --mirror --page-requisites --adjust-extension --convert-links \
        --wait=1 --random-wait --no-clobber -e robots=off \
        http://example.com/sitemap.html

    Be sure to change http://example.com/sitemap.html as well as any flags to suit your needs. Here are the meanings of the flags, although you should also consult the official wget documentation (or info wget) as well as its manual (type man wget):

    • --mirror turns on several options favorable to mirroring the site, including recursively following links.
    • --page-requisites makes wget download images, stylesheets, and other files that help to produce an appearance faithful to the original.
    • --adjust-extension adds the suffix .html to HTML files to make them easier to browse locally.
    • --convert-links alters linking in downloaded files so they are suitable for local browsing.
    • --wait=1 forces wget to wait for 1 second between requests (however, see the next option as well).
    • --random-wait randomly multiplies the constant from --wait for each request so that the download pattern is less suspicious; this plausibly helps prevent being IP banned.
    • --no-clobber will preserve local copies of a file instead of overwriting or downloading new copies. This is something to consider if the download was stopped in the middle.
    • -e robots=off turns off obeying the robots.txt file of a site. Some people consider it respectful to follow the rules in robots.txt, but sometimes whole sites are excluded by robots.txt (if the owners put Disallow: /). The Archiveteam, for one, considers robots.txta suicide note” and ignores it.

    Consider also using the following options:

    • --user-agent='Mozilla/5.0 Firefox/40.0' will set your user agent to appear as if you are using Firefox to access the site. This is useful since some sites disallow crawler-like user agents or else display different pages depending on the user agent.
    • --restrict-file-names=nocontrol will prevent wget from touching special characters in the URL. This often is useful if you are downloading from a site with filenames containing many Unicode characters.4

    cURL is a similar utility to wget that has similar features. On Mac OS X and many BSD systems, cURL is the default downloader instead of wget. You may also be interested in finding other programs that are intended for mirroring sites, such as HTTrack.

  7. Examine the website to see if there are any patterns or structure to it. If a site cannot be mirrored easily using wget (or similar utility), then it is time for another strategy. Many sites, such as web forums, have explicit patterns in the URL, such as numbered threads.

    To take an example, we can look at AutoAdmit, a notorious discussion board. One of the threads on the site has the URL http://autoadmit.com/thread.php?thread_id=2993725&mc=12&forum_id=2. However, we might notice that the only crucial number here is the thread_id; indeed, navigating to just http://autoadmit.com/thread.php?thread_id=2993725 shows an essentially identical page. We might now try to deduce that the general AutoAdmit URL structure is http://autoadmit.com/thread.php?thread_id=N, where N is some number; trying several URLs that are instances of this general pattern confirms this. After this, it is a matter of looping through all the threads and downloading them; in bash:

    for i in {1..2993725}
    do 
        wget -A 'Mozilla/5.0 Firefox/40.0' \
            "http://xoxohth.com/thread.php?thread_id=$i"
        sleep 0.3s
    done

    Of course, the URL structure of AutoAdmit is one of the simplest. Other sites, like WordPress blogs, may require e.g. looping through the archive pages after calculating how many pages of posts are in each month (though WordPress blogs also tend to be quite amenable to a naïve crawl). There is no general way to go about this, so you might be on your own; searching around on Google can often be very helpful.

  8. If the site has a lot of JavaScript, consider more advanced or tedious techniques for downloading the site. In recent times, the web has shifted increasingly toward the heavy use of JavaScript and other interactive elements (sometimes called DOM scripting). While this shift has allowed the web to “rival native applications without a hefty initial download, without an install process, and do so across devices old and new”5, it is still often the subject of ridicule6 and criticism7. In terms of archiving websites, JavaScript and interactive elements are almost always bad news. There are several strategies of trying to deal with archiving a JavaScript-heavy site.

    One approach is to try to avoid JavaScript entirely. Many sites make available RSS or Atom feeds that readers can use to keep up with new content. For archivers, web feeds are useful because it is static XML, which makes downloading and processing easy. This is especially useful if you want to continue making snapshots of the site: just follow the relevant feeds and you will automatically have what you want (if you’re lucky; many sites only show previews on feeds, or the authors may edit the content afterwards, etc., which complicates matters). A caveat here is that most web feeds only contain the newest several posts or articles, so retroactively archiving a site may not be possible.

    Other things to look into:

    • In-browser JavaScript to e.g. automatically click on elements
    • PhantomJS and other headless browsers
    • Websites frequently change their UI so keeping up can be hard
  9. If a site is password-protected or otherwise requires authentication, export cookies. Sites that require authentication are tricky to access through tools like wget. However, it is possible to export cookies from graphical browsers like Firefox, then use the cookies to access the sites on wget. Firefox as the plugin Export Cookies, which works well.8

  10. Use multiple computers or IP addresses to speed up archival. If a website can relatively quickly detect crawlers and ban them, one option (besides slowing down) is to connect to the site using multiple IP addresses. There are multiple ways to do this; here are two: one simple form of having multiple IP addresses is to change your physical location by for instance visiting several cafés or libraries; at each new location, you will have a fresh IP address with which to download from the site. Another option is to obtain access to multiple computers, by for instance purchasing access to virtual private servers. An advantage of the latter option is that you can have multiple IP addresses that can download simultaneously, which will speed up the downloading process instead of simply allowing you to continue the download. However, note that the latter option often involves spending money, although this can be as low as $15 per year. The latter form of downloading is in addition of questionable legality.

    It is also possible to use the anonymity network Tor to continually alter your IP by changing exit nodes. However, it is highly discouraged to abuse Tor in this way.9

  11. Automate as much as possible. If you expect to continue making archives of the site or if the target site is enormous, then it becomes very important to automate as much of the process as possible.

  12. Consider releasing your archives. There are several reasons you might want to release your archives to the public. For one, you have now gone through a time-consuming process of archiving the site; if others also want local copies, then it’s likely they would be gratified to find that someone else has done the work for them. In addition, “lots of copies keep stuff safe”, so allowing others to mirror your archives ensures that it is safer against threats like disk failure. See for instance the Internet Archive’s upload page for more.

Warnings


  1. See for instance “Archiving URLs” by gwern, § Link rot.↩︎

  2. See Wikipedia’s “The Myth of the Rational Voter”, § Make-work bias.↩︎

  3. See also “Downloading an Entire Web Site with wget” and “How can I download an entire website?”, which suggest similar flags for wget.↩︎

  4. Brouwer, Andries E. “wget failure to handle UTF-8 filenames”.↩︎

  5. Archibald, Jake. “Progressive enhancement is still important”. 2013-07-03.↩︎

  6. Zawinski, Jamie. “design”. 2001.↩︎

  7. cat -v. “Everyone has JavaScript, right?” 2015-04-24.↩︎

  8. gwern. “Archiving URLs”, § Local caching. 2015-08-12.↩︎

  9. Note, however, that if you want to crawl sites that require Tor (such as certain black-market sites), then the legality of network abuse is probably not a problem, although the legality of the content might be. For the curious, a highly detailed guide to using Tor to archive sites is available at gwern’s “Black-market archives”, § How to crawl markets.↩︎

  10. Manu Chandra Prasad. Answer to “What are some Web crawler tips to avoid crawler traps?” 2014-08-22.↩︎

  11. Stack Overflow. “Interview question: Honeypots and web crawlers”. 2011.↩︎

  12. See for instance a graph of total transfer size from November 2010 to September 2015.↩︎

  13. Website Optimization. “Average Web Page Breaks 1600K”. 2014-07-18.↩︎

Backlinks