Thursday, October 3, 2013

Backing up the Sagemath Cloud

The terms of usage of the Sagemath Cloud say "This free service is not guaranteed to have any uptime or backups." That said, I do actually care a huge amount about backing up the data stored there, and ensuring that you don't lose your work.

Bup

I spent a lot of time building a snapshot system for user projects on top of bup. Bup is a highly efficient de-duplicating compressed backup system built on top of git; unlike other approaches, you can store arbitrary data, huge files, etc.

I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the best, especially because the source code is readable. Right now https://cloud.sagemath.com makes several thousand bup snapshots every day, and it has practically saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files).

You can access these snapshots by clicking on the camera icon on the right side of the file listing page.


Some lessons learned when implementing the snapshot system

  • Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches, even with very little in them, makes things painfully slow, e.g., due to an enormous number of separate calls to git. When users interactively get directory listings, it should take at most about 1 second to get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching -- to offset this issue, which are here in case anyone is interested: https://github.com/williamstein/bup (I think they are too hackish to be included in bup, but anybody is welcome to them.)

  • Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one.

  • When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken.

  • In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon.

Availability

The snapshots are distributed around the Sagemath Cloud cluster, so failure of single machines doesn't mean that backups become unavailable. I also have scripts that automatically rsync all of the snapshot repositories to machines in other locations, and keep offsite copies as well. It is thus unlikely that any file you create in cloud.sagemath could just get lost. For better or worse, is also impossible to permanently delete anything. Given the target audience of mathematicians and math students, and the terms of usage, I hope this is reasonable.