Failure Recovery

While we always hope for and expect success, we also need to plan for failure. Disk drives wear out, and machines crash, and this is a normal part of working with computers. So we must plan ahead for what we will do if (nay, when) a disk fails, so that we can quickly recover and get back to work.

Each section on this page describes a different "component" of the whole project, be it data, hardware or software. Please include a brief description of what the component is or does, and then describe how we would recover from a failure. It might be that data can be recovered from a tape backup, while software could more easily just be re-built if we retained the configuration information used to build it the first time.

Please describe proceedures which are currently in place. If there is a better way to do it in the future you can note that too, but we need to know what is currently available.

Software

Cosmics

  • Description: Cosmics analysis software is....
  • Recovery Plan:

LIGO

Bluestone (TLA)

  • Description: Bluestone, the LIGO Analysis Tool (aka TLA) consists of PHP code run by an Apache server, along with ROOT scripts and Bourne shell scripts to run an analysis using ROOT.
  • Recovery Plan: This is all software which can be restored or rebult:
    1. Rebuild both the Apache server and PHP from the scripts in Spy Hill CVS
    2. Check out TLA code from Spy Hill CVS

Spy Hill CVS

  • Description: CVS repository at :pserver:anonymous@spy-hill.net:/usr/local/cvsroot/i2u2 contains the code Eric has been developing for the LIGO analysis tool, discussion forums, glossary wiki, QNFellow's library, etc.
  • Recovery Plan: the CVS repository is mirrored nightly via rsync to a partition on a different machine. So any lost or munged files can be recovered from the mirror. If the disk holding the main repository fails the mirror can be used. The repository is also backed up once a month to CD, so one can pull out the most recent CD and restore the files.

Data

Cosmics

  • Description: cosmics data is ...
  • Recovery Plan:

LIGO

LIGO ELabs RDS at Argonne

  • Description: Frame files from the LIGO ELabs Reduced Data Set (RDS), currently kept at /disks1/myers/data/ligo/frames. Only data from the most recent 90 days are currently retained.
  • Recovery Plan: The partition data1.i2u2.org:/disks1 is backed up to tape by the MCS systems staff, so they could be recovered from those tapes. But these files are mirrored from tekoa.ligo-wa.caltech.edu:/data/ligo/ligo/frames via rsync, so it would probably be easier to recover the data across the network.

LIGO ELabs RDS at Hanford

  • Description: Frame files for the LIGO ELabs Reduced Data Set (RDS) are generated on tekoa.ligo-wa.caltech.edu via a cron job which runs every hour and collects the new data from both the frame builder and DMT. These are stored on that machine in /data/ligo/frames/.
  • Recovery Plan: If the /data partition on tekoa is lost it can be regenerated by running the script which builds the RDS frame files (bin/update-rds.php) with a stop GPS time which goes back far enough that the entire RDS would be regenerated. The earliest time available is probably 6 October 2003 (GPS 749494800). This would take about 5 days, but the newest regenerated data could be used immediately.

LIGO Archive and DMT data at Hanford

  • Description: the ELabs RDS is derived from two sources, both at the LIGO Hanford Observatory: 1) raw frame files from the Frame Builder, mounted on tekoa at /archive/frames/trend/minute-trend/LHO/ from ldas:/archive/frames, and 2) DMT frames, mounted on tekoa at /dmt/New_Seis_Blrms/, mounted from ldas:/fb0_frames/trend/dmt.
  • Recovery Plan: if something happens to these datasets it will be a disaster for LIGO. We can likely rely on the folks at LHO and Caltech to recover them for us, probably from tapes written to the Tier 1 tape archive at Caltech. Don't expect getting ELabs back on line to be a high priority. Existing RDS frame files already on tekoa and at Argonne and Spy-Hill should still be available for student use. We just won't be able to generate new data until LIGO recovers.

CMS

STAR

Hardware

What will we do if/when one of our server nodes fails? Or if all of them fail, or are unavailable due to problems at the hosting site?

Server Nodes

  • Description: server software is deployed on a cluster of machines www10 through www17 physically hosted by MCS Systems at Argonne National Laboratory.
  • The server nodes are not backed up in any way by MCS. The fact that home directories are on the /sandbox partition is a stark reminder of this. If you want to make sure your work is saved on one of these nodes then you need to make sure yourself that it's copied somewhere else. That could be by checking in to CVS on a repository which is hosted elsewhere, by copying to a machine which is backed up, or some other means.
  • Eric uses scripts he wrote called dpush and dsync which perform rsync push or push/pull to another machine. He's set the $DSYNCHOST to be login.mcs.anl.gov, and so when he's in a directory and give the dpush command then a complete copy of that directory is pushed to the same directory (relative to $HOME). Existing files on that machine are overwritten if they are older than the one being pushed. The dcopy script does pull instead of push. The dsync does push and pull, resulting in both directories being syncronized with the newest version of any file replacing any older version. Talk to Eric if you want these scripts and/or help using them.

  • An idea which has been suggested but not fully discussed is to put a copy of the cosmics software (whatever is most stable, not necessarily the newest) and a subset of data on a laptop or a desktop machine somewhere other than Argonne, so that it can act as a substitute server in the event that the Argonne site is down (as it will be for the entire day of 22 September 2007).

  • A production version of the LIGO analysis tool is running on tekoa.ligo-wa.caltech.edu, which is the primary data source for I2U2. If the Argonne server is down this can serve as a backup server, at least using the 'guest' demo account.

Data Nodes

  • Description: Data for the e-Labs, originally just cosmics but now also LIGO, and others in the future, is saved on two machines physically hosted by MCS Systems at ANL along with the server nodes.
    • data0.i2u2.org has 2x300GB drives (possibly RAID 1?) with a single 217G data partition which is exported to the servers, which mount it as as /nfs.
    • data1.i2u2.org has 6x156GB disks (possibly RAID 0+1?), with two 135G partitions, which mount on the servers as /disks0 and /disks1.
  • Recovery Plan:
  1. If a single drive fails in a RAID 1 array the data can be copied from the mirror.
  2. If there is a catastrophic failure of all disks or the controller then the data can be recovered from tapes. The filesystems on data0 and data1 are backed up to tape by MCS, and they are also being mirrored by rsync to storage at the CI (details?)

Network

  • If the network for the cluster is down, but the MCS login server is still up, then it would be possible to turn off the load-balancing and replace the main page at http://www.i2u2.org with a message saying that the site is down, and perhaps with status information. That would be better than just no response. The details of how to do this and when to do this should be fleshed in further.

  • If the network at ANL is down, including the MCS login service which hosts http://www.i2u2.org then we could have a pre-arranged web-site where people could look for information.

This topic: ELabs > FGCSSPCFPFGCS2013SI > FailureRecovery
Topic revision: 2019-05-22, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback