LIGO Storage Requirements
LIGO data are stored in "frame" files, with names like
, in a directory hierarchy of the form
The name of the file contains information about what the file contains. The initial "H" is for Hanford (use "L" for Livingston, "G" for GEO600) while the initial "M" indicates minute trend (it will be "T" for second trends). The 9 digit number is the GPS time at the begining of the frame (it becomes 10 digits in 2011), and the number after the hyphen is the length of the frame, in seconds (3600 seconds = 1 hour).
We started with minute-trend data, since they are available both in raw PEM form (trended) and BLRMS (Bandwidth Limited RMS) from DMT, so we can do more with them from the start. For production, the data are made available to the cluster at
. The past 90 days of data, which is about, are stored on the Argonne cluster in
, and Eric also keeps a copy on Spy HIll for his own testing. In both cases we constantly update new files, when possible, and discard those older than 90 days. The past 3.5 years (what is now available to us) is about 18GB. If the minute-trends RDS grows at about 4GB/year then in the next 5 years we would need 40GB to store everything, which presents no real problems.
as of Sept 2008 both the test/development and the production copies of the RDS contain the full data set. In other words, we are not trimming the test/dev version on the ANL cluster. We can trim it later on if storage becomes an issue.
A full collection of second-trends would be roughly 60 times larger than minute trends, but this is an over-estimate, because the minute-trends include DMT BLRM channels and the second-trends do not. One second-trend file, containing 60 seconds of data, is presently 204124 bytes long. This adds up to almost exactly 100GB per year. Thus storing the entire collection of second trends for the past 3 years and another 5 years would be about 800GB. This is likely an over-estimate.
We might be able to store all second-trends on the ANL cluster, but we are not sure we would want all of it, provided that we can generate the frame files we are interested in for a given analysis when needed. Most of it would just sit unused. Instead, we run a data preparation task to generate frame files for a given time interval at the beginning of an analysis. There are two ways to do this, so we need to find which is best (or perhaps facilitate both)? One of the two methods is now working, with the automatic data prep task launched by Bluestone when you first select your time segment. It may be better to make the prep step a separate analysis/transformation. Or to support both.
Raw data frames
Raw data at the full sampling rate of 256Hz for seismic channel, so a rough size estimate is 256 times as large as second trends. That's 25600GB, or 26TB per year. But this not a very accurate estimate. The weather channels have much lower sampling rate, while the magnetometer channels are sampled at 2048Hz.
As of Sept 2008 we can refine our estimates, because we've been able to generate frame files for raw data for the ELabs RDS. Each frame file has 75 channels, covers 16 seconds, and contains 3421504 bytes. That adds up to a more conservative estimate of 6,285 MB/year = 6.14 TB/year.
Either way, this won't fit in the 300GB available on
, or even in the 2.0TB storage array on
(which is already 75% full anyway). But again, we don't need to generate the whole RDS beforehand. We will treat these like second-trends and generate frames in a given time interval, as requested by students. One complication is that the data collected for various LIGO runs (S5, S4, A4, E12) are stored in separate sub-directories, and if we are to preserve that structure (and that seems like a good idea -- its' metadata) then the scripts need to know about the directories and GPS times of the runs. Eric has a flexible scheme worked out for this, but it has not yet been coded.
- Ten-minute trends and 1-hour trends would be useful for the kinds of exercises we are likely to start students on. LIGO does not generate these, so Eric will have to write a program to do so. Expect Ten-min trends to be 1/10 the size of minute trends (not accounting for frame file header overhead). Expect hour trends to be 1/60 the size of minute trends. So generate these once and keep them on disk. They present no storage challenges, only the effort to write the code.
-- Main.EricMyers - 13 Jun 2007
-- Main.EricMyers - 16 May 2008
-- Main.EricMyers - 10 Oct 2008