When an analysis task is started it will usually need a set of parameters which control what the task is to do. For example, for a LIGO analysis the task will need to know the starting and ending GPS times and the data channel(s) to use, along with the sampling type (raw or trended data?). When there are only a few parameters to pass to the task these can be specified on a command line or passed as a function argument. But when there are more than a few such parameters then they need to written to a file, and then that file is passed to the task.
A method of passing parameters is also needed within a workflow. If the flow is made up of several components (modules, elements) which receive output from one module or pass it on to another, then one again may need to pass parameters along with the data. When the number of parameters is not small then passing the parameters via a file is the most effective way to do this.
One can easily imagine several different ways to structure parameter files. Rather than choosing one to the exclusion of all other choices, I propose that we allow the use of several different parameter file formats. If we plan this right then we can add flexibility without adding undue complexity.
Regardless of which particular file format is used, all parameter file schemes should follow these guidlines:
- Distinguishable first line: it should be possible for a program to distinguish the general parameter file format scheme just from reading only the first line of the file. (This is like Unix "magic numbers", like shell scripts starting with
#!/bin/sh or PostScript files staring with
- Format versioning: each file should include a format version number, to allow us to make changes to the format as we go along, but to still be able to detect and read files written with an earlier format. This information may or may not be a part of the first line of the file.
- Positional Format: one common way to format parameter files is to have each item on each line represent a parameter, based on it's position. The disadvantages of this is that you cannot easily change the format, it's hard for humans to read, and it's hard to alter paramters by hand if you need to do so. A positional format is also very specific to a particular application.
- Key/Value/Type: An alternative to a positional format is to make the file a list of parameter items, which consist of a "key" or "name" for the given parameter, along with the "value" to be associated with the parameter. Some schemes may also include the "type" for the parameter, to allow us to cross check inputs with outputs. However, a parameter type is not a requirement for all schemes. Instead, the value may be tested to see that it matches the proper type at run time, or cast into the proper type (eg.
- Comments: Ideally a parameter file scheme will also allow "comment" lines, which can be included in the file to improve readability, but which are ignored by the program or routine which reads and processes the file as input.
Here are some of the various schemes which we may use or at least consider:
Simple Key:Value (SKV) parameter files
The simplest format for a parameter file is to simply write the name (or "key") and value to the file, one per line (some scheme is needed for continuation to the next line if the value is too long). The value could be separated from the key by a special separator character, such as ":" or "=", or it could simply be separated by a space (provided that parameter names are not allowed to have spaces).
An example of this is provided by the system configuration parameters on a Linux system, found in
onRed Hat systems or in
on Debian systems (such as www13). These files are "sourced" by Bourne shell scripts, so they simply contain shell variable settings, with the variable name (usually in capital letters, but not always), an equals sign, and then a value. Because they are shell scripts they can also contain comments; anything after the # symbol is ignored up to the end of the line (unless it's in a quoted string). For example, the parameters used when starting the Apache web server on www13 are in
, which contains
# 0 = start on boot; 1 = don't start on boot
The advantage of the SKV scheme is that it is simple to implement, and as demonstrated by this example could even be implemented by "sourcing" code fragments. Eric has used this same scheme in the LIGO Analysis Tool at the end of an analysis when you are given a form which lets you customize the plot (change pen color, change plot and axes labels). The interface code writes a small ROOT script with the new parameter values, and then runs that through as part of a sequence of ROOT scripts which load the previous plot script, apply the changes, and then save the new plot as a script.
The main problem with the SKV scheme is that it cannot be easily used to encode the properties of data "objects", or to encode parameters which are naturally arranged into a hierarchy.
Grouped Key:Value (GKV) parameters
A small variation on the Key:Value format is to allow parameters to be grouped in the parameter file by subsystem or some other grouping principle. One common way of doing this to begin a block of parameters with the group name enclosed in brackets. There are several common examples of this. On Windows, files like
use this scheme to apply parameters for specific subsystems (though I don't have any examples handy).
The configuration file for PHP also uses this kind of segmentation. On www13 you can browse the configuration for PHP as run under the web server, in the file
. You will see sections such as
; Whether or not to define the various syslog variables (e.g. $LOG_PID,
; $LOG_CRON, etc.). Turning it off is a good idea performance-wise. In
; runtime, you can define these variables by calling define_syslog_variables()
define_syslog_variables = Off
;java.class.path = .\php_java.jar
;java.home = c:\jdk
;java.library = c:\jdk\jre\bin\hotspot\jvm.dll
;java.library.path = .\
sql.safe_mode = Off
As you can also see, the comment character in PHP configuration files is the semi-colon.
Eric has used a variation of this with column position in the LIGO Analysis Tool to specify which data channels are available at different user levels (with 1=beginner, 2=intermediate, 3=advanced). Each line in the file has a channel name, whitespace, and then the sampling rate. At various points in the file the user_level
is set to the level required to view or use the channels which follow. So the list of minute-trend channels in the file
# This file contains the channel information for the LIGO I2U2
# Reduced Data Set for Minute Trended data. This information will
# eventually move from a flat file to a database. Meanwhile, the
# columns are:
# Full_Name sample_rate
# Only the channels for a given user level (or lower) are loaded.
and so on... Beginners only see the EARTHQUAKE channel, while advanced users can also see the RAIN and WIND channels. In this case the Key=value is in the [brackets] and the rest of the file has a specialized format based on column position.
Hierarchical Key:Value (HKey) schemes
- Examples: Microsoft Windows Registry... or NeXT/Darwin NetInfo database (but both are actually binary formats!)
- Example: X11 Resource Files (see `man xrdb`)...
I hope to avoid XML for a while. It's more complex, harder for humans to read, and does not have a nice comment mechanism
But I should mention here that LIGO has a file format called the LIGO Lightweight format (LIGO_LW) which is based on XML. This would likely be a good way to pass data (not parameters) between modules in a workflow. And if the data are in XML, and the parameters and metadata are also in XML, then they might even be bundled together into a larger XML assembly. This needs further thought and discussion.
Based on RFC 2445... easier than XML to read, but can also encapsulate object structure...
If we want to encapsulate objects this may be easier to do compared to using XML. It is certainly easier to read.
PHP Session Serialization
PHP stores variables during a sesion in the global variable $SESSION. At the end of a page hit the contents of this variable are serialized and put into a file in a common area for session data. At the next page hit this file is read in and the values restored into the $SESSION array. (I belive we can even use this for load balancing; if the common area is shared via NFS then two servers could both serve pages from the same PHP session, regardless of which one got the page hit.)
One interesting result is that if we know the serialization format for $SESSION then we could write code for ROOT which reads and decodes that file and extracts values from it. One could then easily extract the value of any $SESSION variable into a ROOT script. All you would need to do is know the path to where the session data are stored and the session ID.
JSP Session Serialization
Tomcat's JSP must also store session data in an intermediate file between page hits, so the same idea as above for PHP might be applied to JSP.
YAML (rhymes with camel) is a human-readable data serialization format that takes concepts from languages such as XML, C, Python, Perl, as well as the format for electronic mail as specified by RFC 2822. YAML syntax was designed to be easily mapped to data types common to most high-level languages: list, hash, and scalar. Its familiar indented outline and lean appearance makes it especially suited for tasks where humans are likely to view or edit data structures, such as configuration files, dumping during debugging, and document headers...