Distributing calculations on a cluster

A guide to get started with distributing pyspi jobs across a PBS-type cluster.

pyspi distribute scripts

All scripts necessary for distributing pyspi jobs across a PBS-type cluster are readily accessible in the pyspi-distribute GitHub repository. Each job will contain one calculator object that is associated with one multivariate time-series (MTS).

The scripts allow the user to specify a directory containing the MTS files, with each MTS sample being stored in a separate binary NumPy (.npy) file. Within this directory, the user needs to also include a YAML configuration file which specifies the relative location of each .npy file (and, optionally, the name, dim_order, and any relevant labels). An R script to automatically populate this configuration file is provided for the user: create_yaml_for_samples.R.

Using pyspi distribute

The following instructions provide a guide to getting started with distributing calculations on a PBS-type cluster. For a more in-depth example of a typical workflow using pyspi-distribute see the walkthrough tutorial here.

1. Initialising the environment

Follow the pyspi installation guide to install and setup pyspi on your cluster. We recommend installing into a new conda environment to ensure there are no clashes in dependencies.

2. Organising and formatting your data

Organise all of your multivariate time series into a user-specified data directory. If your data is not already in the correct format to be read into pyspi, follow the NumPy guide on saving your data as a .npy file here.

Ensure each multivariate time-series sample is stored in a separate .npy file.

3. Configuring the `.yaml` file

Either manually create a sample.yaml file or automatically populate it using create_yaml_for_samples.R. Ensure that the .yaml file is located in the recently created user-specified data directory, along with your MTS samples.

An example of a typical sample.yaml file for two MTS samples sample1.npy and sample2.npy is provided below:

- {file: ./database/sample1.npy, name: sample1, dim_order: sp, labels: [synthetic,noise] }
- {file: ./database/sample2.npy, name: sample2, dim_order: sp, labels: [synthetic,noise] }

Here, the optional parameters: name (str), dim_order (str) and labels (list of str) are specified alongside the file path of the sample.

Automatically generate a sample YAML file (optional)

If you have many samples, you may wish to automatically populate your sample YAML configuration file. We have provided the R script create_yaml_for_samples.R to accomplish this. This script can be run on the command line with the following arguments:

--data_dir: [Required] Data directory in which all samples' MTS NumPy files are stored (e.g. database/)
--sample_metadata: [Optional] Path to CSV file containing sample metadata. If supplied, the identifying variable for each sample MTS must be sampleID.
--label_vars: [Optional] Variable(s) to include in the YAML file from the metadata.
--overwrite: [Optional flag] If included, sample.yaml will be overwritten if if already exists.

4. Submit the pyspi jobs to the PBS cluster

Activate your conda environment in the location where pyspi is installed using: conda activate pyspi. Then, submit the jobs from the command line using distribute_jobs.py (see below for usage).

The script, distribute_jobs.py works by taking in a data directory (where your MTS samples and sample.yaml are located), iterating over each MTS NumPy sample, and submitting a separate PBS job for each sample. The pbs file is automatically generated from this script and submitted via qsub.

The distribute_jobs.py script includes several command-line options for user configuration, all of which are optional:

Option

Description

--data_dir

Data directory in which all samples' MTS NumPy files are stored. If no path is supplied, the default is ./database/ from the directory in which distribute_jobs.py is located

--compute_file

The file path for python script that actually runs pyspi. Default is pyspi_compute.py in the directory where this script is located.

--sample_yaml

The file path to the sample YAML configuration file. The default is ./database/sample.yaml.

--pyspi_config

If desired, the file path to a user-generated YAML configuration file specifying a subset of SPIs to compute.

--walltime_hrs

Maximum wall-time allowed for a given job, in hours. The default is 24.

--overwrite_pkl

Including this flag means that existing pyspi results for a given sample will be overwritten.

--pbs-notify

When pbs should email user; a=abort, b=begin, e=end. The default is none.

--email

Email address when pbs should notify user.

5. Accessing results

The results will be stored in the user-specified data directory under the same name as the numpy files. For example, if you have the file database/sample1.npy in your YAML fle, then there will be a new folder called database/sample1 with a calc.pkl file inside that contains the Calculator object.

In order to access the results, you must load the calculator with dill:

import dill

with open('calc.pkl', 'rb') as f:
    calc = dill.load(f)

Then you can view the contents as per the standard pyspi documentation, e.g.,

calc.table
calc.table['cov_EmpiricalCovariance']

PreviousCreating a reduced SPI set NextFAQ

Last updated 1 year ago