Distributing calculations on a cluster
A guide to get started with distributing pyspi jobs across a PBS-type cluster.
pyspi distribute scripts
All scripts necessary for distributing pyspi jobs across a PBS-type cluster are readily accessible in the pyspi-distribute GitHub repository. Each job will contain one calculator object that is associated with one multivariate time-series (MTS).
The scripts allow the user to specify a directory containing the MTS files, with each MTS sample being stored in a separate binary NumPy (.npy) file. Within this directory, the user needs to also include a YAML configuration file which specifies the relative location of each .npy
file (and, optionally, the name
, dim_order
, and any relevant labels
). An R script to automatically populate this configuration file is provided for the user: create_yaml_for_samples.R
.
Using pyspi distribute
The following instructions provide a guide to getting started with distributing calculations on a PBS-type cluster. For a more in-depth example of a typical workflow using pyspi-distribute see the walkthrough tutorial here.
1. Initialising the environment
Follow the pyspi installation guide to install and setup pyspi on your cluster. We recommend installing into a new conda environment to ensure there are no clashes in dependencies.
2. Organising and formatting your data
Organise all of your multivariate time series into a user-specified data directory. If your data is not already in the correct format to be read into pyspi, follow the NumPy guide on saving your data as a .npy
file here.
Ensure each multivariate time-series sample is stored in a separate .npy
file.
3. Configuring the .yaml
file
.yaml
fileEither manually create a sample.yaml
file or automatically populate it using create_yaml_for_samples.R
. Ensure that the .yaml
file is located in the recently created user-specified data directory, along with your MTS samples.
An example of a typical sample.yaml
file for two MTS samples sample1.npy
and sample2.npy
is provided below:
Here, the optional parameters: name
(str), dim_order
(str) and labels
(list of str) are specified alongside the file path of the sample.
4. Submit the pyspi jobs to the PBS cluster
Activate your conda environment in the location where pyspi is installed using: conda activate pyspi
. Then, submit the jobs from the command line using distribute_jobs.py
(see below for usage).
The script, distribute_jobs.py
works by taking in a data directory (where your MTS samples and sample.yaml are located), iterating over each MTS NumPy sample, and submitting a separate PBS job for each sample. The pbs file is automatically generated from this script and submitted via qsub
.
The distribute_jobs.py
script includes several command-line options for user configuration, all of which are optional:
--data_dir
Data directory in which all samples' MTS NumPy files are stored. If no path is supplied, the default is ./database/
from the directory in which distribute_jobs.py
is located
--compute_file
--sample_yaml
The file path to the sample YAML configuration file. The default is ./database/sample.yaml
.
--pyspi_config
If desired, the file path to a user-generated YAML configuration file specifying a subset of SPIs to compute.
--walltime_hrs
Maximum wall-time allowed for a given job, in hours. The default is 24
.
--overwrite_pkl
Including this flag means that existing pyspi results for a given sample will be overwritten.
--pbs-notify
When pbs should email user; a=abort, b=begin, e=end. The default is none.
--email
Email address when pbs should notify user.
5. Accessing results
The results will be stored in the user-specified data directory under the same name as the numpy files. For example, if you have the file database/sample1.npy
in your YAML fle, then there will be a new folder called database/sample1
with a calc.pkl
file inside that contains the Calculator object.
In order to access the results, you must load the calculator with dill:
Then you can view the contents as per the standard pyspi documentation, e.g.,
Last updated