Distributing Calculations
Example workflow for distributing pyspi calculations on a PBS cluster.
Let's begin by importing the necessary libraries for this simple workflow example:
You will also need to download the following scripts for distributing pyspi jobs across a PBS-type cluster:
Create a new directory for your workflow (e.g., pyspi-distribute/
) and copy over the three scripts from above. Also create a folder for your multivariate time series data. Here we will call this folder example_data
. Your file structure should look like the following:
1. Preparing the MTS data
Save each multivariate time series (MTS) to its own numpy binary file (.npy). We'll generate three example datasets that are each stored as a separate entry in a dictionary.
Note: By default, each process is z-scored along the time domain, so there is no need to normalise the data beforehand. If you would like to disable this functionality, you can do so by setting normalise=False
in line 108 of distribute_jobs.py
.
2. Configuring the YAML file
Now that we have saved our MTS data to separate .npy files, we need to provide a configuration file which specifies the relative location of each .npy
file (and, optionally, the name
, dim_order
, and any relevant labels
). See here for more details on how to create a YAML file for distributing pyspi calculations. For this example, we will use the following code to generate the YAML called sample.yaml:
This will produce the following YAML file:
Note that here we set the MTS name to "Dataset_{X}" as well as the labels, but you can use the labels
argument to include any metadata about a given MTS that you wish.
Your file structure prior to submitting the jobs to the PBS cluster should look like the following:
3. Submitting jobs to the PBS cluster
Now that we've saved our MTS datasets to .npy files and generated the configuration YAML file, we're ready to submit PBS jobs through pyspi-distribute
. Use the following shell script template as a guide:
Please replace the following placeholders with your actual values before running this command:
CALC_FILE_NAME_HERE
: Name of your calculation file (e.g.,calc.pkl
).YOUR_EMAIL_HERE
: Your email address to receive notifications (e.g.,example@example.com
).YOUR_CONDA_ENV_HERE
: Name of your Conda environment (e.g.,my_conda_env
).YOUR_PBS_QUEUE_HERE
: Name of your PBS queue (e.g.,batch
orshort
).WALLTIME_HOURS_HERE
: Number of hours requested for walltime (e.g.,4
).CPUS_REQUEST_HERE
: Number of CPUs requested (e.g.,8
).MEMORY_REQUEST_GBS_HERE
: Amount of memory requested in GBs (e.g.,32
).
It is recommended you do a trial run with a small example dataset to get a sense of the time/memory requirements for your full dataset before submitting all the jobs. We also include the optional flag: --table_only
such that only the SPI results table is saved as opposed to the entire calculator
object, but you can omit this if you wish to save the whole object.
4. Accessing results
Once an individual PBS job is completed, you will find a new folder in your data directory (example_data/
) with the corresponding sample name (e.g., Dataset_0
) that contains job output information as well as the saved pyspi computation result in the .pkl file. Since we set the --table_only
flag, we can read in pickle file to get the SPI results table:
Last updated