pyspi: Statistics for Pairwise Interactions
pyspi GitHub
  • Welcome to pyspi
    • Citing pyspi
  • Installing and using pyspi
    • Installation
      • Alternative Installation Options
      • Troubleshooting
    • Usage
      • Walkthrough Tutorials
        • Getting Started: A Simple Demonstration
        • Neuroimaging: fMRI Time Series
        • Finance: Stock Price Time Series
        • Distributing Calculations
      • Advanced Usage
        • Creating a reduced SPI set
        • Distributing calculations on a cluster
      • FAQ
  • Information about pyspi
    • SPIs
      • Glossary of Terms
      • Table of SPIs
      • SPI Descriptions
        • Basic Statistics
        • Distance Similarity
        • Causal Inference
        • Information Theory
        • Spectral
        • Miscellaneous
      • SPI Subsets
    • API Reference
      • pyspi.calculator.CorrelationFrame
      • pyspi.calculator.Calculator
      • pyspi.data.Data
      • pyspi.calculator.CalculatorFrame
      • pyspi.utils.filter_spis
    • Publications using pyspi
    • Related Packages
  • Development
    • Development
      • Incorporating new SPIs
      • Contributing to pyspi
      • Code of Conduct
    • License
Powered by GitBook

All page cover images on this wiki are created with the help of DALL-E, an AI program developed by OpenAI, or stock images from Unsplash.

On this page
  • pyspi distribute scripts
  • Using pyspi distribute
  • 1. Initialising the environment
  • 2. Organising and formatting your data
  • 3. Configuring the .yaml file
  • 4. Submit the pyspi jobs to the PBS cluster
  • 5. Accessing results
  1. Installing and using pyspi
  2. Usage
  3. Advanced Usage

Distributing calculations on a cluster

A guide to get started with distributing pyspi jobs across a PBS-type cluster.

PreviousCreating a reduced SPI setNextFAQ

Last updated 1 year ago

pyspi distribute scripts

All scripts necessary for distributing pyspi jobs across a PBS-type cluster are readily accessible in the GitHub repository. Each job will contain one calculator object that is associated with one multivariate time-series (MTS).

The scripts allow the user to specify a directory containing the MTS files, with each MTS sample being stored in a separate binary NumPy () file. Within this directory, the user needs to also include a YAML configuration file which specifies the relative location of each .npy file (and, optionally, the name, dim_order, and any relevant labels). An R script to automatically populate this configuration file is provided for the user: create_yaml_for_samples.R.

Using pyspi distribute

The following instructions provide a guide to getting started with distributing calculations on a PBS-type cluster. For a more in-depth example of a typical workflow using pyspi-distribute see the walkthrough tutorial .

1. Initialising the environment

Follow the pyspi to install and setup pyspi on your cluster. We recommend installing into a new conda environment to ensure there are no clashes in dependencies.

2. Organising and formatting your data

Organise all of your multivariate time series into a user-specified data directory. If your data is not already in the correct format to be read into pyspi, follow the NumPy guide on saving your data as a .npy file .

Ensure each multivariate time-series sample is stored in a separate .npy file.

3. Configuring the .yaml file

Either manually create a sample.yaml file or automatically populate it using create_yaml_for_samples.R. Ensure that the .yaml file is located in the recently created user-specified data directory, along with your MTS samples.

An example of a typical sample.yaml file for two MTS samples sample1.npy and sample2.npy is provided below:

- {file: ./database/sample1.npy, name: sample1, dim_order: sp, labels: [synthetic,noise] }
- {file: ./database/sample2.npy, name: sample2, dim_order: sp, labels: [synthetic,noise] }

Here, the optional parameters: name (str), dim_order (str) and labels (list of str) are specified alongside the file path of the sample.

Automatically generate a sample YAML file (optional)

If you have many samples, you may wish to automatically populate your sample YAML configuration file. We have provided the R script create_yaml_for_samples.R to accomplish this. This script can be run on the command line with the following arguments:

  • --data_dir: [Required] Data directory in which all samples' MTS NumPy files are stored (e.g. database/)

  • --sample_metadata: [Optional] Path to CSV file containing sample metadata. If supplied, the identifying variable for each sample MTS must be sampleID.

  • --label_vars: [Optional] Variable(s) to include in the YAML file from the metadata.

  • --overwrite: [Optional flag] If included, sample.yaml will be overwritten if if already exists.

4. Submit the pyspi jobs to the PBS cluster

Activate your conda environment in the location where pyspi is installed using: conda activate pyspi. Then, submit the jobs from the command line using distribute_jobs.py (see below for usage).

The script, distribute_jobs.py works by taking in a data directory (where your MTS samples and sample.yaml are located), iterating over each MTS NumPy sample, and submitting a separate PBS job for each sample. The pbs file is automatically generated from this script and submitted via qsub.

The distribute_jobs.py script includes several command-line options for user configuration, all of which are optional:

Option
Description

--data_dir

Data directory in which all samples' MTS NumPy files are stored. If no path is supplied, the default is ./database/ from the directory in which distribute_jobs.py is located

--compute_file

--sample_yaml

The file path to the sample YAML configuration file. The default is ./database/sample.yaml.

--pyspi_config

If desired, the file path to a user-generated YAML configuration file specifying a subset of SPIs to compute.

--walltime_hrs

Maximum wall-time allowed for a given job, in hours. The default is 24.

--overwrite_pkl

Including this flag means that existing pyspi results for a given sample will be overwritten.

--pbs-notify

When pbs should email user; a=abort, b=begin, e=end. The default is none.

--email

Email address when pbs should notify user.

5. Accessing results

import dill

with open('calc.pkl', 'rb') as f:
    calc = dill.load(f)

Then you can view the contents as per the standard pyspi documentation, e.g.,

calc.table
calc.table['cov_EmpiricalCovariance']

The file path for python script that actually runs pyspi. Default is in the directory where this script is located.

The results will be stored in the user-specified data directory under the same name as the numpy files. For example, if you have the file database/sample1.npy in your YAML fle, then there will be a new folder called database/sample1 with a calc.pkl file inside that contains the object.

In order to access the results, you must load the calculator with :

pyspi-distribute
.npy
here
installation guide
here
Calculator
dill
pyspi_compute.py
Page cover image