FAQ

These FAQs aim to cover the basic questions new users might have when using pyspi.

How many SPIs should I measure for my dataset?

When starting out, we recommend that users work with a smaller subset of available SPIs first, so they get a sense of computation times and working with the output in a lower-dimensional space. Users have the option to pass in a customised configuration .yaml file as described in the creating a reduced SPI set documentation.

Alternatively, we provide two pre-defined subsets of SPIs that can serve as good starting points: sonnet and fast. The sonnet subset includes 14 SPIs selected to represent the 14 modules identified through hierarchical clustering in the original paper. To retain as many SPIs as possible while minimising computation time, we also offer a fast option that omits the most computationally expensive SPIs. Either SPI subset can be toggled by setting the corresponding flag in the Calculator() function call as follows:

from pyspi import Calculator
data = ... # your dataset
calc = Calculator(dataset=data, subset="sonnet") # or calc = Calculator(subset="fast")

What pre-processing steps are applied to my data?

There are two pre-processing steps that can be applied to your raw multivariate time series (MTS) dataset before computing SPIs:

(1) Detrend: Detrend each time series in the dataset individually along the time dimension using the SciPy detrend function with default settings. If enabled, detrending is always applied to the dataset before z-score normalisation.

(2) Z-score normalise: Normalise each time series in the dataset individually along the time dimension using the SciPy zscore function.

By default, when instantiating a Calculator() object with your dataset, pyspi will normalise each time series — representing a process in a MTS dataset — individually along the time axis.

If you would to specify which pre-processing steps to include/exclude, you can pass the corresponding flags for each operation when instantiating a Calculator(). Here are some examples of how you can skip either or both operations:

# skip detrending, keep z-scoring
calc = Calculator(dataset=data, detrend=False)

# skip z-scoring, keep detrending
calc = Calculator(dataset=data, zscore=False, detrend=True)

# disable both detrending and zscoring
calc = Calculator(dataset=data, zscore=False, detrend=False)

After successfully instantiating a Calculator object, a summary of the pre-processing steps will be displayed for verification before computing SPIs. Here is an example of the output when explicitly setting the detrending step to False:

216 SPI(s) were successfully initialised.

[1/2] Skipping detrending of the dataset...
[2/2] Normalising (z-scoring) the dataset...

How long does pyspi take to run?

This depends on the size of your multivariate time series (MTS) data – both the number of processes and the number of time points (observations). In general, we recommend that users try running pyspi first with a small representative sample from their dataset to assess time and computing requirements, and scaling up accordingly. The amount of time also depends on the feature set you’re using – whether it’s the full set of all SPIs or a reduced set (like sonnet or fast described above).

To give users a sense of how long pyspi takes to run, we ran a series of experiments on a high-performing computing cluster with 2 cores, 2 MPI, and 40GB memory. We ran pyspi on simulated NumPy arrays with either a fixed number of processes (2) or fixed number of time points (100) to see how timing scales with the array size. Here are the results:

We note that computation times for the sonnet and fast subset are roughly equivalent, and the full set of SPIs requires increasingly large amounts of time to compute with increasing time series lengths. The computation time for the full set of SPIs increases with a consistent slope to that of the sonnet and fast subsets with increasing number of processes (right).

Here are the timing values for each condition, which can help users estimate the computation time requirements for their dataset:

How can I contribute to pyspi?

Contributions play a vital role in the continual development and enhancement of pyspi, a project built and enriched through community collaboration. By participating in this project, you are contributing to the broader community and helping shape the future of this package.

Code is not the only way to contribute to pyspi. Reviewing pull requests, answering questions to help others and aid in troubleshooting, organising and teaching tutorials and improving documentation are all priceless contributions to the project. For further details on how you can contribute to the project, as well as general guidelines for our contributors, please refer to Contributing to pyspi.

Do I need to normalise my dataset before applying pyspi?

When passing your dataset into the Calculator object, pyspi will automatically z-score (normalise) along the time axis by default (see API reference for Data object). This means that you can supply raw values to the Calculator object without having to normalise the dataset as a pre-processing step.

If you do not wish for pyspi to z-score your data, or you would like more control over how your data is pre-processed, you can pass the normalise=False flag to the Calculator when instantiating.

from pyspi.calculator import Calculator
import numpy as np

# your dataset
data = ... 

# instantiate a Calculator object as usual and set the normalise flag
calc = Calculator(dataset=data, normalise=False)

# disable both detrending and normalisation
calc = Calculator(datast=data, detrend=False, normalise=False)

Can I distribute pyspi calculations across a cluster?

If you have access to a portable batch system (PBS) cluster and are processing MTS with many processes (or are analysing many MTS), then you may find the pyspi distribute repository helpful. Each job contains one calculator object that is associated with one MTS. To get started with running pyspi jobs on a PBS-type cluster, follow our guide located here.

How can I cite pyspi in my work?

If you used pyspi in your work, it would be greatly appreciated if you cite the original authors. Feel free to star our GitHub repository if you find our package useful, as this also helps to increase awareness in the time-series analysis community.

Can I run pyspi on my operating system?

pyspi is designed with cross-platform compatibility in mind and can be run on various operating systems, ensuring a wide range of users have access to pyspi and all of it features. Specifically, pyspi currently supports:

MacOS (Python >= 3.9)
Windows (Python >=3.8)
Linux (Python >=3.8)

In all cases, ensure that you have the required version of Python installed, as pyspi is a python-based package. We actively monitor and work on compatibility issues that may arise with new updates to these operating systems. Users are encouraged to report any compatibility issues they encounter on our GitHub issues page, helping us improve pyspi for all users.

Are there examples showcasing a complete pipeline using pyspi?

Yes, we currently provide two notebooks with examples of complete pipelines using pyspi. These notebooks are available in the Usage Examples section. If you want to share a notebook with additional pipelines or specific use cases, please feel free to contact us.

How can I save my results from pyspi?

Once you have computed the SPIs for your dataset, the results will be stored in the calculator object. We recommend saving the calculator as a .pkl file using the dill library in python. To get started, you will need to install the dill package: pip install dill.

Saving a calculator

from pyspi.calculator import Calculator
import dill

# compute the SPIs as usual
data = np.load('../pyspi/data/forex.npy').T
calc = Calculator(dataset=data, subset='fast')
calc.compute()

# save the calculator object as a .pkl
with open('saved_calculator_name.pkl', 'wb') as f:
    dill.dump(calc, f)

Loading a calculator

# specify the location of your saved calculator .pkl file
loc = '../pyspi/saved_calculator_name.pkl'

with open(loc, 'rb') as f:
    calc = dill.load(f)

# now access the calculator object as usual
calc.table
calc.table['cov_EmpricialCovariance']

PreviousDistributing calculations on a cluster NextSPIs

Last updated 6 months ago