Adding new SPIs to pyspi
A short tutorial on how you can integrate new SPIs into pyspi.
Overview
pyspi is designed with modularity in mind, making it easy for users to add new SPIs to the existing framework. In this tutorial, we'll walk you through the process of incorporating a new SPI called the Gromov-Wasserstein distance (GWTau), from scratch. This implementation is based on the recent paper by Kravtsova et al.
By following these guidelines, you'll be able to create new SPIs that seamlessly integrate with pyspi and contribute to a robust library of SPIs for the community to use.
Let's get started with adding the GWTau SPI to pyspi!
Table of contents 📜
1. Writing the SPI function 👨💻
By definition, SPIs are pairwise measures - that is, they take in a pair of time series (as arrays) and compute a summary statistic which captures some aspect of the relationship between them (as a scalar value output). With this in mind, each SPI corresponds to a function with the following general form:
where and are the two time series (assumed to be the same length) for which the SPI is to be computed on, and is a real, scalar-valued output. For a dataset of p processes (time series), pyspi will apply your SPI function to pairs of time series, yielding a matrix of pairwise interactions (MPI).
Developing your SPI function in isolation
Whether you choose to implement the SPI algorithm from scratch or adapt an existing function from another library, the end goal is to create a function that can be generically applied to pairs of time series. In either case, a good starting point is to develop or adapt the algorithm in isolation first. This approach allows you to focus on the core logic of the SPI without worrying about the broader framework. By working on the SPI independently, you can ensure that the function is correct, efficient, and well-tested before incorporating it into pyspi.
An excellent way to develop your SPI function in isolation is by using an interactive notebook, such as Jupyter Notebook or Google Colab. Interactive notebooks provide several benefits for a first pass:
Iterative Development: Notebooks allow you to write and test code in small chunks, making it easier to iterate and refine your implementation.
Visualisation: You can easily visualise your results and intermediate steps, helping you gain insights into the behavior of your SPI.
Debugging: Interactive notebooks make it convenient to debug your code by examining variables and testing different scenarios.
Documentation: Notebooks support markdown cells, enabling you to document your code and thought process alongside the implementation.
In the example below, we port an existing implementation of GWTau from MATLAB to Python. In order to compute this SPI, our final pairwise function, , called gwtau()
relies on two helper functions: wass_sorted()
and vec_geo_dist():
First, we import all necessary libraries to compute our SPI. In this implementation, we only require NumPy.
Next, we define our two helper functions based on the implementations in MATLAB.
Now, we define the pairwise function which takes in any two time series xi and xj, and returns a scalar output, sij.
Additional Considerations ❗
If you intend to adapt an SPI from an existing library in Python which is not already included in the requirements, consider the following interactions:
Interactions with existing dependencies: specific versions of pyspi dependencies may not be compatible, leading to clashes and other issues.
Incompatibilities with different Python versions and operating systems: some dependencies may not be compatible with all operating systems (Windows, MacOS, Linux). We strive to provide support to all users, so keep this in mind if you intend to create a pull request for your SPI.
To minimise potential dependency clashes or other related issues, we recommend users prioritise existing dependencies (see the requirements) or standard Python packages before turning to additional libraries and/or external dependencies.
2. Validating the algorithm ✅
When incorporating a new SPI into pyspi, it's crucial to validate the correctness of your implementation. Validation ensures that the SPI function produces the expected results and behaves consistently with the original proposed algorithm.
Theoretical ground truth
In ideal circumstances, there may be a theoretical ground truth against which you can compare your results. If such a ground truth exists, use it to verify that your implementation produces the expected outputs for a range of input scenarios.
Empricial verification
If there is no theoretical ground truth available, you can empirically verify your results using a standard benchmarking dataset, preferably one provided in the original research paper.
Some general tips:
Validate edge cases: Test your implementation on edge cases to ensure it handles these situations appropriately and produces reasonable outputs. This may include testing on empty arrays, arrays with missing values, arrays containing NaNs/Infs, and arrays of mismatching lengths.
Assess output plausibility: Examine the outputs of your SPI implementation. Are they plausible based on your understanding of the measure and what it computes? Consider whether the outputs fall within an expected range of values and if they exhibit the expected behaviour or relationships. Also, consider the type of the output: do we expect a floating point value or an integer to be returned? For example, we expect a distance measure to return strictly positive, real-valued floating-point outputs, and should be insensitive to the ordering of the two time series (distance measures are symmetric) provided to the function.
Consider numerical precision: Due to differences in floating-point representations and rounding behaviour across languages and hardware, you may encounter discrepancies in the computed values. When comparing results, allow for a small tolerance (e.g., checking if the abs. tolerance is within a specified threshold) or use appropriate numerical comparison techniques rather than expecting exact equality.
Verify reproducibility across runs
Stochastic SPIs
If your SPI implementation includes stochastic elements, it's essential to ensure that the results are reproducible when using the same random seed. Test your implementation with a fixed random seed and verify that it consistently produces the same (within some tolerance) output across multiple runs.
Deterministic SPIs
Even if your SPI is purely deterministic, it's crucial to verify that the algorithm is stable and produces consistent results across multiple runs. When dealing with floating-point operations, be mindful of how errors due to limited numerical precision can propagate through your algorithm or become magnified by certain operations e.g., subtracting nearly equal quantities, or dividing by numbers close to zero.
In the example below we validate our adapted Python implementation of GWTau against the original implementation in MATLAB using the dataset provided in the original paper. Select a tab to compare the results:
For a visual comparison, here we plot a heatmap showing the outputs for each pair of time series in the benchmarking dataset consisting of 29 processes, each with 40 observations, and yielding a matrix of pairwise interactions:
Note that since GWTau
is an undirected statistic, the resulting MPI is symmetric about the diagonal - that is, M(i, j) = M(j, i).
3. Optimising the algorithm 🚄
Once you have a working and correct first pass of your SPI function, it can be beneficial to optimise its performance before integrating it into pyspi. Optimisation is advantageous for several reasons:
Scalability: As the size of the MTS dataset grows, the number of pairwise computations increases quadratically. As such, it's important that the SPI function can handle large-scale computations without consuming excessive resources and time.
Computational efficiency: pyspi is designed to work with large datasets containing hundreds or even thousands of time series. An unoptimised SPI function can lead to long computation times, making it impractical for users to analys their data efficiently.
Low-Hanging Fruits 🍎
While the specific details of optimisations will depend on the nature of your SPI, below we outline several low-hanging fruits that should be considered when implementing any algorithm - these optimisations can lead to drastic reductions in computational time and resources and are relatively low effort to implement:
Vectorisation: Use NumPy's vectorised operations wherever possible to perform computations on entire arrays instead of looping over each element.
Caching: If your SPI involves repeated operations on the same data, consider caching intermediate results to avoid redundant calculations.
Efficient algorithms: Review your algorithm and look for opportunities to improve its efficiency. This might involve using more efficient data structures or leveraging existing sub-routines.
Optimised libraries: It may be possible that aspects of your algorithm already have existing implementations in other Python libraries. While we recommend users minimise the number of additional dependencies (to avoid clashes and other compatibility issues), you may be able to use the library's source code to inform the types of optimisations to make.
Profiling and Benchmarking 🗃️
Before optimising, be sure to run your original validated SPI function on a representative dataset and record the execution time (e.g., using the standard time
library). This will serve as a baseline for comparison. Advanced users can consider libraries such as cProfile to identify bottlenecks in the SPI function code.
Each time you optimise your code, re-run your function on the same dataset and compare the execution times with the baseline. Repeat iteratively, focusing on one optimisation at a time, until you are satisfied with the performance improvements.
In the below example, we showcase optimisations made to both helper functions in the GWTau implementation, focusing on several `low-hanging fruits'. Select a tab to explore how each function was optimised and the resulting performance improvements:
First, we import the standard time
module in Python for our basic benchmarking:
Here is the original function from above which involves a for
loop:
Running this function on two test time series, each of length 1000
observations and taking the mean execution time across 10 000 iterations:
This gives the following mean execution time:
Leveraging NumPy's vectorised operations, we can rewrite the vec_geo_dist()
function as follows:
Repeating the same benchmarking test on the same data, we obtain the following execution time:
In this simple example, we achieve around two orders of magnitude speed-up by using NumPy vectorised operations. For hundreds or even thousands of SPI computations, these savings can accumulate to result in large and meaningful differences in execution time!
Using the two optimised versions of the helper functions from the above example, the following plot showcases the performance improvements over the naive implementation of GWTau
on a pair of time series of varying lengths (across 5000 repeated calculations per time series length):
4. Incorporating your SPI into the pyspi framework 🌏
Once you have verified that your optimised SPI consistently produces the expected results, the next step is to incorporate the function into the pyspi library.
In short, pyspi primarily relies on decorators and base classes to provide a structure for handling different types of statistical measures in the pyspi
library. This also allows for consistent input parsing and computation of univariate, bivariate, and multivariate statistics. The base classes also enable the categorisation of measures based on their properties (directed/undirected, signed/unsigned).
SPIs inherit from the following base classes. Click the expandable tab to access a full description:
The function decorators are the following:
In our example, given that the GWTau
SPI is a pairwise distance measure, it is undirected (symmetric), unsigned, and is a bivariate function.
When writing a class for an SPI in pyspi, ensure it includes the following attributes as standard:
name
: the name of the SPI.identifier:
a unique shorthand name for the SPI - this will be what is returned in the final results table and is used to index the MPI corresponding to the SPI.labels:
descriptive keywords corresponding to the SPI.
Below we write a class for our optimised implementation of GWTau
and inherits from the appropriate base classes. Given that our SPI is a distance-based measure, we add it to the distance
module (pyspi/statistics/distance.py
).
Note that in this example, our SPI does not rely on any additional parameters/keyword arguments. As a result, we obtain a single SPI. Depending on your particular SPI, there may be several hyperparameters or unique configurations.
Below, we show an example of an existing SPI, hsic
, which accepts a parameter, biased
, as input. As a result, there are two unique implementations of the same base SPI: one corresponding to biased = true
and another when biased = false.
We thus obtain the following two SPIs:
hsic
hsic_biased
For more examples of SPIs that rely on several parameters and their corresponding implementations, refer to the source code for pyspi.
5. Updating the configuration files ♻️
To make the new SPI accessible to the Calculator
, we will need to add it to the relevant configuration YAML file(s). If you intend to make a pull request for your SPI, ensure that the default config file (config.yaml
) includes your SPI.
In pyspi, all configuration files follow a specific nested format:
module: This is where the SPI class is located. For example, the
GWTau
SPI is located inpyspi/statistics/distance.py
so we import from the.statistics.distance
module. pyspi modules are formed from conceptual groupings e.g.,distance
,infotheory
,basic
,causal
,misc
,spectral
,wavelet,
so consider which would be most appropriate for your SPI.base SPI name: The base SPI name is the name of the SPI class. Ensure this matches the name of the class defined in the relevant module e.g.,
GromovWasserstainTau
.labels: Keywords pertaining to the SPI. These keywords can be used to filter SPIs with the
filter_spis()
function, so ensure that the labels accurately describe the pairwise measure.dependencies: External dependencies required to compute the SPI. These are dependencies which cannot be installed directly using
pip
and will need to be installed and managed separately. Examples includeoctave
andjava.
configs: SPI-specific settings and configurations. These are additional parameters and keyword arguments that are passed directly to the SPI function's constructor, so they will need to be specified with the correct keys and values. For example, the
hsic
SPI function constructor takes the argument:biased
which can either take a value ofTrue
orFalse
.
Here is an example of the configuration for the GWTau
SPI. Since there are no external dependencies or additional function arguments, we pass an empty list as the values for the dependencies
and configs
keys, respectively.
For further examples of SPIs with configs and external dependencies, see the source code.
Note that pyspi also includes several pre-defined subsets. If you would like to include your new SPI in an existing subset, make sure to add the configuration to the relevant YAML files (e.g., fast_config.yaml
for the fast
subset).
And that's it. You should now be able to compute your SPI using the pyspi Calculator.
When comparing results in pyspi to your benchmarking outputs, keep in mind that pyspi normalises time series before computing SPIs, as standard.
6. [OPTIONAL] Create a pull request 🖐️
We strive to provide the most comprehensive toolkit of SPIs to our users. If you would like your SPI to be incorporated into the pyspi library, you can create a pull request. Generally, the only requirement is that the SPI is accompanied by a published research paper, so be sure to include a reference in your pull request. Here is a general checklist we encourage users to follow before submitting a pull request:
Pull Request Checklist
Last updated