Performing calculations
An hctsa dataset has been initialized (specifying details of a time-series dataset and operations to include using TS_Init
), all results entries in the resulting HCTSA.mat
are set to NaN
, corresponding to results that are as yet uncomputed.
Calculations are performed using the function TS_Compute
, which stores results back into the matrices in HCTSA.mat
. This function can be run without inputs to compute all missing values in the default hctsa file, HCTSA.mat
:
TS_Compute
will begin evaluating operations on time series in HCTSA.mat
for which elements in TS_DataMat
are NaN
(i.e., computations that have not been run previously). Results are stored back in the matrices of HCTSA.mat
: TS_DataMat
(output of each operation on each time series), TS_CalcTime
(calculation time for each operation on each time series), and TS_Quality
(labels indicating errors or special-valued outputs).
Custom settings for running TS_Compute
TS_Compute
(1) Computing features in parallel across available cores using Matlab's Parallel Processing Toolbox. This can be achieved by setting the first input to true
:
(2) Computing across a custom range of time-series IDs (ts_id
) and operation IDs (op_id
). This can be achieved by setting the second and third inputs:
(3) Specifying what types of values should be computed:
(4) Specifying a custom .mat file to operate on (HCTSA.mat
is the default):
(5) Suppress commandline output. All computations are displayed to screen by default (which can be overwhelming but is useful for error checking). This functionality can be suppressed by setting the final (6th) input to false
:
Computation approaches for full datasets
Computing features for full time-series datasets can be time consuming, especially for large datasets of long time series. An appropriate strategy therefore depends on the time-series length, the number of time series in the dataset, and the computational resources available. When multiple cores are available, it is always recommended to use the parallel setting (i.e., as TS_Compute(true)
).
Computation time scaling
The first thing to think about is how the time taken to compute 7749 features of v0.93 of hctsa scales with the length of time series in your dataset (see plot below). The figure compares results using a single core (e.g., TS_Compute(false)
) to results using a 16-core machine, with parallelization enabled (e.g., TS_Compute(true)
).
Times may vary across on individual machines, but the above plot can be used to estimate the computation time per time series, and thus help decide on an appropriate computation strategy for a given dataset.
Note that if computation times are too long for the computational resources at hand, one can always choose a reduced set of features, rather than the full set of >7000, to get a preliminary understanding of the dataset. One such reduced set of features is in INP_ops_reduced.txt
. We plan to reduced additional reduced feature sets, determined according to different criteria, in future.
On a single machine
If only a single machine is available for computation, there are a couple of options:
For small datasets, when it is feasible to run all computations in a single go, it is easiest to run computations within Matlab in a single call of
TS_Compute
.For larger datasets that may run for a long time on a single machine, one may wish to use something like the provided
sample_runscript_matlab
script, whereTS_Compute
commands are run in a loop over time series, compute small sections of the dataset at a time (and then saving the results to file, e.g.,HCTSA.mat
), eventually covering the full dataset iteratively.
On a distributed compute cluster using Matlab
Code for running distributed hctsa computations on a cluster (using pbs or slurm schedulers) is here. The strategy is as follows: with a distributed computing setup, a local Matlab file (HCTSA.mat
) can be split into smaller pieces using TS_Subset
, which outputs a new data file for a particular subset of your data, e.g., TS_Subset('raw',1:100)
will generate a new file, HCTSA_subset.mat
that contains just time series with IDs from 1 to 100. Computing features for time series in each such subset can then be run on a distributed computing setup. For example, with a different compute node computing a different subset (by queuing batch jobs that each work on a given subset of time series). After all subsets have been computed, the results are recombined into a single HCTSA.mat
file using TS_Combine
commands.
Using mySQL to facilitate distributed computing
Distributing feature computations on a large-scale distributed computing setup can be better suited to a linked mySQL database, especially for datasets that grow with time, as new time series can be easily added to the database. In this case, computation proceeds similarly to above, where shell scripts on a distributed cluster computing environment can be used to distribute jobs across cores, with all individual jobs writing to a centralized mySQL server. A set of Matlab code that generates an appropriately formatted mySQL database and interfaces with the database to facilitate hctsa feature computation is included with the software package, and is described in detail here.
Last updated