arrow-left

All pages
gitbookPowered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Inspecting errors

When applying thousands of time-series analysis methods to diverse datasets, many operations can give results that are not real numbers. Some time series may be inappropriate for a given operation (such as fitting a positive-only distribution to data that is not positive), or measuring stationarity across 2000 datapoints in time series that are shorter than 2000 samples. Other times, an optimization routine may fail, or some unknown error may be called.

Some errors are not problems with the code, but represent issues with applying particular sets of code to particular time series, such as when a Matlab fitting function reaches the maximum number of iterations and returns an error. Other errors are genuine problems with the code that need to be corrected. Both cases are labeled as errors in our framework.

It can be good practice to visualize where special values and errors are occurring after a computation to see where things might be going wrong, using TS_InspectQuality. This can be run in four modes:

  1. TS_InspectQuality('summary'); [default] Summarizes the proportion of special-valued outputs in each operation as a bar plot, ordered by the proportion of special-valued outputs.

  2. TS_InspectQuality('master'); Plots which types of special-valued outputs were encountered for each master operation.

  3. TS_InspectQuality('full'); Plots the full data matrix (all time series as rows and all operations as columns), and shows where each possible special-valued output can occur (including 'error', 'NaN', 'Inf', '-Inf', 'complex', 'empty', or a 'link error').

  4. TS_InspectQuality('reduced'); As 'full', but includes only columns where special values occurred.

For example, running TS_InspectQuality('summary') loads in data from HCTSA.mat and produces the following, which can be zoomed in on and explored to understand which features are producing problematic outputs:

hashtag
Errors with compiled code

Note that errors returned from Matlab files do not halt the progress of the computation (using try-catch statements), but errors with compiled mex functions (or external command-line packages like TISEAN) can produce a fault that crashes Matlab or the system. We have performed some basic testing on all mex functions, but for some unusual time series, such faults may still occur. These situations must be dealt with by either identifying and fixing the problem in the original source code and recompiling, or by removing the problem code.

hashtag
Troubleshooting errors

When getting information on operations that produce special-valued outputs (getting IDs listed from TS_InspectQuality), it can be useful to then test examples by re-running pieces of code with the problematic data. The function TS_WhichProblemTS can be used to retrieve time series from an hctsa dataset that caused a problem for a given operation.

Usage is as follows:

This provides the list of time series IDs (ts_ind), their time-series data vectors (dataCell), and the code to evaluate the given operation (in this case, the master operation code corresponding to the operation with ID 684).

You can then pick an example time series (e.g., the first problem time series: x = dataCell{1}; x_z = zscore(x)), and then copy and paste the code in codeEval into the command line to evaluate the code for this time series. This method allows easy debugging and inspection of examples of time-series data that caused problems for particular operations flagged through the TS_InspectQuality process.

pca_image
% Find time series that failed for the operation with ID = 684.
[ts_ind, dataCell, codeEval] = TS_WhichProblemTS(684);

Working with hctsa files

When running hctsa analyses, often you want to take subsets of time series (to look in more detail at a subset of your data) or subsets of operations (to explore the behavior of different feature subsets), or combine multiple subsets of data together (e.g., as additional data arrive).

The hctsa package contains a range of functions for these types of tasks, working directly with hctsa .mat files, and are described below. Note that these types of tasks are easier to manage when hctsa data are stored in a mySQL database.

hashtag
Retrieving time series (or operations) of interest by matching on assigned keywords using TS_GetIDs

Many time-series classification problems involve filtering subsets of time series based on keyword matching, where keywords are specified in the provided when initializing a dataset.

Most filtering functions (such as those listed in this section), require you to specify a range of IDs of TimeSeries or Operations in order to specify them. Recall that each TimeSeries and Operation is assigned a unique ID (assed as the ID field in the corresponding metadata table). To quickly get the IDs of time series that match a given keyword, the following function can be used:

Or the IDs of operations tagged with the 'entropy' keyword:

These IDs can then be used in the functions below (e.g., to clear data, or extract a subset of data).

Note that to get a quick impression of the unique time-series keywords present in a dataset, use the function TS_WhatKeywords, which gives a text summary of the unique keywords in an hctsa dataset.

hashtag
Clearing or removing data from an hctsa dataset using TS_LocalClearRemove

Sometimes you may want to remove a time series from an hctsa dataset because the data was not properly processed, for example. Or one operation may have produced errors because of a missing toolbox reference, or you may have altered the code for an operation, and want to clear the stored results from previous calculations.

For example, often you want to remove from your operation library operations that are dependent on the location of the data (e.g., its mean: 'locdep'), that only operate on positive-only time series ('posOnly'), that require the TISEAN package ('tisean'), or that are stochastic (i.e., they give different results when repeated, 'stochastic').

The function TS_LocalClearRemove achieves these tasks when working directly with .mat files (NB: if using a mySQL database, should be used instead).

TS_LocalClearRemove loads in a an hctsa .mat data file, clears or removes the specified time series or operations, and then writes the result back to the file.

Example 1: Clear all computed data from time series with IDs 1:5 from HCTSA.mat (specifying 'raw'):

Example 2: Remove all operations with the keyword 'tisean' (that depend on the ) from HCTSA.mat:

Example 3: Remove all operations that require positive-only data (the 'posOnly' keyword) from HCTSA.mat:

Example 4: Remove all operations that are location dependent (the 'locdep' keyword) from HCTSA.mat:

See the documentation in the function file for additional details about the inputs to TS_LocalClearRemove.

hashtag
Extracting a subset from an hctsa dataset using TS_Subset

Sometimes it's useful to retrieve a subset of an hctsa dataset, when analyzing just a particular class of time series, for example, or investigating a balanced subset of data for time-series classification, or to compare the behavior of a reduced subset of features. This can be done with TS_Subset, which takes in a hctsa dataset and generates the desired subset, which can be saved to a new .mat file.

Example 1: Import data from 'HCTSA_N.mat', then save a new dataset containing only time series with IDs in the range 1--100, and all operations, to 'HCTSA_N_subset.mat' (see documentation for all inputs).

Note that the subset in this case will have be normalized using the full dataset of all time series, and just this subset (with IDs up to 100) are now being analyzed. Depending on the normalization method used, different results would be obtained if the subsetting was performed prior to normalization.

Example 2: From HCTSA.mat ('raw'), save a subset of that dataset to 'HCTSA_healthy.mat' containing only time series tagged with the 'healthy' keyword:

hashtag
Combining multiple hctsa datasets using TS_Combine

When analyzing a growing dataset, sometimes new data needs to be combined with computations on existing data. Alternatively, when computing a large dataset, sometimes you may wish to compute sections of it separately, and may later want to combine each section into a full dataset.

To combine hctsa data files, you can use the TS_Combine function.

Example: combine hctsa datasets stored in the files HCTSA_healthy.mat and HCTSA_disease.mat into a new combined file, HCTSA_combined.mat:

The third input, compare_tsids, controls the behavior of the function in combining time series. By setting this to 1, TS_Combine assumes that the TimeSeries IDs are comparable between the datasets (e.g., most common when using a ), and thus filters out duplicates so that the resulting hctsa dataset contains a unique set of time series. By setting this to 0 (default), the output will contain a union of time series present in each of the two hctsa datasets. In the case that duplicate TimeSeries IDs exist in the combination file, a new index will be generated in the combined file (where IDs assigned to time series are re-assigned as unique integers using TS_ReIndex).

In combining operations, this function works differently when data have been stored in a unified , in which case operation IDs can be compared meaningfully and combined as an intersection. However, when hctsa datasets have been generated using TS_Init, the function will check that the same set of operations have been used in both files.

input file
SQL_ClearRemove
TISEAN packagearrow-up-right
mySQL database to store hctsa data
mySQL database
TimeSeriesIDs = TS_GetIDs(theKeyword,'HCTSA_N.mat');
OperationIDs = TS_GetIDs('entropy','norm','ops');
TS_LocalClearRemove('ts',1:5,0,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('tisean','raw','ops'),1,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('posOnly','raw','ops'),1,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('locdep','raw','ops'),1,'raw');
TS_Subset('norm',1:100,[],1,'HCTSA_N_subset.mat')
TS_Subset('raw',TS_GetIDs('healthy','raw'),[],1,'HCTSA_healthy.mat')
TS_Combine('HCTSA_healthy.mat','HCTSA_disease.mat',false,false,'HCTSA_combined.mat')

Input files

Formatted input files are used to set up a custom dataset of time-series data, pieces of Matlab code to run (master operations), and associated outputs from that code (operations). By default, you can simply specify a custom time-series dataset and the default operation library will be used. In this section we describe how to initiate an hctsa analysis, including how to format the input files used in the hctsa framework.

hashtag
Working with a default feature set using TS_Init

To work with a default feature set, hctsa or catch22, you just need to specify information about the time-series to analyze, specified by an input file (e.g., INP_ts.mat or INP_ts.txt). Details of how to format this input file is described below. A test input file, 'INP_test_ts.mat', is provided with the repository, so you can set up feature extraction for it using the hctsa feature set as:

or

And for catch22 as:

The full hctsa feature set involves significant computation time so it is a recommended first step to test out your analysis pipeline using a smaller, faster feature set like (but note that it is insensitvie to mean and standard deviation; to include them use catch24). This feature set is provided as a submodule within hctsa, and it is very fast to compute using compiled C code (the features are compiled on initial install of hctsa (by running mexAll from the Toolboxes/catch22 directory of hctsa).

TS_Init produces a Matlab file, HCTSA.mat, containing all of the structures required to understand the set of time series, operations, and the results of their computation (explained ). Through this initialization process, each time series will be assigned a unique ID, as will each master operation, and each operation.

hashtag
Using custom feature sets

You can specify a custom feature set of your own making by specifying

  1. The code to run (INP_mops.txt); and

  2. The features to extract from that code (INP_ops.txt).

Details of how to format these input files are described below. The syntax for using a custom feature-set is as:

hashtag
Time Series Input Files

When formatting a time series input file, two formats are available:

  • .mat file input, which is suited to data that are already stored as variables in Matlab; or

  • .txt file input, which is better suited to when each time series is already stored as an individual text file.

hashtag
Input file format 1 (.mat file)

When using a .mat file input, the .mat file should contain three variables:

  • timeSeriesData: either a N x 1 cell (for N time series), where each element contains a vector of time-series values, or a N x M matrix, where each row specifies the values of a time series (all of length M).

  • labels: a N x 1 cell of unique strings containing a named label for each time series.

An example involving two time series is below. In this example, we add two time series (showing only the first two values shown of each), which are labeled according to .dat files from a hypothetical EEG experiment, and assigned keywords (which are separated by commas and no whitespace). In this case, both are assigned keywords 'subject1' and 'eeg' and, additionally, the first time series is assigned 'trial1', and the second 'trial2' (these labels can be used later to retrieve individual time series). Note that the labels do not need to specify filenames, but can be any useful label for a given time series.

hashtag
Input file format 2 (text file)

When using a text file input, the input file now specifies filenames of time series data files, which Matlab will then attempt to load (using dlmread). Data files should thus be accessible in the Matlab path. Each time-series text file should have a single real number on each row, specifying the ordered values that make up the time series. Once imported, the time-series data is stored in the database; thus the original time-series data files are no longer required, and can be removed from the Matlab path.

The input text file should be formatted as rows with each row specifying two whitespace separated entries: (i) the file name of a time-series data file and (ii) comma-delimited keywords.

For example, consider the following input file, containing three lines (one for each time series to be added to the database):

Using this input file, a new analysis will contain 3 time series, gaussianwhitenoise_001.dat and gaussianwhitenoise_002.dat will be assigned the keywords noise and gaussian, and the data in sinusoid_001.dat will be assigned keywords ‘periodic’ and ‘sine’. Note that keywords should be separated only by commas (and no whitespace).

hashtag
Adding master operations

In our system, a master operation refers to a piece of Matlab code and a set of input parameters.

Valid outputs from a master operation are: 1. A single real number, 2. A structure containing real numbers, 3. NaN to indicate that the input time series is not appropriate for this code.

The (potentially many) outputs from a master operation can thus be mapped to individual operations (or features), which are single real numbers summarizing a time series that make up individual columns of the resulting data matrix.

Two example lines from the input file, INP_mops.txt (in the Database directory of the repository), are as follows:

Each line in the input file specifies two pieces of information, separated by whitespace: 1. A piece of code and its input parameters. 2. A unique label for that master operation (that can be referenced by individual operations).

We use the convention that x refers to the input time series and x_z refers to a z-scored transformation of the input time series (i.e., ). In the example above, the first line thus adds an entry in the database for running the code CO_tc3 using a z-scored time series as input (x_z), with 1 as the second input with the label CO_tc3_xz_1, and the second line will add an entry for running the code ST_length using the untransformed time series, x, with the label length.

When the time comes to perform computations on data using the methods in the database, Matlab needs to have path access to each of the master operations functions specified in the database. For the above example, Matlab will attempt to run both CO_tc3(x_z,1) and ST_length(x), and thus the functions CO_tc3.m and ST_length.m must be in the Matlab path. Recall that the script startup.m, which should be run at the start of each session using hctsa, handles the addition of paths required for the default code library.

hashtag
Modifying operations (features)

The input file, e.g., INP_ops.txt (in the Database directory of the repository) should contain a row for every operation, and use labels that correspond to master operations. An example excerpt from such a file is below:

The first column references a corresponding master label and, in the case of master operations that produce structure, the particular field of the structure to reference (after the fullstop), the second column denotes the label for the operation, and the final column is a set of comma-delimited keywords (that must not include whitespace). Whitespace is used to separate the three entries on each line of the input file. In this example, the master operation labeled CO_tc3_xz_1, outputs is a structure, with fields that are referenced by the first five operations listed here, and the ST_length master operation outputs a single number (the length of the time series), which is referenced by the operation named 'length' here. The two keywords 'correlation' and 'nonlinear' are added to the CO_tc3_1 operations, while the keywords raw and lengthDependent are added to the operation called length. These keywords can be used to organize and filter the set of operations used for a given analysis task.

Running hctsa computations

An hctsa analysis requires setting a library of time series, master operations and operations, and generates a HCTSA.mat file (using TS_Init), as described here. Once this is set up, computations are run using TS_Compute.

These steps, as well as information on how to inspect the results of an hctsa analysis and working with HCTSA*.mat filesarrow-up-right, are provided in this chapter.

keywords: a N x 1 cell of strings, where each element contains a comma-delimited set of keywords (one for each time series), containing no whitespace.

(x−μx)/σx(x - \mu_x)/\sigma_x(x−μx​)/σx​
catch22arrow-up-right
here
TS_Init('INP_test_ts.mat');
TS_Init('INP_test_ts.mat','hctsa');
TS_Init('INP_test_ts.mat','catch22');
TS_Init('INP_ts.mat',{'INP_mops.txt','INP_ops.txt'});
timeSeriesData = {[1.45,2.87,...],[8.53,-1.244,...]}; % (a cell of vectors)
labels = {'EEGExperiment_sub1_trail1.dat','EEGExperiment_sub1_trail2.dat'}; % data labels for each time series
keywords = {'subject1,trial1,eeg','subject1,trial2,eeg'}; % comma-delimited keywords for each time series

% Save these variables out to INP_test.mat:
save('INP_test.mat','timeSeriesData','labels','keywords');

% Initialize a new hctsa analysis using these data and the default feature library:
TS_Init('INP_test.mat','hctsa');
gaussianwhitenoise_001.dat     noise,gaussian
gaussianwhitenoise_002.dat     noise,gaussian
sinusoid_001.dat               periodic,sine
CO_tc3(x_z,1)     CO_tc3_xz_1
ST_length(x)    ST_length
    CO_tc3_xz_1.raw     CO_tc3_1_raw      correlation,nonlinear
    CO_tc3_xz_1.abs     CO_tc3_1_abs      correlation,nonlinear
    CO_tc3_xz_1.num     CO_tc3_1_num      correlation,nonlinear
    CO_tc3_xz_1.absnum  CO_tc3_1_absnum   correlation,nonlinear
    CO_tc3_xz_1.denom   CO_tc3_1_denom    correlation,nonlinear
    ST_length           length            raw,lengthDependent

Performing calculations

An hctsa dataset has been initialized (specifying details of a time-series dataset and operations to include using TS_Init), all results entries in the resulting HCTSA.mat are set to NaN, corresponding to results that are as yet uncomputed.

Calculations are performed using the function TS_Compute, which stores results back into the matrices in HCTSA.mat. This function can be run without inputs to compute all missing values in the default hctsa file, HCTSA.mat:

TS_Compute will begin evaluating operations on time series in HCTSA.mat for which elements in TS_DataMat are NaN (i.e., computations that have not been run previously). Results are stored back in the matrices of HCTSA.mat: TS_DataMat (output of each operation on each time series), TS_CalcTime (calculation time for each operation on each time series), and TS_Quality (labels indicating errors or special-valued outputs).

hashtag
Custom settings for running TS_Compute

(1) Computing features in parallel across available cores using Matlab's Parallel Processing Toolbox. This can be achieved by setting the first input to true:

(2) Computing across a custom range of time-series IDs (ts_id) and operation IDs (op_id). This can be achieved by setting the second and third inputs:

(3) Specifying what types of values should be computed:

(4) Specifying a custom .mat file to operate on (HCTSA.mat is the default):

(5) Suppress commandline output. All computations are displayed to screen by default (which can be overwhelming but is useful for error checking). This functionality can be suppressed by setting the final (6th) input to false:

hashtag
Computation approaches for full datasets

Computing features for full time-series datasets can be time consuming, especially for large datasets of long time series. An appropriate strategy therefore depends on the time-series length, the number of time series in the dataset, and the computational resources available. When multiple cores are available, it is always recommended to use the parallel setting (i.e., as TS_Compute(true)).

hashtag
Computation time scaling

The first thing to think about is how the time taken to compute 7749 features of v0.93 of hctsa scales with the length of time series in your dataset (see plot below). The figure compares results using a single core (e.g., TS_Compute(false)) to results using a 16-core machine, with parallelization enabled (e.g., TS_Compute(true)).

Times may vary across on individual machines, but the above plot can be used to estimate the computation time per time series, and thus help decide on an appropriate computation strategy for a given dataset.

Note that if computation times are too long for the computational resources at hand, one can always choose a reduced set of features, rather than the full set of >7000, to get a preliminary understanding of the dataset. One such reduced set of features is in INP_ops_reduced.txt. We plan to reduced additional reduced feature sets, determined according to different criteria, in future.

hashtag
On a single machine

If only a single machine is available for computation, there are a couple of options:

  1. For small datasets, when it is feasible to run all computations in a single go, it is easiest to run computations within Matlab in a single call of TS_Compute.

  2. For larger datasets that may run for a long time on a single machine, one may wish to use something like the provided sample_runscript_matlab script, where TS_Compute commands are run in a loop over time series, compute small sections of the dataset at a time (and then saving the results to file, e.g., HCTSA.mat), eventually covering the full dataset iteratively.

hashtag
On a distributed compute cluster using Matlab

Code for running distributed hctsa computations on a cluster (using pbs or slurm schedulers) is . The strategy is as follows: with a distributed computing setup, a local Matlab file (HCTSA.mat) can be split into smaller pieces using TS_Subset, which outputs a new data file for a particular subset of your data, e.g., TS_Subset('raw',1:100) will generate a new file, HCTSA_subset.mat that contains just time series with IDs from 1 to 100. Computing features for time series in each such subset can then be run on a distributed computing setup. For example, with a different compute node computing a different subset (by queuing batch jobs that each work on a given subset of time series). After all subsets have been computed, the results are recombined into a single HCTSA.mat file using TS_Combine commands.

hashtag
Using mySQL to facilitate distributed computing

Distributing feature computations on a large-scale distributed computing setup can be better suited to a linked mySQL database, especially for datasets that grow with time, as new time series can be easily added to the database. In this case, computation proceeds similarly to above, where shell scripts on a distributed cluster computing environment can be used to distribute jobs across cores, with all individual jobs writing to a centralized mySQL server. A set of Matlab code that generates an appropriately formatted mySQL database and interfaces with the database to facilitate hctsa feature computation is included with the software package, and is described in detail .

    % Compute all missing values in HCTSA.mat:
    TS_Compute();
herearrow-up-right
here
    % Compute all missing values in HCTSA.mat using parallel processing:
    TS_Compute(true);
    % Compute missing values in HCTSA.mat for ts_ids from 1:10 and op_ids from 1:1000
    TS_Compute(false,1:10,1:1000);
    % Compute all values that have never been computed before (default)
    TS_Compute(false,[],[],'missing');
    % Compute all values that have never previous been calculated OR have previously been computed but returned an error:
    TS_Compute(false,[],[],'error');
    % Compute all missing values in my_HCTSA_file.mat:
    TS_Compute(false,[],[],'missing','my_HCTSA_file.mat');
    % Compute all missing values in HCTSA.mat, suppressing output to screen:
    TS_Compute(false,[],[],'missing','',false);