Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Below are some UMAP projections of UEA/UCR time-series classification datasets.
In each dataset, each time series is represented in the high-dimensional feature space of hctsa features. In each plot, we show the data projected into a two-dimensional UMAP space. Each point in the space is a time series, and we have coloured each point according to its assigned label. In many cases, the unsupervised UMAP projection of the hctsa feature space captures the labeled structure of the data.
Only a selection of the most interesting-looking projections (according to our eyes) are shown here. Oh, did you want to have a play with all the data? It's available on figshare.
Thanks to Carl Lubba for doing all the computation shown here!
Thinking about running an hctsa analysis? Read this first.
The typical data analysis pipeline starts with inspecting and understanding the data, processing it in accordance with the questions of interest (and to be consistent with the assumptions of the analysis methods that will be applied), and then formulating and testing appropriate analysis methods. A typical hctsa pipeline inverts this process: many analysis methods are first applied, and then their results are interpreted.
Good practice involves thinking carefully about this full hctsa pipeline, including the type of questions and interpretations that are sought from it, and thus how the data are to be prepared, and how the results can be interpreted accurately.
hctsa automates some parts of a time-series analysis pipeline (such as guiding the selection of informative statistical properties in a time-series classification problem), but it does not replace expertise. Careful thought, with understanding of the problem and domain-specific issues are essential in designing how hctsa can be used for a given application, and in properly interpreting its results.
Some general advice before embarking on an hctsa analysis:
What are the data?
For long, streaming data: how long is each time series? Does this capture the timescale of the problem you care about?
For continuously evolving processes: At what rate are the data sampled? Does this capture the patterns of interest? E.g., if the sampling rate is too high, this can lead to trivially autocorrelated time series such that time-series methods in hctsa will find it challenging to resolve the patterns of interest.
Are the appropriately processed (detrended, artifacts removed, …)? This requires careful thought, often with domain expertise, about what problem is being solved. E.g., many time-series analysis algorithms will be dominated by underlying trends if underlying trends in time series are not removed. In general, properties of the data, especially if they're likely to affect many time-series analysis algorithms (like underlying trends, artefactual outliers, etc.) should be removed if they're not informative of the differences you care about distinguishing. See the section below for more information.
What do they look like? Addressing the questions above requires you to look at each of the time series to get a sense of the dynamics you're interested in characterizing using time-series features.
What problem are you trying to solve? In designing an analysis using hctsa, you will need to think about the sample size you have, what effect sizes are expected, what statistical power will you have, etc. E.g., if you only have 5 examples of each of two classes, you will not have the statistical power to pick out individual features from a library of 7000, simple (unregularized) classifiers will be likely to overfit, etc.
Trial run with a reduced set. Once you’ve devised a pipeline, it's best to run through it in hctsa but using a reduced feature set first (e.g., the catch22 set), which runs quickly on a laptop and gives you a sense for the process. Once you're satisfied with the analysis pipeline, you can always scale up to the full hctsa library of >7000 features.
hctsa contains thousands of time-series analysis methods, many of which make strong assumptions of the data, such as that it is (weak-sense) stationary. Although hctsa has been applied meaningfully to short, non-stationary patterns (as commonly studied in time-series data-mining applications), it is better suited to longer streams of data, from which more subtle temporal statistics can be sampled.
hctsa is not substitute for thoughtful domain-motivated data preparation and processing. For example, hctsa cannot know is 'signal' and what is 'noise' for the question you're asking of your data. The analyst should prepare their data in a way that makes sense for the questions of relevance to them, including the possibility of de-trending/filtering the data, applying noise-removal methods, etc.
The following should be considered:
Standardizing. If your time-series data are measured on scales that differ across different recordings, then the data should be appropriately standardized.
Detrending. If your data contain low-order trends that are not meaningful (e.g., sensor drift) or not a signal of relevance (your question is based more around the structure of deviations from a low-frequency trend), then this should be removed.
There are often many routes to solving a given data analysis challenge. For example, in a time-series classification problem, the two classes may be perfectly distinguished based on their lag-1 autocorrelation, and also on their Lyapunov exponent spectrum, and also on hundreds of other properties. In general, one should avoid interpreting the most complex features (like Lyapunov exponents) as being uniquely useful for a problem, as they reproduce the behavior of much simpler features, which provide a more interpretable and parsimonious interpretation of the relevant patterns in the dataset. For other problems, time-series analysis methods (that are sensitive to the time-ordering of the data samples) may not provide any benefit at all over properties of the data distribution (e.g., the variance), or more trivial differences in time-series length across classes.
In general, complex explanations of patterns in a dataset can only be justified when simpler explanations have been ruled out. E.g., Do not write a paper about a complex (e.g., powerlaw fit to a visibility graph degree-distribution) feature when your data can be just as well (or better) distinguished by their variance.
hctsa has many functions to check this type of behavior: from inspecting the pairwise correlation structure of high-performing features (in TS_TopFeatures
) to basic checks on different keyword-labeled classes of features (in TS_CompareFeatureSets
).
Welcome to the hctsa documentation!
This documentation contains resources and information about highly comparative time-series analysis using hctsa, and also outlines the steps required to set up and implement highly comparative time-series analysis using the hctsa package, as described in our papers:
B.D. Fulcher and N.S. Jones. hctsa: A computational framework for automated time-series phenotyping using massive feature extraction. Cell Systems 5, 527 (2017).
B.D. Fulcher, M.A. Little, N.S. Jones. Highly comparative time-series analysis: the empirical structure of time series and their methods. J. Roy. Soc. Interface 10, 20130048 (2013).
Some advice for getting started with an hctsa analysis
Sometimes it's easiest to quickly get going when you can follow on to some example code. Here are some repositories that allow you to do just this. There are a range of open datasets with pre-computed hctsa features, as well as some examples of hctsa workflows. You may also check the page of publications using hctsafor example publications that have used hctsa—some of these contain associated open code.
If you have data to share and host, let me know and I'll add it to this list.
Resources related to HCTSA including talks/demos, workflow examples and related packages.
A collection of good resources for time-series analysis (including in other programming languages like python and R) are listed here. See also larger collections of time-series resources, by Lukasz Mentel and, focused on python: Max Christ.
There is a list of python-based packages for time-series analysis here that is worth taking a look at, in addition to those packages highlighted below:
This page lists scientific research publications that have used hctsa.
Articles are labeled as follows:
📗 = Journal publication.
📙 = Preprint.
= Link to GitHub code repository available.
If you have used hctsa in your published work, or we have missed any publications, feel free to reach out by and we'll add it this growing list!
The following publications for details of how the highly-comparative approach to time-series analysis has developed since our initial publication in 2013. We:
We have used hctsa to:
as well as:
Connect structural brain connectivity to fMRI dynamics (mouse).
Connect structural brain connectivity to fMRI dynamics (human).
Distinguish time-series patterns for data-mining applications.
Classify babies with low blood pH from fetal heart rate time series.
Here are some highlights:
In addition to:
Predict age from resting-state MEG from individual brain regions.
Estimate brain age in children from EEG.
Extract gradients from fMRI hctsa time-series features to understand the relationship between schizophrenia and nicotine dependence.
Classify endogenous (preictal), interictal, and seizure-like (ictal) activity from local field potentials (LFPs) from layers II/III of the primary somatosensory cortex of young mice (using feature selection methods from an initial pool of hctsafeatures).
Distinguish motor-evoked potentials corresponding to multiple sclerosis.
Here are some highlights:
in addition to:
Identify sepsis in very low birth weight (<1.5kg) infants from heart rate signals, identifying heart rate characteristics of reduced variability and transient decelerations.
Identify novel heart-rate variability metrics, including RobustSD
, to create a parsimonious model for cerebral palsy prediction in preterm neonatal intensive care unit patients.
Predicting post cardiac arrest outcomes.
Detect falls from wearable sensor data.
Detect falls from wearable sensor data.
Select features for fetal heart rate analysis using genetic algorithms.
Here are some highlights:
in addition to:
Diagnose a spacecraft propulsion system utilizing data provided by the Prognostics and Health Management (PHM) society, as part of the Asia-Pacific PHM conference’s data challenge, 2023.
Identify faults in a large-scale industrial process.
Frequently asked questions pertaining to the use of hctsa.
Select a drop-down for more information:
A full list of Matlab code files, organized loosely into broad categories, with brief descriptions
The full default set of over 7700 features in hctsa is produced by running all of the code files below, many of which produce a large number of outputs (e.g., some functions fit a time-series model and then output statistics including the parameters of the best-fitting model, measures of the model's goodness of fit, the optimal model order, and autocorrelation statistics on the residuals).
In our default feature set, each function is run with multiple input parameters, with each parameter set yielding characteristic outputs. For example,
CO_AutoCorr
determine the method in which autocorrelation is computed, as well as the time lag at which autocorrelation is calculated, e.g., lag 1, lag 2, lag 3, etc.
WL_dwtcoeff
has inputs that set the mother wavelet to use and level of wavelet decomposition; and
FC_LocalSimple
has inputs that determine the time-series forecasting method to use and the size of the training window.
The set of code files below and their input parameters that define the default hctsa feature set are in the INP_mops.txt
file of the hctsa repository.
Algorithms for summarizing properties of the distribution of values in a time series (independent of their ordered sequence through time).
Code summarizing basic properties of how values of a time series are correlated through time.
Entropy and complexity measures for time series based on information theory
Fitting time-series models and doing simple forecasting on time series.
Quantifying how properties of a time series change over time.
Nonlinear time-series analysis methods, including embedding dimensions and fluctuation analysis.
Properties of the time-series power spectrum, wavelet spectrum, and other periodicity measures.
Properties of a discrete symbolization of a time series.
Simple time-series properties derived mostly from the heart rate variability (HRV) literature.
Basic statistics of a time series, including measures of trend.
Other properties, like extreme values, visibility graphs, physics-based simulations, and dependence on pre-processings applied to a time series.
The hctsa package can be used completely within Matlab, allowing users to analyse time-series datasets quickly and easily. Here we will focus on this Matlab-based use of the software, but note that, for larger datasets requiring distributed computing set-ups, or datasets that may grow with time, hctsa can also be linked to a mySQL database, as described in .
The simplest way to get the hctsa package up and running is to run the install
script, which adds the required paths to dependent time-series packages (toolboxes), and compiles mex binaries to work on your system architecture. Once this one-off installation step is complete, you're ready to go! (NB: to include additional functions from the TISEAN nonlinear time-series analysis package, you'll also need to ).
After installation, future use of the package can begin by opening Matlab, navigating to the hctsa package, and then loading the paths required by the hctsa package by running the startup
script.
Denoising. If your data contain high-frequency measurement noise, the analyst should consider removing it. For example, using a filter (e.g., moving average), wavelet denoising, or using a phase-space reconstruction (e.g., cf. ).
Downsampling. Features assume that the data are sampled at a rate that properly resolves the temporal patterns of interest. If your data are over-sampled, then many features will be sensitive to this dominating autocorrelation structure, and will be less sensitive to interesting patterns in the data. In this case, you can consider downsampling your data, for which there are many heuristics (e.g., cf. ).
.
.
.
.
📙 .
.
.
.
.
.
.
.
.
.
.
.
Accompanying web resource that provides a self-organising database of time-series data. It allows users to upload, explore, and compare thousands of different types of time-series data.
pyspi is a comprehensive python library for computing hundreds of statistics of pairwise interactions (SPIs) directly from multivariate time series (MTS) data.
This reduced set of 22 features, determined through a combination of classification performance (across 93 problems) and mutual redundancy (as explained in this 📗 paper), is available here as an efficiently coded C implementation.
A reduced set of 16 features (trained on mouse fMRI data), coded in C with wrappers for use in python, R, etc.
Provides code for setting a hctsa feature-extraction job running on a computing cluster without leaving your local machine (with associated Julia helper code).
Provides MATLAB code for distributing hctsa calculations across nodes of a computing cluster (using pbs or slurm schedulers).
A repository for testing new time-series analysis algorithms for redundancy (and assessing whether to include into hctsa).
A work in progress to recode many hctsa time-series analysis features into native python code.
This excellent repository allows users to run hctsa software from within python.
Some preliminary python code for analyzing the results of hctsa calculations.
Native python time-series code to extract hundreds of time-series features, with in-built feature filtering. See the paper.
Kats
is a toolkit for analysing time-series data in python, and includes basic feature extraction functionality.
theft
(Tools for Handling Extraction of Features from Time series) allows users to compute various open time-series features sets and contains associated tools for visualising and analysing the results.
TSFEL
, 'Time Series Feature Extraction Library', is a python package with implementations of 60 simple time-series features (with unit tests).
A flexible and efficient toolkit for flexible time-series processing & feature extraction, that makes minimal assumptions about sequential data. Includes support for a range of feature sets, including tsfresh, catch22, TSFEL, and Kats.
Includes a wide range of useful signal processing tools (like power spectral densities, detrending and bandpass filtering, and empirical mode decomposition). It also includes estimation of complexity parameters (many entropies and correlation dimensions, see part of readme here) as well as detrended fluctuation analysis.
Makes available existing collections of time-series data for analysis.
Includes implementations of a range of time-series features.
Provides a collection of tools for the analysis of tidy time-series data represented as a tsibble.
A unified framework for machine learning with time series.
An open-source library of efficient algorithms to analyse time series using GPU and CPU.
A python-based nonlinear time-series analysis and complex systems code package. See this publication.
A python package for performing machine learning using time-series data, including loading datasets from the UCR/UAE repository.
A python package that can extract features from multivariate time series.
Code file | Description |
| Burstiness statistic of a time series. |
| Fits a distribution to data. |
| Custom skewness measures. |
| Statistics of a kernel-smoothed distribution of the data. |
| Maximum likelihood distribution fit to data. |
| The highlowmu statistic. |
| Mode of the histogram. |
| A given measure of location of a data vector. |
| The maximum and minimum values of the input data vector |
| A moment of the distribution of the input time series. |
| How statistics depend on distributional outliers. |
| How distributional statistics depend on distributional outliers. |
| Proportion of values in a time-series vector. |
| Quantiles of the distribution of values in the time series data vector. |
| How time-series properties change as points are removed. |
| Fits of parametric distributions or simple time-series models. |
| Spread of the input time series. |
| Mean of the outlier-trimmed time series. |
| Distributional asymmetry. |
| The proportion of the time series that are unique values. |
| Proportion of data points within p standard deviations of the mean. |
| Coefficient of variation. |
| Distance from the mean at which a given proportion of data are more distant. |
| Distributional entropy. |
| Hypothesis test for distributional fits to a data vector. |
Code file | Description |
| Changes in the automutual information with the addition of noise. |
| Compute the autocorrelation of an input time series. |
| How the autocorrelation function changes with the time lag. |
| Statistics of the time series in a 2-dimensional embedding space. |
| Angle autocorrelation in a 2-dimensional embedding space. |
| Point density statistics in a 2-d embedding space. |
| Analyzes distances in a 2-d embedding space of a time series. |
| Shape-based statistics in a 2-d embedding space. |
| First time the autocorrelation function crosses a threshold. |
| Time of first minimum in a given correlation function. |
| A custom nonlinear autocorrelation of a time series. |
| Analysis of line-of-sight angles between time-series data points. |
| Statistics on datapoints inside geometric shapes across the time series. |
| The 1/e correlation length. |
| The first zero-crossing of the generalized self-correlation function. |
| The generalized linear self-correlation function of a time series. |
| Normalized nonlinear autocorrelation function, |
| Normalized nonlinear autocorrelation, |
| Computes James Theiler's crinkle statistic. |
| Computes Theiler's Q statistic. |
| Time reversal asymmetry statistic. |
| Principal Components analysis of a time series in an embedding space. |
Automutual information: |
| Automutual information (Rudy Moddemeijer implementation). |
| Variability in first minimum of automutual information. |
| The automutual information of the distribution using histograms. |
| Statistics on automutual information function for a time series. |
Code file | Description |
| Approximate Entropy of a time series. |
| Simple complexity measure of a time series. |
| Lempel-Ziv complexity of a n-bit encoding of a time series. |
| Approximate Shannon entropy of a time series. |
| Permutation Entropy of a time series. |
| Entropy of a time series using Rudy Moddemeijer's code. |
| How time-series properties change with increasing randomization. |
| Sample Entropy of a time series. |
| Multiscale entropy of a time series. |
| Recurrence period density entropy (RPDE). |
| Entropy of time series using wavelets. |
Code file | Description |
| Compares a range of ARMA models fitted to a time series. |
| Fits an AR model of a given order, p. |
| Compares model fits of various orders to a time series. |
| Robustness of test-set goodness of fit. |
| Exponential smoothing time-series prediction model. |
| Robustness of model parameters across different segments of a time series. |
| Comparison of GARCH time-series models. |
| GARCH time-series modeling. |
| Gaussian Process time-series modeling for local prediction. |
| Gaussian Process time-series model for local prediction. |
| Gaussian Process time-series model parameters and goodness of fit. |
| Change in goodness of fit across different state space models. |
| State space time-series model fitting. |
| Statistics of a fitted AR model to a time series. |
| Statistics on a fitted ARMA model. |
| Hidden Markov Model (HMM) fitting to a time series. |
| Fits a Hidden Markov Model to sequential data. |
| Goodness of model predictions across prediction lengths. |
| Simple local time-series forecasting. |
| How simple local forecasting depends on window length. |
| How surprised you would be of the next data point given recent memory. |
| Investigates whether AR model fit improves with different preprocessings. |
Code file | Description |
| Mean and variance in local time-series subsegments. |
| How stationarity estimates depend on the number of time-series subsegments. |
| The KPSS stationarity test. |
| Compares the distribution in consecutive time-series segments. |
| Compares local statistics to global statistics of a time series. |
| Phillips-Peron unit root test. |
| How the time-series range changes across time. |
| Sliding window measures of stationarity. |
| Bootstrap-based stationarity measure. |
| Simple mean-stationarity metric, |
| Standard deviation of the nth derivative of the time series. |
| How the output of |
| Cross-forecast errors of zeroth-order time-series models. |
| Variance ratio test for random walk. |
Step detection: |
| Analysis of discrete steps in a time series. |
| Dependence of step detection on regularization parameter. |
| Variance change points in a time series. |
Code file | Description |
| Correlation dimension of a time series. |
| Delay Vector Variance method for real and complex signals. |
| False nearest neighbors of a time series. |
| Normalized drop-one-out constant interpolation nonlinear prediction error. |
| Information dimension. |
|
|
| False nearest neighbors of a time series. |
| Fractal dimension spectrum, |
| Correlation sum scaling by Grassberger-Proccacia algorithm. |
| Largest Lyapunov exponent of a time series. |
| Poincare section analysis of a time series. |
| Analysis of the histogram of return times. |
| Taken's estimator for correlation dimension. |
|
|
| Box counting, information, and correlation dimension of a time series. |
| Analyzes the false-nearest neighbors statistic. |
| Analyzes test statistics obtained from surrogate time series. |
| Surrogate time-series analysis. |
| Optimal delay time using the method of Parlitz and Wichard. |
| Local density estimates in the time-delay embedding space. |
| Nonlinearity measure derived from the nonlinear average magnitude difference function. |
Fluctuation analysis: |
| Physionet implementation of multiscale multifractal analysis |
| Matlab wrapper for Max Little's |
| Implements fluctuation analysis by a variety of methods. |
Code file | Description |
| Statistics of the power spectrum of a time series. |
| A simple test of seasonality. |
| Periodicity extraction measure of Wang et al. |
| Detail coefficients of a wavelet decomposition. |
| Wavelet decomposition of the time series. |
| Continuous wavelet transform of a time series. |
| Discrete wavelet transform coefficients. |
| Parameters of fractional Gaussian noise/Brownian motion in a time series. |
| Frequency components in a periodic time series. |
Code file | Description |
| Statistics on a binary symbolization of the time series. |
| Characterizes stretches of 0/1 in time-series binarization. |
| Motifs in a coarse-graining of a time series to a 3-letter alphabet. |
| Local motifs in a binary symbolization of the time series. |
| Transition probabilities between different time-series states. |
| How transition probabilities change with alphabet size. |
Code file | Description |
| Classic heart rate variability (HRV) statistics. |
|
|
| The |
| Heart rate variability (HRV) measures of a time series. |
Code file | Description |
| Quantifies various measures of trend in a time series. |
| Goodness of a polynomial fit to a time series. |
| Length of an input data vector. |
| How local maximums and minimums vary across the time series. |
| Correlations between simple statistics in local windows of a time series. |
| Basic statistics about an input time series. |
Code file | Description |
| Moving threshold model for extreme events in a time series. |
| Statistical hypothesis test applied to a time series. |
| Visibility graph analysis of a time series. |
| Couples the values of the time series to a dynamical system. |
| Simulates a hypothetical walker moving through the time domain. |
| Compare how time-series properties change after pre-processing. |
| How time-series properties change in response to iterative pre-processing. |
An hctsa analysis requires setting a library of time series, master operations and operations, and generates a HCTSA.mat
file (using TS_Init
), as described here. Once this is set up, computations are run using TS_Compute
.
These steps, as well as information on how to inspect the results of an hctsa analysis and working with HCTSA*.mat files, are provided in this chapter.
At its core, hctsa analysis involves computing a library of time-series analysis features (which we call operations) on a time-series dataset.
The basic sequence of a Matlab-based hctsa analysis is to: 1. Initialize a HCTSA.mat
file, which contains all of the information about the set of time series and operations in your analysis, as well as the results of applying all operations to all time series, using TS_Init
,
These operations can be computed on your time-series data using TS_Compute
. The results are structured in the local HCTSA.mat
file containing matrices (that store the results of the computations) and the tables (that store information about the time-series data and operations), as described here.
After the computation is complete, a range of processing, analysis, and plotting functions are provided to understand and interpret the results.
As a quick check of your operation library, you can compute the full default code library on a time-series data vector (a column vector of real numbers) as follows:
Suppose you have have a time-series dataset to analyze. You first generate a formatted INP_ts.mat
input file containing your time series data and associated name and keyword labels, as described here. You then initialize an hctsa calculation using the default library of features:
This generates a local file, HCTSA.mat
containing the associated metadata for your time series, as well as information about the full time-series feature library (Operations
) and the set of functions and code to call to evaluate them (MasterOperations
), as described here.
Next you want to evaluate the code on all of the time series in your dataset. For this you can simply run:
As described here, or, for larger datasets, using a script to regularly save back to the local file (cf. sample_runscript_matlab
).
Having run your calculations, you may then want to label your data using the keywords you provided in the case that you have labeled groups of time series:
and then normalize and filter the data using the default sigmoidal transformation:
A range of visualization scripts are then available to analyze the results, such as plotting the reordered data matrix:
To inspect a low-dimensional representation of the data:
Or to determine which features are best at classifying the labeled groups of time series in your dataset:
Each of these functions can be run with a range of input settings.
Some external code packages require compiled binary code to be used. Compilation of the mex code is handled by compile_mex
as part of the install
script, but the TISEAN package binaries need to be compiled separately in the command line.
Many of the operations (especially external code packages) rely on mex functions (pieces of code written in C or fortran), that need to be compiled to run natively on a given system architecture. To ensure that as many operations as possible run successfully on your data, you should compile these mex functions for your system. This requires working compilers (e.g., gcc, g++) to be installed on your system, which can be configured using mex -setup
(cf. doc mex
for more information).
Once mex is set up, the mex functions used in the time-series code repository can be compiled by navigating to the Toolboxes directory and then running compile_mex
.
Some operations rely on the TISEAN nonlinear time-series analysis package, which Matlab accesses via the terminal using system
commands, so the TISEAN binaries cannot be installed from within Matlab, but instead must be installed from the command line. If you are running Linux or Mac, we will assume that you are familiar with the command line, while those running Windows will require an alternate method to install TISEAN, as explained below.
In the command line (not within Matlab), navigate to the Toolboxes/Tisean_3.0.1 directory of the repository, then run the following chain of commands:
This should install the TISEAN binaries in your ~/bin/ directory (you can instead install into a system-wide directory, /usr/bin, for example, by running ./configure –prefix=/usr
). Additional information about the TISEAN installation process is provided on the TISEAN website.
If installation was successful then you should be able to access the newly-compiled binaries from the commandline, e.g., typing the command which poincare
should return the path to the TISEAN function poincare
. Otherwise, you should check that the install directory is in your system path, e.g., by adding the following:
to your ~/.bash_profile (and running source ~/.bash_profile
to update).
The path where TISEAN is installed will also have to be in Matlab’s environment path, which is added by startup.m
, assuming that the binaries are stored in ~/bin. The startup.m
code also adds the DYLD_LIBRARY_PATH, which is also required for TISEAN to function properly.
If you choose to use a custom location for the TISEAN binaries, that is not in the default Matlab system path (getenv('PATH')
in Matlab), then you will have to add this path manually. You can test that Matlab can see the TISEAN binaries by typing, for example, the following into Matlab:
!which poincare
If Matlab’s system paths are set up correctly, this command should return the path to your compiled TISEAN binary, poincare
.
If you are running Matlab from Windows, you will need a mechanism for Matlab to call system
commands and find compiled TISEAN binaries. There are two options:
Install Cygwin on your machine. Cygwin provides a Linux distribution-like environment on Windows. Use this environment to compile and install TISEAN (as per the instructions above for Linux or Mac), which will require it to have C and fortran compilers installed. Matlab will then also need to be launched from Cygwin, using the command: matlab &
. This instance of Matlab should then be able to call system
commands through cygwin, including the ability to access the TISEAN binaries.
Sacrifice operations that rely on TISEAN. In total, TISEAN-based operations account for approximately 300 operations in the operation library. Although they provide important, well-tested implementations of nonlinear time-series analysis methods, it's not the end of the world if you decide it's too much trouble to install and are ok to miss out on these methods (see below on how to explicitly remove them from a computed library).
If you decide not to use functions from the TISEAN package, you should initialize your dataset with the TISEAN functions removed. You could do this by removing them from you INP_ops.txt
file when initializing your dataset, or you could remove them from your initialized hctsa dataset by filtering on the 'tisean'
keyword.
For example, to filter a local Matlab hctsa file (e.g., HCTSA.mat
), you can use the following: TS_LocalClearRemove('raw','ops',TS_GetIDs('tisean','raw','ops'),true);
, which will remove all operations with the 'tisean' keyword from the hctsa dataset in HCTSA.mat
.
[If you are using a mySQL database to store the results of your hctsa calculations, TISEAN functions can be removed from the database as follows: SQL_ClearRemove('ops',SQL_GetIDs('ops',0,'tisean',{}),true)
].
An hctsa dataset has been initialized (specifying details of a time-series dataset and operations to include using TS_Init
), all results entries in the resulting HCTSA.mat
are set to NaN
, corresponding to results that are as yet uncomputed.
Calculations are performed using the function TS_Compute
, which stores results back into the matrices in HCTSA.mat
. This function can be run without inputs to compute all missing values in the default hctsa file, HCTSA.mat
:
TS_Compute
will begin evaluating operations on time series in HCTSA.mat
for which elements in TS_DataMat
are NaN
(i.e., computations that have not been run previously). Results are stored back in the matrices of HCTSA.mat
: TS_DataMat
(output of each operation on each time series), TS_CalcTime
(calculation time for each operation on each time series), and TS_Quality
(labels indicating errors or special-valued outputs).
TS_Compute
(1) Computing features in parallel across available cores using Matlab's Parallel Processing Toolbox. This can be achieved by setting the first input to true
:
(2) Computing across a custom range of time-series IDs (ts_id
) and operation IDs (op_id
). This can be achieved by setting the second and third inputs:
(3) Specifying what types of values should be computed:
(4) Specifying a custom .mat file to operate on (HCTSA.mat
is the default):
(5) Suppress commandline output. All computations are displayed to screen by default (which can be overwhelming but is useful for error checking). This functionality can be suppressed by setting the final (6th) input to false
:
Computing features for full time-series datasets can be time consuming, especially for large datasets of long time series. An appropriate strategy therefore depends on the time-series length, the number of time series in the dataset, and the computational resources available. When multiple cores are available, it is always recommended to use the parallel setting (i.e., as TS_Compute(true)
).
The first thing to think about is how the time taken to compute 7749 features of v0.93 of hctsa scales with the length of time series in your dataset (see plot below). The figure compares results using a single core (e.g., TS_Compute(false)
) to results using a 16-core machine, with parallelization enabled (e.g., TS_Compute(true)
).
Times may vary across on individual machines, but the above plot can be used to estimate the computation time per time series, and thus help decide on an appropriate computation strategy for a given dataset.
Note that if computation times are too long for the computational resources at hand, one can always choose a reduced set of features, rather than the full set of >7000, to get a preliminary understanding of the dataset. One such reduced set of features is in INP_ops_reduced.txt
. We plan to reduced additional reduced feature sets, determined according to different criteria, in future.
If only a single machine is available for computation, there are a couple of options:
For small datasets, when it is feasible to run all computations in a single go, it is easiest to run computations within Matlab in a single call of TS_Compute
.
For larger datasets that may run for a long time on a single machine, one may wish to use something like the provided sample_runscript_matlab
script, where TS_Compute
commands are run in a loop over time series, compute small sections of the dataset at a time (and then saving the results to file, e.g., HCTSA.mat
), eventually covering the full dataset iteratively.
Code for running distributed hctsa computations on a cluster (using pbs or slurm schedulers) is here. The strategy is as follows: with a distributed computing setup, a local Matlab file (HCTSA.mat
) can be split into smaller pieces using TS_Subset
, which outputs a new data file for a particular subset of your data, e.g., TS_Subset('raw',1:100)
will generate a new file, HCTSA_subset.mat
that contains just time series with IDs from 1 to 100. Computing features for time series in each such subset can then be run on a distributed computing setup. For example, with a different compute node computing a different subset (by queuing batch jobs that each work on a given subset of time series). After all subsets have been computed, the results are recombined into a single HCTSA.mat
file using TS_Combine
commands.
Distributing feature computations on a large-scale distributed computing setup can be better suited to a linked mySQL database, especially for datasets that grow with time, as new time series can be easily added to the database. In this case, computation proceeds similarly to above, where shell scripts on a distributed cluster computing environment can be used to distribute jobs across cores, with all individual jobs writing to a centralized mySQL server. A set of Matlab code that generates an appropriately formatted mySQL database and interfaces with the database to facilitate hctsa feature computation is included with the software package, and is described in detail here.
Formatted input files are used to set up a custom dataset of time-series data, pieces of Matlab code to run (master operations), and associated outputs from that code (operations). By default, you can simply specify a custom time-series dataset and the default operation library will be used. In this section we describe how to initiate an hctsa analysis, including how to format the input files used in the hctsa framework.
TS_Init
To work with a default feature set, hctsa or catch22, you just need to specify information about the time-series to analyze, specified by an input file (e.g., INP_ts.mat
or INP_ts.txt
). Details of how to format this input file is described below. A test input file, 'INP_test_ts.mat'
, is provided with the repository, so you can set up feature extraction for it using the hctsa feature set as:
or
And for catch22 as:
The full hctsa feature set involves significant computation time so it is a recommended first step to test out your analysis pipeline using a smaller, faster feature set like catch22 (but note that it is insensitvie to mean and standard deviation; to include them use catch24). This feature set is provided as a submodule within hctsa, and it is very fast to compute using compiled C code (the features are compiled on initial install
of hctsa (by running mexAll
from the Toolboxes/catch22
directory of hctsa).
TS_Init
produces a Matlab file, HCTSA.mat
, containing all of the structures required to understand the set of time series, operations, and the results of their computation (explained here). Through this initialization process, each time series will be assigned a unique ID, as will each master operation, and each operation.
You can specify a custom feature set of your own making by specifying
The code to run (INP_mops.txt
); and
The features to extract from that code (INP_ops.txt
).
Details of how to format these input files are described below. The syntax for using a custom feature-set is as:
When formatting a time series input file, two formats are available:
.mat
file input, which is suited to data that are already stored as variables in Matlab; or
.txt
file input, which is better suited to when each time series is already stored as an individual text file.
.mat
file)When using a .mat file input, the .mat file should contain three variables:
timeSeriesData
: either a N x 1 cell (for N time series), where each element contains a vector of time-series values, or a N x M matrix, where each row specifies the values of a time series (all of length M).
labels
: a N x 1 cell of unique strings containing a named label for each time series.
keywords
: a N x 1 cell of strings, where each element contains a comma-delimited set of keywords (one for each time series), containing no whitespace.
An example involving two time series is below. In this example, we add two time series (showing only the first two values shown of each), which are labeled according to .dat
files from a hypothetical EEG experiment, and assigned keywords (which are separated by commas and no whitespace). In this case, both are assigned keywords 'subject1' and 'eeg' and, additionally, the first time series is assigned 'trial1', and the second 'trial2' (these labels can be used later to retrieve individual time series). Note that the labels do not need to specify filenames, but can be any useful label for a given time series.
When using a text file input, the input file now specifies filenames of time series data files, which Matlab will then attempt to load (using dlmread
). Data files should thus be accessible in the Matlab path. Each time-series text file should have a single real number on each row, specifying the ordered values that make up the time series. Once imported, the time-series data is stored in the database; thus the original time-series data files are no longer required, and can be removed from the Matlab path.
The input text file should be formatted as rows with each row specifying two whitespace separated entries: (i) the file name of a time-series data file and (ii) comma-delimited keywords.
For example, consider the following input file, containing three lines (one for each time series to be added to the database):
Using this input file, a new analysis will contain 3 time series, gaussianwhitenoise_001.dat
and gaussianwhitenoise_002.dat
will be assigned the keywords noise
and gaussian
, and the data in sinusoid_001.dat
will be assigned keywords ‘periodic’ and ‘sine’. Note that keywords should be separated only by commas (and no whitespace).
In our system, a master operation refers to a piece of Matlab code and a set of input parameters.
Valid outputs from a master operation are: 1. A single real number, 2. A structure containing real numbers, 3. NaN
to indicate that the input time series is not appropriate for this code.
The (potentially many) outputs from a master operation can thus be mapped to individual operations (or features), which are single real numbers summarizing a time series that make up individual columns of the resulting data matrix.
Two example lines from the input file, INP_mops.txt
(in the Database
directory of the repository), are as follows:
Each line in the input file specifies two pieces of information, separated by whitespace: 1. A piece of code and its input parameters. 2. A unique label for that master operation (that can be referenced by individual operations).
When the time comes to perform computations on data using the methods in the database, Matlab needs to have path access to each of the master operations functions specified in the database. For the above example, Matlab will attempt to run both CO_tc3(x_z,1)
and ST_length(x)
, and thus the functions CO_tc3.m
and ST_length.m
must be in the Matlab path. Recall that the script startup.m
, which should be run at the start of each session using hctsa, handles the addition of paths required for the default code library.
The input file, e.g., INP_ops.txt
(in the Database
directory of the repository) should contain a row for every operation, and use labels that correspond to master operations. An example excerpt from such a file is below:
The first column references a corresponding master label and, in the case of master operations that produce structure, the particular field of the structure to reference (after the fullstop), the second column denotes the label for the operation, and the final column is a set of comma-delimited keywords (that must not include whitespace). Whitespace is used to separate the three entries on each line of the input file. In this example, the master operation labeled CO_tc3_xz_1
, outputs is a structure, with fields that are referenced by the first five operations listed here, and the ST_length
master operation outputs a single number (the length of the time series), which is referenced by the operation named 'length' here. The two keywords 'correlation' and 'nonlinear' are added to the CO_tc3_1
operations, while the keywords raw
and lengthDependent
are added to the operation called length
. These keywords can be used to organize and filter the set of operations used for a given analysis task.
The hctsa framework consists of three basic objects containing relevant metadata:
Master Operations specify pieces of code (Matlab functions) and their inputs to be computed. Taking in a single time series, master operations can generate a large number of outputs as a Matlab structure, each of which can be identified with a single operation (or 'feature').
Operations (or 'features') are a single number summarizing some measure of structure in a time series. In hctsa, each operation links to an output from a piece of evaluated code (a master operation).
Time series are univariate and uniformly sampled time-ordered measurements.
These three different objects are summarized below:
In the example above, a master operation specifies the code to run, CO_AutoCorr(x,1:5,'TimeDomain')
, which outputs the autocorrelation of the input time series (x
) at lags 1, 2, ..., 5. Each operation (or 'feature') is a single number that draws on this set of outputs, for example, the autocorrelation at lag 1, which is named AC_1
, for example.
In the hctsa framework, master operations, operations, and time series are stored as tables that contain all of their associated keywords and metadata (and actual time-series data in the case of time series).
For a given hctsa analysis, the user must specify a set of code to evaluate (master operations), their associated individual outputs to measure (operations), and a set of time series to evaluate the features on (time series).
We provide a default library of over 7700 operations (derived from approximately 1000 unique master operations). This can be customized, and additional pieces of code can also be added to the repository.
Having specified a set of master operations, operations, and time series, the results of computing these functions in the time series data are stored in three matrices:
TS_DataMat is an n x m data matrix containing the results of applying m operations to the n time series.
TS_Quality is an n x m matrix containing quality labels for each operation output (coding different outputs such as errors or NaNs). Quality labels are described in the section below.
TS_CalcTime is an n x m matrix containing calculation times for each operation output. Note that the calculation time stored is for the corresponding master operation.
Each HCTSA*.mat
file includes the tables described above: for TimeSeries (corresponding to the rows of the TS_ matrices), Operations (corresponding to columns of the TS_ matrices), and MasterOperations, corresponding to the code evaluated to compute the operations. In addition, the results are stored as above: TS_DataMat, TS_Quality, and TS_CalcTime.
Quality labels are used to indicate when operations take non-real values, or when fatal errors were encountered. Quality labels are stored in the Quality column of the Results table in the mySQL database, and in local Matlab files as the TS_Quality matrix.
When the quality label is nonzero, this indicates that a special-valued output occurred. In this case, the output value of the operation is set to zero, as a convention, and the quality label codes the special output value:
When applying thousands of time-series analysis methods to diverse datasets, many operations can give results that are not real numbers. Some time series may be inappropriate for a given operation (such as fitting a positive-only distribution to data that is not positive), or measuring stationarity across 2000 datapoints in time series that are shorter than 2000 samples. Other times, an optimization routine may fail, or some unknown error may be called.
Some errors are not problems with the code, but represent issues with applying particular sets of code to particular time series, such as when a Matlab fitting function reaches the maximum number of iterations and returns an error. Other errors are genuine problems with the code that need to be corrected. Both cases are labeled as errors in our framework.
It can be good practice to visualize where special values and errors are occurring after a computation to see where things might be going wrong, using TS_InspectQuality
. This can be run in four modes:
TS_InspectQuality('summary');
[default] Summarizes the proportion of special-valued outputs in each operation as a bar plot, ordered by the proportion of special-valued outputs.
TS_InspectQuality('master');
Plots which types of special-valued outputs were encountered for each master operation.
TS_InspectQuality('full');
Plots the full data matrix (all time series as rows and all operations as columns), and shows where each possible special-valued output can occur (including 'error', 'NaN', 'Inf', '-Inf', 'complex', 'empty', or a 'link error').
TS_InspectQuality('reduced');
As 'full'
, but includes only columns where special values occurred.
For example, running TS_InspectQuality('summary')
loads in data from HCTSA.mat and produces the following, which can be zoomed in on and explored to understand which features are producing problematic outputs:
Note that errors returned from Matlab files do not halt the progress of the computation (using try-catch
statements), but errors with compiled mex functions (or external command-line packages like TISEAN) can produce a fault that crashes Matlab or the system. We have performed some basic testing on all mex functions, but for some unusual time series, such faults may still occur. These situations must be dealt with by either identifying and fixing the problem in the original source code and recompiling, or by removing the problem code.
When getting information on operations that produce special-valued outputs (getting IDs listed from TS_InspectQuality
), it can be useful to then test examples by re-running pieces of code with the problematic data. The function TS_WhichProblemTS
can be used to retrieve time series from an hctsa dataset that caused a problem for a given operation.
Usage is as follows:
This provides the list of time series IDs (ts_ind
), their time-series data vectors (dataCell
), and the code to evaluate the given operation (in this case, the master operation code corresponding to the operation with ID 684).
You can then pick an example time series (e.g., the first problem time series: x = dataCell{1}; x_z = zscore(x)
), and then copy and paste the code in codeEval into the command line to evaluate the code for this time series. This method allows easy debugging and inspection of examples of time-series data that caused problems for particular operations flagged through the TS_InspectQuality
process.
The clustered data matrix (if clustering has been performed, otherwise the non-clustered data matrix is used) can be visualized by running TS_PlotDataMatrix
.
This will produce a colored visualization of the data matrix such as that shown below.
When data is grouped according to a set of distinct keywords and stored as group metadata (using the TS_LabelGroups
), these can also be visualized using TS_PlotDataMatrix('colorGroups',true)
.
Running TS_PlotDataMatrix('norm')
plots the data contained in the local file HCTSA_N.mat
, yielding:
where black rectangles label missing values, and other values are shown from low (blue) to high (red) after normalization using the scaled outlier-robust sigmoidal transformation. Due to the size of the matrix, operations are not labeled individually.
Examples of time series segments are shown to the left of the plot, and when the middle plot is zoomed, the time-series annotations remain matched to the data matrix:
It can be useful to display the matrix with the order of time series and operations preserved, but the relationships between rows and columns can be difficult to visualize when ordered randomly.
Much prettier, hey?! By reordering rows and columns, this representation reveals correlated patterns of outputs across different types of operations, and similar sets of properties between different types of time series.
In this example, we consider a set of 20 periodic and 20 noisy periodic signals. We assigned the time series in HCTSA.mat
to groups (using TS_LabelGroups('raw',{'periodic','noisy'})
), then normalized the data matrix (TS_Normalize
), and then clustered it (TS_Cluster
). So now we have a clustered data matrix containing thousands of summaries of each time series, as well as pre-assigned group information as to which time series are periodic and which are noisy. When the time series have been assigned to groups, this can be accessed by switching on the 'colorGroups'
setting:
producing the following two plots:
When group information is not used (the left plot), the data is visualized in the default blue/yellow/red color scheme, but when the assigned groups are colored (right plot), we see that the clustered dataset separates perfectly into the periodic (green) and noisy (blue) time series, and we can visualize the features that contribute to the separation.
hctsa allows you to map time-series data into a unified feature space, a representation that is well suited to then applying machine learning methods to perform classification where the goal is to predict a class label assigned to each time series from features of its dynamics.
Class labels can be assigned to time series in an hctsa dataset using the function TS_LabelGroups
, which populates the Group
column of the TimeSeries
table as a categorical labeling of the data (the class labels the classifier will attempt to predict). These group labels are used by a range of analysis functions, including TS_PlotLowDim
, TS_TopFeatures
, and TS_Classify
.
The machinery of TS_LabelGroups
assumes that the class labels you are interested in are contained in the Keywords
column of the TimeSeries
table (set up from your original when TS_Init
was run).
If you want to label according to a specific set of keywords, you can do this by specifying the keywords that define the unique groups. The example below assigns labels to two groups of time series in the HCTSA.mat
(specifying the shorthand 'raw'
for this default, un-normalized data), corresponding to those labeled as 'parkinsons'
and those labeled as 'healthy'
:
Note that any time series that are not labeling by either 'parkinsons'
nor 'healthy'
are left unlabeled.
If every time series has a single keyword that uniquely labels it, TS_LabelGroups
can automatically detect this by running TS_LabelGroups('raw',{});
.
More complex labeling (e.g., using custom combinations of keywords) is not implemented in TS_LabelGroups
, but can be achieved by writing a script that does logical operations on calls to TS_LabelGroups
and saves back to the TimeSeries.Group
column. Here is of combining two labels (male/female and day/night) on fly movement data.
By default, the group labels are saved back to the data file (in this example, HCTSA.mat
).
Group labels can be reassigned at any time by re-running the TS_LabelGroups
function, and can be cleared by running, e.g., TS_LabelGroups('raw','clear')
.
Assigned labels are used by the analysis, plotting, and classification functions of hctsa.
Note: If you assign a labeling to a given dataset (e.g., HCTSA.mat
), then this labeling will remain with the normalized data (e.g., after running TS_Normalize
).
Once a set of operations have been computed on a time-series dataset, the results are stored in a local HCTSA.mat file. The result can be used to perform a wide variety of highly comparative analyses, such as those outlined in .
The type of analysis employed should the be motivated by the specific time-series analysis task at hand. Setting up the problem, guiding the methodology, and interpreting the results requires strong scientific input that should draw on domain knowledge, including the questions asked of the data, experience performing data analysis, and statistical considerations.
The first main component of an hctsa analysis involves filtering and normalizing the data using TS_Normalize
, described , which produces a file called HCTSA_N.mat. Information about the similarity of pairs of time series and operations can be computed using TS_Cluster
, described which stores this information in HCTSA_N.mat. The suite of plotting and analysis tools that we provide with hctsa work with this normalized data, stored in HCTSA_N.mat, by default.
Visualizing structure in the data matrix using .
Visualizing the time-series traces using .
Visualizing low-dimensional structure in the data using .
Exploring similar matches to a target time series using .
Visualizing the behavior of a given operation across the time-series dataset using .
For time-series classification tasks, groups of time series can be labeled using the TS_LabelGroups
function described ; this group label information is stored in the local HCTSA*.mat file, and used by default in the various plotting and analysis functions provided. Additional analysis functions are provided for basic time-series classification tasks:
Explore the classification performance of the full library of features using
Determine the features that (individually) best distinguish between the labeled groups using
For the purposes of visualizing the data matrix, it is often desirable to have the rows and columns reordered to put similar rows adjacent to one another, and similarly to place similar columns adjacent to one another. This reordering can be done using hierarchical linkage clustering, by the function TS_Cluster
:
This function reads in the data from HCTSA_N.mat
, and stores the re-ordering of rows and columns back into HCTSA_N.mat
in the ts_clust
and op_clust
(and, if the size is manageable, also the pairwise distance information). Visualization functions (such as and ) can then take advantage of this information, using the general input label 'cl'
.
The first step in analyzing a dataset involves processing the data matrix, which can be done using TS_Normalize
. This involves filtering out operations or time series that produced many errors or special-valued outputs, and then normalizing of the output of all operations, which is typically done in-sample, according to an outlier-robust sigmoidal transform (although other normalizing transformations can be selected). Both of these tasks are performed using the function TS_Normalize
. The TS_Normalize
function writes the new, filtered, normalized matrix to a local file called HCTSA_N.mat
. This contains normalized, and trimmed versions of the information in HCTSA.mat
.
Example usage is as follows:
The first input controls the normalization method, in this case a scaled, outlier-robust sigmoidal transformation, specified with 'mixedSigmoid'
. The second input controls the filtering of time series and operations based on minimum thresholds for good values in the corresponding rows (corresponding to time series; filtered first) and columns (corresponding to operations; filtered second) of the data matrix.
In the example above, time series (rows of the data matrix) with more than 20% special values (specifying 0.8) are first filtered out, and then operations (columns of the data matrix) containing any special values (specifying 1.0) are removed. Columns with approximately constant values are also filtered out. After filtering the data matrix, the outlier-robust ‘scaledRobustSigmoid’ sigmoidal transformation is applied to all remaining operations (columns). The filtered, normalized matrix is saved to the file HCTSA_N.mat
.
Details about what normalization is saved to the HCTSA_N.mat
file as normalizationInfo
, a structure that contains the normalization function, filtering options used, and the corresponding TS_Normalize
code that can be used to re-run the normalization.
It makes sense to weight each operation equally for the purposes of dimensionality reduction, and thus normalize all operations to the same range using a transformation like ‘scaledRobustSigmoid’, ‘scaledSigmoid’, or ‘mixedSigmoid’. For the case of calculating mutual information distances between operations, however, one would rather not distort the distributions and perform no normalization, using ‘raw’ or a linear transformation like ‘zscore’, for example. The full list of implemented normalization transformations are listed in the function BF_NormalizeMatrix
.
Note that the 'scaledRobustSigmoid' transformation does not tolerate distributions with an interquartile range of zero, which will be filtered out; 'mixedSigmoid' will treat these distributions in terms of their standard deviation (rather than interquartile range).
Filtering parameters depend on the application. Some applications can allow the filtering thresholds to be relaxed. For example, setting [0.7,0.9]
, removes time series with less than 70% good values, and then removes operations with less than 90% good values. Some applications can tolerate some special-valued outputs from operations (like some clustering methods, where distances are simply calculated using those operations that are did not produce special-valued outputs for each pair of objects), but others cannot (like Principal Components Analysis); the filtering parameters should be specified accordingly.
Analysis can be performed on the data contained in HCTSA_N.mat
in the knowledge that different settings for filtering and normalizing the results can be applied at any time by simply rerunning TS_Normalize
, which will overwrite the existing HCTSA_N.mat
with the results of the new normalization and filtration settings.
TS_FilterData
It is often useful to check whether the feature-based classification results of a given analysis is driven by 'trivial' types of features that do not depend on the dynamical properties of the data, e.g., features sensitive to time-series length, location (e.g., mean), or spread (e.g., variance). Because these features are labeled as 'lengthdep'
, 'locdep'
, and 'spreaddep'
, you can easily filter these out to check the robustness of your analysis.
An example:
You could use the same template to filter 'locdep'
or 'spreaddep'
features (or any other combination of keyword labels). You can then go ahead with analyzing the filtered HCTSA dataset as above, except using your new filename, HCTSA_locFilt
.
When running hctsa analyses, often you want to take subsets of time series (to look in more detail at a subset of your data) or subsets of operations (to explore the behavior of different feature subsets), or combine multiple subsets of data together (e.g., as additional data arrive).
The hctsa package contains a range of functions for these types of tasks, working directly with hctsa .mat files, and are described below. Note that these types of tasks are easier to manage when hctsa data are stored in a .
TS_GetIDs
Many time-series classification problems involve filtering subsets of time series based on keyword matching, where keywords are specified in the provided when initializing a dataset.
Most filtering functions (such as those listed in this section), require you to specify a range of IDs of TimeSeries or Operations in order to specify them. Recall that each TimeSeries and Operation is assigned a unique ID (assed as the ID field in the corresponding metadata table). To quickly get the IDs of time series that match a given keyword, the following function can be used:
Or the IDs of operations tagged with the 'entropy' keyword:
These IDs can then be used in the functions below (e.g., to clear data, or extract a subset of data).
Note that to get a quick impression of the unique time-series keywords present in a dataset, use the function TS_WhatKeywords
, which gives a text summary of the unique keywords in an hctsa dataset.
TS_LocalClearRemove
Sometimes you may want to remove a time series from an hctsa dataset because the data was not properly processed, for example. Or one operation may have produced errors because of a missing toolbox reference, or you may have altered the code for an operation, and want to clear the stored results from previous calculations.
For example, often you want to remove from your operation library operations that are dependent on the location of the data (e.g., its mean: 'locdep'
), that only operate on positive-only time series ('posOnly'
), that require the TISEAN package ('tisean'
), or that are stochastic (i.e., they give different results when repeated, 'stochastic'
).
The function TS_LocalClearRemove
achieves these tasks when working directly with .mat files (NB: if using a mySQL database, should be used instead).
TS_LocalClearRemove
loads in a an hctsa .mat data file, clears or removes the specified time series or operations, and then writes the result back to the file.
Example 1: Clear all computed data from time series with IDs 1:5 from HCTSA.mat
(specifying 'raw'
):
Example 3: Remove all operations that require positive-only data (the 'posOnly'
keyword) from HCTSA.mat
:
Example 4: Remove all operations that are location dependent (the 'locdep'
keyword) from HCTSA.mat
:
See the documentation in the function file for additional details about the inputs to TS_LocalClearRemove
.
TS_Subset
Sometimes it's useful to retrieve a subset of an hctsa dataset, when analyzing just a particular class of time series, for example, or investigating a balanced subset of data for time-series classification, or to compare the behavior of a reduced subset of features. This can be done with TS_Subset
, which takes in a hctsa dataset and generates the desired subset, which can be saved to a new .mat file.
Example 1: Import data from 'HCTSA_N.mat', then save a new dataset containing only time series with IDs in the range 1--100, and all operations, to 'HCTSA_N_subset.mat' (see documentation for all inputs).
Note that the subset in this case will have be normalized using the full dataset of all time series, and just this subset (with IDs up to 100) are now being analyzed. Depending on the normalization method used, different results would be obtained if the subsetting was performed prior to normalization.
Example 2: From HCTSA.mat
('raw'
), save a subset of that dataset to 'HCTSA_healthy.mat' containing only time series tagged with the 'healthy' keyword:
TS_Combine
When analyzing a growing dataset, sometimes new data needs to be combined with computations on existing data. Alternatively, when computing a large dataset, sometimes you may wish to compute sections of it separately, and may later want to combine each section into a full dataset.
To combine hctsa data files, you can use the TS_Combine
function.
Example: combine hctsa datasets stored in the files HCTSA_healthy.mat
and HCTSA_disease.mat
into a new combined file, HCTSA_combined.mat
:
We use the convention that x
refers to the input time series and x_z
refers to a z-scored transformation of the input time series (i.e., ). In the example above, the first line thus adds an entry in the database for running the code CO_tc3
using a z-scored time series as input (x_z
), with 1
as the second input with the label CO_tc3_xz_1
, and the second line will add an entry for running the code ST_length
using the untransformed time series, x
, with the label length
.
If you have run , it will save clustering information back to the HCTSA file, that will be loaded if possible and used to reorder rows and columns by similarity (if this information exists). Here's an example of running TS_PlotDataMatrix
on data that has this clustering info:
Example 2: Remove all operations with the keyword 'tisean' (that depend on the ) from HCTSA.mat
:
The third input, compare_tsids
, controls the behavior of the function in combining time series. By setting this to 1, TS_Combine
assumes that the TimeSeries IDs are comparable between the datasets (e.g., most common when using a ), and thus filters out duplicates so that the resulting hctsa dataset contains a unique set of time series. By setting this to 0 (default), the output will contain a union of time series present in each of the two hctsa datasets. In the case that duplicate TimeSeries IDs exist in the combination file, a new index will be generated in the combined file (where IDs assigned to time series are re-assigned as unique integers using TS_ReIndex
).
In combining operations, this function works differently when data have been stored in a unified , in which case operation IDs can be compared meaningfully and combined as an intersection. However, when hctsa datasets have been generated using TS_Init
, the function will check that the same set of operations have been used in both files.
Master Operation
Operation
Time Series
Summary:
Code and inputs to execute
Single feature
Univariate data
Example:
CO_AutoCorr(x,1:5,'TimeDomain')
AC_1
[1.2, 33.7, -0.1, ...]
Quality label
Description
0
No problems with calculation. Output was a real number.
1
A fatal error was encountered.
2
Output of the code was NaN.
3
Output of the code was Inf.
4
Output of the code was -Inf
5
Output had a non-zero imaginary component
6
Output was empty (e.g., []
)
7
Field specified for this operation did not exist in the master operation output structure
In a time-series classification task, you are often interested in not only determining differences between the labeled classes, but also interpreting them in the context of your application. For example, an entropy estimate may give low values to the RR interval series measured from a healthy population, but higher values to those measured from patients with congestive heart failure. When performing a highly comparative time-series analysis, we have computed a large number of time-series features and can use these results to arrive at these types of interpretable conclusions automatically.
A simple way to determine which individual features are useful for distinguishing the labeled classes of your dataset is to compare each feature individually in terms of its ability to separate the labeled classes of time series. This can be achieved using the TS_TopFeatures
function.
In this example, we consider 100 EEG time series from seizure, and 100 EEG time series from eyes open (from the publicly-available Bonn University dataset). After running TS_LabelGroups
(as simply TS_LabelGroups()
, which automatically detects the two keywords assigned to the two classes of signals), we run simply:
By default this function will compare the in-sample linear classification performance of all operations in separating the labeled classes of the dataset (individually), produce plots summarizing the performance across the full library (compared to an empirical null distribution), characterize the top-performing features, and show their dependencies on the dataset. Inputs to the function control how these tasks are performed (including the statistic used to measure individual performance, what plots to produce, and whether to produce nulls).
First, we get an output to screen, showing the mean linear classification accuracy across all operations, and a list of the operations with top performance (ordered by their test statistic, with their ID shown in square brackets, and keywords in parentheses):
Here we see that measures of entropy are dominating this top list, including measures of the time-series distribution.
An accompanying plot summarizes the performance of the full library on the dataset, compared to a set of nulls (generated by shuffling the class labels randomly):
Here we can see that the default feature library (blue: 7275 features remaining after filtering and normalizing) are performing much better than the randomized features (null: red).
Next we get a set of plots showing the class probability distributions for the top features. For example:
This allows us to interpret the values of features in terms of the dataset. After some inspection of the listed operations, we find that various measures of EEG Sample Entropy are higher for healthy individuals with eyes open (blue) than for seizure signals (purple).
Finally, we can visualize how the top operations depend on each other, to try to deduce the main independent behaviors in this set of 25 top features for the given classification task:
In this plot, the magnitude of the linear correlation, R
(as 1-|R|
) between all pairs of top operations is shown using color, and a visualization of a clustering into 5 groups is shown (with green boxes corresponding to clusters of operations based on the dendrogram shown to the right of the plot). Blue boxes indicate a representative operation of each cluster (closest to the cluster centre).
Here we see that most pairs of operations are quite dependent (at |R| > ~0.8
), and the main clusters are those measuring distribution entropy (EN_DistributionEntropy_
), Lempel-Ziv complexity (EN_MS_LZcomplexity_
), Sample Entropy (EN_SampEn_
), and a one-step-ahead surprise metric (FC_Surprise_
).
This function produces all of the basic visualizations required to achieve a basic understanding of the differences between the labeled classes of the dataset, and should be considered a first step in a more detailed analysis. For example, in this case we may investigate the top-performing features in more detail, to understand in detail what they are measuring, and how. This process is described in the Interpreting features section.
While the global structure of a time-series dataset can be investigated by plotting the data matrix (TS_PlotDataMatrix
) or a low-dimensional representation of it (TS_PlotLowDim
), sometimes it can be more interesting to retrieve and visualize relationships between a set of nearest neighbors to a particular time series of interest.
The hctsa framework provides a way to easily compute distances between pairs of time series, e.g., as a Euclidean distance between their normalized feature vectors. This allows very different time series (in terms of their origin, their method of recording and measurement, and their number of samples) to be compared straightforwardly according to their properties, measured by the algorithms in our hctsa library.
For this, we use the TS_SimSearch
function, specifying the id of the time series of interest (i.e., the ID
field of the TimeSeries
structure) with the first input and the number of neighbors with the 'numNeighbors' input specifier (default: 20). By default, data is loaded from HCTSA_N.loc
, but a custom source can be specified using the 'whatDataFile'
input specifier (e.g., TS_SimSearch('whatDataFile','HCTSA_custom.mat')
).
After specifying the target and how many neighbors to retrieve, TS_SimSearch
outputs the list of neighbors and their distances to screen, and the function also provides a range of plotting options to visualize the neighbors. The plots to produce are specified as a cell using the 'whatPlots' input.
To investigate the pairwise relationships between all neighbors retrieved, you specify the 'matrix'
option of the TS_SimSearch
function. An example output using a publicly-available EEG dataset, retrieving 14 neighbors from the time series with ID = 1
, as TS_SimSearch(1,'whatPlots',{'matrix'},'numNeighbors',14)
, is shown below:
The specified target time series (ID = 1
) is shown as a white star, and all 14 neighbors are shown, as labeled on the left of the plot with their respective IDs, and a 100-sample subset of their time traces.
Pairwise distances are computed between all pairs of time series (as a Euclidean distance between their feature vectors), and plotted using color, from low (red = more similar pairs of time series) to high (blue = more different pairs of time series).
Because this dataset contains 3 classes that were previously labeled (using TS_LabelGroups
as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'})
), the function shows these class assignments using color labels to the left of the plot (purple, green, and orange in this case).
In this case we see that the purple and green classes are relatively similar under this distance metric (eyes open and eyes closed), whereas the orange time series (seizure) are distinguished.
Another way to visualize the similarity (under our feature-based distance metric) of all pairs of neighbors is using a network visualization. This is specified as:
which produces something like the following:
The strongest links are visualized as blue lines (by default, the top 40% of strongest links are plotted, cf. the legend showing 0.9, 0.8, 0.7, and 0.6 for the top 10%, 20%, 30%, and 40% of links, respectively).
The target is distinguished (as purple in this case), and the other classes of time series are shown using color, with names and time-series segments annotated. Again, you can see that the EEG time series during seizure (blue) are distinguished from eyes open (red) and eyes closed (green).
The scatter setting visualizes the relationship between the target and each of 12 time series with the most similar properties to the target. Each subplot is a scatter of the (normalized) outputs of each feature for the specified target (x-axis) and the match (y-axis). An example is shown below.
Multiple output plots can be produced simultaneously by specifying many types of plots as follows:
This produces a plot of each type.
Note that pairwise distances can be pre-computed and saved in the HCTSA*.mat
file using TS_PairwiseDist
for custom distance metrics (which is done by default in TS_Cluster
for datasets containing fewer than 1000 objects). TS_SimSearch
checks for this information in the specified input data (containing the ts_clust
or op_clust
structure), and uses it to retrieve neighbors. If distances have not previously been computed, distances from the target are computed as euclidean distances (time series) or absolute correlation distances (operations) between feature vectors within TS_SimSearch
.
The hctsa package provides a simple means of plotting time series: the TS_PlotTimeSeries
function.
For example, to plot a set of time series that have not been assigned groups, we can run the following:
For our assorted set of time series, this produces the following:
Showing the first 400 samples of 10 selected time series, equally-spaced through the TimeSeries IDs in HCTSA_N.mat
.
Many more custom plotting options are available by passing an options structure to TS_PlotTimeSeries
, including the 'plotFreeForm'
option which allows very many time series to be shown in a single plot (without the usual axis borders):
producing an overview picture of the first 300 samples of 40 time series (spaced through the rows of the data matrix).
When the time series have been assigned groups (using TS_LabelGroups
, here), this information is automatically incorporated into TS_PlotTimeSeries
, which then plots a given number of each time series group, and colors them accordingly:
In this case the two labeled groups of time series are recognized by the function: red (noisy), blue (no noise), and then 5 time series in each group are plotted, showing the first 500 samples of each time series.
Sometimes it's useful to be able to investigate the behavior of an individual operation (or feature) across a time-series dataset. What are the distribution of outputs, and what types of time series receive low values, and what receive high values?
These types of simple questions for specific features of interest can be investigated using the TS_FeatureSummary
function. The function takes in an operation ID as its input (and can also take inputs specifying a custom data source, or custom annotation parameters), and produces a distribution of outputs from that operation across the dataset, with the ability to then annotate time series onto that plot.
For example, the following:
Produces the following plot (where 6 points on the distribution have been clicked on to annotate them with short time-series segments):
You can visually see that time series with more autocorrelated patterns through time receive higher values from this operation. Because groups have not been assigned to this dataset, the time series are colored at random.
Running TS_FeatureSummary in violin plot mode provides another representation of the same result:
This plots the distribution of feature 4310 from HCTSA.mat
as a violin plot, with ten 500-point time series subsegments annotated at different points through the distribution, shown to the right of the plot:
When time series groups have been labeled (using TS_LabelGroups
as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'},'raw');
), TS_FeatureSummary
will plot the distribution for each class separately, as well as an overall distribution. Annotated points can then be added to each class-specific distributions. In the example shown below, we can see that the 'noisy' class (red) has low values for this feature (CO_tc3_2_denom
), whereas the 'periodic' class mostly has high values.
TS_SingleFeature
provides a simpler way of seeing the class distributions without annotations, as either kernel-smoothed distributions, as in TS_FeatureSummary
, or as violin plots.
See below for example implementations:
Shows the distributions with classification bar underneath (for where a linear classifier would classify different parts of the space as either noisy or periodic):
Shows the distributions shown as a violin plot, with means annotated and classification bar to the left:
Note that the title, which gives an indication of the 10-fold cross-validated balanced accuracy of a linear classifier in the space is done on the basis of a single 10-fold split and is stochastic. Thus, as shown above, this can yield slightly different results when repeated. For a more rigorous analysis than this simple indication, the procedure should be repeated many more times to give a converged estimate of the balanced classification accuracy.
When performing a time-series classification task, a basic first exploration of the data is to investigate how accurately a classifier can learn a mapping from time-series features to labels assigned to time series in your dataset.
The first step is to assign group labels to time series in your dataset using TS_LabelGroups
.
Depending on the classifier, you typically want to first normalize the features to put them all on a similar scale (using TS_Normalize
).
Depending on the question asked of the data, you should also consider whether certain types of features should be removed. For example, you may wish to exclude length-dependent features (if differences in time-series length vary between classes but are an uninteresting artefact of the measurement). This can be done using TS_Subset
(and functions like TS_CompareFeatureSets
described below allow you to test the sensitivity of these results).
TS_Classify
)TS_Classify
uses all of the features in a given hctsa data matrix to classify assigned class labels.
You can set classification settings, from the number of folds to use in cross-validation to the type of classifier, as the cfnParams
structure. For the labeling defined in a given TimeSeries
table, you can set defaults for this using cfnParams = GiveMeDefaultClassificationParams('norm')
(takes TimeSeries
labeling from HCTSA_N.mat
). This automatically sets an appropriate number of folds (for cross-validation), and includes settings for taking into account class imbalance in classifier training and evaluation. It is best to alter the values inside this function to suit your needs, such that these settings can be applied consistently.
First let's run a simple classification of the groups labeled in HCTSA_N.mat
, using default classification settings:
In large feature spaces like in hctsa, simpler classifiers (like 'svm_linear'
) tend to generalize well, but you can play with the settings in cfnParams
to get a sense for how the performance varies.
As well as the classification results, the function also produces a confusion matrix, which is especially useful for evaluating where classification errors are occurring. Here's an example for a five-class problem:
In datasets containing fewer time series, it is more likely to obtain high classification accuracies by chance. You may therefore wonder how confident you can be with your classification accuracy. For example if you get a two-class classification accuracy of 60%, you might wonder what the probability is of obtaining such an accuracy by chance?
You can set numNulls
in TS_Classify
to iterate over the classification settings defined in cfnParams
except using shuffled class labels. This builds up a null distribution from which you can estimate a p-value to infer the significance of the classification accuracy obtained with the true data labeling provided.
You can also choose to run across multiple cores by switching on doParallel
:
This gives you a p-value estimate (both via a direct permutation test, and by assuming a Gaussian null distribution), and plots the null distribution with the true result annotated:
TS_CompareFeatureSets
)You might wonder whether the classification results are driven by simple types of features that aren't related to time-series dynamics at all (such as the mean of the data, or time-series length).
These can be filtered out from the initial computation (e.g., when performing TS_Init
), or subsequently (e.g., using TS_Subset
), but you can test the effect such features are having on your dataset using TS_CompareFeatureSets
. Here's an example output:
Here we see that length-dependent features are contributing to accurate classification (above-50% accuracy for this two-class balanced problem). We can take some relief from the fact that excluding these features ('notLengthDependent'
) does not significantly alter the classification accuracy, so these features are not single-handedly driving the classification results. Nevertheless, assuming differences in recording length is not an interesting difference we want to bias our classification results, it would be advisable to remove these for peace of mind.
The complexity of the time-series analysis literature is necessary for strong classification results to different degrees, depending on the task. You can quickly assess how accurately a smaller number of reduced components (e.g., Principal Components) can better classify your dataset using TS_Classify_LowDim
:
The classification accuracy is shown for all features (green, dashed), and as a function of the number of leading PCs included in the classifier (black circles). Note that this is cumulative: '5 PCs' means classification in the five-dimensional space of the five leading PCs.
Here we find that we can get decent classification accuracy with just four PCs (and perhaps even more complex classifiers will give even better results in the lower-dimensional spaces).
You can quickly interpret the type of features loading strongly onto each PC from the information shown to screen. For example:
Demonstrates that, on this dataset, long-lag autocorrelations are the most strongly correlated features to PC5.
The software also provides a basic means of visualizing low-dimensional representations of the data, using PCA as TS_PlotLowDim
.
This can be done for a time-series dataset as follows:
This uses the normalized data (specifying 'norm'
), plotting time series in the reduced, two-dimensional principal components space of operations (the leading two principal components of the data matrix).
By default, the user will be prompted to select 10 points on the plot to annotate with their corresponding time series, which are annotated as the first 300 points of that time series (and their names by default).
After selecting 10 points, we have the following:
The proportion of variance explained by each principal component is provided in parentheses in the axis label.
Annotation properties can be altered with some detail by specifying properties as the annotateParams
input variable, for example:
which yields:
If groups of time series have been specified (using TS_LabelGroups
), then these are automatically recognized by TS_PlotLowDim
, which will then distinguish the labeled groups in the resulting 2-dimensional annotated time-series plot.
Consider the sample dataset containing 20 periodic signals with additive noise (given the keyword noisy in the database), and 20 purely periodic signals (given the keyword periodic in the database). After retrieving and normalizing the data, we store the two groups in the metadata for the normalized dataset HCTSA_N.mat:
Now when we plot the dataset in TS_PlotLowDim
, it will automatically distinguish the groups in the plot and attempt to classify the difference in the reduced principal components space.
Running the following:
The function then directs you to select 6 points to annotate time series to, producing the following:
Notice how the two labeled groups have been distinguished as red and blue points, and a linear classification boundary has been added (with in-sample misclassification rate annotated to the title and to each individual principal component). If marginal distributions are plotted (setting showDistribution = true
above), they are labeled according to the same colors.
When running large-scale hctsa computations, it can be useful to set up a mySQL database for time series, operations, and the computation results, and have many Matlab instances (running on different nodes of a compute cluster, for example) communicate directly with the database.
The hctsa software comes with this (optional) functionality, allowing a more powerful, distributed way to compute and store results of a large-scale computation.
This chapter outlines the steps involved in setting up, and running hctsa computations using a linked mySQL database.
The hctsa package requires some preliminary set up to work with a mySQL database, described here:
Installation of mySQL, either locally, or on an accessible server.
Setting up Matlab with a mySQL java connector (done by running the install_jconnector
script in the Database directory, and then restarting Matlab).
After the database is set up, and the packages required by hctsa are installed (by running the install
script), linking to a mySQL database can be done by running the install_database
script, which:
Sets up Matlab to be able to communicate with the mySQL server and creates a new database to store Matlab calculations in, described here.
This section contains additional details about each of these steps.
Note that the above steps are one-off installation steps; once the software is installed and compiled, a typical workflow will simply involve opening Matlab, running the startup
script (which adds all paths required for the hctsa software), and then working within Matlab from any desired directory.
Once installed using our default library of operations, the typical next step is to add a dataset of time series to the database using the SQL_Add
command. Custom master operations and operations can also be added, if required.
After installing the software and importing a time-series dataset to a mySQL database, the process by which data is retrieved from the database to local Matlab files (using SQL_Retrieve
), feature sets computed within Matlab (using TS_Compute
), and computed data stored back in the database (SQL_store
) is described in detail here.
After the computation is complete for a time-series dataset, a range of processing, analysis, and plotting functions are also provided with the software, as described here.
We assume that the user has access to and appropriate read/write privileges for a local or network mySQL server database. Instructions on how to install and set up a mySQL database on a variety of operating systems can be found here.
Before the structure of the database can be created, Matlab must be set up to be able to talk to the mySQL server, which requires installing a mySQL java connector. The steps required to achieve this are performed by the script install_jconnector
, which should be run from the main hctsa directory. If this script runs successfully and a mySQL server has been installed (either on your local machine or on an external server, see above), you are then ready to run the install
script.
The following outlines the actions performed by the install_jconnector
script (including instructions on how to perform the steps manually):
It is necessary to relocate the J connector from the Database directory of this code repository (which is also freely available here): the file mysql-connector-java-5.1.35-bin.jar
(for version 5.1.35). Instructions are here and are summarized below, and described in the Matlab documentation. This .jar file must be added to a static path where it can always be found by Matlab. A good candidate directory is the java/jarext/ subdirectory of the Matlab root directory (to determine the Matlab root directory, simply type matlabroot
in an open Matlab command window).
For Matlab to see this file, you need to add a reference to it in the javaclasspath.txt file (an alternative is to modify the classpath.txt file directly, but this may not be supported by newer versions of Matlab). This file can be found (or if it does not exist, should be created) in Matlab’s preferences directory (to determine this location, type prefdir
in a command window, or navigate to it within Matlab using cd(prefdir)
).
This javaclasspath.txt file must contain a text reference to the location of the java connector on the disk. In the case recommended above, where it has been added to the java/jarext directory, we would add the following to the javaclasspath.txt file:
ensuring that the version number (5.1.35) matches your version of the J connector (if you are using a more recent version, for example).
Note that javaclasspath.txt can also be in Matlab’s startup directory (for example, to modify this just for an individual user).
After restarting Matlab, Matlab should then have the ability to communicate with mySQL servers (we will check whether this works below).
The main tasks involved in installing the Matlab/mySQL interface are achieved by the install.m
script, which runs the user through the steps below.
If you have not already done so, creating a mySQL database to use with Matlab can be done using the SQL_create_db
function. This requires that mySQL is installed on an accessible server, or on the local machine (i.e., using localhost
). If the database has already been set up, then you do not need to use the SQL_create_db
function but you must then create a text file, sql-setting.conf
, in the Database directory of the repository. This file contains four comma-delimited entries corresponding to the server name, database name, username, and password, as per the following:
The settings listed here are those used to connect to the mySQL server. Remember that your password is sitting here in this document in unencrypted plain text, so do not use a secure or important password.
To check that Matlab can connect to external servers using the mySQL J-connector, using correct host name, username, and password settings, we introduce the Matlab routines SQL_OpenDatabase
and SQL_CloseDatabase
. An example usage is as follows:
For this to work, the sql_settings.conf file must be set up properly. This file specifies (in unencrypted plain text!) the login details for your mySQL database in the form hostName,databaseName,username,password
.
An example sql_settings.conf file:
Once you have configured your sql_settings.conf file, and you can run dbc = SQL_OpenDatabase;
and SQL_CloseDatabase(dbc)
without errors, then you can smile to yourself and you should at this point be happy because Matlab can communicate successfully with your mySQL server! You should also be excited because you are now ready to set up the database structure!
Note that if your database is not set up on your local machine (i.e., localhost
), then Matlab can communicate with a mySQL server through an ssh tunnel, which requires some additional setup (described below).
Note also that the SQL_OpenDatabase
function uses Matlab's Database Toolbox if a license is available, but otherwise will use java commands; both are supported and should give identical operational behavior.
To start writing a new dataset to a new database, or start retrieving data from a different database, you will need to change the database that Matlab is configured to connect to. This can be done using the SQL_ChangeDatabase
script (which walks you through the steps and writes over the existing sql_settings.conf file), or by altering the sql_settings.conf file directly.
Note that one can swap between multiple databases easily by commenting out lines of the sql_settings.conf file (adding %
to the start of a line to comment it out).
In some cases, the mySQL server you wish to connect to requires an ssh tunnel. One solution is to use port forwarding from your local machine to the server. The port forward can be set up in the terminal using a command like:
This command connects port 1234 on your local computer to port 3306 (the default mySQL port) on the server. Now, telling Matlab to connect to localhost
through port 1234 will connect it, through the established ssh tunnel, to the server. This can be achieved by specifying the server as localhost
and the port number as 1234 in the sql_settings.conf file (or during the install
process), which can be specified as the (optional) fifth entry, i.e.,:
As described above, computation involves three main steps:
The procedure involves three main steps:
Retrieve a set of time series and operations from (the Results table) of the database to a local Matlab file, HCTSA.mat (using SQL_Retrieve
).
Compute the operations on the retrieved time series in Matlab and store the results locally (using TS_Compute
).
Write the results back to the Results table of the database (using SQL_store
).
It is usually the most efficient practice to retrieve a small number of time series at each iteration of the SQL_Retrieve
–TS_Compute
–SQL_store
loop, and distribute this computation across multiple machines if possible. An example runscript is given in the code that accompanies this document, as sample_runscript_sql
, which retrieves a single time series at a time, computes it, and then writes the results back to the database in a loop. This can be viewed as a template for runscripts that one may wish to use when performing time-series calculations across the database.
This workflow is well suited to distributed computing for large datasets, whereby each node can iterate over a small set of time series, with all the results being written back to a central location (the mySQL database).
By designating different sections of the database to cycle through, this procedure can also be used to (manually) distribute the computation across different machines. Retrieving a large section of the database at once can be problematic because it requires large disk reads and writes, uses a lot of memory, and if problems occur in the reading or writing to/from files, one may have to abandon a large number of existing computations.
Although many sections of the time-series analysis literature has worked to develop methods for quantifying complex temporal structure in long time-series recordings, many time series that are analyzed in practice are relatively short. hctsa has been successfully applied to time-series classification problems in the data mining literature, which includes datasets of time series as short as 60 samples (link to paper). However, time-series data are sometimes even shorter, including yearly economic data across perhaps six years, or biological data measured at say 10 points across a lifespan. Although many features in hctsa will not give a meaningful output when applied to a short time series, hctsa includes methods for filtering such features (cf. TS_Normalize
), after which the remaining features can be used for analysis.
The number of features with a meaningful output, from time series as short as 5 samples, up to those with as many as 500 samples, is shown below (where the maximum set of 7749 is shown as a dashed horizontal line):
In each case, over 3000 features can be computed. Note that one must be careful when representing a 5-dimensional object with thousands of features, the vast majority of which will be highly intercorrelated.
To demonstrate the feasibility of running hctsa analysis on datasets of short time series, we applied hctsa to gene expression data in the cerebellar brain region, r1A, across seven developmental time points (from the Allen Institute's Developing Mouse Brain Atlas), for a subset of 50 genes. After filtering and normalizing (TS_Normalize
), then clustering (TS_Cluster
), we plotted the clustered time-series data matrix (TS_PlotDataMatrix('cl')
):
Inspecting the time series plots to the left of the colored matrix, we can see that genes with similar temporal expression profiles are clustered together based on their 2829-long feature vector representations. Thus, these feature-based representations provide a meaningful representation of these short time series. Further, while these 2829-long feature vectors are shorter than those that can be computed from longer time series, they still constitute a highly comprehensive representation that can be used as the starting point to obtain interpretable understanding in addressing specific domain questions.
The mySQL database is structured into four main components:
A lists of all the filenames and other metadata of time series (the TimeSeries table),
A list of all the names and other metadata of pieces of time-series analysis operations (the Operations table),
A list of all the pieces of code that must be evaluated to give each operation a value, which is necessary, for example, when one piece of code produces multiple outputs (the MasterOperations table), and
A list of the results of applying operations to time series in the database (the Results table).
Additional tables are related to indexing and managing efficient keyword labeling, etc.
Time series and operations have their own tables that contain metadata associated with each piece of data, and each operation, and the results of applying each method to each time series is in the Results table, that has a row for every combination of time series and operation, where we also record calculation times and the quality of outputs (for cases where there the output of the operation was not a real number, or when some error occurred in the computation). Note that while data for each time series data is stored on the database, the executable time-series analysis code files are not, such that all code files must be in Matlab’s path (all required paths can be added by running startup.m
).
Another handy (but dangerous) function to know about is SQL_reset
, which will delete all data in the mySQL database, create all the new tables, and then fill the database with all the time-series analysis operations. The TimeSeries, Operations, and MasterOperations tables can be generated by running SQL_CreateAllTables
, with master operations and operations added to the database using SQL_Add
commands (described here).
You now you have the basic table structure set up in the database and have done the first bits of mySQL manipulation through the Matlab interface.
It is very useful to be able to inspect the database directly, using a graphical interface. A very good example for Mac is the excellent, free application, Sequel Pro (a screenshot is shown below, showing the first 40 rows of the Operations table of our default operations library). Applications similar to Sequal Pro exist for Windows and Linux platforms. Applications that allow the database to be inspected are extremely useful, however they should not be used to manipulate the database directly. Instead, Matlab scripts should be used to interact with the database to ensure that the proper relationships between the different tables are maintained (including the indexing of keyword labels).
One of the key goals of highly comparative time-series analysis, is to allow unbiased methodological comparison between the vast literature of time-series analysis tools developed for different applications. By representing features in terms of their outputs across a time-series dataset, the context of a given feature can be assessed by searching the database for features with similar behavior. The search can be done using a diverse range of real and model-generated data, or using a more specific dataset if this is more appropriate for a given application (e.g., looking just at EEG signals). Just like similar time series to a target can be retrieved and visualized, similar features to a given target feature can also be retrieved using TS_SimSearch
.
This chapter will give instructions on how you can compare a new time-series analysis feature to our library of over 7000 time-series features using hctsa. We assume that the reader has installed hctsa, which will be required to work with files and compute features.
The first step is defining the set of features to compare to (here we use the default hctsa library), and the set of time-series data that behavior is going to be assessed on. If you have just developed a new algorithm for time-series analysis and want to see how it performs across a range of interdisciplinary time-series data, then you may want to use a diverse set of time series sampled from across science. This can be easily achieved using our set of 1000 time series, a random selection of 25 such time series are plotted below (only the first 250 samples are plotted to aid visualization):
Pre-computed results for a recent version of hctsa can be downloaded from figshare as HCTSA_Empirical1000.mat
.
Alternatively, features can be recomputed using our input file for the time-series dataset, using the input file provided in the same figshare data repository. This ensures implementation consistencies on your local compute architecture; i.e., using TS_Init('INP_Empirical1000.mat');
to initialize, followed by compute commands involving TS_Compute
).
However, if you only ever analyze a particular type of data (e.g., rainfall), then perhaps you're more interested in which methods perform similarly on rainfall data. For this case, you can produce your own data context for custom data using properly structured input files as explained here.
We use the (hypothetical) example of a hot new feature, :boom: hot_feature1
:boom:, recently published in Science (and not yet in the hctsa library), and attempt to determine whether it is completely new, or whether there are existing features that exhibit similar performance to it. Think first about the data context (described above), which allows you to understand the behavior of thousands of features on a diverse dataset with which to compare the behavior of our new feature, hot_feature1
. This example uses the Empirical1000
data context downloaded as HCTSA_Empirical1000.mat
from figshare.
Getting the feature values for the new feature, hot_feature1
, could be done directly (using TS_CalculateFeatureVector
), but in order to maintain the HCTSA structure, we instead produce a new HCTSA.mat
file containing just hot_feature
and the same time series. For example, to compare to the HCTSA_Empirical1000.mat
file hosted on figshare, you should use the same version of hctsa to enable a valid comparison to the same set of features.
We first generate an input file, INP_hot_master.txt
containing the function call, that takes in a time series, x
:
Any additional arguments to the function MyHotFeature.m
should be specified here. MyHotFeature.m
must also be in a form that outputs a structure (or a single real number, as explained here).
The interesting field in the structure output produced by MyHotFeature(x)
is hotFeature1
, which needs to be specified in another input text file, INP_hot_features.txt
, for example, as:
where we have given this feature two keywords: hot
and science
.
So now we are able to initiate a new hctsa calculation, specifying custom code calls (master) and features to extract from the code call (features), as:
This generates a new file, HCTSA_hot.mat
, containing information about the 1000 time series, and the new hot feature, hot_feature1
, which can then be computed as:
So now we have both a context of the behavior of a library of >7000 features on 1000 diverse time series, and we also have the behavior of our three hot new features. It is time to combine them and look for inter-relationships!
Now that we have all of the data in the same HCTSA file, we can compare the behavior of the new feature to the existing library of features. This can be done manually by the researcher, or by using standard hctsa functions; the most relevant is TS_SimSearch
. We can find the ID assigned to our new hot_feature
in the merged HCTSA file as:
which tells us that the ID of my_hot_feature
in HCTSA_merged.mat
is 7703. Then we can use TS_SimSearch
to explore the relationship of our hot new feature to other features in the hctsa library (in terms of linear, Pearson, correlations):
We find that our feature is reproducing the behavior of the first zero of the autocorrelation function (the first match: first_zero_ac
; see Interpreting Features for more info on how to interpret matching features):
The pairwise distance matrix (distances are , for Pearson correlation coefficients, ) produced by TS_SimSearch
provides another visualization of the context of this hot new feature (in this case there are so many highly correlated features, that the matrix doesn't reveal much subtle structure):
In this case, the hot new feature wasn't so hot: it was highly (linearly) correlated to many existing features (including the simple zero-crossing of the autocorrelation function, first_zero_ac
), even across a highly diverse time-series dataset. However, if you have more luck and come up with a hot new feature that shows distinctive (and useful) performance, then it can be incorporated in the default set of features used by hctsa by adding the necessary master and feature definitions (i.e., the text in INP_hot_master.txt
and the text in INP_hot_features.txt
) to the library files (INP_mops.txt
and INP_ops.txt
in the Database directory of hctsa), as explained here. You might even celebrate your success by sharing your new feature with the community, by sending a Pull Request to the hctsa github repository!! :satisfied:
If using a set of 1000 time series, then this is easy because all the data is already computed in HCTSA_Empirical1000.mat
on figshare :relaxed:
For example, say we want to find neighbors to the fastdfa
algorithm from Max Little's website. This algorithm is already implemented in hctsa in the code SC_fastdfa.m
as the feature SC_fastdfa_exponent
. We can find the ID of this feature by finding the matching row in the Operations table (ID=750
):
and then find similar features using TS_SimSearch
, e.g., as:
Yielding:
We see that other features in the library indeed have strong relationships to SC_fastdfa_exponent
, including some unexpected relationships with the stationarity estimate, StatAvl25
.
Combining the network visualization with scatter plots produces the figures in our original paper on the empirical structure of time series and their methods (cf. Sec. 2.4 of the supplementary text), see below:
Specific pairwise relationships can be probed in more detail (visualizing the types of time series that drive any relationship) using TS_Plot2d
, e.g., as:
A typical data analysis procedure might first involve carefully inspecting your data, then thinking about the patterns you might want to capture, then trying to devise an algorithm to capture them (testing the algorithm carefully on simulated data, perhaps, to test its behavior, dependence on noise level, time-series length, sensitivity to low-order trends or other types of stationarity, etc.). But in hctsa we do something somewhat dangerous—apply thousands of methods without even needing to first even look at the data
A conventional data analysis pipeline starts with the hard work of thinking about the dynamics and carefully selecting time-series analysis methods with deep knowledge of how they work and the assumptions that make their interpretation valid. This hard work is not avoided in hctsa, although if you're lucky, hctsa may save you a bunch of work—if it identifies high-performing features, you can focus your energy on interpreting just these. But this hard work comes at the end of an analysis, and is difficult and time-consuming to do well—often involving looking deeply into the algorithms and theory underlying these methods, combined with careful inspection of the data to ensure you are properly interpreting them.
Being the most substantial challenge—one that cannot be automated and requires alot of careful thought—this page provides some basic advice. While it's often challenging to work out what a given feature is capturing in your dataset, it is the most interesting part of an analysis, as it points you to interpretable scientific theory/algorithms, and makes you think carefully about the structure in your time series.
Sometimes you will run TS_TopFeatures
and find no features that significantly distinguish time series recorded from the labeled groups in your dataset. In this case, there may not be a signal, the signal may be too small given your sample size (and the perhaps large number of candidate time-series features, if using the hctsa set), or the right statistical estimators may not be present in hctsa.
But other times you will obtain a long list of statistically significant (after multiple hypothesis correction) features (e.g., from TS_TopFeatures
) with values that significantly distinguish the groups you care about (individuals with some disease diagnosis compared to that of healthy controls). In cases like this, the next step is to obtain some understanding of what's happening.
Consider a long and kinda gnarly list, that is quite hard to make sense of. This is a typical situation because the features names in hctsa are long and typically hard to interpret directly. Like perhaps:
If there are hundreds or thousands of statistically significant features, some of which have much higher accuracy than others, you may first want to inspect only a subset of features, e.g., the 50 or 100 top-performing features (with the most significant differences). If you can make sense of these first, you will usually have a pretty good basis for working out what's going on with the features with lower performance (if indeed you want to go beyond interpreting the best-performing features).
TS_TopFeatures
is helpful for showing us how features with different, long, and complicated names might cluster into groups of similarly behaving features that are measuring similar properties of the data (cf. the previous section). This is essential to interpret groups of features that are capturing the same underlying conceptual property of the dynamics. It helps you work out what's going on, and avoids over-interpreting any specific feature in isolation. For example, sometimes features with very complicated sounding names (like nonlinear embedding dimension estimates) exhibit very similar behavior to very simple features (like lag-1 autocorrelation or distributional outlier measures), indicating that the nonlinear embedding dimension estimates are likely sensitive to simpler properties in the time series (e.g., sensitive to outliers in the data). Intepreting the nonlinear embedding dimension in isolation would be a mistake in such a case.
Plotting and inspecting clustered feature–feature correlation plots are crucial for identifying groups of features with similar behavior on the dataset. Then we can inspect and interpret these groups of similar (highly inter-correlated) features together as a group. This should be the first step in interpreting a set of significant features: which groups of features behave similarly, and what common concepts are they measuring?
An example is here, were Bailey et al. (2023) identified 3 important groups of features—measuring stationarity of self-correlation properties, stationarity of distributional properties, and global distributional properties—and then go on to interpret each in turn:
When such a group of high-performing features capturing a common time-series property has been identified, how can we start to interpret and understand what each individual feature is measuring?
Some features in the group may be easy to interpret directly. For example, in the list above rms
is straightforward to interpret directly: it is simply the root-mean-square of the distribution of time-series values. Others have clues in the name (e.g., features starting with WL_coeffs
are to do with measuring wavelet coefficients, features starting with EN_mse
correspond to measuring the multiscale entropy ('mse'), and features starting with FC_LocalSimple_mean
are related to time-series forecasting using local means of the time series).
Below we outline a procedure for how a user can go from a time-series feature selected by hctsa towards a deeper understanding of the type of algorithm that feature is derived from, how that algorithm performs across the dataset, and thus how it can provide interpretable information about your specific time-series dataset.
The simplest way of getting a quick idea of what sort of property a feature might be measuring is from its keywords, that often label individual features by the class of time-series analysis method from which they were derived. Keywords have not been applied exhaustively to features (this is an ongoing challenge), but when they are there they can be useful at giving a sense of what's going on. In the list above, we see keywords listed in parentheses, such as:
'forecasting' (methods related to predicting future values of a time series),
'entropy' (methods related to predictability and information content in a time series),
'wavelet' (features derived from wavelet transforms of the time series),
'locationDependent' (location dependent: features that change under mean shifts of a time series),
'spreadDependent' (spread dependent: features that change under rescaling about their mean), and
'lengthDependent' (length dependent: features that are highly sensitive to the length of a time series).
To find more specific detailed information about a feature, beyond just a broad categorical label of the literature from which it was derived, the next step is find and inspect the code file that generates the feature of interest. For example, say we were interested in the top performing feature in the list above:
We know from the keyword that this feature has something to do with forecasting, and the name provides clues about the details (e.g., FC_
stands for forecasting, the function FC_LocalSimple
is the one that produces this feature, which, as the name suggests, performs simple local time-series prediction). We can use the feature ID (3016
) provided in square brackets to get information from the Operations
metadata table:
Inspecting the text before the dot, .
, in the CodeString
field (FC_LocalSimple_mean3
) tells us the name that hctsa uses to describe the Matlab function and its unique set of inputs that produces this feature. Whereas the text following the dot, .
, in the CodeString
field (taures
), tells us the field of the output structure produced by the Matlab function that was run.
We can use the MasterID
to get more information about the code that was run using the MasterOperations
metadata table:
This tells us that the code used to produce our feature was FC_LocalSimple(y,'mean',3)
. We can get information about this function in the command line by running a help
command:
We can also inspect the code in the function FC_LocalSimple
directly for more information. Like all code files for computing time-series features, FC_LocalSimple.m
is located in the Operations directory of the hctsa repository.
Inspecting the code file, we see that running FC_LocalSimple(y,'mean',3)
does forecasting using local estimates of the time-series mean (since the second input to FC_LocalSimple
, forecastMeth
is set to 'mean'
), using the previous three time-series values to make the prediction (since the third input to FC_LocalSimple
, trainLength
is set to 3
).
To understand what the specific output quantity from this code is that came up as being highly informative in our TS_TopFeatures
analysis, we need to look for the output labeled taures
of the output structure produced by FC_LocalSimple
. We discover the following relevant lines of code in FC_LocalSimple.m
:
This shows us that, after doing the local mean prediction, FC_LocalSimple
then outputs some features on whether there is any residual autocorrelation structure in the residuals of the rolling predictions (the outputs labeled ac1
, ac2
, and our output of interest: taures
).
The code shows that this taures
output computes the CO_FirstZero
of the residuals, which measures the first zero of the autocorrelation function (e.g., cf help CO_FirstZero
). When the local mean prediction still leaves a lot of autocorrelation structure in the residuals, our feature, FC_LocalSimple_mean3_taures
, will thus take a high value.
Once we've seen the code that was used to produce a feature, and started to think about how such an algorithm might be measuring useful structure in our time series, we can then check our intuition by inspecting its performance on our dataset (as described in Investigating specific operations).
For example, we can run the following:
which produces a plot like that shown below. We have run this on a dataset containing noisy sine waves, labeled 'noisy' (red) and periodic signals without noise, labeled 'periodic' (blue):
In this plot, we see how this feature orders time series (with the distribution of values shown on the left, and split between the two groups: 'noisy', and 'periodic'). Our intuition from the code, that time series with longer correlation timescales will have highly autocorrelated residuals after a local mean prediction, appears to hold visually on this dataset.
In general, the mechanism provided by TS_FeatureSummary
to visualize how a given feature orders time series, including across labeled groups, can be very useful for feature interpretation.
hctsa contains a large number of features, many of which can be expected to be highly inter-correlated on a given time-series dataset. It is thus crucial to explore how a given feature relates to other features in the library, e.g., using the correlation matrix produced by TS_TopFeatures
(cf. Finding informative features), or by searching for features with similar behavior on the dataset to a given feature of interest (cf. Finding nearest neighbors).
In a specific domain context, you may need to decide on the trade-off between more complicated features that may have slightly higher in-sample performance on a given task, and simpler, more interpretable features that may help guide domain understanding. The procedures outlined above are typically the first step to understanding a time-series analysis algorithm, and its relationship to alternatives that have been developed across science.
Sometimes you might wish to remove a problematic set of time series from the database (that might have been input incorrectly, for example), including their metadata, and all records of computed data. Other times you might find a problem with the implementation of one of the operations. In this case, you would like to retain that operation in the database, but flush all of the data currently computed for it (so that you can recompute new values). Both of these types of tasks (both removing and clearing time series or operations) can be achieved with the function SQL_ClearRemove
.
This function takes in information about whether to clear (clear any calculated data) or remove (completely delete the given time series or operations from the database).
Example usages are given below:
After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to personal preference of the user whether to keep all time-series data in a single database, or have a different database for each dataset.
Time series are added using the same function used to add master operations and operations to the database, SQL_Add
, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database. The time-series data files to import, and the keywords to assign to each time series are specified in either: (i) an appropriately formatted matlab (.mat
) file, or (ii) a structured input text file, as explained below.
Note that, when using the .mat
file input method, time-series data is stored in the database to six significant figures. However, when using the .txt
file input method, time-series data values are stored as written in the input text file of each time series.
Time series can be indexed by assigning keywords to them (which are stored in the TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the database). Assigning keywords to time series makes it easier to retrieve a set of time series with a given set of keywords for analysis, and to group time series annotated with different keywords for classification tasks.
When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.
SQL_Add
syntaxAdding a set of time series to the database requires an appropriately formatted input file, INP_ts.txt, for example, the appropriate code is:
We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt')
, as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat')
.
Retrieving data from the Results table of the database is typically done for one of two purposes:
To calculate as-yet uncalculated entries to be stored back into the database, and
To analyze already-computed data stored in the database in Matlab.
The function SQL_Retrieve
performs both of these functions, using different inputs. Here we describe the use of the SQL_Retrieve
function for the purposes of populating uncalculated (NULL) entries in the Results table of the database in Matlab.
For calculating missing entries in the database, SQL_Retrieve
can be run as follows:
The third input, 'null'
, retrieves ts_ids and op_ids from the sets provided that contain (as-yet) uncalculated (i.e., NULL) elements in the database; these can then be calculated and stored back in the database. An example usage is given below:
Running this code will retrieve null (uncalculated) data from the database for time series with ts_ids between 1 and 5 (inclusive) and all operations in the database, keeping only the rows and columns of the resulting time series x operations matrix that contain NULLs.
When calculations are complete and one wishes to analyze all of the data stored in the database (not just NULL entries requiring computation), the third input should be set to ‘all’ to retrieve all entries in the Results table of the database, as described .
SQL_Retrieve
writes to a local Matlab file, HCTSA.mat, that contains the data retrieved from the database.
After retrieving data from the mySQL database, missing entries (NULL in the database, and NaN in the local Matlab file) can be computed using TS_Compute
, and stored back to the database using SQL_store
. These functions are described below.
TS_Compute
Values retrieved using SQL_Retrieve
(to the local HCTSA.mat
file) that have not previously been calculated are evaluated using TS_Compute
, as described . These results can then be inspected directly (if needed), or simply written back to the database using SQL_store
, as described below.
SQL_store
Once calculations have been performed using Matlab on local files, the results must be written back to the database. This task is performed by SQL_store
, which reads the data in HCTSA.mat
, checks that the metadata still matches the database, and then begins updating the Output, Quality, and CalculationTime columns of the Results table in the mySQL database. This can be done by simply running:
Depending on database latencies, this can be a relatively slow process, up to 20-25 s per time series, updating each row in the Results table individually using mySQL UPDATE statements. However, the delay in this step means that the computation can be distributed across multiple compute nodes, and that stored data can be indexed and retrieved systematically. Keeping results in local Matlab files can be extremely inefficient, and can indeed be untenable for large datasets.
When linking Matlab to a mySQL database, metadata associated with time series, operations, and master operations, as well as the results of computations are all stored in an indexed database. Adding master operations, operations, and time series to the database can be achieved using the SQL_Add
function, as described below.
The following table summarizes the terminology used for each type of object in hctsa land:
SQL_Add
SQL_Add
has two key inputs that specify:
Whether to import a set of time series (specify ‘ts’
), a set of operations (specify ‘ops’
), or a set of master operations (specify ‘mops’
),
The name of the input file that contains about the time series, master operations, or operations to be imported.
In this section, we describe how to use SQL_Add
to add master operations, operations, and time series to the database.
Users wishing to run the default hctsa code library their own time-series dataset will only need to add time series to the database, as the full operation library is added by default by the install.m
script. Users wishing to add additional features using custom time-series code or different types of inputs to existing code files, can either edit the default INP_ops.txt and INP_mops.txt files provided with the repository, or create new input files for their custom analysis methods (as explained for and ).
REMINDER: Manually editing the database, including adding or deleting rows, is very dangerous, as it can create inconsistencies and errors in the database structure. Adding time series and operations to the database should only be done using SQL_Add
which sets up the Results table of the database and ensures that the indexing relationships in the database are properly maintained.
By default, the install
script populates the database with the default library of highly comparative time-series analysis code. The formatted input file specifying these pieces of code and their input parameters is INP_mops.txt in the Database directory of the repository. This step can be reproduced using the following command:
Once added, each master operation is assigned a unique integer, mop_id, that can be used to identify it. For example, when adding individual operations, the mop_id is used to map each individual operation to a corresponding master operation.
New functions and their input parameters to execute can be added to the database using SQL_Add
in the same way as described above. For example, lines corresponding to the new code can be added to the current INP_mops.txt file, or by generating a new input file and running SQL_Add
on the new input file. Once in the database, the software will then run the new pieces of code. Note that SQL_Add
checks for repeats that already exist in the database, so that duplicate database entries cannot be added with SQL_Add
.
Corresponding operations (or features) will then need to added separately, to link to the structured outputs of master operations.
Every operation added to the database will be given a unique integer identifier, op_id, which provides a common way of retrieving specific operations from the database.
Note that after (hopefully successfully) adding the operations to the database, the SQL_Add
function indexes the operation keywords to an OperationKeywords table that produces a unique identifier for each keyword, and another linking table that allows operations with each keyword to be retrieved efficiently.
After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to the user whether to keep all time-series data in a single database, or have a different database for each dataset.
Time series are added using the same function used to add master operations and operations to the database, SQL_Add
, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database.
Time series can be indexed by assigning keywords to them (which are stored in the TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the database).
When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.
We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt')
, as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat')
.
New code added to the database should be checked for the following: 1. Output is a real number or structure (and uses an output of NaN to assign all outputs to a NaN). 2. The function is accessible in the Matlab path. 3. Output(s) from the function have matching operations (or features), which also need to be .
Operations can be added to the mySQL database using an , such as INP_ops.txt
, as follows:
Adding a set of time series to the database requires an , following either of the following:
Master Operation | Operation | Time Series |
Database identifier: | mop_id | op_id | ts_id |
Input to | 'mops' | 'ops' | 'ts' |
In this section we describe how keywords and other types of metadata stored in the database can be manipulated, and how results of whole sets of operations and time series can be cleared or completely deleted from the database. These tasks are implemented as Matlab functions, and ensure that key database structures are maintained. Instead performing such tasks by acting directly on the database can cause inconsistencies and should be avoided.
Running out of java heap space throws the error java.lang.OutOfMemoryError
, and usually happens when trying to retrieve a large set of time series/operations from the database. Matlab needs to keep the whole retrieval in memory, and has a hard limit on this. The java heap size can be increased in the Matlab preferences, under General Java Heap Memory.
The first step of any analysis is to retrieve a relevant portion of data from the mySQL database to local Matlab files for analysis. This is done using the SQL_Retrieve
function described , except we use the 'all'
input to retrieve all data, rather than the 'null'
input used to retrieve just missing data (requiring calculation).
Example usage is as follows:
SQL_Retrieve(ts_ids, op_ids,'all');
for vectors ts_ids
and op_ids
, specifying the ts_ids and op_ids to be retrieved from the database.
Sets of ts_ids and op_ids to retrieve can be selected by inspecting the database, or by retrieving relevant sets of keywords using the SQL_GetIDs
function. Running the code in this way, using the ‘all’ tag, ensures that the full range of ts_ids and op_ids specified are retrieved from the database and stored in the local file, HCTSA.mat, which can then form the basis of subsequent analysis.
The database structure provides much flexibility in storing and indexing the large datasets that can be analyzed using the hctsa approach, however the process of retrieving and storing large amounts of data from a database can take a considerable amount of time, depending on database latencies.
Note that missing, or NULL, entries in the database are converted to NaN entries in the local Matlab matrices.
Extract EEG markers of cognitive decline.
of a tutorial session on feature-based time-series analysis tools at Computational Neuroscience Academy 2023, Krakow, Poland.
t, check out this slides for the full talk about analyzing neural dynamics using hctsa at CNS 2020 are . July 2020.
Talk about the importance of comparison for time-series analysis to QMNET. , . (August 2020).
Seizure EEG. An overview tutorial on applying hctsa to a 5-class EEG dataset is in (including the use of reduced feature set, catch22, within the hctsa framework). There is also a recording of the tutorial (final ~hour is a hands-on demo using this dataset) on .
.
.
.
📙
.
📗 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
📗
.
.
.
.
.
.
.
.
.
.
📗
.
.
.
.
.
.
.
.
.
.