1 of 44

Highly comparative time-series analysis with hctsa

Information about hctsa

Introduction

Welcome to the hctsa documentation!

Getting started

Some advice for getting started with an hctsa analysis

🖥️ Talks and live demos

💾 hctsa datasets, and example workflows

If you have data to share and host, let me know and I'll add it to this list.

Publications using hctsa

This page lists scientific research publications that have used hctsa.

Articles are labeled as follows:

📗 = Journal publication.
📙 = Preprint.

Our Research 📕

Methods Papers

The following publications for details of how the highly-comparative approach to time-series analysis has developed since our initial publication in 2013. We:

Applications Papers

We have used hctsa to:

as well as:

Connect structural brain connectivity to fMRI dynamics (mouse).
Connect structural brain connectivity to fMRI dynamics (human).
Distinguish time-series patterns for data-mining applications.
Classify babies with low blood pH from fetal heart rate time series.

Others' Research 📕

🧬 Biology

🧫 Cellular Neuroscience

🧠 Neuroimaging

Here are some highlights:

In addition to:

Predict age from resting-state MEG from individual brain regions.
Estimate brain age in children from EEG.
Extract gradients from fMRI hctsa time-series features to understand the relationship between schizophrenia and nicotine dependence.
Classify endogenous (preictal), interictal, and seizure-like (ictal) activity from local field potentials (LFPs) from layers II/III of the primary somatosensory cortex of young mice (using feature selection methods from an initial pool of hctsafeatures).
Distinguish motor-evoked potentials corresponding to multiple sclerosis.

🔬 Medicine—General

Here are some highlights:

in addition to:

Predict MS disability progression using time-series features of evoked potential signals
Identify sepsis in very low birth weight (<1.5kg) infants from heart rate signals, identifying heart rate characteristics of reduced variability and transient decelerations.
Identify novel heart-rate variability metrics, including RobustSD, to create a parsimonious model for cerebral palsy prediction in preterm neonatal intensive care unit patients.
Predicting post cardiac arrest outcomes.
Detect falls from wearable sensor data.
Detect falls from wearable sensor data.
Select features for fetal heart rate analysis using genetic algorithms.

🦠 Medicine—Pathology

🏗 Engineering

Here are some highlights:

in addition to:

Diagnose a spacecraft propulsion system utilizing data provided by the Prognostics and Health Management (PHM) society, as part of the Asia-Pacific PHM conference’s data challenge, 2023.
Identify faults in a large-scale industrial process.

⛰️ Geoscience

UMAP Projections

Below are some UMAP projections of UEA/UCR time-series classification datasets.

🫀ECG Data

🧬 Other Biological Data

⚝ Shapes

🏞️ Image

👋 UWave

🖥️ Simulated

📡 Sensor

💡 Spectrograph

Related Time-Series Resources

Resources related to HCTSA including talks/demos, workflow examples and related packages.

Time-Series Data

Feature sets derived from hctsa

hctsa

Feature-based time-series packages developed by others

Other packages and resources

List of included code files

A full list of Matlab code files, organized loosely into broad categories, with brief descriptions

Introduction

The full default set of over 7700 features in hctsa is produced by running all of the code files below, many of which produce a large number of outputs (e.g., some functions fit a time-series model and then output statistics including the parameters of the best-fitting model, measures of the model's goodness of fit, the optimal model order, and autocorrelation statistics on the residuals).

In our default feature set, each function is run with multiple input parameters, with each parameter set yielding characteristic outputs. For example,

CO_AutoCorr determine the method in which autocorrelation is computed, as well as the time lag at which autocorrelation is calculated, e.g., lag 1, lag 2, lag 3, etc.
WL_dwtcoeff has inputs that set the mother wavelet to use and level of wavelet decomposition; and
FC_LocalSimple has inputs that determine the time-series forecasting method to use and the size of the training window.

The set of code files below and their input parameters that define the default hctsa feature set are in the INP_mops.txt file of the hctsa repository.

Distribution

Algorithms for summarizing properties of the distribution of values in a time series (independent of their ordered sequence through time).

Code file

Description

DN_Burstiness

Burstiness statistic of a time series.

DN_CompareKSFit

Fits a distribution to data.

DN_CustomSkewness

Custom skewness measures.

DN_FitKernelSmooth

Statistics of a kernel-smoothed distribution of the data.

DN_Fit_mle

Maximum likelihood distribution fit to data.

DN_HighLowMu

The highlowmu statistic.

DN_HistogramMode

Mode of the histogram.

DN_Mean

A given measure of location of a data vector.

DN_MinMax

The maximum and minimum values of the input data vector

DN_Moments

A moment of the distribution of the input time series.

DN_OutlierInclude

How statistics depend on distributional outliers.

DN_OutlierTest

How distributional statistics depend on distributional outliers.

DN_ProportionValues

Proportion of values in a time-series vector.

DN_Quantile

Quantiles of the distribution of values in the time series data vector.

DN_RemovePoints

How time-series properties change as points are removed.

DN_SimpleFit

Fits of parametric distributions or simple time-series models.

DN_Spread

Spread of the input time series.

DN_TrimmedMean

Mean of the outlier-trimmed time series.

DN_HistogramAsymmetry

Distributional asymmetry.

DN_Unique

The proportion of the time series that are unique values.

DN_Withinp

Proportion of data points within p standard deviations of the mean.

DN_cv

Coefficient of variation.

DN_pleft

Distance from the mean at which a given proportion of data are more distant.

EN_DistributionEntropy

Distributional entropy.

HT_DistributionTest

Hypothesis test for distributional fits to a data vector.

Correlation

Code summarizing basic properties of how values of a time series are correlated through time.

Code file

Description

CO_AddNoise

Changes in the automutual information with the addition of noise.

CO_AutoCorr

Compute the autocorrelation of an input time series.

CO_AutoCorrShape

How the autocorrelation function changes with the time lag.

CO_Embed2

Statistics of the time series in a 2-dimensional embedding space.

CO_Embed2_AngleTau

Angle autocorrelation in a 2-dimensional embedding space.

CO_Embed2_Basic

Point density statistics in a 2-d embedding space.

CO_Embed2_Dist

Analyzes distances in a 2-d embedding space of a time series.

CO_Embed2_Shapes

Shape-based statistics in a 2-d embedding space.

CO_FirstCrossing

First time the autocorrelation function crosses a threshold.

CO_FirstMin

Time of first minimum in a given correlation function.

CO_NonlinearAutocorr

A custom nonlinear autocorrelation of a time series.

CO_StickAngles

Analysis of line-of-sight angles between time-series data points.

CO_TranslateShape

Statistics on datapoints inside geometric shapes across the time series.

CO_f1ecac

The 1/e correlation length.

CO_fzcglscf

The first zero-crossing of the generalized self-correlation function.

CO_glscf

The generalized linear self-correlation function of a time series.

CO_tc3

Normalized nonlinear autocorrelation function, tc3.

CO_trev

Normalized nonlinear autocorrelation, trev function of a time series.

DK_crinkle

Computes James Theiler's crinkle statistic.

DK_theilerQ

Computes Theiler's Q statistic.

DK_timerev

Time reversal asymmetry statistic.

NL_embed_PCA

Principal Components analysis of a time series in an embedding space.

Automutual information:

CO_RM_AMInformation

Automutual information (Rudy Moddemeijer implementation).

CO_CompareMinAMI

Variability in first minimum of automutual information.

CO_HistogramAMI

The automutual information of the distribution using histograms.

IN_AutoMutualInfoStats

Statistics on automutual information function for a time series.

Information Theory

Entropy and complexity measures for time series based on information theory

Code file

Description

EN_ApEn

Approximate Entropy of a time series.

EN_CID

Simple complexity measure of a time series.

EN_MS_LZcomplexity

Lempel-Ziv complexity of a n-bit encoding of a time series.

EN_MS_shannon

Approximate Shannon entropy of a time series.

EN_PermEn

Permutation Entropy of a time series.

EN_RM_entropy

Entropy of a time series using Rudy Moddemeijer's code.

EN_Randomize

How time-series properties change with increasing randomization.

EN_SampEn

Sample Entropy of a time series.

EN_mse

Multiscale entropy of a time series.

EN_rpde

Recurrence period density entropy (RPDE).

EN_wentropy

Entropy of time series using wavelets.

Time-series model fitting and forecasting

Fitting time-series models and doing simple forecasting on time series.

Code file

Description

MF_ARMA_orders

Compares a range of ARMA models fitted to a time series.

MF_AR_arcov

Fits an AR model of a given order, p.

MF_CompareAR

Compares model fits of various orders to a time series.

MF_CompareTestSets

Robustness of test-set goodness of fit.

MF_ExpSmoothing

Exponential smoothing time-series prediction model.

MF_FitSubsegments

Robustness of model parameters across different segments of a time series.

MF_GARCHcompare

Comparison of GARCH time-series models.

MF_GARCHfit

GARCH time-series modeling.

MF_GP_FitAcross

Gaussian Process time-series modeling for local prediction.

MF_GP_LocalPrediction

Gaussian Process time-series model for local prediction.

MF_GP_hyperparameters

Gaussian Process time-series model parameters and goodness of fit.

MF_StateSpaceCompOrder

Change in goodness of fit across different state space models.

MF_StateSpace_n4sid

State space time-series model fitting.

MF_arfit

Statistics of a fitted AR model to a time series.

MF_armax

Statistics on a fitted ARMA model.

MF_hmm_CompareNStates

Hidden Markov Model (HMM) fitting to a time series.

MF_hmm_fit

Fits a Hidden Markov Model to sequential data.

MF_steps_ahead

Goodness of model predictions across prediction lengths.

FC_LocalSimple

Simple local time-series forecasting.

FC_LoopLocalSimple

How simple local forecasting depends on window length.

FC_Surprise

How surprised you would be of the next data point given recent memory.

PP_ModelFit

Investigates whether AR model fit improves with different preprocessings.

Stationarity and step detection

Quantifying how properties of a time series change over time.

Code file

Description

SY_DriftingMean

Mean and variance in local time-series subsegments.

SY_DynWin

How stationarity estimates depend on the number of time-series subsegments.

SY_KPSStest

The KPSS stationarity test.

SY_LocalDistributions

Compares the distribution in consecutive time-series segments.

SY_LocalGlobal

Compares local statistics to global statistics of a time series.

SY_PPtest

Phillips-Peron unit root test.

SY_RangeEvolve

How the time-series range changes across time.

SY_SlidingWindow

Sliding window measures of stationarity.

SY_SpreadRandomLocal

Bootstrap-based stationarity measure.

SY_StatAv

Simple mean-stationarity metric, StatAv.

SY_StdNthDer

Standard deviation of the nth derivative of the time series.

SY_StdNthDerChange

How the output of SY_StdNthDer changes with order parameter.

SY_TISEAN_nstat_z

Cross-forecast errors of zeroth-order time-series models.

SY_VarRatioTest

Variance ratio test for random walk.

Step detection:

CP_ML_StepDetect

Analysis of discrete steps in a time series.

CP_l1pwc_sweep_lambda

Dependence of step detection on regularization parameter.

CP_wavelet_varchg

Variance change points in a time series.

Nonlinear time-series analysis and fractal scaling

Nonlinear time-series analysis methods, including embedding dimensions and fluctuation analysis.

Code file

Description

NL_BoxCorrDim

Correlation dimension of a time series.

NL_DVV

Delay Vector Variance method for real and complex signals.

NL_MS_fnn

False nearest neighbors of a time series.

NL_MS_nlpe

Normalized drop-one-out constant interpolation nonlinear prediction error.

NL_TISEAN_c1

Information dimension.

NL_TISEAN_d2

d2 routine from the TISEAN package.

NL_TISEAN_fnn

False nearest neighbors of a time series.

NL_TSTL_FractalDimensions

Fractal dimension spectrum, D(q), of a time series.

NL_TSTL_GPCorrSum

Correlation sum scaling by Grassberger-Proccacia algorithm.

NL_TSTL_LargestLyap

Largest Lyapunov exponent of a time series.

NL_TSTL_PoincareSection

Poincare section analysis of a time series.

NL_TSTL_ReturnTime

Analysis of the histogram of return times.

NL_TSTL_TakensEstimator

Taken's estimator for correlation dimension.

NL_TSTL_acp

acp function in TSTOOL

NL_TSTL_dimensions

Box counting, information, and correlation dimension of a time series.

NL_crptool_fnn

Analyzes the false-nearest neighbors statistic.

SD_SurrogateTest

Analyzes test statistics obtained from surrogate time series.

SD_TSTL_surrogates

Surrogate time-series analysis.

TSTL_delaytime

Optimal delay time using the method of Parlitz and Wichard.

TSTL_localdensity

Local density estimates in the time-delay embedding space.

NL_nsamdf

Nonlinearity measure derived from the nonlinear average magnitude difference function.

Fluctuation analysis:

SC_MMA

Physionet implementation of multiscale multifractal analysis

SC_fastdfa

Matlab wrapper for Max Little's ML_fastdfa code

SC_FluctAnal

Implements fluctuation analysis by a variety of methods.

Fourier and wavelet transforms, periodicity measures

Properties of the time-series power spectrum, wavelet spectrum, and other periodicity measures.

Code file

Description

SP_Summaries

Statistics of the power spectrum of a time series.

DT_IsSeasonal

A simple test of seasonality.

PD_PeriodicityWang

Periodicity extraction measure of Wang et al.

WL_DetailCoeffs

Detail coefficients of a wavelet decomposition.

WL_coeffs

Wavelet decomposition of the time series.

WL_cwt

Continuous wavelet transform of a time series.

WL_dwtcoeff

Discrete wavelet transform coefficients.

WL_fBM

Parameters of fractional Gaussian noise/Brownian motion in a time series.

WL_scal2frq

Frequency components in a periodic time series.

Symbolic transformations

Properties of a discrete symbolization of a time series.

Code file

Description

SB_BinaryStats

Statistics on a binary symbolization of the time series.

SB_BinaryStretch

Characterizes stretches of 0/1 in time-series binarization.

SB_MotifThree

Motifs in a coarse-graining of a time series to a 3-letter alphabet.

SB_MotifTwo

Local motifs in a binary symbolization of the time series.

SB_TransitionMatrix

Transition probabilities between different time-series states.

SB_TransitionpAlphabet

How transition probabilities change with alphabet size.

Statistics from biomedical signal processing

Simple time-series properties derived mostly from the heart rate variability (HRV) literature.

Code file

Description

MD_hrv_classic

Classic heart rate variability (HRV) statistics.

MD_pNN

pNNx measures of heart rate variability.

MD_polvar

The POLVARd measure of a time series.

MD_rawHRVmeas

Heart rate variability (HRV) measures of a time series.

Basic statistics, trend

Basic statistics of a time series, including measures of trend.

Code file

Description

SY_Trend

Quantifies various measures of trend in a time series.

ST_FitPolynomial

Goodness of a polynomial fit to a time series.

ST_Length

Length of an input data vector.

ST_LocalExtrema

How local maximums and minimums vary across the time series.

ST_MomentCorr

Correlations between simple statistics in local windows of a time series.

ST_SimpleStats

Basic statistics about an input time series.

Others

Other properties, like extreme values, visibility graphs, physics-based simulations, and dependence on pre-processings applied to a time series.

Code file

Description

EX_MovingThreshold

Moving threshold model for extreme events in a time series.

HT_HypothesisTest

Statistical hypothesis test applied to a time series.

NW_VisibilityGraph

Visibility graph analysis of a time series.

PH_ForcePotential

Couples the values of the time series to a dynamical system.

PH_Walker

Simulates a hypothetical walker moving through the time domain.

PP_Compare

Compare how time-series properties change after pre-processing.

PP_Iterate

How time-series properties change in response to iterative pre-processing.

FAQ

Frequently asked questions pertaining to the use of hctsa.

Select a drop-down for more information:

How can I export the extracted features?

Use OutputToCSV. This gives you .csv files corresponding to a given hctsa calculation that you can analyze however you please.

Installing and using hctsa

General advice and common pitfalls

Thinking about running an hctsa analysis? Read this first.

The typical data analysis pipeline starts with inspecting and understanding the data, processing it in accordance with the questions of interest (and to be consistent with the assumptions of the analysis methods that will be applied), and then formulating and testing appropriate analysis methods. A typical hctsa pipeline inverts this process: many analysis methods are first applied, and then their results are interpreted.

Good practice involves thinking carefully about this full hctsa pipeline, including the type of questions and interpretations that are sought from it, and thus how the data are to be prepared, and how the results can be interpreted accurately.

Preparing for an analysis

hctsa automates some parts of a time-series analysis pipeline (such as guiding the selection of informative statistical properties in a time-series classification problem), but it does not replace expertise. Careful thought, with understanding of the problem and domain-specific issues are essential in designing how hctsa can be used for a given application, and in properly interpreting its results.

Some general advice before embarking on an hctsa analysis:

What are the data?
- For long, streaming data: how long is each time series? Does this capture the timescale of the problem you care about?
- For continuously evolving processes: At what rate are the data sampled? Does this capture the patterns of interest? E.g., if the sampling rate is too high, this can lead to trivially autocorrelated time series such that time-series methods in hctsa will find it challenging to resolve the patterns of interest.
- Are the appropriately processed (detrended, artifacts removed, …)? This requires careful thought, often with domain expertise, about what problem is being solved. E.g., many time-series analysis algorithms will be dominated by underlying trends if underlying trends in time series are not removed. In general, properties of the data, especially if they're likely to affect many time-series analysis algorithms (like underlying trends, artefactual outliers, etc.) should be removed if they're not informative of the differences you care about distinguishing. See the section below for more information.
- What do they look like? Addressing the questions above requires you to look at each of the time series to get a sense of the dynamics you're interested in characterizing using time-series features.
What problem are you trying to solve? In designing an analysis using hctsa, you will need to think about the sample size you have, what effect sizes are expected, what statistical power will you have, etc. E.g., if you only have 5 examples of each of two classes, you will not have the statistical power to pick out individual features from a library of 7000, simple (unregularized) classifiers will be likely to overfit, etc.
Trial run with a reduced set. Once you’ve devised a pipeline, it's best to run through it in hctsa but using a reduced feature set first (e.g., the catch22 set), which runs quickly on a laptop and gives you a sense for the process. Once you're satisfied with the analysis pipeline, you can always scale up to the full hctsa library of >7000 features.

Data processing

hctsa contains thousands of time-series analysis methods, many of which make strong assumptions of the data, such as that it is (weak-sense) stationary. Although hctsa has been applied meaningfully to short, non-stationary patterns (as commonly studied in time-series data-mining applications), it is better suited to longer streams of data, from which more subtle temporal statistics can be sampled.

hctsa is not substitute for thoughtful domain-motivated data preparation and processing. For example, hctsa cannot know is 'signal' and what is 'noise' for the question you're asking of your data. The analyst should prepare their data in a way that makes sense for the questions of relevance to them, including the possibility of de-trending/filtering the data, applying noise-removal methods, etc.

The following should be considered:

Standardizing. If your time-series data are measured on scales that differ across different recordings, then the data should be appropriately standardized.
Detrending. If your data contain low-order trends that are not meaningful (e.g., sensor drift) or not a signal of relevance (your question is based more around the structure of deviations from a low-frequency trend), then this should be removed.

Interpreting Features

Checking for simpler explanations

There are often many routes to solving a given data analysis challenge. For example, in a time-series classification problem, the two classes may be perfectly distinguished based on their lag-1 autocorrelation, and also on their Lyapunov exponent spectrum, and also on hundreds of other properties. In general, one should avoid interpreting the most complex features (like Lyapunov exponents) as being uniquely useful for a problem, as they reproduce the behavior of much simpler features, which provide a more interpretable and parsimonious interpretation of the relevant patterns in the dataset. For other problems, time-series analysis methods (that are sensitive to the time-ordering of the data samples) may not provide any benefit at all over properties of the data distribution (e.g., the variance), or more trivial differences in time-series length across classes.

In general, complex explanations of patterns in a dataset can only be justified when simpler explanations have been ruled out. E.g., Do not write a paper about a complex (e.g., powerlaw fit to a visibility graph degree-distribution) feature when your data can be just as well (or better) distinguished by their variance.

hctsa has many functions to check this type of behavior: from inspecting the pairwise correlation structure of high-performing features (in TS_TopFeatures) to basic checks on different keyword-labeled classes of features (in TS_CompareFeatureSets).

Installing and setting up

Installing the hctsa package

After installation, future use of the package can begin by opening Matlab, navigating to the hctsa package, and then loading the paths required by the hctsa package by running the startup script.

Structure of the hctsa framework

Overview

The hctsa framework consists of three basic objects containing relevant metadata:

Master Operations specify pieces of code (Matlab functions) and their inputs to be computed. Taking in a single time series, master operations can generate a large number of outputs as a Matlab structure, each of which can be identified with a single operation (or 'feature').
Operations (or 'features') are a single number summarizing some measure of structure in a time series. In hctsa, each operation links to an output from a piece of evaluated code (a master operation).
Time series are univariate and uniformly sampled time-ordered measurements.

These three different objects are summarized below:

Master Operation

Operation

Time Series

Summary:

Code and inputs to execute

Single feature

Univariate data

Example:

CO_AutoCorr(x,1:5,'TimeDomain')

AC_1

[1.2, 33.7, -0.1, ...]

In the example above, a master operation specifies the code to run, CO_AutoCorr(x,1:5,'TimeDomain'), which outputs the autocorrelation of the input time series (x) at lags 1, 2, ..., 5. Each operation (or 'feature') is a single number that draws on this set of outputs, for example, the autocorrelation at lag 1, which is named AC_1, for example.

In the hctsa framework, master operations, operations, and time series are stored as tables that contain all of their associated keywords and metadata (and actual time-series data in the case of time series).

For a given hctsa analysis, the user must specify a set of code to evaluate (master operations), their associated individual outputs to measure (operations), and a set of time series to evaluate the features on (time series).

We provide a default library of over 7700 operations (derived from approximately 1000 unique master operations). This can be customized, and additional pieces of code can also be added to the repository.

The results of a hctsa analysis

Having specified a set of master operations, operations, and time series, the results of computing these functions in the time series data are stored in three matrices:

TS_DataMat is an n x m data matrix containing the results of applying m operations to the n time series.
TS_Quality is an n x m matrix containing quality labels for each operation output (coding different outputs such as errors or NaNs). Quality labels are described in the section below.
TS_CalcTime is an n x m matrix containing calculation times for each operation output. Note that the calculation time stored is for the corresponding master operation.

HCTSA .mat files

Each HCTSA*.mat file includes the tables described above: for TimeSeries (corresponding to the rows of the TS_ matrices), Operations (corresponding to columns of the TS_ matrices), and MasterOperations, corresponding to the code evaluated to compute the operations. In addition, the results are stored as above: TS_DataMat, TS_Quality, and TS_CalcTime.

Quality labels

Quality labels are used to indicate when operations take non-real values, or when fatal errors were encountered. Quality labels are stored in the Quality column of the Results table in the mySQL database, and in local Matlab files as the TS_Quality matrix.

When the quality label is nonzero, this indicates that a special-valued output occurred. In this case, the output value of the operation is set to zero, as a convention, and the quality label codes the special output value:

Quality label

Description

No problems with calculation. Output was a real number.

A fatal error was encountered.

Output of the code was NaN.

Output of the code was Inf.

Output of the code was -Inf

Output had a non-zero imaginary component

Output was empty (e.g., [])

Field specified for this operation did not exist in the master operation output structure

Overview of an hctsa analysis

At its core, hctsa analysis involves computing a library of time-series analysis features (which we call operations) on a time-series dataset.

Example 1: Compute a feature vector for a time series

As a quick check of your operation library, you can compute the full default code library on a time-series data vector (a column vector of real numbers) as follows:

Example 2: Analyze a time-series dataset

Next you want to evaluate the code on all of the time series in your dataset. For this you can simply run:

Having run your calculations, you may then want to label your data using the keywords you provided in the case that you have labeled groups of time series:

and then normalize and filter the data using the default sigmoidal transformation:

A range of visualization scripts are then available to analyze the results, such as plotting the reordered data matrix:

To inspect a low-dimensional representation of the data:

Or to determine which features are best at classifying the labeled groups of time series in your dataset:

Each of these functions can be run with a range of input settings.

Running hctsa computations

Inspecting errors

When applying thousands of time-series analysis methods to diverse datasets, many operations can give results that are not real numbers. Some time series may be inappropriate for a given operation (such as fitting a positive-only distribution to data that is not positive), or measuring stationarity across 2000 datapoints in time series that are shorter than 2000 samples. Other times, an optimization routine may fail, or some unknown error may be called.

Some errors are not problems with the code, but represent issues with applying particular sets of code to particular time series, such as when a Matlab fitting function reaches the maximum number of iterations and returns an error. Other errors are genuine problems with the code that need to be corrected. Both cases are labeled as errors in our framework.

It can be good practice to visualize where special values and errors are occurring after a computation to see where things might be going wrong, using TS_InspectQuality. This can be run in four modes:

TS_InspectQuality('summary'); [default] Summarizes the proportion of special-valued outputs in each operation as a bar plot, ordered by the proportion of special-valued outputs.
TS_InspectQuality('master'); Plots which types of special-valued outputs were encountered for each master operation.
TS_InspectQuality('full'); Plots the full data matrix (all time series as rows and all operations as columns), and shows where each possible special-valued output can occur (including 'error', 'NaN', 'Inf', '-Inf', 'complex', 'empty', or a 'link error').
TS_InspectQuality('reduced'); As 'full', but includes only columns where special values occurred.

For example, running TS_InspectQuality('summary') loads in data from HCTSA.mat and produces the following, which can be zoomed in on and explored to understand which features are producing problematic outputs:

Errors with compiled code

Note that errors returned from Matlab files do not halt the progress of the computation (using try-catch statements), but errors with compiled mex functions (or external command-line packages like TISEAN) can produce a fault that crashes Matlab or the system. We have performed some basic testing on all mex functions, but for some unusual time series, such faults may still occur. These situations must be dealt with by either identifying and fixing the problem in the original source code and recompiling, or by removing the problem code.

Troubleshooting errors

When getting information on operations that produce special-valued outputs (getting IDs listed from TS_InspectQuality), it can be useful to then test examples by re-running pieces of code with the problematic data. The function TS_WhichProblemTS can be used to retrieve time series from an hctsa dataset that caused a problem for a given operation.

Usage is as follows:

This provides the list of time series IDs (ts_ind), their time-series data vectors (dataCell), and the code to evaluate the given operation (in this case, the master operation code corresponding to the operation with ID 684).

You can then pick an example time series (e.g., the first problem time series: x = dataCell{1}; x_z = zscore(x)), and then copy and paste the code in codeEval into the command line to evaluate the code for this time series. This method allows easy debugging and inspection of examples of time-series data that caused problems for particular operations flagged through the TS_InspectQuality process.

List of included code files

A full list of Matlab code files, organized loosely into broad categories, with brief descriptions

Introduction

In our default feature set, each function is run with multiple input parameters, with each parameter set yielding characteristic outputs. For example,

CO_AutoCorr determine the method in which autocorrelation is computed, as well as the time lag at which autocorrelation is calculated, e.g., lag 1, lag 2, lag 3, etc.
WL_dwtcoeff has inputs that set the mother wavelet to use and level of wavelet decomposition; and
FC_LocalSimple has inputs that determine the time-series forecasting method to use and the size of the training window.

The set of code files below and their input parameters that define the default hctsa feature set are in the INP_mops.txt file of the hctsa repository.

Distribution

Algorithms for summarizing properties of the distribution of values in a time series (independent of their ordered sequence through time).

Code file

Description

DN_Burstiness

Burstiness statistic of a time series.

DN_CompareKSFit

Fits a distribution to data.

DN_CustomSkewness

Custom skewness measures.

DN_FitKernelSmooth

Statistics of a kernel-smoothed distribution of the data.

DN_Fit_mle

Maximum likelihood distribution fit to data.

DN_HighLowMu

The highlowmu statistic.

DN_HistogramMode

Mode of the histogram.

DN_Mean

A given measure of location of a data vector.

DN_MinMax

The maximum and minimum values of the input data vector

DN_Moments

A moment of the distribution of the input time series.

DN_OutlierInclude

How statistics depend on distributional outliers.

DN_OutlierTest

How distributional statistics depend on distributional outliers.

DN_ProportionValues

Proportion of values in a time-series vector.

DN_Quantile

Quantiles of the distribution of values in the time series data vector.

DN_RemovePoints

How time-series properties change as points are removed.

DN_SimpleFit

Fits of parametric distributions or simple time-series models.

DN_Spread

Spread of the input time series.

DN_TrimmedMean

Mean of the outlier-trimmed time series.

DN_HistogramAsymmetry

Distributional asymmetry.

DN_Unique

The proportion of the time series that are unique values.

DN_Withinp

Proportion of data points within p standard deviations of the mean.

DN_cv

Coefficient of variation.

DN_pleft

Distance from the mean at which a given proportion of data are more distant.

EN_DistributionEntropy

Distributional entropy.

HT_DistributionTest

Hypothesis test for distributional fits to a data vector.

Correlation

Code summarizing basic properties of how values of a time series are correlated through time.

Code file

Description

CO_AddNoise

Changes in the automutual information with the addition of noise.

CO_AutoCorr

Compute the autocorrelation of an input time series.

CO_AutoCorrShape

How the autocorrelation function changes with the time lag.

CO_Embed2

Statistics of the time series in a 2-dimensional embedding space.

CO_Embed2_AngleTau

Angle autocorrelation in a 2-dimensional embedding space.

CO_Embed2_Basic

Point density statistics in a 2-d embedding space.

CO_Embed2_Dist

Analyzes distances in a 2-d embedding space of a time series.

CO_Embed2_Shapes

Shape-based statistics in a 2-d embedding space.

CO_FirstCrossing

First time the autocorrelation function crosses a threshold.

CO_FirstMin

Time of first minimum in a given correlation function.

CO_NonlinearAutocorr

A custom nonlinear autocorrelation of a time series.

CO_StickAngles

Analysis of line-of-sight angles between time-series data points.

CO_TranslateShape

Statistics on datapoints inside geometric shapes across the time series.

CO_f1ecac

The 1/e correlation length.

CO_fzcglscf

The first zero-crossing of the generalized self-correlation function.

CO_glscf

The generalized linear self-correlation function of a time series.

CO_tc3

Normalized nonlinear autocorrelation function, tc3.

CO_trev

Normalized nonlinear autocorrelation, trev function of a time series.

DK_crinkle

Computes James Theiler's crinkle statistic.

DK_theilerQ

Computes Theiler's Q statistic.

DK_timerev

Time reversal asymmetry statistic.

NL_embed_PCA

Principal Components analysis of a time series in an embedding space.

Automutual information:

CO_RM_AMInformation

Automutual information (Rudy Moddemeijer implementation).

CO_CompareMinAMI

Variability in first minimum of automutual information.

CO_HistogramAMI

The automutual information of the distribution using histograms.

IN_AutoMutualInfoStats

Statistics on automutual information function for a time series.

Information Theory

Entropy and complexity measures for time series based on information theory

Code file

Description

EN_ApEn

Approximate Entropy of a time series.

EN_CID

Simple complexity measure of a time series.

EN_MS_LZcomplexity

Lempel-Ziv complexity of a n-bit encoding of a time series.

EN_MS_shannon

Approximate Shannon entropy of a time series.

EN_PermEn

Permutation Entropy of a time series.

EN_RM_entropy

Entropy of a time series using Rudy Moddemeijer's code.

EN_Randomize

How time-series properties change with increasing randomization.

EN_SampEn

Sample Entropy of a time series.

EN_mse

Multiscale entropy of a time series.

EN_rpde

Recurrence period density entropy (RPDE).

EN_wentropy

Entropy of time series using wavelets.

Time-series model fitting and forecasting

Fitting time-series models and doing simple forecasting on time series.

Code file

Description

MF_ARMA_orders

Compares a range of ARMA models fitted to a time series.

MF_AR_arcov

Fits an AR model of a given order, p.

MF_CompareAR

Compares model fits of various orders to a time series.

MF_CompareTestSets

Robustness of test-set goodness of fit.

MF_ExpSmoothing

Exponential smoothing time-series prediction model.

MF_FitSubsegments

Robustness of model parameters across different segments of a time series.

MF_GARCHcompare

Comparison of GARCH time-series models.

MF_GARCHfit

GARCH time-series modeling.

MF_GP_FitAcross

Gaussian Process time-series modeling for local prediction.

MF_GP_LocalPrediction

Gaussian Process time-series model for local prediction.

MF_GP_hyperparameters

Gaussian Process time-series model parameters and goodness of fit.

MF_StateSpaceCompOrder

Change in goodness of fit across different state space models.

MF_StateSpace_n4sid

State space time-series model fitting.

MF_arfit

Statistics of a fitted AR model to a time series.

MF_armax

Statistics on a fitted ARMA model.

MF_hmm_CompareNStates

Hidden Markov Model (HMM) fitting to a time series.

MF_hmm_fit

Fits a Hidden Markov Model to sequential data.

MF_steps_ahead

Goodness of model predictions across prediction lengths.

FC_LocalSimple

Simple local time-series forecasting.

FC_LoopLocalSimple

How simple local forecasting depends on window length.

FC_Surprise

How surprised you would be of the next data point given recent memory.

PP_ModelFit

Investigates whether AR model fit improves with different preprocessings.

Stationarity and step detection

Quantifying how properties of a time series change over time.

Code file

Description

SY_DriftingMean

Mean and variance in local time-series subsegments.

SY_DynWin

How stationarity estimates depend on the number of time-series subsegments.

SY_KPSStest

The KPSS stationarity test.

SY_LocalDistributions

Compares the distribution in consecutive time-series segments.

SY_LocalGlobal

Compares local statistics to global statistics of a time series.

SY_PPtest

Phillips-Peron unit root test.

SY_RangeEvolve

How the time-series range changes across time.

SY_SlidingWindow

Sliding window measures of stationarity.

SY_SpreadRandomLocal

Bootstrap-based stationarity measure.

SY_StatAv

Simple mean-stationarity metric, StatAv.

SY_StdNthDer

Standard deviation of the nth derivative of the time series.

SY_StdNthDerChange

How the output of SY_StdNthDer changes with order parameter.

SY_TISEAN_nstat_z

Cross-forecast errors of zeroth-order time-series models.

SY_VarRatioTest

Variance ratio test for random walk.

Step detection:

CP_ML_StepDetect

Analysis of discrete steps in a time series.

CP_l1pwc_sweep_lambda

Dependence of step detection on regularization parameter.

CP_wavelet_varchg

Variance change points in a time series.

Nonlinear time-series analysis and fractal scaling

Nonlinear time-series analysis methods, including embedding dimensions and fluctuation analysis.

Code file

Description

NL_BoxCorrDim

Correlation dimension of a time series.

NL_DVV

Delay Vector Variance method for real and complex signals.

NL_MS_fnn

False nearest neighbors of a time series.

NL_MS_nlpe

Normalized drop-one-out constant interpolation nonlinear prediction error.

NL_TISEAN_c1

Information dimension.

NL_TISEAN_d2

d2 routine from the TISEAN package.

NL_TISEAN_fnn

False nearest neighbors of a time series.

NL_TSTL_FractalDimensions

Fractal dimension spectrum, D(q), of a time series.

NL_TSTL_GPCorrSum

Correlation sum scaling by Grassberger-Proccacia algorithm.

NL_TSTL_LargestLyap

Largest Lyapunov exponent of a time series.

NL_TSTL_PoincareSection

Poincare section analysis of a time series.

NL_TSTL_ReturnTime

Analysis of the histogram of return times.

NL_TSTL_TakensEstimator

Taken's estimator for correlation dimension.

NL_TSTL_acp

acp function in TSTOOL

NL_TSTL_dimensions

Box counting, information, and correlation dimension of a time series.

NL_crptool_fnn

Analyzes the false-nearest neighbors statistic.

SD_SurrogateTest

Analyzes test statistics obtained from surrogate time series.

SD_TSTL_surrogates

Surrogate time-series analysis.

TSTL_delaytime

Optimal delay time using the method of Parlitz and Wichard.

TSTL_localdensity

Local density estimates in the time-delay embedding space.

NL_nsamdf

Nonlinearity measure derived from the nonlinear average magnitude difference function.

Fluctuation analysis:

SC_MMA

Physionet implementation of multiscale multifractal analysis

SC_fastdfa

Matlab wrapper for Max Little's ML_fastdfa code

SC_FluctAnal

Implements fluctuation analysis by a variety of methods.

Fourier and wavelet transforms, periodicity measures

Properties of the time-series power spectrum, wavelet spectrum, and other periodicity measures.

Code file

Description

SP_Summaries

Statistics of the power spectrum of a time series.

DT_IsSeasonal

A simple test of seasonality.

PD_PeriodicityWang

Periodicity extraction measure of Wang et al.

WL_DetailCoeffs

Detail coefficients of a wavelet decomposition.

WL_coeffs

Wavelet decomposition of the time series.

WL_cwt

Continuous wavelet transform of a time series.

WL_dwtcoeff

Discrete wavelet transform coefficients.

WL_fBM

Parameters of fractional Gaussian noise/Brownian motion in a time series.

WL_scal2frq

Frequency components in a periodic time series.

Symbolic transformations

Properties of a discrete symbolization of a time series.

Code file

Description

SB_BinaryStats

Statistics on a binary symbolization of the time series.

SB_BinaryStretch

Characterizes stretches of 0/1 in time-series binarization.

SB_MotifThree

Motifs in a coarse-graining of a time series to a 3-letter alphabet.

SB_MotifTwo

Local motifs in a binary symbolization of the time series.

SB_TransitionMatrix

Transition probabilities between different time-series states.

SB_TransitionpAlphabet

How transition probabilities change with alphabet size.

Statistics from biomedical signal processing

Simple time-series properties derived mostly from the heart rate variability (HRV) literature.

Code file

Description

MD_hrv_classic

Classic heart rate variability (HRV) statistics.

MD_pNN

pNNx measures of heart rate variability.

MD_polvar

The POLVARd measure of a time series.

MD_rawHRVmeas

Heart rate variability (HRV) measures of a time series.

Basic statistics, trend

Basic statistics of a time series, including measures of trend.

Code file

Description

SY_Trend

Quantifies various measures of trend in a time series.

ST_FitPolynomial

Goodness of a polynomial fit to a time series.

ST_Length

Length of an input data vector.

ST_LocalExtrema

How local maximums and minimums vary across the time series.

ST_MomentCorr

Correlations between simple statistics in local windows of a time series.

ST_SimpleStats

Basic statistics about an input time series.

Others

Other properties, like extreme values, visibility graphs, physics-based simulations, and dependence on pre-processings applied to a time series.

Code file

Description

EX_MovingThreshold

Moving threshold model for extreme events in a time series.

HT_HypothesisTest

Statistical hypothesis test applied to a time series.

NW_VisibilityGraph

Visibility graph analysis of a time series.

PH_ForcePotential

Couples the values of the time series to a dynamical system.

PH_Walker

Simulates a hypothetical walker moving through the time domain.

PP_Compare

Compare how time-series properties change after pre-processing.

PP_Iterate

How time-series properties change in response to iterative pre-processing.

Input files

Formatted input files are used to set up a custom dataset of time-series data, pieces of Matlab code to run (master operations), and associated outputs from that code (operations). By default, you can simply specify a custom time-series dataset and the default operation library will be used. In this section we describe how to initiate an hctsa analysis, including how to format the input files used in the hctsa framework.

Working with a default feature set using `TS_Init`

To work with a default feature set, hctsa or catch22, you just need to specify information about the time-series to analyze, specified by an input file (e.g., INP_ts.mat or INP_ts.txt). Details of how to format this input file is described below. A test input file, 'INP_test_ts.mat', is provided with the repository, so you can set up feature extraction for it using the hctsa feature set as:

TS_Init('INP_test_ts.mat');

TS_Init('INP_test_ts.mat','hctsa');

And for catch22 as:

TS_Init('INP_test_ts.mat','catch22');

Using custom feature sets

You can specify a custom feature set of your own making by specifying

The code to run (INP_mops.txt); and
The features to extract from that code (INP_ops.txt).

Details of how to format these input files are described below. The syntax for using a custom feature-set is as:

TS_Init('INP_ts.mat',{'INP_mops.txt','INP_ops.txt'});

Time Series Input Files

When formatting a time series input file, two formats are available:

.mat file input, which is suited to data that are already stored as variables in Matlab; or
.txt file input, which is better suited to when each time series is already stored as an individual text file.

Input file format 1 (`.mat` file)

When using a .mat file input, the .mat file should contain three variables:

timeSeriesData: either a N x 1 cell (for N time series), where each element contains a vector of time-series values, or a N x M matrix, where each row specifies the values of a time series (all of length M).
labels: a N x 1 cell of unique strings containing a named label for each time series.
keywords: a N x 1 cell of strings, where each element contains a comma-delimited set of keywords (one for each time series), containing no whitespace.

An example involving two time series is below. In this example, we add two time series (showing only the first two values shown of each), which are labeled according to .dat files from a hypothetical EEG experiment, and assigned keywords (which are separated by commas and no whitespace). In this case, both are assigned keywords 'subject1' and 'eeg' and, additionally, the first time series is assigned 'trial1', and the second 'trial2' (these labels can be used later to retrieve individual time series). Note that the labels do not need to specify filenames, but can be any useful label for a given time series.

timeSeriesData = {[1.45,2.87,...],[8.53,-1.244,...]}; % (a cell of vectors)
labels = {'EEGExperiment_sub1_trail1.dat','EEGExperiment_sub1_trail2.dat'}; % data labels for each time series
keywords = {'subject1,trial1,eeg','subject1,trial2,eeg'}; % comma-delimited keywords for each time series

% Save these variables out to INP_test.mat:
save('INP_test.mat','timeSeriesData','labels','keywords');

% Initialize a new hctsa analysis using these data and the default feature library:
TS_Init('INP_test.mat','hctsa');

Input file format 2 (text file)

When using a text file input, the input file now specifies filenames of time series data files, which Matlab will then attempt to load (using dlmread). Data files should thus be accessible in the Matlab path. Each time-series text file should have a single real number on each row, specifying the ordered values that make up the time series. Once imported, the time-series data is stored in the database; thus the original time-series data files are no longer required, and can be removed from the Matlab path.

The input text file should be formatted as rows with each row specifying two whitespace separated entries: (i) the file name of a time-series data file and (ii) comma-delimited keywords.

For example, consider the following input file, containing three lines (one for each time series to be added to the database):

gaussianwhitenoise_001.dat     noise,gaussian
gaussianwhitenoise_002.dat     noise,gaussian
sinusoid_001.dat               periodic,sine

Using this input file, a new analysis will contain 3 time series, gaussianwhitenoise_001.dat and gaussianwhitenoise_002.dat will be assigned the keywords noise and gaussian, and the data in sinusoid_001.dat will be assigned keywords ‘periodic’ and ‘sine’. Note that keywords should be separated only by commas (and no whitespace).

Adding master operations

In our system, a master operation refers to a piece of Matlab code and a set of input parameters.

Valid outputs from a master operation are: 1. A single real number, 2. A structure containing real numbers, 3. NaN to indicate that the input time series is not appropriate for this code.

The (potentially many) outputs from a master operation can thus be mapped to individual operations (or features), which are single real numbers summarizing a time series that make up individual columns of the resulting data matrix.

Two example lines from the input file, INP_mops.txt (in the Database directory of the repository), are as follows:

CO_tc3(x_z,1)     CO_tc3_xz_1
ST_length(x)    ST_length

Each line in the input file specifies two pieces of information, separated by whitespace: 1. A piece of code and its input parameters. 2. A unique label for that master operation (that can be referenced by individual operations).

When the time comes to perform computations on data using the methods in the database, Matlab needs to have path access to each of the master operations functions specified in the database. For the above example, Matlab will attempt to run both CO_tc3(x_z,1) and ST_length(x), and thus the functions CO_tc3.m and ST_length.m must be in the Matlab path. Recall that the script startup.m, which should be run at the start of each session using hctsa, handles the addition of paths required for the default code library.

Modifying operations (features)

The input file, e.g., INP_ops.txt (in the Database directory of the repository) should contain a row for every operation, and use labels that correspond to master operations. An example excerpt from such a file is below:

    CO_tc3_xz_1.raw     CO_tc3_1_raw      correlation,nonlinear
    CO_tc3_xz_1.abs     CO_tc3_1_abs      correlation,nonlinear
    CO_tc3_xz_1.num     CO_tc3_1_num      correlation,nonlinear
    CO_tc3_xz_1.absnum  CO_tc3_1_absnum   correlation,nonlinear
    CO_tc3_xz_1.denom   CO_tc3_1_denom    correlation,nonlinear
    ST_length           length            raw,lengthDependent

The first column references a corresponding master label and, in the case of master operations that produce structure, the particular field of the structure to reference (after the fullstop), the second column denotes the label for the operation, and the final column is a set of comma-delimited keywords (that must not include whitespace). Whitespace is used to separate the three entries on each line of the input file. In this example, the master operation labeled CO_tc3_xz_1, outputs is a structure, with fields that are referenced by the first five operations listed here, and the ST_length master operation outputs a single number (the length of the time series), which is referenced by the operation named 'length' here. The two keywords 'correlation' and 'nonlinear' are added to the CO_tc3_1 operations, while the keywords raw and lengthDependent are added to the operation called length. These keywords can be used to organize and filter the set of operations used for a given analysis task.

Interpreting features

A conventional data analysis pipeline starts with the hard work of thinking about the dynamics and carefully selecting time-series analysis methods with deep knowledge of how they work and the assumptions that make their interpretation valid. This hard work is not avoided in hctsa, although if you're lucky, hctsa may save you a bunch of work—if it identifies high-performing features, you can focus your energy on interpreting just these. But this hard work comes at the end of an analysis, and is difficult and time-consuming to do well—often involving looking deeply into the algorithms and theory underlying these methods, combined with careful inspection of the data to ensure you are properly interpreting them.

Being the most substantial challenge—one that cannot be automated and requires alot of careful thought—this page provides some basic advice. While it's often challenging to work out what a given feature is capturing in your dataset, it is the most interesting part of an analysis, as it points you to interpretable scientific theory/algorithms, and makes you think carefully about the structure in your time series.

Finding features to interpret

1: Identify significant features

Sometimes you will run TS_TopFeatures and find no features that significantly distinguish time series recorded from the labeled groups in your dataset. In this case, there may not be a signal, the signal may be too small given your sample size (and the perhaps large number of candidate time-series features, if using the hctsa set), or the right statistical estimators may not be present in hctsa.

But other times you will obtain a long list of statistically significant (after multiple hypothesis correction) features (e.g., from TS_TopFeatures) with values that significantly distinguish the groups you care about (individuals with some disease diagnosis compared to that of healthy controls). In cases like this, the next step is to obtain some understanding of what's happening.

Consider a long and kinda gnarly list, that is quite hard to make sense of. This is a typical situation because the features names in hctsa are long and typically hard to interpret directly. Like perhaps:

[3016] FC_LocalSimple_mean3_taures (forecasting) -- 59.97%
[3067] FC_LocalSimple_median3_taures (forecasting) -- 58.14%
[2748] EN_mse_1-10_2_015_sampen_s3 (entropy,sampen,mse) -- 54.10%
[7338] MF_armax_2_2_05_1_AR_1 (model) -- 53.71%
[7339] MF_armax_2_2_05_1_AR_2 (model) -- 53.31%
[3185] DN_CompareKSFit_uni_psx (distribution,ksdensity,raw,locdep) -- 52.14%
[6912] MF_steps_ahead_ar_best_6_ac1_3 (model,prediction,arfit) -- 52.11%
[6564] WL_coeffs_db3_4_med_coeff (wavelet) -- 52.01%
[4552] SP_Summaries_fft_logdev_fpoly2csS_p1 (spectral) -- 51.57%
[6634] WL_dwtcoeff_sym2_5_noisestd_l5 (wavelet,dwt) -- 51.48%
[930] DN_FitKernelSmoothraw_entropy (distribution,ksdensity,entropy,raw,spreaddep) -- 51.37%
[6574] WL_coeffs_db3_5_med_coeff (wavelet) -- 51.26%
[6630] WL_dwtcoeff_sym2_5_noisestd_l4 (wavelet,dwt) -- 51.04%
[1891] CO_StickAngles_y_ac2_all (correlation) -- 50.85%
[16] rms (distribution,location,raw,locdep,spreaddep) -- 50.83%
[6965] MF_steps_ahead_arma_3_1_6_rmserr_6 (model,prediction) -- 50.83%
[2747] EN_mse_1-10_2_015_sampen_s2 (entropy,sampen,mse) -- 50.35%
[4201] SC_FluctAnal_mag_2_dfa_50_2_logi_ssr (scaling) -- 50.34%
[6946] MF_steps_ahead_ss_best_6_meandiffrms (model,prediction) -- 50.33%

If there are hundreds or thousands of statistically significant features, some of which have much higher accuracy than others, you may first want to inspect only a subset of features, e.g., the 50 or 100 top-performing features (with the most significant differences). If you can make sense of these first, you will usually have a pretty good basis for working out what's going on with the features with lower performance (if indeed you want to go beyond interpreting the best-performing features).

2: Find groups of similar features

Plotting and inspecting clustered feature–feature correlation plots are crucial for identifying groups of features with similar behavior on the dataset. Then we can inspect and interpret these groups of similar (highly inter-correlated) features together as a group. This should be the first step in interpreting a set of significant features: which groups of features behave similarly, and what common concepts are they measuring?

Interpreting individual features

When such a group of high-performing features capturing a common time-series property has been identified, how can we start to interpret and understand what each individual feature is measuring?

Some features in the group may be easy to interpret directly. For example, in the list above rms is straightforward to interpret directly: it is simply the root-mean-square of the distribution of time-series values. Others have clues in the name (e.g., features starting with WL_coeffs are to do with measuring wavelet coefficients, features starting with EN_mse correspond to measuring the multiscale entropy ('mse'), and features starting with FC_LocalSimple_mean are related to time-series forecasting using local means of the time series).

Below we outline a procedure for how a user can go from a time-series feature selected by hctsa towards a deeper understanding of the type of algorithm that feature is derived from, how that algorithm performs across the dataset, and thus how it can provide interpretable information about your specific time-series dataset.

Inspecting keywords

The simplest way of getting a quick idea of what sort of property a feature might be measuring is from its keywords, that often label individual features by the class of time-series analysis method from which they were derived. Keywords have not been applied exhaustively to features (this is an ongoing challenge), but when they are there they can be useful at giving a sense of what's going on. In the list above, we see keywords listed in parentheses, such as:

'forecasting' (methods related to predicting future values of a time series),
'entropy' (methods related to predictability and information content in a time series),
'wavelet' (features derived from wavelet transforms of the time series),
'locationDependent' (location dependent: features that change under mean shifts of a time series),
'spreadDependent' (spread dependent: features that change under rescaling about their mean), and
'lengthDependent' (length dependent: features that are highly sensitive to the length of a time series).

Inspecting code

To find more specific detailed information about a feature, beyond just a broad categorical label of the literature from which it was derived, the next step is find and inspect the code file that generates the feature of interest. For example, say we were interested in the top performing feature in the list above:

    [3016] FC_LocalSimple_mean3_taures (forecasting) -- 59.97%

We know from the keyword that this feature has something to do with forecasting, and the name provides clues about the details (e.g., FC_ stands for forecasting, the function FC_LocalSimple is the one that produces this feature, which, as the name suggests, performs simple local time-series prediction). We can use the feature ID (3016) provided in square brackets to get information from the Operations metadata table:

>> Operations(Operations.ID==3016,:)
ID                 Name                   Keywords                CodeString              MasterID
____    _____________________________    _____________    _____________________________    ________

3016    'FC_LocalSimple_mean3_taures'    'forecasting'    'FC_LocalSimple_mean3.taures'    836

Inspecting the text before the dot, ., in the CodeString field (FC_LocalSimple_mean3) tells us the name that hctsa uses to describe the Matlab function and its unique set of inputs that produces this feature. Whereas the text following the dot, ., in the CodeString field (taures), tells us the field of the output structure produced by the Matlab function that was run.

We can use the MasterID to get more information about the code that was run using the MasterOperations metadata table:

>> MasterOperations(MasterOperations.ID==836,:)
ID             Label                         Code            
___    ______________________    ____________________________

836    'FC_LocalSimple_mean3'    'FC_LocalSimple(y,'mean',3)'

This tells us that the code used to produce our feature was FC_LocalSimple(y,'mean',3). We can get information about this function in the command line by running a help command:

>> help FC_LocalSimple
FC_LocalSimple    Simple local time-series forecasting.

Simple predictors using the past trainLength values of the time series to
predict its next value.

---INPUTS:
y, the input time series

forecastMeth, the forecasting method:
         (i) 'mean': local mean prediction using the past trainLength time-series
                      values,
         (ii) 'median': local median prediction using the past trainLength
                        time-series values
         (iii) 'lfit': local linear prediction using the past trainLength
                        time-series values.

trainLength, the number of time-series values to use to forecast the next value

---OUTPUTS: the mean error, stationarity of residuals, Gaussianity of
residuals, and their autocorrelation structure.

We can also inspect the code in the function FC_LocalSimple directly for more information. Like all code files for computing time-series features, FC_LocalSimple.m is located in the Operations directory of the hctsa repository.

Inspecting the code file, we see that running FC_LocalSimple(y,'mean',3) does forecasting using local estimates of the time-series mean (since the second input to FC_LocalSimple, forecastMeth is set to 'mean'), using the previous three time-series values to make the prediction (since the third input to FC_LocalSimple, trainLength is set to 3).

To understand what the specific output quantity from this code is that came up as being highly informative in our TS_TopFeatures analysis, we need to look for the output labeled taures of the output structure produced by FC_LocalSimple. We discover the following relevant lines of code in FC_LocalSimple.m:

% Autocorrelation structure of the residuals:
out.ac1 = CO_AutoCorr(res,1,'Fourier');
out.ac2 = CO_AutoCorr(res,2,'Fourier');
out.taures = CO_FirstZero(res,'ac');

This shows us that, after doing the local mean prediction, FC_LocalSimple then outputs some features on whether there is any residual autocorrelation structure in the residuals of the rolling predictions (the outputs labeled ac1, ac2, and our output of interest: taures).

The code shows that this taures output computes the CO_FirstZero of the residuals, which measures the first zero of the autocorrelation function (e.g., cf help CO_FirstZero). When the local mean prediction still leaves a lot of autocorrelation structure in the residuals, our feature, FC_LocalSimple_mean3_taures, will thus take a high value.

Visualizing outputs

For example, we can run the following:

TS_FeatureSummary(3016,'raw',true);

which produces a plot like that shown below. We have run this on a dataset containing noisy sine waves, labeled 'noisy' (red) and periodic signals without noise, labeled 'periodic' (blue):

In this plot, we see how this feature orders time series (with the distribution of values shown on the left, and split between the two groups: 'noisy', and 'periodic'). Our intuition from the code, that time series with longer correlation timescales will have highly autocorrelated residuals after a local mean prediction, appears to hold visually on this dataset.

In general, the mechanism provided by TS_FeatureSummary to visualize how a given feature orders time series, including across labeled groups, can be very useful for feature interpretation.

Summary

In a specific domain context, you may need to decide on the trade-off between more complicated features that may have slightly higher in-sample performance on a given task, and simpler, more interpretable features that may help guide domain understanding. The procedures outlined above are typically the first step to understanding a time-series analysis algorithm, and its relationship to alternatives that have been developed across science.