Only this pageAll pages
Powered by GitBook
1 of 44

Highly comparative time-series analysis with hctsa

Information about hctsa

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Installing and using hctsa

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Cycling through computations using runscripts

As described above, computation involves three main steps:

The procedure involves three main steps:

  1. Retrieve a set of time series and operations from (the Results table) of the database to a local Matlab file, HCTSA.mat (using SQL_Retrieve).

  2. Compute the operations on the retrieved time series in Matlab and store the results locally (using TS_Compute).

  3. Write the results back to the Results table of the database (using SQL_store).

It is usually the most efficient practice to retrieve a small number of time series at each iteration of the SQL_Retrieve–TS_Compute–SQL_store loop, and distribute this computation across multiple machines if possible. An example runscript is given in the code that accompanies this document, as sample_runscript_sql, which retrieves a single time series at a time, computes it, and then writes the results back to the database in a loop. This can be viewed as a template for runscripts that one may wish to use when performing time-series calculations across the database.

This workflow is well suited to distributed computing for large datasets, whereby each node can iterate over a small set of time series, with all the results being written back to a central location (the mySQL database).

By designating different sections of the database to cycle through, this procedure can also be used to (manually) distribute the computation across different machines. Retrieving a large section of the database at once can be problematic because it requires large disk reads and writes, uses a lot of memory, and if problems occur in the reading or writing to/from files, one may have to abandon a large number of existing computations.

Retrieving data from the database

The first step of any analysis is to retrieve a relevant portion of data from the mySQL database to local Matlab files for analysis. This is done using the SQL_Retrieve function described earlier, except we use the 'all' input to retrieve all data, rather than the 'null' input used to retrieve just missing data (requiring calculation).

Example usage is as follows:

SQL_Retrieve(ts_ids, op_ids,'all');

for vectors ts_ids and op_ids, specifying the ts_ids and op_ids to be retrieved from the database.

Sets of ts_ids and op_ids to retrieve can be selected by inspecting the database, or by retrieving relevant sets of keywords using the SQL_GetIDs function. Running the code in this way, using the ‘all’ tag, ensures that the full range of ts_ids and op_ids specified are retrieved from the database and stored in the local file, HCTSA.mat, which can then form the basis of subsequent analysis.

The database structure provides much flexibility in storing and indexing the large datasets that can be analyzed using the hctsa approach, however the process of retrieving and storing large amounts of data from a database can take a considerable amount of time, depending on database latencies.

Note that missing, or NULL, entries in the database are converted to NaN entries in the local Matlab matrices.

Getting started

Some advice for getting started with an hctsa analysis

🖥️ Talks and live demos

💾 hctsa datasets, and example workflows

Sometimes it's easiest to quickly get going when you can follow on to some example code. Here are some repositories that allow you to do just this. There are a range of open datasets with pre-computed hctsa features, as well as some examples of hctsa workflows. You may also check the for example publications that have used hctsa—some of these contain associated open code.

If you have data to share and host, let me know and I'll add it to this list.

UMAP Projections

Below are some UMAP projections of UEA/UCR time-series classification datasets.

In each dataset, each time series is represented in the high-dimensional feature space of hctsa features. In each plot, we show the data projected into a two-dimensional space. Each point in the space is a time series, and we have coloured each point according to its assigned label. In many cases, the unsupervised UMAP projection of the hctsa feature space captures the labeled structure of the data.

Only a selection of the most interesting-looking projections (according to our eyes) are shown here. Oh, did you want to have a play with all the data? It's available on .

Thanks to for doing all the computation shown here!


FAQ

Frequently asked questions pertaining to the use of hctsa.

Select a drop-down for more information:

How can I use more machine-learning algorithms rather than SVM linear only using TS_Classify?

You can specify the classification algorithm in the cfnParams structure. You can see the options in GiveMeCfn. Typically because there is complexity in the embedding in a high-dimensional feature space, we have tried to remove complexity in the classifiers (to avoid overfitting), also for interpretability. You can also use OutputToCSV

Overview of an hctsa analysis

At its core, hctsa analysis involves computing a library of time-series analysis features (which we call operations) on a time-series dataset.

The basic sequence of a Matlab-based hctsa analysis is to: 1. Initialize a HCTSA.mat file, which contains all of the information about the set of time series and operations in your analysis, as well as the results of applying all operations to all time series, using ,

  1. These operations can be computed on your time-series data using . The results are structured in the local HCTSA.mat file containing matrices (that store the results of the computations) and the tables (that store information about the time-series data and operations), as described .

Structure of the hctsa framework

Overview

The hctsa framework consists of three basic objects containing relevant metadata:

  1. Master Operations specify pieces of code (Matlab functions) and their inputs to be computed. Taking in a single time series, master operations can generate a large number of outputs as a Matlab structure, each of which can be identified with a single operation (or 'feature').

Performing calculations

An hctsa dataset has been initialized (specifying details of a time-series dataset and operations to include using TS_Init), all results entries in the resulting HCTSA.mat are set to NaN, corresponding to results that are as yet uncomputed.

Calculations are performed using the function TS_Compute, which stores results back into the matrices in HCTSA.mat. This function can be run without inputs to compute all missing values in the default hctsa file, HCTSA.mat:

TS_Compute will begin evaluating operations on time series in HCTSA.mat

Analyzing and visualizing results

Once a set of operations have been computed on a time-series dataset, the results are stored in a local HCTSA.mat file. The result can be used to perform a wide variety of highly comparative analyses, such as those outlined in .

The type of analysis employed should the be motivated by the specific time-series analysis task at hand. Setting up the problem, guiding the methodology, and interpreting the results requires strong scientific input that should draw on domain knowledge, including the questions asked of the data, experience performing data analysis, and statistical considerations.

The first main component of an hctsa analysis involves filtering and normalizing the data using TS_Normalize, described , which produces a file called HCTSA_N.mat. Information about the similarity of pairs of time series and operations can be computed using TS_Cluster, described which stores this information in HCTSA_N.mat. The suite of plotting and analysis tools that we provide with hctsa work with this normalized data, stored in HCTSA_N.mat, by default.

Clustering rows and columns

For the purposes of visualizing the data matrix, it is often desirable to have the rows and columns reordered to put similar rows adjacent to one another, and similarly to place similar columns adjacent to one another. This reordering can be done using hierarchical linkage clustering, by the function TS_Cluster:

This function reads in the data from HCTSA_N.mat, and stores the re-ordering of rows and columns back into HCTSA_N.mat in the ts_clust and op_clust (and, if the size is manageable, also the pairwise distance information). Visualization functions (such as and ) can then take advantage of this information, using the general input label 'cl'.

Filtering and normalizing

The first step in analyzing a dataset involves processing the data matrix, which can be done using TS_Normalize. This involves filtering out operations or time series that produced many errors or special-valued outputs, and then normalizing of the output of all operations, which is typically done in-sample, according to an outlier-robust sigmoidal transform (although other normalizing transformations can be selected). Both of these tasks are performed using the function TS_Normalize. The TS_Normalize function writes the new, filtered, normalized matrix to a local file called HCTSA_N.mat. This contains normalized, and trimmed versions of the information in HCTSA.mat.

Example usage is as follows:

The first input controls the normalization method, in this case a scaled, outlier-robust sigmoidal transformation, specified with 'mixedSigmoid'. The second input controls the filtering of time series and operations based on minimum thresholds for good values in the corresponding rows (corresponding to time series; filtered first) and columns (corresponding to operations; filtered second) of the data matrix.

Input files

Formatted input files are used to set up a custom dataset of time-series data, pieces of Matlab code to run (master operations), and associated outputs from that code (operations). By default, you can simply specify a custom time-series dataset and the default operation library will be used. In this section we describe how to initiate an hctsa analysis, including how to format the input files used in the hctsa framework.

Working with a default feature set using TS_Init

To work with a default feature set, hctsa or catch22, you just need to specify information about the time-series to analyze, specified by an input file (e.g., INP_ts.mat

Exploring classification accuracy

When performing a time-series classification task, a basic first exploration of the data is to investigate how accurately a classifier can learn a mapping from time-series features to labels assigned to time series in your dataset.

The first step is to assign group labels to time series in your dataset using .

Depending on the classifier, you typically want to first normalize the features to put them all on a similar scale (using TS_Normalize).

Depending on the question asked of the data, you should also consider whether certain types of features should be removed. For example, you may wish to exclude length-dependent features (if differences in time-series length vary between classes but are an uninteresting artefact of the measurement). This can be done using TS_Subset (and functions like TS_CompareFeatureSets described below allow you to test the sensitivity of these results).

Related Time-Series Resources

Resources related to HCTSA including talks/demos, workflow examples and related packages.

A collection of good resources for time-series analysis (including in other programming languages like python and R) are listed here. See also larger collections of time-series resources, by and, focused on python: .

Time-Series Data

Publications using hctsa

This page lists scientific research publications that have used hctsa.

Articles are labeled as follows:

  • 📗 = Journal publication.

  • 📙 = Preprint.

  • 💻 = Link to GitHub code repository available.

If you have used hctsa in your published work, or we have missed any publications, feel free to reach out by

Introduction

Welcome to the hctsa documentation!

This documentation contains resources and information about highly comparative time-series analysis using hctsa, and also outlines the steps required to set up and implement highly comparative time-series analysis using the , as described in our papers:

  1. B.D. Fulcher and N.S. Jones. . Cell Systems 5, 527 (2017).

  2. B.D. Fulcher, M.A. Little, N.S. Jones. . J. Roy. Soc. Interface

List of included code files

A full list of Matlab code files, organized loosely into broad categories, with brief descriptions

Introduction

The full default set of over 7700 features in hctsa is produced by running all of the code files below, many of which produce a large number of outputs (e.g., some functions fit a time-series model and then output statistics including the parameters of the best-fitting model, measures of the model's goodness of fit, the optimal model order, and autocorrelation statistics on the residuals).

In our default feature set, each function is run with multiple input parameters, with each parameter set yielding characteristic outputs. For example,

General advice and common pitfalls

Thinking about running an hctsa analysis? Read this first.

The typical data analysis pipeline starts with inspecting and understanding the data, processing it in accordance with the questions of interest (and to be consistent with the assumptions of the analysis methods that will be applied), and then formulating and testing appropriate analysis methods. A typical hctsa pipeline inverts this process: many analysis methods are first applied, and then their results are interpreted.

Good practice involves thinking carefully about this full hctsa pipeline, including the type of questions and interpretations that are sought from it, and thus how the data are to be prepared, and how the results can be interpreted accurately.

Preparing for an analysis

hctsa automates some parts of a time-series analysis pipeline (such as guiding the selection of informative statistical properties in a time-series classification problem), but it does not replace expertise. Careful thought, with understanding of the problem and domain-specific issues are essential in designing how

Finding nearest neighbors

While the global structure of a time-series dataset can be investigated by plotting the data matrix () or a low-dimensional representation of it (), sometimes it can be more interesting to retrieve and visualize relationships between a set of nearest neighbors to a particular time series of interest.

The hctsa framework provides a way to easily compute distances between pairs of time series, e.g., as a Euclidean distance between their normalized feature vectors. This allows very different time series (in terms of their origin, their method of recording and measurement, and their number of samples) to be compared straightforwardly according to their properties, measured by the algorithms in our hctsa library.

For this, we use the TS_SimSearch function, specifying the id of the time series of interest (i.e., the ID field of the TimeSeries structure) with the first input and the number of neighbors with the 'numNeighbors' input specifier (default: 20). By default, data is loaded from HCTSA_N.loc, but a custom source can be specified using the 'whatDataFile'

Installing and setting up

The hctsa package can be used completely within Matlab, allowing users to analyse time-series datasets quickly and easily. Here we will focus on this Matlab-based use of the software, but note that, for larger datasets requiring distributed computing set-ups, or datasets that may grow with time, hctsa can also be linked to a mySQL database, as described in .

Installing the hctsa package

The simplest way to get the hctsa package up and running is to run the install script, which adds the required paths to dependent time-series packages (toolboxes), and compiles

Running hctsa computations

An hctsa analysis requires setting a library of time series, master operations and operations, and generates a HCTSA.mat file (using TS_Init), as described . Once this is set up, computations are run using .

These steps, as well as information on how to and , are provided in this chapter.

Compiling binaries

Some external code packages require compiled binary code to be used. Compilation of the mex code is handled by compile_mex as part of the install script, but the TISEAN package binaries need to be compiled separately in the command line.

Compiling mex code

Many of the operations (especially external code packages) rely on mex functions (pieces of code written in C or fortran), that need to be compiled to run natively on a given system architecture. To ensure that as many operations as possible run successfully on your data, you should compile these mex functions for your system. This requires working compilers (e.g., gcc, g++) to be installed on your system, which can be configured using

Working with hctsa files

When running hctsa analyses, often you want to take subsets of time series (to look in more detail at a subset of your data) or subsets of operations (to explore the behavior of different feature subsets), or combine multiple subsets of data together (e.g., as additional data arrive).

The hctsa package contains a range of functions for these types of tasks, working directly with hctsa .mat files, and are described below. Note that these types of tasks are easier to manage when hctsa data are stored in a .

Retrieving time series (or operations) of interest by matching on assigned keywords using TS_GetIDs

Inspecting errors

When applying thousands of time-series analysis methods to diverse datasets, many operations can give results that are not real numbers. Some time series may be inappropriate for a given operation (such as fitting a positive-only distribution to data that is not positive), or measuring stationarity across 2000 datapoints in time series that are shorter than 2000 samples. Other times, an optimization routine may fail, or some unknown error may be called.

Some errors are not problems with the code, but represent issues with applying particular sets of code to particular time series, such as when a Matlab fitting function reaches the maximum number of iterations and returns an error. Other errors are genuine problems with the code that need to be corrected. Both cases are labeled as errors in our framework.

It can be good practice to visualize where special values and errors are occurring after a computation to see where things might be going wrong, using TS_InspectQuality. This can be run in four modes:

  1. TS_InspectQuality('summary');

Plotting the time series

The hctsa package provides a simple means of plotting time series: the TS_PlotTimeSeries function.

Basic plotting

For example, to plot a set of time series that have not been assigned groups, we can run the following:

For our assorted set of time series, this produces the following:

Assigning group labels to data

hctsa allows you to map time-series data into a unified feature space, a representation that is well suited to then applying machine learning methods to perform classification where the goal is to predict a class label assigned to each time series from features of its dynamics.

Class labels can be assigned to time series in an hctsa dataset using the function TS_LabelGroups, which populates the Group column of the TimeSeries table as a categorical labeling of the data (the class labels the classifier will attempt to predict). These group labels are used by a range of analysis functions, including TS_PlotLowDim, TS_TopFeatures, and TS_Classify.

The machinery of TS_LabelGroups

here
TS_Compute
inspect the results of an hctsa analysis
working with HCTSA*.mat files
10
, 20130048 (2013).
hctsa package
hctsa: A computational framework for automated time-series phenotyping using massive feature extraction
Highly comparative time-series analysis: the empirical structure of time series and their methods
mex
binaries to work on your system architecture. Once this one-off installation step is complete, you're ready to go! (NB: to include additional functions from the TISEAN nonlinear time-series analysis package, you'll also need to
).

After installation, future use of the package can begin by opening Matlab, navigating to the hctsa package, and then loading the paths required by the hctsa package by running the startup script.

a dedicated chapter
compile TISEAN routines
distanceMetricRow = 'euclidean'; % time-series feature distance
linkageMethodRow = 'average'; % linkage method
distanceMetricCol = 'corr_fast'; % a (poor) approximation of correlations with NaNs
linkageMethodCol = 'average'; % linkage method

TS_Cluster(distanceMetricRow, linkageMethodRow, distanceMetricCol, linkageMethodCol);
TS_PlotDataMatrix
TS_SimSearch

After the computation is complete, a range of processing, analysis, and plotting functions are provided to understand and interpret the results.

Example 1: Compute a feature vector for a time series

As a quick check of your operation library, you can compute the full default code library on a time-series data vector (a column vector of real numbers) as follows:

Example 2: Analyze a time-series dataset

Suppose you have have a time-series dataset to analyze. You first generate a formatted INP_ts.mat input file containing your time series data and associated name and keyword labels, as described here. You then initialize an hctsa calculation using the default library of features:

This generates a local file, HCTSA.mat containing the associated metadata for your time series, as well as information about the full time-series feature library (Operations) and the set of functions and code to call to evaluate them (MasterOperations), as described here.

Next you want to evaluate the code on all of the time series in your dataset. For this you can simply run:

As described here, or, for larger datasets, using a script to regularly save back to the local file (cf. sample_runscript_matlab).

Having run your calculations, you may then want to label your data using the keywords you provided in the case that you have labeled groups of time series:

and then normalize and filter the data using the default sigmoidal transformation:

A range of visualization scripts are then available to analyze the results, such as plotting the reordered data matrix:

To inspect a low-dimensional representation of the data:

Or to determine which features are best at classifying the labeled groups of time series in your dataset:

Each of these functions can be run with a range of input settings.

TS_Init
TS_Compute
here

In the example above, time series (rows of the data matrix) with more than 20% special values (specifying 0.8) are first filtered out, and then operations (columns of the data matrix) containing any special values (specifying 1.0) are removed. Columns with approximately constant values are also filtered out. After filtering the data matrix, the outlier-robust ‘scaledRobustSigmoid’ sigmoidal transformation is applied to all remaining operations (columns). The filtered, normalized matrix is saved to the file HCTSA_N.mat.

Details about what normalization is saved to the HCTSA_N.mat file as normalizationInfo, a structure that contains the normalization function, filtering options used, and the corresponding TS_Normalize code that can be used to re-run the normalization.

Setting the normalizing transformation

It makes sense to weight each operation equally for the purposes of dimensionality reduction, and thus normalize all operations to the same range using a transformation like ‘scaledRobustSigmoid’, ‘scaledSigmoid’, or ‘mixedSigmoid’. For the case of calculating mutual information distances between operations, however, one would rather not distort the distributions and perform no normalization, using ‘raw’ or a linear transformation like ‘zscore’, for example. The full list of implemented normalization transformations are listed in the function BF_NormalizeMatrix.

Note that the 'scaledRobustSigmoid' transformation does not tolerate distributions with an interquartile range of zero, which will be filtered out; 'mixedSigmoid' will treat these distributions in terms of their standard deviation (rather than interquartile range).

Setting the filtering parameters

Filtering parameters depend on the application. Some applications can allow the filtering thresholds to be relaxed. For example, setting [0.7,0.9], removes time series with less than 70% good values, and then removes operations with less than 90% good values. Some applications can tolerate some special-valued outputs from operations (like some clustering methods, where distances are simply calculated using those operations that are did not produce special-valued outputs for each pair of objects), but others cannot (like Principal Components Analysis); the filtering parameters should be specified accordingly.

Analysis can be performed on the data contained in HCTSA_N.mat in the knowledge that different settings for filtering and normalizing the results can be applied at any time by simply rerunning TS_Normalize, which will overwrite the existing HCTSA_N.mat with the results of the new normalization and filtration settings.

Filtering features using TS_FilterData

It is often useful to check whether the feature-based classification results of a given analysis is driven by 'trivial' types of features that do not depend on the dynamical properties of the data, e.g., features sensitive to time-series length, location (e.g., mean), or spread (e.g., variance). Because these features are labeled as 'lengthdep', 'locdep', and 'spreaddep', you can easily filter these out to check the robustness of your analysis.

An example:

You could use the same template to filter 'locdep' or 'spreaddep' features (or any other combination of keyword labels). You can then go ahead with analyzing the filtered HCTSA dataset as above, except using your new filename, HCTSA_locFilt.

mex -setup
(cf.
doc mex
for more information).

Once mex is set up, the mex functions used in the time-series code repository can be compiled by navigating to the Toolboxes directory and then running compile_mex.

Compiling the TISEAN binaries

Some operations rely on the TISEAN nonlinear time-series analysis package, which Matlab accesses via the terminal using system commands, so the TISEAN binaries cannot be installed from within Matlab, but instead must be installed from the command line. If you are running Linux or Mac, we will assume that you are familiar with the command line, while those running Windows will require an alternate method to install TISEAN, as explained below.

Installing TISEAN on Linux or Mac

In the command line (not within Matlab), navigate to the Toolboxes/Tisean_3.0.1 directory of the repository, then run the following chain of commands:

This should install the TISEAN binaries in your ~/bin/ directory (you can instead install into a system-wide directory, /usr/bin, for example, by running ./configure –prefix=/usr). Additional information about the TISEAN installation process is provided on the TISEAN website.

If installation was successful then you should be able to access the newly-compiled binaries from the commandline, e.g., typing the command which poincare should return the path to the TISEAN function poincare. Otherwise, you should check that the install directory is in your system path, e.g., by adding the following:

to your ~/.bash_profile (and running source ~/.bash_profile to update).

The path where TISEAN is installed will also have to be in Matlab’s environment path, which is added by startup.m, assuming that the binaries are stored in ~/bin. The startup.m code also adds the DYLD_LIBRARY_PATH, which is also required for TISEAN to function properly.

If you choose to use a custom location for the TISEAN binaries, that is not in the default Matlab system path (getenv('PATH') in Matlab), then you will have to add this path manually. You can test that Matlab can see the TISEAN binaries by typing, for example, the following into Matlab:

!which poincare

If Matlab’s system paths are set up correctly, this command should return the path to your compiled TISEAN binary, poincare.

Installing TISEAN on Windows

If you are running Matlab from Windows, you will need a mechanism for Matlab to call system commands and find compiled TISEAN binaries. There are two options:

  1. Install Cygwin on your machine. Cygwin provides a Linux distribution-like environment on Windows. Use this environment to compile and install TISEAN (as per the instructions above for Linux or Mac), which will require it to have C and fortran compilers installed. Matlab will then also need to be launched from Cygwin, using the command: matlab &. This instance of Matlab should then be able to call system commands through cygwin, including the ability to access the TISEAN binaries.

  2. Sacrifice operations that rely on TISEAN. In total, TISEAN-based operations account for approximately 300 operations in the operation library. Although they provide important, well-tested implementations of nonlinear time-series analysis methods, it's not the end of the world if you decide it's too much trouble to install and are ok to miss out on these methods (see below on how to explicitly remove them from a computed library).

Ignoring TISEAN functions

If you decide not to use functions from the TISEAN package, you should initialize your dataset with the TISEAN functions removed. You could do this by removing them from you INP_ops.txt file when initializing your dataset, or you could remove them from your initialized hctsa dataset by filtering on the 'tisean' keyword.

For example, to filter a local Matlab hctsa file (e.g., HCTSA.mat), you can use the following: TS_LocalClearRemove('raw','ops',TS_GetIDs('tisean','raw','ops'),true);, which will remove all operations with the 'tisean' keyword from the hctsa dataset in HCTSA.mat.

[If you are using a mySQL database to store the results of your hctsa calculations, TISEAN functions can be removed from the database as follows: SQL_ClearRemove('ops',SQL_GetIDs('ops',0,'tisean',{}),true)].

Many time-series classification problems involve filtering subsets of time series based on keyword matching, where keywords are specified in the input file provided when initializing a dataset.

Most filtering functions (such as those listed in this section), require you to specify a range of IDs of TimeSeries or Operations in order to specify them. Recall that each TimeSeries and Operation is assigned a unique ID (assed as the ID field in the corresponding metadata table). To quickly get the IDs of time series that match a given keyword, the following function can be used:

Or the IDs of operations tagged with the 'entropy' keyword:

These IDs can then be used in the functions below (e.g., to clear data, or extract a subset of data).

Note that to get a quick impression of the unique time-series keywords present in a dataset, use the function TS_WhatKeywords, which gives a text summary of the unique keywords in an hctsa dataset.

Clearing or removing data from an hctsa dataset using TS_LocalClearRemove

Sometimes you may want to remove a time series from an hctsa dataset because the data was not properly processed, for example. Or one operation may have produced errors because of a missing toolbox reference, or you may have altered the code for an operation, and want to clear the stored results from previous calculations.

For example, often you want to remove from your operation library operations that are dependent on the location of the data (e.g., its mean: 'locdep'), that only operate on positive-only time series ('posOnly'), that require the TISEAN package ('tisean'), or that are stochastic (i.e., they give different results when repeated, 'stochastic').

The function TS_LocalClearRemove achieves these tasks when working directly with .mat files (NB: if using a mySQL database, SQL_ClearRemove should be used instead).

TS_LocalClearRemove loads in a an hctsa .mat data file, clears or removes the specified time series or operations, and then writes the result back to the file.

Example 1: Clear all computed data from time series with IDs 1:5 from HCTSA.mat (specifying 'raw'):

Example 2: Remove all operations with the keyword 'tisean' (that depend on the TISEAN package) from HCTSA.mat:

Example 3: Remove all operations that require positive-only data (the 'posOnly' keyword) from HCTSA.mat:

Example 4: Remove all operations that are location dependent (the 'locdep' keyword) from HCTSA.mat:

See the documentation in the function file for additional details about the inputs to TS_LocalClearRemove.

Extracting a subset from an hctsa dataset using TS_Subset

Sometimes it's useful to retrieve a subset of an hctsa dataset, when analyzing just a particular class of time series, for example, or investigating a balanced subset of data for time-series classification, or to compare the behavior of a reduced subset of features. This can be done with TS_Subset, which takes in a hctsa dataset and generates the desired subset, which can be saved to a new .mat file.

Example 1: Import data from 'HCTSA_N.mat', then save a new dataset containing only time series with IDs in the range 1--100, and all operations, to 'HCTSA_N_subset.mat' (see documentation for all inputs).

Note that the subset in this case will have be normalized using the full dataset of all time series, and just this subset (with IDs up to 100) are now being analyzed. Depending on the normalization method used, different results would be obtained if the subsetting was performed prior to normalization.

Example 2: From HCTSA.mat ('raw'), save a subset of that dataset to 'HCTSA_healthy.mat' containing only time series tagged with the 'healthy' keyword:

Combining multiple hctsa datasets using TS_Combine

When analyzing a growing dataset, sometimes new data needs to be combined with computations on existing data. Alternatively, when computing a large dataset, sometimes you may wish to compute sections of it separately, and may later want to combine each section into a full dataset.

To combine hctsa data files, you can use the TS_Combine function.

Example: combine hctsa datasets stored in the files HCTSA_healthy.mat and HCTSA_disease.mat into a new combined file, HCTSA_combined.mat:

The third input, compare_tsids, controls the behavior of the function in combining time series. By setting this to 1, TS_Combine assumes that the TimeSeries IDs are comparable between the datasets (e.g., most common when using a mySQL database to store hctsa data), and thus filters out duplicates so that the resulting hctsa dataset contains a unique set of time series. By setting this to 0 (default), the output will contain a union of time series present in each of the two hctsa datasets. In the case that duplicate TimeSeries IDs exist in the combination file, a new index will be generated in the combined file (where IDs assigned to time series are re-assigned as unique integers using TS_ReIndex).

In combining operations, this function works differently when data have been stored in a unified mySQL database, in which case operation IDs can be compared meaningfully and combined as an intersection. However, when hctsa datasets have been generated using TS_Init, the function will check that the same set of operations have been used in both files.

mySQL database
assumes that the class labels you are interested in are contained in the
Keywords
column of the
TimeSeries
table (set up from your original
when TS_Init was run).

Labeling

Manual labeling

If you want to label according to a specific set of keywords, you can do this by specifying the keywords that define the unique groups. The example below assigns labels to two groups of time series in the HCTSA.mat (specifying the shorthand 'raw' for this default, un-normalized data), corresponding to those labeled as 'parkinsons' and those labeled as 'healthy':

Note that any time series that are not labeling by either 'parkinsons' nor 'healthy' are left unlabeled.

Automatic labeling

If every time series has a single keyword that uniquely labels it, TS_LabelGroups can automatically detect this by running TS_LabelGroups('raw',{});.

Complex labeling

More complex labeling (e.g., using custom combinations of keywords) is not implemented in TS_LabelGroups, but can be achieved by writing a script that does logical operations on calls to TS_LabelGroups and saves back to the TimeSeries.Group column. Here is an example of combining two labels (male/female and day/night) on fly movement data.

Working with labels

By default, the group labels are saved back to the data file (in this example, HCTSA.mat).

Group labels can be reassigned at any time by re-running the TS_LabelGroups function, and can be cleared by running, e.g., TS_LabelGroups('raw','clear').

Assigned labels are used by the analysis, plotting, and classification functions of hctsa.

Note: If you assign a labeling to a given dataset (e.g., HCTSA.mat), then this labeling will remain with the normalized data (e.g., after running TS_Normalize).

input file
x = randn(500,1); % A random time-series
featVector = TS_CalculateFeatureVector(x,false); % compute the default feature vector for x
TS_Init('INP_ts.mat');
TS_Compute;
TS_LabelGroups;
TS_Normalize;
TS_Cluster; % compute a reordering of data and features
TS_PlotDataMatrix; % plot the data matrix
TS_PlotLowDim;
TS_TopFeatures;
TS_Normalize('mixedSigmoid',[0.8,1.0]);
% Get the IDs of length-dependent features from the `HCTSA.mat` file:
[ID_lengthDep,ID_notlengthDep] = TS_GetIDs('lengthdep','raw','ops');

% Generate a new file without these features, called 'HCTSA_locFilt':
TS_FilterData('raw',[],ID_notlengthDep,'HCTSA_locFilt.mat');
$ ./configure
$ make clean
$ make
$ make install
    export PATH=$PATH:$HOME/bin
TimeSeriesIDs = TS_GetIDs(theKeyword,'HCTSA_N.mat');
OperationIDs = TS_GetIDs('entropy','norm','ops');
TS_LocalClearRemove('ts',1:5,0,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('tisean','raw','ops'),1,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('posOnly','raw','ops'),1,'raw');
TS_LocalClearRemove('ops',TS_GetIDs('locdep','raw','ops'),1,'raw');
TS_Subset('norm',1:100,[],1,'HCTSA_N_subset.mat')
TS_Subset('raw',TS_GetIDs('healthy','raw'),[],1,'HCTSA_healthy.mat')
TS_Combine('HCTSA_healthy.mat','HCTSA_disease.mat',false,false,'HCTSA_combined.mat')
TS_LabelGroups('raw',{'parkinsons','healthy'});
🫀ECG Data

🧬 Other Biological Data

⚝ Shapes

🏞️ Image

👋 UWave

🖥️ Simulated

📡 Sensor

💡 Spectrograph

UMAP
figshare
Carl Lubba
and use the hctsa data in other environments (like python), [if this route, feel free to share your python workflow
]
Can I perform a multivariate time series analysis using hctsa?

hctsa is designed to extracting thousands of features from a single univariate time series; pyspi implements hundreds of pairwise dependence measures from multivariate time series.

You could incorporate both (i) univariate features of the components of the system; and (ii) features summarizing the pairwise dependence structure of the distributed system.

For example, some points to consider:

  • You could compute univariate features of each component of your system and then concatenate them to combine features of all individual time series:

    • To do this you may consider using a reduced set of features, like to avoid massive dimensionality explosion.

    • You may alternatively use hctsa and then do feature selection or dimensionality reduction to tailor a reduced feature set to your application.

    • You may also consider adding some pairwise dependence measures to summarize the multivariate structure, cf. .

  • You could compute univariate features of extracted dominant components of your multivariate system, e.g., using PCA.

How can I export the extracted features?

Use OutputToCSV. This gives you .csv files corresponding to a given hctsa calculation that you can analyze however you please.

Are there any plans to move away from Matlab (which is proprietary and thus not accessible to users perhaps outside universities) to perhaps Python/R/Julia?

Most users are within Universities with a Matlab license, so I haven't come across issues with this (and the main analysis code is licensed non-commercial anyway). But there are a couple solutions if this comes up:

  • Use an alternative (e.g., native R or native python) feature-extraction tool, such as those listed here. These have far fewer features than hctsa but can get you some of the way there.

  • Use Matlab temporarily to derive a reduced set of useful features for your problem, and then implement them (or find non-Matlab implementations). This pipeline is demonstrated (and implemented) in and we currently have new reduced ones in development. Our goal is to code reduced feature sets in C so they can be efficiently used in any programming language.

  • Also note that although it doesn't get around the license issue, you can run hctsa from python using .


here

Operations (or 'features') are a single number summarizing some measure of structure in a time series. In hctsa, each operation links to an output from a piece of evaluated code (a master operation).

  • Time series are univariate and uniformly sampled time-ordered measurements.

  • These three different objects are summarized below:

    Master Operation

    Operation

    Time Series

    Summary:

    Code and inputs to execute

    Single feature

    Univariate data

    Example:

    CO_AutoCorr(x,1:5,'TimeDomain')

    AC_1

    [1.2, 33.7, -0.1, ...]

    In the example above, a master operation specifies the code to run, CO_AutoCorr(x,1:5,'TimeDomain'), which outputs the autocorrelation of the input time series (x) at lags 1, 2, ..., 5. Each operation (or 'feature') is a single number that draws on this set of outputs, for example, the autocorrelation at lag 1, which is named AC_1, for example.

    In the hctsa framework, master operations, operations, and time series are stored as tables that contain all of their associated keywords and metadata (and actual time-series data in the case of time series).

    For a given hctsa analysis, the user must specify a set of code to evaluate (master operations), their associated individual outputs to measure (operations), and a set of time series to evaluate the features on (time series).

    We provide a default library of over 7700 operations (derived from approximately 1000 unique master operations). This can be customized, and additional pieces of code can also be added to the repository.

    The results of a hctsa analysis

    Having specified a set of master operations, operations, and time series, the results of computing these functions in the time series data are stored in three matrices:

    • TS_DataMat is an n x m data matrix containing the results of applying m operations to the n time series.

    • TS_Quality is an n x m matrix containing quality labels for each operation output (coding different outputs such as errors or NaNs). Quality labels are described in the section below.

    • TS_CalcTime is an n x m matrix containing calculation times for each operation output. Note that the calculation time stored is for the corresponding master operation.

    HCTSA .mat files

    Each HCTSA*.mat file includes the tables described above: for TimeSeries (corresponding to the rows of the TS_ matrices), Operations (corresponding to columns of the TS_ matrices), and MasterOperations, corresponding to the code evaluated to compute the operations. In addition, the results are stored as above: TS_DataMat, TS_Quality, and TS_CalcTime.

    Quality labels

    Quality labels are used to indicate when operations take non-real values, or when fatal errors were encountered. Quality labels are stored in the Quality column of the Results table in the mySQL database, and in local Matlab files as the TS_Quality matrix.

    When the quality label is nonzero, this indicates that a special-valued output occurred. In this case, the output value of the operation is set to zero, as a convention, and the quality label codes the special output value:

    Quality label

    Description

    0

    No problems with calculation. Output was a real number.

    1

    A fatal error was encountered.

    2

    Output of the code was NaN.

    3

    Output of the code was Inf.

    4

    Output of the code was -Inf

    for which elements in
    TS_DataMat
    are
    NaN
    (i.e., computations that have not been run previously). Results are stored back in the matrices of
    HCTSA.mat
    :
    TS_DataMat
    (output of each operation on each time series),
    TS_CalcTime
    (calculation time for each operation on each time series), and
    TS_Quality
    (labels indicating errors or special-valued outputs).

    Custom settings for running TS_Compute

    (1) Computing features in parallel across available cores using Matlab's Parallel Processing Toolbox. This can be achieved by setting the first input to true:

    (2) Computing across a custom range of time-series IDs (ts_id) and operation IDs (op_id). This can be achieved by setting the second and third inputs:

    (3) Specifying what types of values should be computed:

    (4) Specifying a custom .mat file to operate on (HCTSA.mat is the default):

    (5) Suppress commandline output. All computations are displayed to screen by default (which can be overwhelming but is useful for error checking). This functionality can be suppressed by setting the final (6th) input to false:

    Computation approaches for full datasets

    Computing features for full time-series datasets can be time consuming, especially for large datasets of long time series. An appropriate strategy therefore depends on the time-series length, the number of time series in the dataset, and the computational resources available. When multiple cores are available, it is always recommended to use the parallel setting (i.e., as TS_Compute(true)).

    Computation time scaling

    The first thing to think about is how the time taken to compute 7749 features of v0.93 of hctsa scales with the length of time series in your dataset (see plot below). The figure compares results using a single core (e.g., TS_Compute(false)) to results using a 16-core machine, with parallelization enabled (e.g., TS_Compute(true)).

    Times may vary across on individual machines, but the above plot can be used to estimate the computation time per time series, and thus help decide on an appropriate computation strategy for a given dataset.

    Note that if computation times are too long for the computational resources at hand, one can always choose a reduced set of features, rather than the full set of >7000, to get a preliminary understanding of the dataset. One such reduced set of features is in INP_ops_reduced.txt. We plan to reduced additional reduced feature sets, determined according to different criteria, in future.

    On a single machine

    If only a single machine is available for computation, there are a couple of options:

    1. For small datasets, when it is feasible to run all computations in a single go, it is easiest to run computations within Matlab in a single call of TS_Compute.

    2. For larger datasets that may run for a long time on a single machine, one may wish to use something like the provided sample_runscript_matlab script, where TS_Compute commands are run in a loop over time series, compute small sections of the dataset at a time (and then saving the results to file, e.g., HCTSA.mat), eventually covering the full dataset iteratively.

    On a distributed compute cluster using Matlab

    Code for running distributed hctsa computations on a cluster (using pbs or slurm schedulers) is here. The strategy is as follows: with a distributed computing setup, a local Matlab file (HCTSA.mat) can be split into smaller pieces using TS_Subset, which outputs a new data file for a particular subset of your data, e.g., TS_Subset('raw',1:100) will generate a new file, HCTSA_subset.mat that contains just time series with IDs from 1 to 100. Computing features for time series in each such subset can then be run on a distributed computing setup. For example, with a different compute node computing a different subset (by queuing batch jobs that each work on a given subset of time series). After all subsets have been computed, the results are recombined into a single HCTSA.mat file using TS_Combine commands.

    Using mySQL to facilitate distributed computing

    Distributing feature computations on a large-scale distributed computing setup can be better suited to a linked mySQL database, especially for datasets that grow with time, as new time series can be easily added to the database. In this case, computation proceeds similarly to above, where shell scripts on a distributed cluster computing environment can be used to distribute jobs across cores, with all individual jobs writing to a centralized mySQL server. A set of Matlab code that generates an appropriately formatted mySQL database and interfaces with the database to facilitate hctsa feature computation is included with the software package, and is described in detail here.

    Plotting and analysis functions:
    • Visualizing structure in the data matrix using TS_PlotDataMatrix.

    • Visualizing the time-series traces using TS_PlotTimeSeries.

    • Visualizing low-dimensional structure in the data using TS_PlotLowDim.

    • Exploring similar matches to a target time series using .

    • Visualizing the behavior of a given operation across the time-series dataset using .

    Tools for classification tasks:

    For time-series classification tasks, groups of time series can be labeled using the TS_LabelGroups function described here; this group label information is stored in the local HCTSA*.mat file, and used by default in the various plotting and analysis functions provided. Additional analysis functions are provided for basic time-series classification tasks:

    • Explore the classification performance of the full library of features using TS_Classify

    • Determine the features that (individually) best distinguish between the labeled groups using TS_TopFeatures

    our paper
    here
    here
    or
    INP_ts.txt
    ). Details of how to format this input file is described below. A test input file,
    'INP_test_ts.mat'
    , is provided with the repository, so you can set up feature extraction for it using the
    hctsa
    feature set as:

    or

    And for catch22 as:

    The full hctsa feature set involves significant computation time so it is a recommended first step to test out your analysis pipeline using a smaller, faster feature set like catch22 (but note that it is insensitvie to mean and standard deviation; to include them use catch24). This feature set is provided as a submodule within hctsa, and it is very fast to compute using compiled C code (the features are compiled on initial install of hctsa (by running mexAll from the Toolboxes/catch22 directory of hctsa).

    TS_Init produces a Matlab file, HCTSA.mat, containing all of the structures required to understand the set of time series, operations, and the results of their computation (explained here). Through this initialization process, each time series will be assigned a unique ID, as will each master operation, and each operation.

    Using custom feature sets

    You can specify a custom feature set of your own making by specifying

    1. The code to run (INP_mops.txt); and

    2. The features to extract from that code (INP_ops.txt).

    Details of how to format these input files are described below. The syntax for using a custom feature-set is as:

    Time Series Input Files

    When formatting a time series input file, two formats are available:

    • .mat file input, which is suited to data that are already stored as variables in Matlab; or

    • .txt file input, which is better suited to when each time series is already stored as an individual text file.

    Input file format 1 (.mat file)

    When using a .mat file input, the .mat file should contain three variables:

    • timeSeriesData: either a N x 1 cell (for N time series), where each element contains a vector of time-series values, or a N x M matrix, where each row specifies the values of a time series (all of length M).

    • labels: a N x 1 cell of unique strings containing a named label for each time series.

    • keywords: a N x 1 cell of strings, where each element contains a comma-delimited set of keywords (one for each time series), containing no whitespace.

    An example involving two time series is below. In this example, we add two time series (showing only the first two values shown of each), which are labeled according to .dat files from a hypothetical EEG experiment, and assigned keywords (which are separated by commas and no whitespace). In this case, both are assigned keywords 'subject1' and 'eeg' and, additionally, the first time series is assigned 'trial1', and the second 'trial2' (these labels can be used later to retrieve individual time series). Note that the labels do not need to specify filenames, but can be any useful label for a given time series.

    Input file format 2 (text file)

    When using a text file input, the input file now specifies filenames of time series data files, which Matlab will then attempt to load (using dlmread). Data files should thus be accessible in the Matlab path. Each time-series text file should have a single real number on each row, specifying the ordered values that make up the time series. Once imported, the time-series data is stored in the database; thus the original time-series data files are no longer required, and can be removed from the Matlab path.

    The input text file should be formatted as rows with each row specifying two whitespace separated entries: (i) the file name of a time-series data file and (ii) comma-delimited keywords.

    For example, consider the following input file, containing three lines (one for each time series to be added to the database):

    Using this input file, a new analysis will contain 3 time series, gaussianwhitenoise_001.dat and gaussianwhitenoise_002.dat will be assigned the keywords noise and gaussian, and the data in sinusoid_001.dat will be assigned keywords ‘periodic’ and ‘sine’. Note that keywords should be separated only by commas (and no whitespace).

    Adding master operations

    In our system, a master operation refers to a piece of Matlab code and a set of input parameters.

    Valid outputs from a master operation are: 1. A single real number, 2. A structure containing real numbers, 3. NaN to indicate that the input time series is not appropriate for this code.

    The (potentially many) outputs from a master operation can thus be mapped to individual operations (or features), which are single real numbers summarizing a time series that make up individual columns of the resulting data matrix.

    Two example lines from the input file, INP_mops.txt (in the Database directory of the repository), are as follows:

    Each line in the input file specifies two pieces of information, separated by whitespace: 1. A piece of code and its input parameters. 2. A unique label for that master operation (that can be referenced by individual operations).

    We use the convention that x refers to the input time series and x_z refers to a z-scored transformation of the input time series (i.e., (x−μx)/σx(x - \mu_x)/\sigma_x(x−μx​)/σx​). In the example above, the first line thus adds an entry in the database for running the code CO_tc3 using a z-scored time series as input (x_z), with 1 as the second input with the label CO_tc3_xz_1, and the second line will add an entry for running the code ST_length using the untransformed time series, x, with the label length.

    When the time comes to perform computations on data using the methods in the database, Matlab needs to have path access to each of the master operations functions specified in the database. For the above example, Matlab will attempt to run both CO_tc3(x_z,1) and ST_length(x), and thus the functions CO_tc3.m and ST_length.m must be in the Matlab path. Recall that the script startup.m, which should be run at the start of each session using hctsa, handles the addition of paths required for the default code library.

    Modifying operations (features)

    The input file, e.g., INP_ops.txt (in the Database directory of the repository) should contain a row for every operation, and use labels that correspond to master operations. An example excerpt from such a file is below:

    The first column references a corresponding master label and, in the case of master operations that produce structure, the particular field of the structure to reference (after the fullstop), the second column denotes the label for the operation, and the final column is a set of comma-delimited keywords (that must not include whitespace). Whitespace is used to separate the three entries on each line of the input file. In this example, the master operation labeled CO_tc3_xz_1, outputs is a structure, with fields that are referenced by the first five operations listed here, and the ST_length master operation outputs a single number (the length of the time series), which is referenced by the operation named 'length' here. The two keywords 'correlation' and 'nonlinear' are added to the CO_tc3_1 operations, while the keywords raw and lengthDependent are added to the operation called length. These keywords can be used to organize and filter the set of operations used for a given analysis task.

    Classifying labeled groups (TS_Classify)

    TS_Classify uses all of the features in a given hctsa data matrix to classify assigned class labels.

    Setting properties of the classification

    You can set classification settings, from the number of folds to use in cross-validation to the type of classifier, as the cfnParams structure. For the labeling defined in a given TimeSeries table, you can set defaults for this using cfnParams = GiveMeDefaultClassificationParams('norm') (takes TimeSeries labeling from HCTSA_N.mat). This automatically sets an appropriate number of folds (for cross-validation), and includes settings for taking into account class imbalance in classifier training and evaluation. It is best to alter the values inside this function to suit your needs, such that these settings can be applied consistently.

    Computing classification accuracy

    First let's run a simple classification of the groups labeled in HCTSA_N.mat, using default classification settings:

    In large feature spaces like in hctsa, simpler classifiers (like 'svm_linear') tend to generalize well, but you can play with the settings in cfnParams to get a sense for how the performance varies.

    As well as the classification results, the function also produces a confusion matrix, which is especially useful for evaluating where classification errors are occurring. Here's an example for a five-class problem:

    Assessing significance as a permutation test relative to a shuffled ensemble

    In datasets containing fewer time series, it is more likely to obtain high classification accuracies by chance. You may therefore wonder how confident you can be with your classification accuracy. For example if you get a two-class classification accuracy of 60%, you might wonder what the probability is of obtaining such an accuracy by chance?

    You can set numNulls in TS_Classify to iterate over the classification settings defined in cfnParams except using shuffled class labels. This builds up a null distribution from which you can estimate a p-value to infer the significance of the classification accuracy obtained with the true data labeling provided.

    You can also choose to run across multiple cores by switching on doParallel:

    This gives you a p-value estimate (both via a direct permutation test, and by assuming a Gaussian null distribution), and plots the null distribution with the true result annotated:

    Comparing feature sets

    Specific feature sets (TS_CompareFeatureSets)

    You might wonder whether the classification results are driven by simple types of features that aren't related to time-series dynamics at all (such as the mean of the data, or time-series length).

    These can be filtered out from the initial computation (e.g., when performing TS_Init), or subsequently (e.g., using TS_Subset), but you can test the effect such features are having on your dataset using TS_CompareFeatureSets. Here's an example output:

    Here we see that length-dependent features are contributing to accurate classification (above-50% accuracy for this two-class balanced problem). We can take some relief from the fact that excluding these features ('notLengthDependent') does not significantly alter the classification accuracy, so these features are not single-handedly driving the classification results. Nevertheless, assuming differences in recording length is not an interesting difference we want to bias our classification results, it would be advisable to remove these for peace of mind.

    Comparing to lower-dimensional feature spaces

    The complexity of the time-series analysis literature is necessary for strong classification results to different degrees, depending on the task. You can quickly assess how accurately a smaller number of reduced components (e.g., Principal Components) can better classify your dataset using TS_Classify_LowDim:

    The classification accuracy is shown for all features (green, dashed), and as a function of the number of leading PCs included in the classifier (black circles). Note that this is cumulative: '5 PCs' means classification in the five-dimensional space of the five leading PCs.

    Here we find that we can get decent classification accuracy with just four PCs (and perhaps even more complex classifiers will give even better results in the lower-dimensional spaces).

    You can quickly interpret the type of features loading strongly onto each PC from the information shown to screen. For example:

    Demonstrates that, on this dataset, long-lag autocorrelations are the most strongly correlated features to PC5.

    TS_LabelGroups
    Feature sets derived from hctsa

    hctsa

    Feature-based time-series packages developed by others

    There is a list of python-based packages for time-series analysis here that is worth taking a look at, in addition to those packages highlighted below:

    Other packages and resources


    Lukasz Mentel
    Max Christ

    Accompanying web resource that provides a self-organising database of time-series data. It allows users to upload, explore, and compare thousands of different types of time-series data.

    Cover

    pyspi is a comprehensive python library for computing hundreds of statistics of pairwise interactions (SPIs) directly from multivariate time series (MTS) data.

    and we'll add it this growing list!

    Our Research 📕

    Methods Papers

    The following publications for details of how the highly-comparative approach to time-series analysis has developed since our initial publication in 2013. We:


    Applications Papers

    We have used hctsa to:

    as well as:

    • Distinguish wake from anesthetized flies.

      • 📗 Leung et al. PLoS Biology (2025).

    • Connect structural brain connectivity to fMRI dynamics (mouse).

      • 📗 Chaos (2017).

    • Connect structural brain connectivity to fMRI dynamics (human).

      • .

    • Distinguish time-series patterns for data-mining applications.

      • .

    • Classify babies with low blood pH from fetal heart rate time series.

      • .


    Others' Research 📕

    🧬 Biology


    🧫 Cellular Neuroscience


    🧠 Neuroimaging

    Here are some highlights:

    In addition to:

    • Detect associations between brain region dynamics and traits like cognitive ability and substance use.

      • � Tian et al., Nature Human Behavior (2025).

    • Predict age from resting-state MEG from individual brain regions.

      • � Stier et

    • Estimate brain age in children from EEG.

      • .

    • Extract gradients from fMRI hctsa time-series features to understand the relationship between schizophrenia and nicotine dependence.

      • .

    • Classify endogenous (preictal), interictal, and seizure-like (ictal) activity from local field potentials (LFPs) from layers II/III of the primary somatosensory cortex of young mice (using feature selection methods from an initial pool of hctsafeatures).

      • .

    • Distinguish motor-evoked potentials corresponding to multiple sclerosis.

      • .


    🔬 Medicine—General

    Here are some highlights:

    in addition to:

    • Predicting hospital length of stay from vital signs

      • Juez–Garcia et al. Continuous vital sign monitoring for predicting hospital length of stay: a feasibility study in chronic obstructive pulmonary disease and chronic heart failure patients (2025).

    • Differentiate essential tremor (ET) and tremor-dominant Parkinson's disease (PD)

    • Predict MS disability progression using time-series features of evoked potential signals

      • 📗 .

    • Identify sepsis in very low birth weight (<1.5kg) infants from heart rate signals, identifying heart rate characteristics of reduced variability and transient decelerations.

    • Identify novel heart-rate variability metrics, including RobustSD, to create a parsimonious model for cerebral palsy prediction in preterm neonatal intensive care unit patients.

      • .

    • Predicting post cardiac arrest outcomes.

      • .

    • Detect falls from wearable sensor data.

      • .

    • Detect falls from wearable sensor data.

      • .

    • Select features for fetal heart rate analysis using genetic algorithms.

      • .


    🦠 Medicine—Pathology


    🏗 Engineering

    Here are some highlights:

    in addition to:

    • Diagnose a spacecraft propulsion system utilizing data provided by the Prognostics and Health Management (PHM) society, as part of the Asia-Pacific PHM conference’s data challenge, 2023.

      • 📗 Proceedings of the Asia Pacific Conference of the PHM Society (2023).

    • Identify faults in a large-scale industrial process.

      • .


    ⛰️ Geoscience


    email

    CO_AutoCorr determine the method in which autocorrelation is computed, as well as the time lag at which autocorrelation is calculated, e.g., lag 1, lag 2, lag 3, etc.

  • WL_dwtcoeff has inputs that set the mother wavelet to use and level of wavelet decomposition; and

  • FC_LocalSimple has inputs that determine the time-series forecasting method to use and the size of the training window.

  • The set of code files below and their input parameters that define the default hctsa feature set are in the INP_mops.txt file of the hctsa repository.

    Distribution

    Algorithms for summarizing properties of the distribution of values in a time series (independent of their ordered sequence through time).

    Code file

    Description

    DN_Burstiness

    Burstiness statistic of a time series.

    DN_CompareKSFit

    Fits a distribution to data.

    DN_CustomSkewness

    Custom skewness measures.

    DN_FitKernelSmooth

    Statistics of a kernel-smoothed distribution of the data.

    DN_Fit_mle

    Maximum likelihood distribution fit to data.

    Correlation

    Code summarizing basic properties of how values of a time series are correlated through time.

    Code file

    Description

    CO_AddNoise

    Changes in the automutual information with the addition of noise.

    CO_AutoCorr

    Compute the autocorrelation of an input time series.

    CO_AutoCorrShape

    How the autocorrelation function changes with the time lag.

    CO_Embed2

    Statistics of the time series in a 2-dimensional embedding space.

    CO_Embed2_AngleTau

    Angle autocorrelation in a 2-dimensional embedding space.

    Information Theory

    Entropy and complexity measures for time series based on information theory

    Code file

    Description

    EN_ApEn

    Approximate Entropy of a time series.

    EN_CID

    Simple complexity measure of a time series.

    EN_MS_LZcomplexity

    Lempel-Ziv complexity of a n-bit encoding of a time series.

    EN_MS_shannon

    Approximate Shannon entropy of a time series.

    EN_PermEn

    Permutation Entropy of a time series.

    Time-series model fitting and forecasting

    Fitting time-series models and doing simple forecasting on time series.

    Code file

    Description

    MF_ARMA_orders

    Compares a range of ARMA models fitted to a time series.

    MF_AR_arcov

    Fits an AR model of a given order, p.

    MF_CompareAR

    Compares model fits of various orders to a time series.

    MF_CompareTestSets

    Robustness of test-set goodness of fit.

    MF_ExpSmoothing

    Exponential smoothing time-series prediction model.

    Stationarity and step detection

    Quantifying how properties of a time series change over time.

    Code file

    Description

    SY_DriftingMean

    Mean and variance in local time-series subsegments.

    SY_DynWin

    How stationarity estimates depend on the number of time-series subsegments.

    SY_KPSStest

    The KPSS stationarity test.

    SY_LocalDistributions

    Compares the distribution in consecutive time-series segments.

    SY_LocalGlobal

    Compares local statistics to global statistics of a time series.

    Nonlinear time-series analysis and fractal scaling

    Nonlinear time-series analysis methods, including embedding dimensions and fluctuation analysis.

    Code file

    Description

    NL_BoxCorrDim

    Correlation dimension of a time series.

    NL_DVV

    Delay Vector Variance method for real and complex signals.

    NL_MS_fnn

    False nearest neighbors of a time series.

    NL_MS_nlpe

    Normalized drop-one-out constant interpolation nonlinear prediction error.

    NL_TISEAN_c1

    Information dimension.

    Fourier and wavelet transforms, periodicity measures

    Properties of the time-series power spectrum, wavelet spectrum, and other periodicity measures.

    Code file

    Description

    SP_Summaries

    Statistics of the power spectrum of a time series.

    DT_IsSeasonal

    A simple test of seasonality.

    PD_PeriodicityWang

    Periodicity extraction measure of Wang et al.

    WL_DetailCoeffs

    Detail coefficients of a wavelet decomposition.

    WL_coeffs

    Wavelet decomposition of the time series.

    Symbolic transformations

    Properties of a discrete symbolization of a time series.

    Code file

    Description

    SB_BinaryStats

    Statistics on a binary symbolization of the time series.

    SB_BinaryStretch

    Characterizes stretches of 0/1 in time-series binarization.

    SB_MotifThree

    Motifs in a coarse-graining of a time series to a 3-letter alphabet.

    SB_MotifTwo

    Local motifs in a binary symbolization of the time series.

    SB_TransitionMatrix

    Transition probabilities between different time-series states.

    Statistics from biomedical signal processing

    Simple time-series properties derived mostly from the heart rate variability (HRV) literature.

    Code file

    Description

    MD_hrv_classic

    Classic heart rate variability (HRV) statistics.

    MD_pNN

    pNNx measures of heart rate variability.

    MD_polvar

    The POLVARd measure of a time series.

    MD_rawHRVmeas

    Heart rate variability (HRV) measures of a time series.

    Basic statistics, trend

    Basic statistics of a time series, including measures of trend.

    Code file

    Description

    SY_Trend

    Quantifies various measures of trend in a time series.

    ST_FitPolynomial

    Goodness of a polynomial fit to a time series.

    ST_Length

    Length of an input data vector.

    ST_LocalExtrema

    How local maximums and minimums vary across the time series.

    ST_MomentCorr

    Correlations between simple statistics in local windows of a time series.

    Others

    Other properties, like extreme values, visibility graphs, physics-based simulations, and dependence on pre-processings applied to a time series.

    Code file

    Description

    EX_MovingThreshold

    Moving threshold model for extreme events in a time series.

    HT_HypothesisTest

    Statistical hypothesis test applied to a time series.

    NW_VisibilityGraph

    Visibility graph analysis of a time series.

    PH_ForcePotential

    Couples the values of the time series to a dynamical system.

    PH_Walker

    Simulates a hypothetical walker moving through the time domain.

    hctsa
    can be used for a given application, and in properly interpreting its results.

    Some general advice before embarking on an hctsa analysis:

    1. What are the data?

      • For long, streaming data: how long is each time series? Does this capture the timescale of the problem you care about?

      • For continuously evolving processes: At what rate are the data sampled? Does this capture the patterns of interest? E.g., if the sampling rate is too high, this can lead to trivially autocorrelated time series such that time-series methods in hctsa will find it challenging to resolve the patterns of interest.

      • Are the appropriately processed (detrended, artifacts removed, …)? This requires careful thought, often with domain expertise, about what problem is being solved. E.g., many time-series analysis algorithms will be dominated by underlying trends if underlying trends in time series are not removed. In general, properties of the data, especially if they're likely to affect many time-series analysis algorithms (like underlying trends, artefactual outliers, etc.) should be removed if they're not informative of the differences you care about distinguishing. See the section below for more information.

      • What do they look like? Addressing the questions above requires you to look at each of the time series to get a sense of the dynamics you're interested in characterizing using time-series features.

    2. What problem are you trying to solve? In designing an analysis using hctsa, you will need to think about the sample size you have, what effect sizes are expected, what statistical power will you have, etc. E.g., if you only have 5 examples of each of two classes, you will not have the statistical power to pick out individual features from a library of 7000, simple (unregularized) classifiers will be likely to overfit, etc.

    3. Trial run with a reduced set. Once you’ve devised a pipeline, it's best to run through it in hctsa but using a reduced feature set first (e.g., the catch22 set), which runs quickly on a laptop and gives you a sense for the process. Once you're satisfied with the analysis pipeline, you can always scale up to the full hctsa library of >7000 features.

    Data processing

    hctsa contains thousands of time-series analysis methods, many of which make strong assumptions of the data, such as that it is (weak-sense) stationary. Although hctsa has been applied meaningfully to short, non-stationary patterns (as commonly studied in time-series data-mining applications), it is better suited to longer streams of data, from which more subtle temporal statistics can be sampled.

    hctsa is not substitute for thoughtful domain-motivated data preparation and processing. For example, hctsa cannot know is 'signal' and what is 'noise' for the question you're asking of your data. The analyst should prepare their data in a way that makes sense for the questions of relevance to them, including the possibility of de-trending/filtering the data, applying noise-removal methods, etc.

    The following should be considered:

    • Standardizing. If your time-series data are measured on scales that differ across different recordings, then the data should be appropriately standardized.

    • Detrending. If your data contain low-order trends that are not meaningful (e.g., sensor drift) or not a signal of relevance (your question is based more around the structure of deviations from a low-frequency trend), then this should be removed.

    • Denoising. If your data contain high-frequency measurement noise, the analyst should consider removing it. For example, using a filter (e.g., moving average), wavelet denoising, or using a phase-space reconstruction (e.g., cf. Schreiber's method).

    • Downsampling. Features assume that the data are sampled at a rate that properly resolves the temporal patterns of interest. If your data are over-sampled, then many features will be sensitive to this dominating autocorrelation structure, and will be less sensitive to interesting patterns in the data. In this case, you can consider downsampling your data, for which there are many heuristics (e.g., cf. ).

    Interpreting Features

    Checking for simpler explanations

    There are often many routes to solving a given data analysis challenge. For example, in a time-series classification problem, the two classes may be perfectly distinguished based on their lag-1 autocorrelation, and also on their Lyapunov exponent spectrum, and also on hundreds of other properties. In general, one should avoid interpreting the most complex features (like Lyapunov exponents) as being uniquely useful for a problem, as they reproduce the behavior of much simpler features, which provide a more interpretable and parsimonious interpretation of the relevant patterns in the dataset. For other problems, time-series analysis methods (that are sensitive to the time-ordering of the data samples) may not provide any benefit at all over properties of the data distribution (e.g., the variance), or more trivial differences in time-series length across classes.

    In general, complex explanations of patterns in a dataset can only be justified when simpler explanations have been ruled out. E.g., Do not write a paper about a complex (e.g., powerlaw fit to a visibility graph degree-distribution) feature when your data can be just as well (or better) distinguished by their variance.

    hctsa has many functions to check this type of behavior: from inspecting the pairwise correlation structure of high-performing features (in TS_TopFeatures) to basic checks on different keyword-labeled classes of features (in TS_CompareFeatureSets).

    input specifier (e.g.,
    TS_SimSearch('whatDataFile','HCTSA_custom.mat')
    ).

    After specifying the target and how many neighbors to retrieve, TS_SimSearch outputs the list of neighbors and their distances to screen, and the function also provides a range of plotting options to visualize the neighbors. The plots to produce are specified as a cell using the 'whatPlots' input.

    Pairwise similarity matrix of neighbors

    To investigate the pairwise relationships between all neighbors retrieved, you specify the 'matrix' option of the TS_SimSearch function. An example output using a publicly-available EEG dataset, retrieving 14 neighbors from the time series with ID = 1, as TS_SimSearch(1,'whatPlots',{'matrix'},'numNeighbors',14), is shown below:

    The specified target time series (ID = 1) is shown as a white star, and all 14 neighbors are shown, as labeled on the left of the plot with their respective IDs, and a 100-sample subset of their time traces.

    Pairwise distances are computed between all pairs of time series (as a Euclidean distance between their feature vectors), and plotted using color, from low (red = more similar pairs of time series) to high (blue = more different pairs of time series).

    Because this dataset contains 3 classes that were previously labeled (using TS_LabelGroups as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'})), the function shows these class assignments using color labels to the left of the plot (purple, green, and orange in this case).

    In this case we see that the purple and green classes are relatively similar under this distance metric (eyes open and eyes closed), whereas the orange time series (seizure) are distinguished.

    Network of neighbors

    Another way to visualize the similarity (under our feature-based distance metric) of all pairs of neighbors is using a network visualization. This is specified as:

    which produces something like the following:

    The strongest links are visualized as blue lines (by default, the top 40% of strongest links are plotted, cf. the legend showing 0.9, 0.8, 0.7, and 0.6 for the top 10%, 20%, 30%, and 40% of links, respectively).

    The target is distinguished (as purple in this case), and the other classes of time series are shown using color, with names and time-series segments annotated. Again, you can see that the EEG time series during seizure (blue) are distinguished from eyes open (red) and eyes closed (green).

    Pairwise similarity matrix of neighbors

    The scatter setting visualizes the relationship between the target and each of 12 time series with the most similar properties to the target. Each subplot is a scatter of the (normalized) outputs of each feature for the specified target (x-axis) and the match (y-axis). An example is shown below.

    Other details

    Multiple output plots can be produced simultaneously by specifying many types of plots as follows:

    This produces a plot of each type.

    Note that pairwise distances can be pre-computed and saved in the HCTSA*.mat file using TS_PairwiseDist for custom distance metrics (which is done by default in TS_Cluster for datasets containing fewer than 1000 objects). TS_SimSearch checks for this information in the specified input data (containing the ts_clust or op_clust structure), and uses it to retrieve neighbors. If distances have not previously been computed, distances from the target are computed as euclidean distances (time series) or absolute correlation distances (operations) between feature vectors within TS_SimSearch.

    TS_PlotDataMatrix
    TS_PlotLowDim
    [default] Summarizes the proportion of special-valued outputs in each operation as a bar plot, ordered by the proportion of special-valued outputs.
  • TS_InspectQuality('master'); Plots which types of special-valued outputs were encountered for each master operation.

  • TS_InspectQuality('full'); Plots the full data matrix (all time series as rows and all operations as columns), and shows where each possible special-valued output can occur (including 'error', 'NaN', 'Inf', '-Inf', 'complex', 'empty', or a 'link error').

  • TS_InspectQuality('reduced'); As 'full', but includes only columns where special values occurred.

  • For example, running TS_InspectQuality('summary') loads in data from HCTSA.mat and produces the following, which can be zoomed in on and explored to understand which features are producing problematic outputs:

    pca_image

    Errors with compiled code

    Note that errors returned from Matlab files do not halt the progress of the computation (using try-catch statements), but errors with compiled mex functions (or external command-line packages like TISEAN) can produce a fault that crashes Matlab or the system. We have performed some basic testing on all mex functions, but for some unusual time series, such faults may still occur. These situations must be dealt with by either identifying and fixing the problem in the original source code and recompiling, or by removing the problem code.

    Troubleshooting errors

    When getting information on operations that produce special-valued outputs (getting IDs listed from TS_InspectQuality), it can be useful to then test examples by re-running pieces of code with the problematic data. The function TS_WhichProblemTS can be used to retrieve time series from an hctsa dataset that caused a problem for a given operation.

    Usage is as follows:

    This provides the list of time series IDs (ts_ind), their time-series data vectors (dataCell), and the code to evaluate the given operation (in this case, the master operation code corresponding to the operation with ID 684).

    You can then pick an example time series (e.g., the first problem time series: x = dataCell{1}; x_z = zscore(x)), and then copy and paste the code in codeEval into the command line to evaluate the code for this time series. This method allows easy debugging and inspection of examples of time-series data that caused problems for particular operations flagged through the TS_InspectQuality process.

    Showing the first 400 samples of 10 selected time series, equally-spaced through the TimeSeries IDs in HCTSA_N.mat.

    Freeform plotting

    Many more custom plotting options are available by passing an options structure to TS_PlotTimeSeries, including the 'plotFreeForm' option which allows very many time series to be shown in a single plot (without the usual axis borders):

    producing an overview picture of the first 300 samples of 40 time series (spaced through the rows of the data matrix).

    Dealing with groups of time series

    When the time series have been assigned groups (using TS_LabelGroups, here), this information is automatically incorporated into TS_PlotTimeSeries, which then plots a given number of each time series group, and colors them accordingly:

    In this case the two labeled groups of time series are recognized by the function: red (noisy), blue (no noise), and then 5 time series in each group are plotted, showing the first 500 samples of each time series.

        whatData = 'norm'; % Get data from HCTSA_N.mat
        plotWhatTimeSeries = 'all'; % plot examples from all time series
        plotHowMany = 10; % how many to plot
        maxLength = 400; % maximum number of samples to plot for each time series
        TS_PlotTimeSeries(whatData,plotHowMany,plotWhatTimeSeries,maxLength);

    Youtube live stream of a tutorial session on feature-based time-series analysis tools at Computational Neuroscience Academy 2023, Krakow, Poland.

    YouTube video walking through the tutorial on classifying EEG data, check out this slides for the full talk about analyzing neural dynamics using hctsa at CNS 2020 are here. July 2020.

    Talk about the importance of comparison for time-series analysis to QMNET. Slides, YouTube. (August 2020).

    page of publications using hctsa
    https://www.youtube.com/watch?v=c1adfGjof8s
    https://github.com/benfulcher/hctsaTutorial_BonnEEG
    Cover

    Seizure EEG. An overview tutorial on applying hctsa to a 5-class EEG dataset is in (including the use of reduced feature set, catch22, within the hctsa framework). There is also a recording of the tutorial (final ~hour is a hands-on demo using this dataset) on .

    Cover

    Worms in a dish. and associated . You can try your hand at classifying different strains of the nematode worm C. elegans based on their time series of their movement speed. The repository, with links to pre-computed HCTSA.mat files, is .

    Cover

    Flies in a tube. and associated . This repository allows you to skip the process of running an hctsa calculation (you can download pre-computed results), and get straight to following key analyses, including classifying day/night, or male/female.

    Cover

    . A collection of 1000 diverse empirical time series to test algorithms on.

    Cover
    Cover
    Cover

    Finding informative features

    In a time-series classification task, you are often interested in not only determining differences between the labeled classes, but also interpreting them in the context of your application. For example, an entropy estimate may give low values to the RR interval series measured from a healthy population, but higher values to those measured from patients with congestive heart failure. When performing a highly comparative time-series analysis, we have computed a large number of time-series features and can use these results to arrive at these types of interpretable conclusions automatically.

    A simple way to determine which individual features are useful for distinguishing the labeled classes of your dataset is to compare each feature individually in terms of its ability to separate the labeled classes of time series. This can be achieved using the TS_TopFeatures function.

    Example: Classification of seizure from EEG

    In this example, we consider 100 EEG time series from seizure, and 100 EEG time series from eyes open (from the publicly-available ). After running (as simply TS_LabelGroups(), which automatically detects the two keywords assigned to the two classes of signals), we run simply:

    By default this function will compare the in-sample linear classification performance of all operations in separating the labeled classes of the dataset (individually), produce plots summarizing the performance across the full library (compared to an empirical null distribution), characterize the top-performing features, and show their dependencies on the dataset. Inputs to the function control how these tasks are performed (including the statistic used to measure individual performance, what plots to produce, and whether to produce nulls).

    First, we get an output to screen, showing the mean linear classification accuracy across all operations, and a list of the operations with top performance (ordered by their test statistic, with their ID shown in square brackets, and keywords in parentheses):

    Here we see that measures of entropy are dominating this top list, including measures of the time-series distribution.

    An accompanying plot summarizes the performance of the full library on the dataset, compared to a set of nulls (generated by shuffling the class labels randomly):

    Here we can see that the default feature library (blue: 7275 features remaining after filtering and normalizing) are performing much better than the randomized features (null: red).

    Next we get a set of plots showing the class probability distributions for the top features. For example:

    This allows us to interpret the values of features in terms of the dataset. After some inspection of the listed operations, we find that various measures of EEG Sample Entropy are higher for healthy individuals with eyes open (blue) than for seizure signals (purple).

    Finally, we can visualize how the top operations depend on each other, to try to deduce the main independent behaviors in this set of 25 top features for the given classification task:

    In this plot, the magnitude of the linear correlation, R (as 1-|R|) between all pairs of top operations is shown using color, and a visualization of a clustering into 5 groups is shown (with green boxes corresponding to clusters of operations based on the dendrogram shown to the right of the plot). Blue boxes indicate a representative operation of each cluster (closest to the cluster centre).

    Here we see that most pairs of operations are quite dependent (at |R| > ~0.8), and the main clusters are those measuring distribution entropy (EN_DistributionEntropy_), Lempel-Ziv complexity (EN_MS_LZcomplexity_), Sample Entropy (EN_SampEn_), and a one-step-ahead surprise metric (FC_Surprise_).

    This function produces all of the basic visualizations required to achieve a basic understanding of the differences between the labeled classes of the dataset, and should be considered a first step in a more detailed analysis. For example, in this case we may investigate the top-performing features in more detail, to understand in detail what they are measuring, and how. This process is described in the section.

    Visualizing the data matrix

    The clustered data matrix (if clustering has been performed, otherwise the non-clustered data matrix is used) can be visualized by running TS_PlotDataMatrix.

    This will produce a colored visualization of the data matrix such as that shown below.

    When data is grouped according to a set of distinct keywords and stored as group metadata (using the TS_LabelGroups function), these can also be visualized using TS_PlotDataMatrix('colorGroups',true).

    Visualizing the normalized data matrix

    Running TS_PlotDataMatrix('norm') plots the data contained in the local file HCTSA_N.mat, yielding:

    where black rectangles label missing values, and other values are shown from low (blue) to high (red) after normalization using the scaled outlier-robust sigmoidal transformation. Due to the size of the matrix, operations are not labeled individually.

    Examples of time series segments are shown to the left of the plot, and when the middle plot is zoomed, the time-series annotations remain matched to the data matrix:

    Visualizing the clustered data matrix

    It can be useful to display the matrix with the order of time series and operations preserved, but the relationships between rows and columns can be difficult to visualize when ordered randomly.

    If you have run , it will save clustering information back to the HCTSA file, that will be loaded if possible and used to reorder rows and columns by similarity (if this information exists). Here's an example of running TS_PlotDataMatrix on data that has this clustering info:

    Much prettier, hey?! By reordering rows and columns, this representation reveals correlated patterns of outputs across different types of operations, and similar sets of properties between different types of time series.

    Incorporating group information

    In this example, we consider a set of 20 periodic and 20 noisy periodic signals. We assigned the time series in HCTSA.mat to groups (using TS_LabelGroups('raw',{'periodic','noisy'})), then normalized the data matrix (TS_Normalize), and then clustered it (TS_Cluster). So now we have a clustered data matrix containing thousands of summaries of each time series, as well as pre-assigned group information as to which time series are periodic and which are noisy. When the time series have been assigned to groups, this can be accessed by switching on the 'colorGroups' setting:

    producing the following two plots:

    When group information is not used (the left plot), the data is visualized in the default blue/yellow/red color scheme, but when the assigned groups are colored (right plot), we see that the clustered dataset separates perfectly into the periodic (green) and noisy (blue) time series, and we can visualize the features that contribute to the separation.

    Interpreting features

    A typical data analysis procedure might first involve carefully inspecting your data, then thinking about the patterns you might want to capture, then trying to devise an algorithm to capture them (testing the algorithm carefully on simulated data, perhaps, to test its behavior, dependence on noise level, time-series length, sensitivity to low-order trends or other types of stationarity, etc.). But in hctsa we do something somewhat dangerous—apply thousands of methods without even needing to first even look at the data 😲

    A conventional data analysis pipeline starts with the hard work of thinking about the dynamics and carefully selecting time-series analysis methods with deep knowledge of how they work and the assumptions that make their interpretation valid. This hard work is not avoided in hctsa, although if you're lucky, hctsa may save you a bunch of work—if it identifies high-performing features, you can focus your energy on interpreting just these. But this hard work comes at the end of an analysis, and is difficult and time-consuming to do well—often involving looking deeply into the algorithms and theory underlying these methods, combined with careful inspection of the data to ensure you are properly interpreting them.

    Being the most substantial challenge—one that cannot be automated and requires alot of careful thought—this page provides some basic advice. While it's often challenging to work out what a given feature is capturing in your dataset, it is the most interesting part of an analysis, as it points you to interpretable scientific theory/algorithms, and makes you think carefully about the structure in your time series.

    Finding features to interpret

    1: Identify significant features

    Sometimes you will run TS_TopFeatures and find no features that significantly distinguish time series recorded from the labeled groups in your dataset. In this case, there may not be a signal, the signal may be too small given your sample size (and the perhaps large number of candidate time-series features, if using the hctsa set), or the right statistical estimators may not be present in hctsa.

    But other times you will obtain a long list of statistically significant (after multiple hypothesis correction) features (e.g., from TS_TopFeatures) with values that significantly distinguish the groups you care about (individuals with some disease diagnosis compared to that of healthy controls). In cases like this, the next step is to obtain some understanding of what's happening.

    Consider a long and kinda gnarly list, that is quite hard to make sense of. This is a typical situation because the features names in hctsa are long and typically hard to interpret directly. Like perhaps:

    If there are hundreds or thousands of statistically significant features, some of which have much higher accuracy than others, you may first want to inspect only a subset of features, e.g., the 50 or 100 top-performing features (with the most significant differences). If you can make sense of these first, you will usually have a pretty good basis for working out what's going on with the features with lower performance (if indeed you want to go beyond interpreting the best-performing features).

    2: Find groups of similar features

    TS_TopFeatures is helpful for showing us how features with different, long, and complicated names might cluster into groups of similarly behaving features that are measuring similar properties of the data (cf. the ). This is essential to interpret groups of features that are capturing the same underlying conceptual property of the dynamics. It helps you work out what's going on, and avoids over-interpreting any specific feature in isolation. For example, sometimes features with very complicated sounding names (like nonlinear embedding dimension estimates) exhibit very similar behavior to very simple features (like lag-1 autocorrelation or distributional outlier measures), indicating that the nonlinear embedding dimension estimates are likely sensitive to simpler properties in the time series (e.g., sensitive to outliers in the data). Intepreting the nonlinear embedding dimension in isolation would be a mistake in such a case.

    Plotting and inspecting clustered feature–feature correlation plots are crucial for identifying groups of features with similar behavior on the dataset. Then we can inspect and interpret these groups of similar (highly inter-correlated) features together as a group. This should be the first step in interpreting a set of significant features: which groups of features behave similarly, and what common concepts are they measuring?

    An example is here, were identified 3 important groups of features—measuring stationarity of self-correlation properties, stationarity of distributional properties, and global distributional properties—and then go on to interpret each in turn:

    Interpreting individual features

    When such a group of high-performing features capturing a common time-series property has been identified, how can we start to interpret and understand what each individual feature is measuring?

    Some features in the group may be easy to interpret directly. For example, in the list above rms is straightforward to interpret directly: it is simply the root-mean-square of the distribution of time-series values. Others have clues in the name (e.g., features starting with WL_coeffs are to do with measuring wavelet coefficients, features starting with EN_mse correspond to measuring the multiscale entropy ('mse'), and features starting with FC_LocalSimple_mean are related to time-series forecasting using local means of the time series).

    Below we outline a procedure for how a user can go from a time-series feature selected by hctsa towards a deeper understanding of the type of algorithm that feature is derived from, how that algorithm performs across the dataset, and thus how it can provide interpretable information about your specific time-series dataset.

    Inspecting keywords

    The simplest way of getting a quick idea of what sort of property a feature might be measuring is from its keywords, that often label individual features by the class of time-series analysis method from which they were derived. Keywords have not been applied exhaustively to features (this is an ongoing challenge), but when they are there they can be useful at giving a sense of what's going on. In the list above, we see keywords listed in parentheses, such as:

    1. 'forecasting' (methods related to predicting future values of a time series),

    2. 'entropy' (methods related to predictability and information content in a time series),

    3. 'wavelet' (features derived from wavelet transforms of the time series),

    Inspecting code

    To find more specific detailed information about a feature, beyond just a broad categorical label of the literature from which it was derived, the next step is find and inspect the code file that generates the feature of interest. For example, say we were interested in the top performing feature in the list above:

    We know from the keyword that this feature has something to do with forecasting, and the name provides clues about the details (e.g., FC_ stands for forecasting, the function FC_LocalSimple is the one that produces this feature, which, as the name suggests, performs simple local time-series prediction). We can use the feature ID (3016) provided in square brackets to get information from the Operations metadata table:

    Inspecting the text before the dot, ., in the CodeString field (FC_LocalSimple_mean3) tells us the name that hctsa uses to describe the Matlab function and its unique set of inputs that produces this feature. Whereas the text following the dot, ., in the CodeString field (taures), tells us the field of the output structure produced by the Matlab function that was run.

    We can use the MasterID to get more information about the code that was run using the MasterOperations metadata table:

    This tells us that the code used to produce our feature was FC_LocalSimple(y,'mean',3). We can get information about this function in the command line by running a help command:

    We can also inspect the code in the function FC_LocalSimple directly for more information. Like all code files for computing time-series features, FC_LocalSimple.m is located in the Operations directory of the hctsa repository.

    Inspecting the code file, we see that running FC_LocalSimple(y,'mean',3) does forecasting using local estimates of the time-series mean (since the second input to FC_LocalSimple, forecastMeth is set to 'mean'), using the previous three time-series values to make the prediction (since the third input to FC_LocalSimple, trainLength is set to 3).

    To understand what the specific output quantity from this code is that came up as being highly informative in our TS_TopFeatures analysis, we need to look for the output labeled taures of the output structure produced by FC_LocalSimple. We discover the following relevant lines of code in FC_LocalSimple.m:

    This shows us that, after doing the local mean prediction, FC_LocalSimple then outputs some features on whether there is any residual autocorrelation structure in the residuals of the rolling predictions (the outputs labeled ac1, ac2, and our output of interest: taures).

    The code shows that this taures output computes the CO_FirstZero of the residuals, which measures the first zero of the autocorrelation function (e.g., cf help CO_FirstZero). When the local mean prediction still leaves a lot of autocorrelation structure in the residuals, our feature, FC_LocalSimple_mean3_taures, will thus take a high value.

    Visualizing outputs

    Once we've seen the code that was used to produce a feature, and started to think about how such an algorithm might be measuring useful structure in our time series, we can then check our intuition by inspecting its performance on our dataset (as described in ).

    For example, we can run the following:

    which produces a plot like that shown below. We have run this on a dataset containing noisy sine waves, labeled 'noisy' (red) and periodic signals without noise, labeled 'periodic' (blue):

    In this plot, we see how this feature orders time series (with the distribution of values shown on the left, and split between the two groups: 'noisy', and 'periodic'). Our intuition from the code, that time series with longer correlation timescales will have highly autocorrelated residuals after a local mean prediction, appears to hold visually on this dataset.

    In general, the mechanism provided by TS_FeatureSummary to visualize how a given feature orders time series, including across labeled groups, can be very useful for feature interpretation.

    Summary

    hctsa contains a large number of features, many of which can be expected to be highly inter-correlated on a given time-series dataset. It is thus crucial to explore how a given feature relates to other features in the library, e.g., using the correlation matrix produced by TS_TopFeatures (cf. ), or by searching for features with similar behavior on the dataset to a given feature of interest (cf. ).

    In a specific domain context, you may need to decide on the trade-off between more complicated features that may have slightly higher in-sample performance on a given task, and simpler, more interpretable features that may help guide domain understanding. The procedures outlined above are typically the first step to understanding a time-series analysis algorithm, and its relationship to alternatives that have been developed across science.

    Investigating specific operations

    Sometimes it's useful to be able to investigate the behavior of an individual operation (or feature) across a time-series dataset. What are the distribution of outputs, and what types of time series receive low values, and what receive high values?

    These types of simple questions for specific features of interest can be investigated using the TS_FeatureSummary function. The function takes in an operation ID as its input (and can also take inputs specifying a custom data source, or custom annotation parameters), and produces a distribution of outputs from that operation across the dataset, with the ability to then annotate time series onto that plot.

    For example, the following:

    TS_FeatureSummary(100)

    Produces the following plot (where 6 points on the distribution have been clicked on to annotate them with short time-series segments):

    You can visually see that time series with more autocorrelated patterns through time receive higher values from this operation. Because groups have not been assigned to this dataset, the time series are colored at random.

    Running TS_FeatureSummary in violin plot mode provides another representation of the same result:

    This plots the distribution of feature 4310 from HCTSA.mat as a violin plot, with ten 500-point time series subsegments annotated at different points through the distribution, shown to the right of the plot:

    Plotting for labeled groups of time series

    When time series groups have been labeled (using as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'},'raw');), TS_FeatureSummary will plot the distribution for each class separately, as well as an overall distribution. Annotated points can then be added to each class-specific distributions. In the example shown below, we can see that the 'noisy' class (red) has low values for this feature (CO_tc3_2_denom), whereas the 'periodic' class mostly has high values.

    Simpler distributions

    TS_SingleFeature provides a simpler way of seeing the class distributions without annotations, as either kernel-smoothed distributions, as in TS_FeatureSummary, or as violin plots. See below for example implementations:

    Shows the distributions with classification bar underneath (for where a linear classifier would classify different parts of the space as either noisy or periodic):

    Shows the distributions shown as a violin plot, with means annotated and classification bar to the left:

    Note that the title, which gives an indication of the 10-fold cross-validated balanced accuracy of a linear classifier in the space is done on the basis of a single 10-fold split and is stochastic. Thus, as shown above, this can yield slightly different results when repeated. For a more rigorous analysis than this simple indication, the procedure should be repeated many more times to give a converged estimate of the balanced classification accuracy.

    Working with a mySQL database

    When running large-scale hctsa computations, it can be useful to set up a mySQL database for time series, operations, and the computation results, and have many Matlab instances (running on different nodes of a compute cluster, for example) communicate directly with the database.

    The hctsa software comes with this (optional) functionality, allowing a more powerful, distributed way to compute and store results of a large-scale computation.

    This chapter outlines the steps involved in setting up, and running hctsa computations using a linked mySQL database.

    Installing the hctsa code package to work with a mySQL database

    The hctsa package requires some preliminary set up to work with a mySQL database, described :

    1. Installation of mySQL, either locally, or on an accessible server.

    2. Setting up Matlab with a mySQL java connector (done by running the install_jconnector script in the Database directory, and then restarting Matlab).

    After the database is set up, and the packages required by hctsa are installed (by running the install script), linking to a mySQL database can be done by running the install_database script, which:

    1. Sets up Matlab to be able to communicate with the mySQL server and creates a new database to store Matlab calculations in, described .

    2. Populates the database with our default library of master operations and operations, as described . (NB: a description of the terminology of 'master operations': a set of input arguments to an analysis function, and 'operations': a single time-series feature, is ).

    This section contains additional details about each of these steps.

    Note that the above steps are one-off installation steps; once the software is installed and compiled, a typical workflow will simply involve opening Matlab, running the startup script (which adds all paths required for the hctsa software), and then working within Matlab from any desired directory.

    Adding a time-series dataset

    Once installed using our default library of operations, the typical next step is to to the database using the SQL_Add command. Custom and can also be added, if required.

    Computation, processing, and analysis

    After installing the software and importing a time-series dataset to a mySQL database, the process by which data is retrieved from the database to local Matlab files (using SQL_Retrieve), feature sets computed within Matlab (using TS_Compute), and computed data stored back in the database (SQL_store) is described in detail .

    After the computation is complete for a time-series dataset, a range of processing, analysis, and plotting functions are also provided with the software, as described .

    Comparing to existing features

    One of the key goals of highly comparative time-series analysis, is to allow unbiased methodological comparison between the vast literature of time-series analysis tools developed for different applications. By representing features in terms of their outputs across a time-series dataset, the context of a given feature can be assessed by searching the database for features with similar behavior. The search can be done using a diverse range of real and model-generated data, or using a more specific dataset if this is more appropriate for a given application (e.g., looking just at EEG signals). Just like , similar features to a given target feature can also be retrieved using TS_SimSearch.

    This chapter will give instructions on how you can compare a new time-series analysis feature to our library of over 7000 time-series features using hctsa. We assume that the reader has , which will be required to work with files and compute features.

    Setting up the mySQL database

    We assume that the user has access to and appropriate read/write privileges for a local or network mySQL server database. Instructions on how to install and set up a mySQL database on a variety of operating systems can be found .

    Setting Matlab up to talk to a mySQL server using the java connector

    Before the structure of the database can be created, Matlab must be set up to be able to talk to the mySQL server, which requires installing a mySQL java connector. The steps required to achieve this are performed by the script install_jconnector, which should be run from the main hctsa directory. If this script runs successfully and a mySQL server has been installed (either on your local machine or on an external server, see above), you are then ready to run the

    Populating the database with time series and operations

    When linking Matlab to a mySQL database, metadata associated with time series, operations, and master operations, as well as the results of computations are all stored in an indexed database. Adding master operations, operations, and time series to the database can be achieved using the SQL_Add function, as described below.

    The following table summarizes the terminology used for each type of object in hctsa land:

    Retrieving from the database

    Retrieving data from the Results table of the database is typically done for one of two purposes:

    1. To calculate as-yet uncalculated entries to be stored back into the database, and

    2. To analyze already-computed data stored in the database in Matlab.

    The function SQL_Retrieve performs both of these functions, using different inputs. Here we describe the use of the SQL_Retrieve function for the purposes of populating uncalculated (NULL) entries in the

    Adding time series

    After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to personal preference of the user whether to keep all time-series data in a single database, or have a different database for each dataset.

    Time series are added using the same function used to add master operations and operations to the database, SQL_Add, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database. The time-series data files to import, and the keywords to assign to each time series are specified in either: (i) an appropriately formatted matlab (.mat) file, or (ii) a structured input text file, as explained below.

    Note that, when using the .mat file input method, time-series data is stored in the database to six significant figures. However, when using the .txt file input method, time-series data values are stored as written in the input text file of each time series.

    Time series can be indexed by assigning keywords to them (which are stored in the

    The database structure

    The mySQL database is structured into four main components:

    1. A lists of all the filenames and other metadata of time series (the TimeSeries table),

    2. A list of all the names and other metadata of pieces of time-series analysis operations (the Operations table),

    Computing operations and writing back to the database

    After retrieving data from the mySQL database, missing entries (NULL in the database, and NaN in the local Matlab file) can be computed using TS_Compute, and stored back to the database using SQL_store. These functions are described below.

    Performing calculations using TS_Compute

    Values retrieved using SQL_Retrieve

        % Compute all missing values in HCTSA.mat:
        TS_Compute();
        % Compute all missing values in HCTSA.mat using parallel processing:
        TS_Compute(true);
        % Compute missing values in HCTSA.mat for ts_ids from 1:10 and op_ids from 1:1000
        TS_Compute(false,1:10,1:1000);
        % Compute all values that have never been computed before (default)
        TS_Compute(false,[],[],'missing');
        % Compute all values that have never previous been calculated OR have previously been computed but returned an error:
        TS_Compute(false,[],[],'error');
        % Compute all missing values in my_HCTSA_file.mat:
        TS_Compute(false,[],[],'missing','my_HCTSA_file.mat');
        % Compute all missing values in HCTSA.mat, suppressing output to screen:
        TS_Compute(false,[],[],'missing','',false);
    TS_Init('INP_test_ts.mat');
    TS_Init('INP_test_ts.mat','hctsa');
    TS_Init('INP_test_ts.mat','catch22');
    TS_Init('INP_ts.mat',{'INP_mops.txt','INP_ops.txt'});
    timeSeriesData = {[1.45,2.87,...],[8.53,-1.244,...]}; % (a cell of vectors)
    labels = {'EEGExperiment_sub1_trail1.dat','EEGExperiment_sub1_trail2.dat'}; % data labels for each time series
    keywords = {'subject1,trial1,eeg','subject1,trial2,eeg'}; % comma-delimited keywords for each time series
    
    % Save these variables out to INP_test.mat:
    save('INP_test.mat','timeSeriesData','labels','keywords');
    
    % Initialize a new hctsa analysis using these data and the default feature library:
    TS_Init('INP_test.mat','hctsa');
    gaussianwhitenoise_001.dat     noise,gaussian
    gaussianwhitenoise_002.dat     noise,gaussian
    sinusoid_001.dat               periodic,sine
    CO_tc3(x_z,1)     CO_tc3_xz_1
    ST_length(x)    ST_length
        CO_tc3_xz_1.raw     CO_tc3_1_raw      correlation,nonlinear
        CO_tc3_xz_1.abs     CO_tc3_1_abs      correlation,nonlinear
        CO_tc3_xz_1.num     CO_tc3_1_num      correlation,nonlinear
        CO_tc3_xz_1.absnum  CO_tc3_1_absnum   correlation,nonlinear
        CO_tc3_xz_1.denom   CO_tc3_1_denom    correlation,nonlinear
        ST_length           length            raw,lengthDependent
    TS_Classify('norm')
    numNulls = 100;
    dataFile = 'HCTSA_N.mat';
    cfnParams = GiveMeDefaultClassificationParams(dataFile);
    TS_Classify(dataFile,cfnParams,numNulls,'doParallel',true)
    ---Top feature loadings for PC5---:
    (0.048, r = 0.65) [116] AC_24 (correlation)
    (0.048, r = 0.64) [115] AC_23 (correlation)
    (0.047, r = 0.64) [117] AC_25 (correlation)
    (0.046, r = 0.62) [114] AC_22 (correlation)
    (0.045, r = 0.61) [118] AC_26 (correlation)
        TS_SimSearch('whatPlots',{'matrix'});
        TS_SimSearch(1,'whatPlots',{'network'});
        TS_SimSearch(1,'whatPlots',{'scatter'});
        TS_SimSearch(1,'whatPlots',{'matrix','network','scatter'})
    % Find time series that failed for the operation with ID = 684.
    [ts_ind, dataCell, codeEval] = TS_WhichProblemTS(684);
        % Plot as a freeform plot without labeling time series:
        plotOptions = struct('plotFreeForm',true,'displayTitles',false);
        TS_PlotTimeSeries('norm',40,'all',300,plotOptions);
        numPerGroup = 5; % plot this many examples of each group of time series
        plotHow = 'grouped'; % plot examples of each assigned group of time series
        TS_PlotTimeSeries('norm',numPerGroup,plotHow,500);

    5

    Output had a non-zero imaginary component

    6

    Output was empty (e.g., [])

    7

    Field specified for this operation did not exist in the master operation output structure

    DN_HighLowMu

    The highlowmu statistic.

    DN_HistogramMode

    Mode of the histogram.

    DN_Mean

    A given measure of location of a data vector.

    DN_MinMax

    The maximum and minimum values of the input data vector

    DN_Moments

    A moment of the distribution of the input time series.

    DN_OutlierInclude

    How statistics depend on distributional outliers.

    DN_OutlierTest

    How distributional statistics depend on distributional outliers.

    DN_ProportionValues

    Proportion of values in a time-series vector.

    DN_Quantile

    Quantiles of the distribution of values in the time series data vector.

    DN_RemovePoints

    How time-series properties change as points are removed.

    DN_SimpleFit

    Fits of parametric distributions or simple time-series models.

    DN_Spread

    Spread of the input time series.

    DN_TrimmedMean

    Mean of the outlier-trimmed time series.

    DN_HistogramAsymmetry

    Distributional asymmetry.

    DN_Unique

    The proportion of the time series that are unique values.

    DN_Withinp

    Proportion of data points within p standard deviations of the mean.

    DN_cv

    Coefficient of variation.

    DN_pleft

    Distance from the mean at which a given proportion of data are more distant.

    EN_DistributionEntropy

    Distributional entropy.

    HT_DistributionTest

    Hypothesis test for distributional fits to a data vector.

    CO_Embed2_Basic

    Point density statistics in a 2-d embedding space.

    CO_Embed2_Dist

    Analyzes distances in a 2-d embedding space of a time series.

    CO_Embed2_Shapes

    Shape-based statistics in a 2-d embedding space.

    CO_FirstCrossing

    First time the autocorrelation function crosses a threshold.

    CO_FirstMin

    Time of first minimum in a given correlation function.

    CO_NonlinearAutocorr

    A custom nonlinear autocorrelation of a time series.

    CO_StickAngles

    Analysis of line-of-sight angles between time-series data points.

    CO_TranslateShape

    Statistics on datapoints inside geometric shapes across the time series.

    CO_f1ecac

    The 1/e correlation length.

    CO_fzcglscf

    The first zero-crossing of the generalized self-correlation function.

    CO_glscf

    The generalized linear self-correlation function of a time series.

    CO_tc3

    Normalized nonlinear autocorrelation function, tc3.

    CO_trev

    Normalized nonlinear autocorrelation, trev function of a time series.

    DK_crinkle

    Computes James Theiler's crinkle statistic.

    DK_theilerQ

    Computes Theiler's Q statistic.

    DK_timerev

    Time reversal asymmetry statistic.

    NL_embed_PCA

    Principal Components analysis of a time series in an embedding space.

    Automutual information:

    CO_RM_AMInformation

    Automutual information (Rudy Moddemeijer implementation).

    CO_CompareMinAMI

    Variability in first minimum of automutual information.

    CO_HistogramAMI

    The automutual information of the distribution using histograms.

    IN_AutoMutualInfoStats

    Statistics on automutual information function for a time series.

    EN_RM_entropy

    Entropy of a time series using Rudy Moddemeijer's code.

    EN_Randomize

    How time-series properties change with increasing randomization.

    EN_SampEn

    Sample Entropy of a time series.

    EN_mse

    Multiscale entropy of a time series.

    EN_rpde

    Recurrence period density entropy (RPDE).

    EN_wentropy

    Entropy of time series using wavelets.

    MF_FitSubsegments

    Robustness of model parameters across different segments of a time series.

    MF_GARCHcompare

    Comparison of GARCH time-series models.

    MF_GARCHfit

    GARCH time-series modeling.

    MF_GP_FitAcross

    Gaussian Process time-series modeling for local prediction.

    MF_GP_LocalPrediction

    Gaussian Process time-series model for local prediction.

    MF_GP_hyperparameters

    Gaussian Process time-series model parameters and goodness of fit.

    MF_StateSpaceCompOrder

    Change in goodness of fit across different state space models.

    MF_StateSpace_n4sid

    State space time-series model fitting.

    MF_arfit

    Statistics of a fitted AR model to a time series.

    MF_armax

    Statistics on a fitted ARMA model.

    MF_hmm_CompareNStates

    Hidden Markov Model (HMM) fitting to a time series.

    MF_hmm_fit

    Fits a Hidden Markov Model to sequential data.

    MF_steps_ahead

    Goodness of model predictions across prediction lengths.

    FC_LocalSimple

    Simple local time-series forecasting.

    FC_LoopLocalSimple

    How simple local forecasting depends on window length.

    FC_Surprise

    How surprised you would be of the next data point given recent memory.

    PP_ModelFit

    Investigates whether AR model fit improves with different preprocessings.

    SY_PPtest

    Phillips-Peron unit root test.

    SY_RangeEvolve

    How the time-series range changes across time.

    SY_SlidingWindow

    Sliding window measures of stationarity.

    SY_SpreadRandomLocal

    Bootstrap-based stationarity measure.

    SY_StatAv

    Simple mean-stationarity metric, StatAv.

    SY_StdNthDer

    Standard deviation of the nth derivative of the time series.

    SY_StdNthDerChange

    How the output of SY_StdNthDer changes with order parameter.

    SY_TISEAN_nstat_z

    Cross-forecast errors of zeroth-order time-series models.

    SY_VarRatioTest

    Variance ratio test for random walk.

    Step detection:

    CP_ML_StepDetect

    Analysis of discrete steps in a time series.

    CP_l1pwc_sweep_lambda

    Dependence of step detection on regularization parameter.

    CP_wavelet_varchg

    Variance change points in a time series.

    NL_TISEAN_d2

    d2 routine from the TISEAN package.

    NL_TISEAN_fnn

    False nearest neighbors of a time series.

    NL_TSTL_FractalDimensions

    Fractal dimension spectrum, D(q), of a time series.

    NL_TSTL_GPCorrSum

    Correlation sum scaling by Grassberger-Proccacia algorithm.

    NL_TSTL_LargestLyap

    Largest Lyapunov exponent of a time series.

    NL_TSTL_PoincareSection

    Poincare section analysis of a time series.

    NL_TSTL_ReturnTime

    Analysis of the histogram of return times.

    NL_TSTL_TakensEstimator

    Taken's estimator for correlation dimension.

    NL_TSTL_acp

    acp function in TSTOOL

    NL_TSTL_dimensions

    Box counting, information, and correlation dimension of a time series.

    NL_crptool_fnn

    Analyzes the false-nearest neighbors statistic.

    SD_SurrogateTest

    Analyzes test statistics obtained from surrogate time series.

    SD_TSTL_surrogates

    Surrogate time-series analysis.

    TSTL_delaytime

    Optimal delay time using the method of Parlitz and Wichard.

    TSTL_localdensity

    Local density estimates in the time-delay embedding space.

    NL_nsamdf

    Nonlinearity measure derived from the nonlinear average magnitude difference function.

    Fluctuation analysis:

    SC_MMA

    Physionet implementation of multiscale multifractal analysis

    SC_fastdfa

    Matlab wrapper for Max Little's ML_fastdfa code

    SC_FluctAnal

    Implements fluctuation analysis by a variety of methods.

    WL_cwt

    Continuous wavelet transform of a time series.

    WL_dwtcoeff

    Discrete wavelet transform coefficients.

    WL_fBM

    Parameters of fractional Gaussian noise/Brownian motion in a time series.

    WL_scal2frq

    Frequency components in a periodic time series.

    SB_TransitionpAlphabet

    How transition probabilities change with alphabet size.

    ST_SimpleStats

    Basic statistics about an input time series.

    PP_Compare

    Compare how time-series properties change after pre-processing.

    PP_Iterate

    How time-series properties change in response to iterative pre-processing.

    catch22
    pyspi
    catch22
    pyopy
    TS_SimSearch
    TS_FeatureSummary
    Toker et al.
    here
    here
    here
    here
    add a dataset of time series
    master operations
    operations
    here
    here
    this GitHub repo
    YouTube
    C. elegans movement speed data
    analysis code
    here
    Drosophila movement speed
    analysis code
    1000 empirical time series

    catch22

    This reduced set of 22 features, determined through a combination of classification performance (across 93 problems) and mutual redundancy (as explained in this 📗 paper), is available here as an efficiently coded C implementation.

    catchamouse16

    A reduced set of 16 features (trained on mouse fMRI data), coded in C with wrappers for use in python, R, etc.

    autohctsa

    Provides code for setting a hctsa feature-extraction job running on a computing cluster without leaving your local machine (with associated Julia helper code).

    distributedhctsa

    Provides MATLAB code for distributing hctsa calculations across nodes of a computing cluster (using pbs or slurm schedulers).

    Candidate Feature Lab

    A repository for testing new time-series analysis algorithms for redundancy (and assessing whether to include into hctsa).

    hctsa-py

    A work in progress to recode many hctsa time-series analysis features into native python code.

    pyopy

    This excellent repository allows users to run hctsa software from within python.

    hctsaAnalysisPython

    Some preliminary python code for analyzing the results of hctsa calculations.

    tsfresh

    Native python time-series code to extract hundreds of time-series features, with in-built feature filtering. See the paper.

    Kats

    Kats is a toolkit for analysing time-series data in python, and includes basic feature extraction functionality.

    theft

    theft (Tools for Handling Extraction of Features from Time series) allows users to compute various open time-series features sets and contains associated tools for visualising and analysing the results.

    TSFEL

    TSFEL, 'Time Series Feature Extraction Library', is a python package with implementations of 60 simple time-series features (with unit tests).

    tsflex

    A flexible and efficient toolkit for flexible time-series processing & feature extraction, that makes minimal assumptions about sequential data. Includes support for a range of feature sets, including tsfresh, catch22, TSFEL, and Kats.

    NeuroKit

    Includes a wide range of useful signal processing tools (like power spectral densities, detrending and bandpass filtering, and empirical mode decomposition). It also includes estimation of complexity parameters (many entropies and correlation dimensions, see part of readme here) as well as detrended fluctuation analysis.

    tscompdata

    Makes available existing collections of time-series data for analysis.

    tsfeatures

    Includes implementations of a range of time-series features.

    feasts

    Provides a collection of tools for the analysis of tidy time-series data represented as a tsibble.

    sktime

    A unified framework for machine learning with time series.

    Khiva

    An open-source library of efficient algorithms to analyse time series using GPU and CPU.

    pyunicorn

    A python-based nonlinear time-series analysis and complex systems code package. See this publication.

    tslearn

    A python package for performing machine learning using time-series data, including loading datasets from the UCR/UAE repository.

    TSFuse

    A python package that can extract features from multivariate time series.

    CompEngine
    pyspi
    Cover
    Cover

    Reduced the hctsa feature library down to a reduced set of 22 efficiently coded features: catch22.

    📗 Data Mining and Knowledge Discovery 33, 1821 (2019).

    💻 catch22 Code.

    Developed a software package for highly-comparative time-series analysis, hctsa (includes applications to high throughput phenotyping of C. Elegans and Drosophila movement time series).

    📗 Cell Systems 5, 527 (2017).

    💻 Code (fly).

    💻 Code (worm).

    Introduced the feature-based time-series analysis methodology.

    📗 Feature Engineering for Machine Learning and Data Analytics, CRC Press (2018).

    📙 Preprint.

    Showed that the behaviour of thousands of time-series methods on thousands of different time series can be used to organise the interdisciplinary time-series analysis literature.

    📗 J. Roy. Soc. Interface (2013).

    Find dynamical signatures of psychiatric disorders from resting-state fMRI data.

    📗 PLoS Comp. Biol. (2024).

    Predict individual response to rTMS depression treatment from EEG data.

    📙 medRxiv (2023).

    Distinguish meditators from non-meditators from 30s of resting-state EEG data.

    📗 Neural Networks (2023).

    Identify neurophysiological signatures of cortical micro-architecture.

    📗 Nature Comms. (2023).

    Classify stars from NASA's Kepler Mission.

    📗 Monthly Notices of the Royal Astronomical Society (2022).

    Determine how striatal neuromodulation affects brain dynamics in thalamus and cortex.

    📗 eLife (2023).

    Uncover the dynamical structure of sleep EEG.

    📗 Sleep Medicine (2022).

    Show how gradients of variation in time-series properties of BOLD dynamics vary with physiological variation and structural connectivity in the human neocortex.

    📗 eLife (2020).

    Distinguish targeted perturbations to mouse fMRI dynamics.

    📗 Cerebral Cortex (2020).

    💻 Code.

    Extract acoustic features from social vocal accommodation in adult marmoset monkeys.

    📙 bioRxiv (2023).

    Track Drosophila in real time for high-throughput behavioural phenotyping.

    📗 eLife (2023).

    Detect anger from photoplethysmography (PPG) sensors.

    📗 Journal of NeuroEngineering and Rehabilitation (2023).

    Identify and distinguish marmoset vocalisations from audio, using Adaboost feature selection from hctsa features.

    📗 J. Roy. Soc. Interface (2023).

    Discriminate zebra finch songs in different social contexts.

    📗 PLoS Computational Biology (2021).

    Distinguish electromagnetic field exposure from zebrafish locomotion time series.

    📗 Sensors (2020).

    Confirm the role of µORs in VTA and NAc in acute fentanyl-induced behaviour (positive reinforcement).

    📗 Chaudun et al., Nature (2024).

    Assess stress-induced changes in astrocyte calcium dynamics.

    📗 Nature Comms. (2020).

    Assess the stress controllability of neurons from their activity time series.

    📗 Nature Neuroscience (2020).

    Understand changes in fMRI brain dynamics in patients with epilepsy.

    📗 Communications Biology (2024).

    Extract EEG markers of cognitive decline.

    📗 npj Aging (2024).

    Detect EEG markers of seizure disorders.

    📗 Brain Communications (2023).

    Capture a distinctive fingerprint of an individual's resting-state fMRI data.

    📙 ResearchSquare (2023).

    Identify methamphetamine users from EEG time series.

    📙 ResearchSquare (2023).

    Compute temporal profile similarity for individual fingerprinting from human fMRI data.

    📗 Network Neuroscience (2023).

    Characterise subnetworks of the frontoparietal control network from fMRI recordings.

    📙 bioRxiv (2023).

    Find time-series properties of motor-evoked potentials that predict multiple sclerosis progression after two years.

    📗 BMC Neurology (2020).

    Detect mild cognitive impairment using single-channel EEG to measure speech-evoked brain responses.

    📗 IEEE Transactions on Neural Systems and Rehabilitation Engineering (2019).

    Differentiate tremor disorders using massive feature extraction, outperforming the best traditional tremor statistic.

    📙 MedRxiv (2024).

    Identify physiological features predictive of respiratory outcomes in extremely pre-term infants from bedside monitor data.

    📙 MedRxiv (2024)

    Discover signatures of fatal neonatal illness from vital signs.

    📗 npj Digital Medicine (2022).

    Detect falls in elderly people from accelerometer data.

    📗 IEEE International Conference on Information and Communication Technology for Sustainable Development (2021).

    Prediction of post-cardiac arrest outcomes at discharge from physiological time series recorded on the first day of intensive care.

    📗 Anaesthesia Critical Care & Pain Medicine (2021).

    Detect falls of elderly people using wearable sensors.

    📗 IEEE Access (2021).

    Demonstrate that the suppression of essential tremor is due to a disruption of oscillations in the olivocerebellar loop.

    📗 Nature Comms. (2021).

    Classify heartbeats measured using single-lead ECG.

    📗 IEEE 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2019).

    Assess muscles for clinical rehabilitation.

    📗 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2017).

    Screen for COVID-19 using digital holographic microscopy.

    📗 Biomedical Optics Express (2022).

    Detect COVID-19 from red blood cells using digital holographic microscopy.

    📗 Optics Express (2022).

    Identify the biogeographic heterogeneity of mucus, lumen, and feces.

    📗 PNAS (2021).

    Detect keyhole porosity formation during laser irradiation of Ti-6Al-4V substrates.

    📗 Additive Manufacturing (2023).

    Identify keyhole pores in a laser powder-bed fusion process using acoustic and inline pyrometry time series.

    📗 Journal of Materials Processing Technology (2022).

    Detect false data injection attacks into smart meters.

    📗 IEEE Access (2021).

    Predict pending loss of power stability from generator response signals.

    📗 IEEE Access (2021).

    Detect seeded bearing faults on a wind turbine subjected to non-stationary wind speed.

    📗 Proceedings of the Seventeenth International Conference on Condition Monitoring and Asset Management (2021).

    Recognise hand gestures.

    📗 PLoS ONE (2020).

    Distinguish energy use behaviours from smart meter data.

    📗 Energy and Buildings (2019).

    Non-intrusively monitor load for appliance detection and electrical power saving in buildings.

    📗 Energy and Buildings (2019).

    Evaluate asphalt irregularity from smartphone sensors.

    📗 International Symposium on Intelligent Data Analysis (2018).

    Detecting earthquakes from seismic recordings.

    📗 Geophysical Prospecting (2023).

    Find temporal patterns for reconstructing surface soil moisture time series.

    📗 Journal of Hydrology (2023).

    Predict earthquakes (in the following month) from seismic indicators in Bangladesh.

    📗 IEEE Access (2021).

    Detect earthquakes in Groningen, The Netherlands.

    📗 82nd EAGE Annual Conference & Exhibition Workshop Programme (2020).

    📗 Network Neuroscience (2020)
    📗 IEEE Trans. Knowl. Data Eng. (2014)
    📗 34th Ann. Int. Conf. IEEE EMBC (2012)
    al., PNAS (2025).
    📗 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2023)
    📗 Cerebral Cortex (2023)
    📙 SciTePress (2023)
    📗 Frontiers in Neuroinformatics (2020)
    Häring et al. Phenotypical Differentiation of Tremor Using Time Series Feature Extraction and Machine Learning, Movement Disorders (2025)
    Fonteyn et al. International Conference on Machine Learning and Applications (ICMLA). (2025)
    📙 MedRxiv (2024)
    📗 Pediatric Research (2023)
    📗 Anaesthesia Critical Care & Pain Medicine (2022)
    📗 Scientific Reports (2021)
    📗 Biosensors (2021)
    📗 Physiological Measurement (2014)
    PhD Thesis
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover
    Cover

    Error handling and maintenance

    In this section we describe how keywords and other types of metadata stored in the database can be manipulated, and how results of whole sets of operations and time series can be cleared or completely deleted from the database. These tasks are implemented as Matlab functions, and ensure that key database structures are maintained. Instead performing such tasks by acting directly on the database can cause inconsistencies and should be avoided.

    Java heap space

    Running out of java heap space throws the error java.lang.OutOfMemoryError, and usually happens when trying to retrieve a large set of time series/operations from the database. Matlab needs to keep the whole retrieval in memory, and has a hard limit on this. The java heap size can be increased in the Matlab preferences, under General Java Heap Memory.

    install
    script.

    The following outlines the actions performed by the install_jconnector script (including instructions on how to perform the steps manually):

    It is necessary to relocate the J connector from the Database directory of this code repository (which is also freely available here): the file mysql-connector-java-5.1.35-bin.jar (for version 5.1.35). Instructions are here and are summarized below, and described in the Matlab documentation. This .jar file must be added to a static path where it can always be found by Matlab. A good candidate directory is the java/jarext/ subdirectory of the Matlab root directory (to determine the Matlab root directory, simply type matlabroot in an open Matlab command window).

    For Matlab to see this file, you need to add a reference to it in the javaclasspath.txt file (an alternative is to modify the classpath.txt file directly, but this may not be supported by newer versions of Matlab). This file can be found (or if it does not exist, should be created) in Matlab’s preferences directory (to determine this location, type prefdir in a command window, or navigate to it within Matlab using cd(prefdir)).

    This javaclasspath.txt file must contain a text reference to the location of the java connector on the disk. In the case recommended above, where it has been added to the java/jarext directory, we would add the following to the javaclasspath.txt file:

    ensuring that the version number (5.1.35) matches your version of the J connector (if you are using a more recent version, for example).

    Note that javaclasspath.txt can also be in Matlab’s startup directory (for example, to modify this just for an individual user).

    After restarting Matlab, Matlab should then have the ability to communicate with mySQL servers (we will check whether this works below).

    Installing the Matlab/mySQL system

    The main tasks involved in installing the Matlab/mySQL interface are achieved by the install.m script, which runs the user through the steps below.

    Creating the mySQL database

    If you have not already done so, creating a mySQL database to use with Matlab can be done using the SQL_create_db function. This requires that mySQL is installed on an accessible server, or on the local machine (i.e., using localhost). If the database has already been set up, then you do not need to use the SQL_create_db function but you must then create a text file, sql-setting.conf, in the Database directory of the repository. This file contains four comma-delimited entries corresponding to the server name, database name, username, and password, as per the following:

    The settings listed here are those used to connect to the mySQL server. Remember that your password is sitting here in this document in unencrypted plain text, so do not use a secure or important password.

    To check that Matlab can connect to external servers using the mySQL J-connector, using correct host name, username, and password settings, we introduce the Matlab routines SQL_OpenDatabase and SQL_CloseDatabase. An example usage is as follows:

    For this to work, the sql_settings.conf file must be set up properly. This file specifies (in unencrypted plain text!) the login details for your mySQL database in the form hostName,databaseName,username,password.

    An example sql_settings.conf file:

    Once you have configured your sql_settings.conf file, and you can run dbc = SQL_OpenDatabase; and SQL_CloseDatabase(dbc) without errors, then you can smile to yourself and you should at this point be happy because Matlab can communicate successfully with your mySQL server! You should also be excited because you are now ready to set up the database structure!

    Note that if your database is not set up on your local machine (i.e., localhost), then Matlab can communicate with a mySQL server through an ssh tunnel, which requires some additional setup (described below).

    Note also that the SQL_OpenDatabase function uses Matlab's Database Toolbox if a license is available, but otherwise will use java commands; both are supported and should give identical operational behavior.

    Changing between different databases

    To start writing a new dataset to a new database, or start retrieving data from a different database, you will need to change the database that Matlab is configured to connect to. This can be done using the SQL_ChangeDatabase script (which walks you through the steps and writes over the existing sql_settings.conf file), or by altering the sql_settings.conf file directly.

    Note that one can swap between multiple databases easily by commenting out lines of the sql_settings.conf file (adding % to the start of a line to comment it out).

    Setting up an ssh tunnel to a mySQL server

    In some cases, the mySQL server you wish to connect to requires an ssh tunnel. One solution is to use port forwarding from your local machine to the server. The port forward can be set up in the terminal using a command like:

    This command connects port 1234 on your local computer to port 3306 (the default mySQL port) on the server. Now, telling Matlab to connect to localhost through port 1234 will connect it, through the established ssh tunnel, to the server. This can be achieved by specifying the server as localhost and the port number as 1234 in the sql_settings.conf file (or during the install process), which can be specified as the (optional) fifth entry, i.e.,:

    here

    ts_id

    Input to SQL_Add:

    'mops'

    'ops'

    'ts'

    Using SQL_Add

    SQL_Add has two key inputs that specify:

    1. Whether to import a set of time series (specify ‘ts’), a set of operations (specify ‘ops’), or a set of master operations (specify ‘mops’),

    2. The name of the input file that contains appropriately-formatted information about the time series, master operations, or operations to be imported.

    In this section, we describe how to use SQL_Add to add master operations, operations, and time series to the database.

    Users wishing to run the default hctsa code library their own time-series dataset will only need to add time series to the database, as the full operation library is added by default by the install.m script. Users wishing to add additional features using custom time-series code or different types of inputs to existing code files, can either edit the default INP_ops.txt and INP_mops.txt files provided with the repository, or create new input files for their custom analysis methods (as explained for operations and master operations).

    REMINDER: Manually editing the database, including adding or deleting rows, is very dangerous, as it can create inconsistencies and errors in the database structure. Adding time series and operations to the database should only be done using SQL_Add which sets up the Results table of the database and ensures that the indexing relationships in the database are properly maintained.

    Example: Adding our library of master operations to the database

    By default, the install script populates the database with the default library of highly comparative time-series analysis code. The formatted input file specifying these pieces of code and their input parameters is INP_mops.txt in the Database directory of the repository. This step can be reproduced using the following command:

    Once added, each master operation is assigned a unique integer, mop_id, that can be used to identify it. For example, when adding individual operations, the mop_id is used to map each individual operation to a corresponding master operation.

    Adding new pieces of executable code to the database

    New functions and their input parameters to execute can be added to the database using SQL_Add in the same way as described above. For example, lines corresponding to the new code can be added to the current INP_mops.txt file, or by generating a new input file and running SQL_Add on the new input file. Once in the database, the software will then run the new pieces of code. Note that SQL_Add checks for repeats that already exist in the database, so that duplicate database entries cannot be added with SQL_Add.

    New code added to the database should be checked for the following: 1. Output is a real number or structure (and uses an output of NaN to assign all outputs to a NaN). 2. The function is accessible in the Matlab path. 3. Output(s) from the function have matching operations (or features), which also need to be added to the database.

    Corresponding operations (or features) will then need to added separately, to link to the structured outputs of master operations.

    Example: Adding our library of operations to the database

    Operations can be added to the mySQL database using an appropriately-formatted input file, such as INP_ops.txt, as follows:

    Every operation added to the database will be given a unique integer identifier, op_id, which provides a common way of retrieving specific operations from the database.

    Note that after (hopefully successfully) adding the operations to the database, the SQL_Add function indexes the operation keywords to an OperationKeywords table that produces a unique identifier for each keyword, and another linking table that allows operations with each keyword to be retrieved efficiently.

    Adding time series to the database

    After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to the user whether to keep all time-series data in a single database, or have a different database for each dataset.

    Time series are added using the same function used to add master operations and operations to the database, SQL_Add, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database.

    Time series can be indexed by assigning keywords to them (which are stored in the TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the database).

    When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.

    Adding a set of time series to the database requires an appropriately formatted input file, following either of the following:

    We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt'), as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat').

    Master Operation

    Operation

    Time Series

    Database identifier:

    mop_id

    op_id

    Results
    table of the database in Matlab.

    For calculating missing entries in the database, SQL_Retrieve can be run as follows:

    The third input, 'null', retrieves ts_ids and op_ids from the sets provided that contain (as-yet) uncalculated (i.e., NULL) elements in the database; these can then be calculated and stored back in the database. An example usage is given below:

    Running this code will retrieve null (uncalculated) data from the database for time series with ts_ids between 1 and 5 (inclusive) and all operations in the database, keeping only the rows and columns of the resulting time series x operations matrix that contain NULLs.

    When calculations are complete and one wishes to analyze all of the data stored in the database (not just NULL entries requiring computation), the third input should be set to ‘all’ to retrieve all entries in the Results table of the database, as described later.

    SQL_Retrieve writes to a local Matlab file, HCTSA.mat, that contains the data retrieved from the database.

        SQL_Retrieve(ts_ids, op_ids, 'null');
        SQL_Retrieve(1:5, 'all', 'null');
    TimeSeriesKeywords
    table and associated index table,
    TsKeywordsRelate
    of the database). Assigning keywords to time series makes it easier to retrieve a set of time series with a given set of keywords for analysis, and to group time series annotated with different keywords for classification tasks.

    When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.

    SQL_Add syntax

    Adding a set of time series to the database requires an appropriately formatted input file, INP_ts.txt, for example, the appropriate code is:

    We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt'), as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat').

    % Add time series (embedded in a .mat file):
    SQL_Add('ts','INP_ts.mat');
    
    % Add time series (stored in data files) using an input text file:
    SQL_Add('ts','INP_ts.txt');
    (to the local
    HCTSA.mat
    file) that have not previously been calculated are evaluated using
    TS_Compute
    , as described
    . These results can then be inspected directly (if needed), or simply written back to the database using SQL_store, as described below.

    Writing calculations back to the database using SQL_store

    Once calculations have been performed using Matlab on local files, the results must be written back to the database. This task is performed by SQL_store, which reads the data in HCTSA.mat, checks that the metadata still matches the database, and then begins updating the Output, Quality, and CalculationTime columns of the Results table in the mySQL database. This can be done by simply running:

    Depending on database latencies, this can be a relatively slow process, up to 20-25 s per time series, updating each row in the Results table individually using mySQL UPDATE statements. However, the delay in this step means that the computation can be distributed across multiple compute nodes, and that stored data can be indexed and retrieved systematically. Keeping results in local Matlab files can be extremely inefficient, and can indeed be untenable for large datasets.

        SQL_store;
    here
        $matlabroot/java/jarext/mysql-connector-java-5.1.35-bin.jar
    hostname,databasename,username,password
        % Open a connection to the default mySQL database as dbc:
        % Connection details are stored in sql-setting.conf
        dbc = SQL_OpenDatabase;
    
        % <Do things with the database connection, dbc>
    
        % Close the connection, dbc:
        SQL_CloseDatabase(dbc);
    localhost,myTestDatabase,benfulcher,myInsecurePassword
        ssh -L 1234:localhost:3306 myUsername@myServer.edu
    hostname,databasename,username,password,1234
    SQL_Add('mops','INP_mops.txt');
    SQL_Add('ops','INP_ops.txt');
    % Add time series (embedded in a .mat file):
    SQL_Add('ts','INP_ts.mat');
    
    % Add time series (stored in data files) using an input text file:
    SQL_Add('ts','INP_ts.txt');
    'locationDependent'
    (location dependent: features that change under mean shifts of a time series),
  • 'spreadDependent' (spread dependent: features that change under rescaling about their mean), and

  • 'lengthDependent' (length dependent: features that are highly sensitive to the length of a time series).

  • previous section
    Bailey et al. (2023)
    Investigating specific operations
    Finding informative features
    Finding nearest neighbors
    Setting a data context

    The first step is defining the set of features to compare to (here we use the default hctsa library), and the set of time-series data that behavior is going to be assessed on. If you have just developed a new algorithm for time-series analysis and want to see how it performs across a range of interdisciplinary time-series data, then you may want to use a diverse set of time series sampled from across science. This can be easily achieved using our set of 1000 time series, a random selection of 25 such time series are plotted below (only the first 250 samples are plotted to aid visualization):

    Pre-computed results for a recent version of hctsa can be downloaded from figshare as HCTSA_Empirical1000.mat.

    Alternatively, features can be recomputed using our input file for the time-series dataset, using the input file provided in the same figshare data repository. This ensures implementation consistencies on your local compute architecture; i.e., using TS_Init('INP_Empirical1000.mat'); to initialize, followed by compute commands involving TS_Compute).

    However, if you only ever analyze a particular type of data (e.g., rainfall), then perhaps you're more interested in which methods perform similarly on rainfall data. For this case, you can produce your own data context for custom data using properly structured input files as explained here.

    EXAMPLE 1: Determining the relationship between a new feature and existing features

    We use the (hypothetical) example of a hot new feature, :boom: hot_feature1 :boom:, recently published in Science (and not yet in the hctsa library), and attempt to determine whether it is completely new, or whether there are existing features that exhibit similar performance to it. Think first about the data context (described above), which allows you to understand the behavior of thousands of features on a diverse dataset with which to compare the behavior of our new feature, hot_feature1. This example uses the Empirical1000 data context downloaded as HCTSA_Empirical1000.mat from figshare.

    1. Computing the new features

    Getting the feature values for the new feature, hot_feature1, could be done directly (using TS_CalculateFeatureVector), but in order to maintain the HCTSA structure, we instead produce a new HCTSA.mat file containing just hot_feature and the same time series. For example, to compare to the HCTSA_Empirical1000.mat file hosted on figshare, you should use the same version of hctsa to enable a valid comparison to the same set of features.

    We first generate an input file, INP_hot_master.txt containing the function call, that takes in a time series, x:

    Any additional arguments to the function MyHotFeature.m should be specified here. MyHotFeature.m must also be in a form that outputs a structure (or a single real number, as explained here).

    The interesting field in the structure output produced by MyHotFeature(x) is hotFeature1, which needs to be specified in another input text file, INP_hot_features.txt, for example, as:

    where we have given this feature two keywords: hot and science.

    So now we are able to initiate a new hctsa calculation, specifying custom code calls (master) and features to extract from the code call (features), as:

    This generates a new file, HCTSA_hot.mat, containing information about the 1000 time series, and the new hot feature, hot_feature1, which can then be computed as:

    2. Combining

    So now we have both a context of the behavior of a library of >7000 features on 1000 diverse time series, and we also have the behavior of our three hot new features. It is time to combine them and look for inter-relationships!

    3. Comparing

    Now that we have all of the data in the same HCTSA file, we can compare the behavior of the new feature to the existing library of features. This can be done manually by the researcher, or by using standard hctsa functions; the most relevant is TS_SimSearch. We can find the ID assigned to our new hot_feature in the merged HCTSA file as:

    which tells us that the ID of my_hot_feature in HCTSA_merged.mat is 7703. Then we can use TS_SimSearch to explore the relationship of our hot new feature to other features in the hctsa library (in terms of linear, Pearson, correlations):

    We find that our feature is reproducing the behavior of the first zero of the autocorrelation function (the first match: first_zero_ac; see Interpreting Features for more info on how to interpret matching features):

    The pairwise distance matrix (distances are 1−∣r∣1-|r|1−∣r∣, for Pearson correlation coefficients, rrr) produced by TS_SimSearch provides another visualization of the context of this hot new feature (in this case there are so many highly correlated features, that the matrix doesn't reveal much subtle structure):

    4. Interpreting

    In this case, the hot new feature wasn't so hot: it was highly (linearly) correlated to many existing features (including the simple zero-crossing of the autocorrelation function, first_zero_ac), even across a highly diverse time-series dataset. However, if you have more luck and come up with a hot new feature that shows distinctive (and useful) performance, then it can be incorporated in the default set of features used by hctsa by adding the necessary master and feature definitions (i.e., the text in INP_hot_master.txt and the text in INP_hot_features.txt) to the library files (INP_mops.txt and INP_ops.txt in the Database directory of hctsa), as explained here. You might even celebrate your success by sharing your new feature with the community, by sending a Pull Request to the hctsa github repository!! :satisfied:

    EXAMPLE 2: Determining the relationship between an existing hctsa feature and the rest of the library.

    If using a set of 1000 time series, then this is easy because all the data is already computed in HCTSA_Empirical1000.mat on figshare :relaxed:

    For example, say we want to find neighbors to the fastdfa algorithm from Max Little's website. This algorithm is already implemented in hctsa in the code SC_fastdfa.m as the feature SC_fastdfa_exponent. We can find the ID of this feature by finding the matching row in the Operations table (ID=750):

    and then find similar features using TS_SimSearch, e.g., as:

    Yielding:

    We see that other features in the library indeed have strong relationships to SC_fastdfa_exponent, including some unexpected relationships with the stationarity estimate, StatAvl25.

    Combining the network visualization with scatter plots produces the figures in our original paper on the empirical structure of time series and their methods (cf. Sec. 2.4 of the supplementary text), see below:

    Specific pairwise relationships can be probed in more detail (visualizing the types of time series that drive any relationship) using TS_Plot2d, e.g., as:

    similar time series to a target can be retrieved and visualized
    installed hctsa
    A list of all the pieces of code that must be evaluated to give each operation a value, which is necessary, for example, when one piece of code produces multiple outputs (the
    MasterOperations
    table), and
  • A list of the results of applying operations to time series in the database (the Results table).

  • Additional tables are related to indexing and managing efficient keyword labeling, etc.

    Time series and operations have their own tables that contain metadata associated with each piece of data, and each operation, and the results of applying each method to each time series is in the Results table, that has a row for every combination of time series and operation, where we also record calculation times and the quality of outputs (for cases where there the output of the operation was not a real number, or when some error occurred in the computation). Note that while data for each time series data is stored on the database, the executable time-series analysis code files are not, such that all code files must be in Matlab’s path (all required paths can be added by running startup.m).

    Another handy (but dangerous) function to know about is SQL_reset, which will delete all data in the mySQL database, create all the new tables, and then fill the database with all the time-series analysis operations. The TimeSeries, Operations, and MasterOperations tables can be generated by running SQL_CreateAllTables, with master operations and operations added to the database using SQL_Add commands (described here).

    You now you have the basic table structure set up in the database and have done the first bits of mySQL manipulation through the Matlab interface.

    It is very useful to be able to inspect the database directly, using a graphical interface. A very good example for Mac is the excellent, free application, Sequel Pro (a screenshot is shown below, showing the first 40 rows of the Operations table of our default operations library). Applications similar to Sequal Pro exist for Windows and Linux platforms. Applications that allow the database to be inspected are extremely useful, however they should not be used to manipulate the database directly. Instead, Matlab scripts should be used to interact with the database to ensure that the proper relationships between the different tables are maintained (including the indexing of keyword labels).

    SQLPro for Mac
    Bonn University dataset
    TS_LabelGroups
    Interpreting features
    TS_Cluster
    TS_LabelGroups

    Low dimensional representation

    The software also provides a basic means of visualizing low-dimensional representations of the data, using PCA as TS_PlotLowDim.

    This can be done for a time-series dataset as follows:

    This uses the normalized data (specifying 'norm'), plotting time series in the reduced, two-dimensional principal components space of operations (the leading two principal components of the data matrix).

    By default, the user will be prompted to select 10 points on the plot to annotate with their corresponding time series, which are annotated as the first 300 points of that time series (and their names by default).

    After selecting 10 points, we have the following:

    The proportion of variance explained by each principal component is provided in parentheses in the axis label.

    Working with short time series

    Although many sections of the time-series analysis literature has worked to develop methods for quantifying complex temporal structure in long time-series recordings, many time series that are analyzed in practice are relatively short. hctsa has been successfully applied to time-series classification problems in the data mining literature, which includes datasets of time series as short as 60 samples (). However, time-series data are sometimes even shorter, including yearly economic data across perhaps six years, or biological data measured at say 10 points across a lifespan. Although many features in hctsa will not give a meaningful output when applied to a short time series, hctsa includes methods for filtering such features (cf. ), after which the remaining features can be used for analysis.

    The number of features with a meaningful output, from time series as short as 5 samples, up to those with as many as 500 samples, is shown below (where the maximum set of 7749 is shown as a dashed horizontal line):

    In each case, over 3000 features can be computed. Note that one must be careful when representing a 5-dimensional object with thousands of features, the vast majority of which will be highly intercorrelated.

    Clearing or removing data

    Sometimes you might wish to remove a problematic set of time series from the database (that might have been input incorrectly, for example), including their metadata, and all records of computed data. Other times you might find a problem with the implementation of one of the operations. In this case, you would like to retain that operation in the database, but flush all of the data currently computed for it (so that you can recompute new values). Both of these types of tasks (both removing and clearing time series or operations) can be achieved with the function SQL_ClearRemove.

    This function takes in information about whether to clear (clear any calculated data) or remove (completely delete the given time series or operations from the database).

    Example usages are given below:

    [3016] FC_LocalSimple_mean3_taures (forecasting) -- 59.97%
    [3067] FC_LocalSimple_median3_taures (forecasting) -- 58.14%
    [2748] EN_mse_1-10_2_015_sampen_s3 (entropy,sampen,mse) -- 54.10%
    [7338] MF_armax_2_2_05_1_AR_1 (model) -- 53.71%
    [7339] MF_armax_2_2_05_1_AR_2 (model) -- 53.31%
    [3185] DN_CompareKSFit_uni_psx (distribution,ksdensity,raw,locdep) -- 52.14%
    [6912] MF_steps_ahead_ar_best_6_ac1_3 (model,prediction,arfit) -- 52.11%
    [6564] WL_coeffs_db3_4_med_coeff (wavelet) -- 52.01%
    [4552] SP_Summaries_fft_logdev_fpoly2csS_p1 (spectral) -- 51.57%
    [6634] WL_dwtcoeff_sym2_5_noisestd_l5 (wavelet,dwt) -- 51.48%
    [930] DN_FitKernelSmoothraw_entropy (distribution,ksdensity,entropy,raw,spreaddep) -- 51.37%
    [6574] WL_coeffs_db3_5_med_coeff (wavelet) -- 51.26%
    [6630] WL_dwtcoeff_sym2_5_noisestd_l4 (wavelet,dwt) -- 51.04%
    [1891] CO_StickAngles_y_ac2_all (correlation) -- 50.85%
    [16] rms (distribution,location,raw,locdep,spreaddep) -- 50.83%
    [6965] MF_steps_ahead_arma_3_1_6_rmserr_6 (model,prediction) -- 50.83%
    [2747] EN_mse_1-10_2_015_sampen_s2 (entropy,sampen,mse) -- 50.35%
    [4201] SC_FluctAnal_mag_2_dfa_50_2_logi_ssr (scaling) -- 50.34%
    [6946] MF_steps_ahead_ss_best_6_meandiffrms (model,prediction) -- 50.33%
        [3016] FC_LocalSimple_mean3_taures (forecasting) -- 59.97%
    >> Operations(Operations.ID==3016,:)
    ID                 Name                   Keywords                CodeString              MasterID
    ____    _____________________________    _____________    _____________________________    ________
    
    3016    'FC_LocalSimple_mean3_taures'    'forecasting'    'FC_LocalSimple_mean3.taures'    836
    >> MasterOperations(MasterOperations.ID==836,:)
    ID             Label                         Code            
    ___    ______________________    ____________________________
    
    836    'FC_LocalSimple_mean3'    'FC_LocalSimple(y,'mean',3)'
    >> help FC_LocalSimple
    FC_LocalSimple    Simple local time-series forecasting.
    
    Simple predictors using the past trainLength values of the time series to
    predict its next value.
    
    ---INPUTS:
    y, the input time series
    
    forecastMeth, the forecasting method:
             (i) 'mean': local mean prediction using the past trainLength time-series
                          values,
             (ii) 'median': local median prediction using the past trainLength
                            time-series values
             (iii) 'lfit': local linear prediction using the past trainLength
                            time-series values.
    
    trainLength, the number of time-series values to use to forecast the next value
    
    ---OUTPUTS: the mean error, stationarity of residuals, Gaussianity of
    residuals, and their autocorrelation structure.
    % Autocorrelation structure of the residuals:
    out.ac1 = CO_AutoCorr(res,1,'Fourier');
    out.ac2 = CO_AutoCorr(res,2,'Fourier');
    out.taures = CO_FirstZero(res,'ac');
    TS_FeatureSummary(3016,'raw',true);
    MyHotFeature(x)     hot_master
    hot_master.hotFeature1     hot_feature1      hot,science
    TS_Init('INP_1000ts.mat','INP_hot_master.txt','INP_hot_features.txt',true,'HCTSA_hot.mat');
    TS_Compute(false,[],[],'missing','HCTSA_hot.mat');
    TS_Combine('HCTSA_Empirical1000.mat','HCTSA_hot.mat',true,true,'HCTSA_merged.mat');
    load('HCTSA_merged.mat','Operations');
    Operations(strcmp(Operations.Name,'my_hot_feature'),:)
    TS_SimSearch(7703,'tsOrOps','ops','whatData','HCTSA_merged.mat','whatPlots',{'scatter','matrix'})
    Operations(strcmp(Operations.Name,'SC_fastdfa_exponent'),:)
    TS_SimSearch(750,'tsOrOps','ops','whatData','HCTSA_Empirical1000.mat','whatPlots',{'scatter','matrix','network'})
    theFeatureIDs = [750,544]; % IDs for the two features of interest
    [TS_DataMat,TimeSeries,Operations] = TS_LoadData('HCTSA_Empirical1000.mat'); % load data
    featureData = TS_DataMat(:,theFeatureIDs); % take the subset
    operationNames = Operations.Name(theFeatureIDs); % names of the two features
    annotateParams = struct('n',6); % annotate six time series with the cursor
    % Generate an annotated 2-dimensional scatter plot:
    TS_Plot2d(featureData,TimeSeries,operationNames,{},annotateParams);
    TS_TopFeatures()
    Mean linear classifier across 8261 operations = 67.02%
    (Random guessing for 2 equiprobable classes = 50.00%)
    [908] EN_DistributionEntropy_raw_ks__005 (entropy,raw,spreaddep) -- 100.00%
    [982] DN_FitKernelSmoothraw_max (distribution,ksdensity,raw,spreaddep) -- 100.00%
    [902] EN_DistributionEntropy_raw_ks_01_0 (entropy,raw,spreaddep) -- 99.50%
    [903] EN_DistributionEntropy_raw_ks_02_0 (entropy,raw,spreaddep) -- 99.50%
    [904] EN_DistributionEntropy_raw_ks_05_0 (entropy,raw,spreaddep) -- 99.50%
    [905] EN_DistributionEntropy_raw_ks_1_0 (entropy,raw,spreaddep) -- 99.50%
    [906] EN_DistributionEntropy_raw_ks__001 (entropy,raw,spreaddep) -- 99.50%
    [907] EN_DistributionEntropy_raw_ks__002 (entropy,raw,spreaddep) -- 99.50%
    [909] EN_DistributionEntropy_raw_ks__01 (entropy,raw,spreaddep) -- 99.50%
    [1127] EN_MS_LZcomplexity_8_diff (MichaelSmall,complexity,mex,LempelZiv) -- 99.50%
    [1128] EN_MS_LZcomplexity_9_diff (MichaelSmall,complexity,mex,LempelZiv) -- 99.50%
    [3412] DN_CompareKSFit_uni_psy (distribution,ksdensity,uni,peaksepy,raw,locdep) -- 99.50%
    [901] EN_DistributionEntropy_raw_ks_005_0 (entropy,raw,spreaddep) -- 99.00%
    [1129] EN_MS_LZcomplexity_10_diff (MichaelSmall,complexity,mex,LempelZiv) -- 99.00%
    [2958] EN_SampEn_5_01_diff1_sampen4 (entropy,sampen,controlen) -- 99.00%
    [910] EN_DistributionEntropy_raw_ks__02 (entropy,raw,spreaddep) -- 98.50%
    [980] DN_FitKernelSmoothraw_entropy (distribution,ksdensity,entropy,raw,spreaddep) -- 98.50%
    [2303] SY_SpreadRandomLocal_ac2_100_meansampen1_015 (stationarity) -- 98.50%
    [2616] FC_Surprise_T1_100_4_udq_500_lq (information,symbolic) -- 98.50%
    [2624] FC_Surprise_T1_100_5_udq_500_lq (information,symbolic) -- 98.50%
    TS_PlotDataMatrix('norm','colorGroups',false); % don't color according to group labels
    TS_PlotDataMatrix('norm','colorGroups',true); % color according to group labels
    annotateParams = struct('maxL',500);
    TS_FeatureSummary(4310,'raw',true,annotateParams);
    opID = 500;
    makeViolin = false;
    TS_SingleFeature('raw',opID,makeViolin);
    opID = 500;
    makeViolin = true;
    TS_SingleFeature('raw',opID,makeViolin);
    % Clear time series with ts_ids in the range 10-15
    SQL_ClearRemove('ts',10:15,0);
    
    % Remove time series with ts_ids in the range 10-15
    SQL_ClearRemove('ts',10:15,1);
    
    % Clear operations with op_ids in the range 100-200
    SQL_ClearRemove('ops',100:200,0);
    Customizing annotations

    Annotation properties can be altered with some detail by specifying properties as the annotateParams input variable, for example:

    which yields:

    annotated plot

    Plotting grouped time series

    If groups of time series have been specified (using TS_LabelGroups), then these are automatically recognized by TS_PlotLowDim, which will then distinguish the labeled groups in the resulting 2-dimensional annotated time-series plot.

    Consider the sample dataset containing 20 periodic signals with additive noise (given the keyword noisy in the database), and 20 purely periodic signals (given the keyword periodic in the database). After retrieving and normalizing the data, we store the two groups in the metadata for the normalized dataset HCTSA_N.mat:

    Now when we plot the dataset in TS_PlotLowDim, it will automatically distinguish the groups in the plot and attempt to classify the difference in the reduced principal components space.

    Running the following:

    The function then directs you to select 6 points to annotate time series to, producing the following:

    Notice how the two labeled groups have been distinguished as red and blue points, and a linear classification boundary has been added (with in-sample misclassification rate annotated to the title and to each individual principal component). If marginal distributions are plotted (setting showDistribution = true above), they are labeled according to the same colors.

    TS_PlotLowDim('norm','pca');
    pca_image
    Example application to developmental gene-expression data

    To demonstrate the feasibility of running hctsa analysis on datasets of short time series, we applied hctsa to gene expression data in the cerebellar brain region, r1A, across seven developmental time points (from the Allen Institute's Developing Mouse Brain Atlas), for a subset of 50 genes. After filtering and normalizing (TS_Normalize), then clustering (TS_Cluster), we plotted the clustered time-series data matrix (TS_PlotDataMatrix('cl')):

    Inspecting the time series plots to the left of the colored matrix, we can see that genes with similar temporal expression profiles are clustered together based on their 2829-long feature vector representations. Thus, these feature-based representations provide a meaningful representation of these short time series. Further, while these 2829-long feature vectors are shorter than those that can be computed from longer time series, they still constitute a highly comprehensive representation that can be used as the starting point to obtain interpretable understanding in addressing specific domain questions.

    link to paper
    TS_Normalize
    % Set whether to plot marginal distributions:
    showDistributions = false;
    % Set up the annotateParams structure:
    annotateParams = struct;
    annotateParams.n = 8; % annotate 8 time series
    annotateParams.maxL = 150; % annotates the first 150 samples of time series
    annotateParams.userInput = false; % points not selected by user but allocated randomly
    annotateParams.textAnnotation = false; % don't display names of annotated time series
    
    % Generate a plot using these settings:
    TS_PlotLowDim('norm','pca',showDistributions,'',annotateParams);
    TS_LabelGroups('norm',{'noisy','periodic'},'ts');
    We found:
    noisy -- 20 matches
    periodic -- 20 matches
    Saving group labels and information back to HCTSA_N.mat... Saved.
    annotateParams = struct('n',6); % annotate 6 time series
    showDistributions = true; % plot marginal distributions
    TS_PlotLowDim('norm','pca',showDistributions,'',annotateParams);