Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The first step in analyzing a dataset involves processing the data matrix, which can be done using TS_Normalize
. This involves filtering out operations or time series that produced many errors or special-valued outputs, and then normalizing of the output of all operations, which is typically done in-sample, according to an outlier-robust sigmoidal transform (although other normalizing transformations can be selected). Both of these tasks are performed using the function TS_Normalize
. The TS_Normalize
function writes the new, filtered, normalized matrix to a local file called HCTSA_N.mat
. This contains normalized, and trimmed versions of the information in HCTSA.mat
.
Example usage is as follows:
The first input controls the normalization method, in this case a scaled, outlier-robust sigmoidal transformation, specified with 'mixedSigmoid'
. The second input controls the filtering of time series and operations based on minimum thresholds for good values in the corresponding rows (corresponding to time series; filtered first) and columns (corresponding to operations; filtered second) of the data matrix.
In the example above, time series (rows of the data matrix) with more than 20% special values (specifying 0.8) are first filtered out, and then operations (columns of the data matrix) containing any special values (specifying 1.0) are removed. Columns with approximately constant values are also filtered out. After filtering the data matrix, the outlier-robust ‘scaledRobustSigmoid’ sigmoidal transformation is applied to all remaining operations (columns). The filtered, normalized matrix is saved to the file HCTSA_N.mat
.
Details about what normalization is saved to the HCTSA_N.mat
file as normalizationInfo
, a structure that contains the normalization function, filtering options used, and the corresponding TS_Normalize
code that can be used to re-run the normalization.
It makes sense to weight each operation equally for the purposes of dimensionality reduction, and thus normalize all operations to the same range using a transformation like ‘scaledRobustSigmoid’, ‘scaledSigmoid’, or ‘mixedSigmoid’. For the case of calculating mutual information distances between operations, however, one would rather not distort the distributions and perform no normalization, using ‘raw’ or a linear transformation like ‘zscore’, for example. The full list of implemented normalization transformations are listed in the function BF_NormalizeMatrix
.
Note that the 'scaledRobustSigmoid' transformation does not tolerate distributions with an interquartile range of zero, which will be filtered out; 'mixedSigmoid' will treat these distributions in terms of their standard deviation (rather than interquartile range).
Filtering parameters depend on the application. Some applications can allow the filtering thresholds to be relaxed. For example, setting [0.7,0.9]
, removes time series with less than 70% good values, and then removes operations with less than 90% good values. Some applications can tolerate some special-valued outputs from operations (like some clustering methods, where distances are simply calculated using those operations that are did not produce special-valued outputs for each pair of objects), but others cannot (like Principal Components Analysis); the filtering parameters should be specified accordingly.
Analysis can be performed on the data contained in HCTSA_N.mat
in the knowledge that different settings for filtering and normalizing the results can be applied at any time by simply rerunning TS_Normalize
, which will overwrite the existing HCTSA_N.mat
with the results of the new normalization and filtration settings.
TS_FilterData
It is often useful to check whether the feature-based classification results of a given analysis is driven by 'trivial' types of features that do not depend on the dynamical properties of the data, e.g., features sensitive to time-series length, location (e.g., mean), or spread (e.g., variance). Because these features are labeled as 'lengthdep'
, 'locdep'
, and 'spreaddep'
, you can easily filter these out to check the robustness of your analysis.
An example:
You could use the same template to filter 'locdep'
or 'spreaddep'
features (or any other combination of keyword labels). You can then go ahead with analyzing the filtered HCTSA dataset as above, except using your new filename, HCTSA_locFilt
.
For the purposes of visualizing the data matrix, it is often desirable to have the rows and columns reordered to put similar rows adjacent to one another, and similarly to place similar columns adjacent to one another. This reordering can be done using hierarchical linkage clustering, by the function TS_Cluster
:
This function reads in the data from HCTSA_N.mat
, and stores the re-ordering of rows and columns back into HCTSA_N.mat
in the ts_clust
and op_clust
(and, if the size is manageable, also the pairwise distance information). Visualization functions (such as TS_PlotDataMatrix
and TS_SimSearch
) can then take advantage of this information, using the general input label 'cl'
.
Once a set of operations have been computed on a time-series dataset, the results are stored in a local HCTSA.mat file. The result can be used to perform a wide variety of highly comparative analyses, such as those outlined in our paper.
The type of analysis employed should the be motivated by the specific time-series analysis task at hand. Setting up the problem, guiding the methodology, and interpreting the results requires strong scientific input that should draw on domain knowledge, including the questions asked of the data, experience performing data analysis, and statistical considerations.
The first main component of an hctsa analysis involves filtering and normalizing the data using TS_Normalize
, described here, which produces a file called HCTSA_N.mat. Information about the similarity of pairs of time series and operations can be computed using TS_Cluster
, described here which stores this information in HCTSA_N.mat. The suite of plotting and analysis tools that we provide with hctsa work with this normalized data, stored in HCTSA_N.mat, by default.
Visualizing structure in the data matrix using TS_PlotDataMatrix
.
Visualizing the time-series traces using TS_PlotTimeSeries
.
Visualizing low-dimensional structure in the data using TS_PlotLowDim
.
Exploring similar matches to a target time series using TS_SimSearch
.
Visualizing the behavior of a given operation across the time-series dataset using TS_FeatureSummary
.
For time-series classification tasks, groups of time series can be labeled using the TS_LabelGroups
function described here; this group label information is stored in the local HCTSA*.mat file, and used by default in the various plotting and analysis functions provided. Additional analysis functions are provided for basic time-series classification tasks:
Explore the classification performance of the full library of features using TS_Classify
Determine the features that (individually) best distinguish between the labeled groups using TS_TopFeatures
hctsa allows you to map time-series data into a unified feature space, a representation that is well suited to then applying machine learning methods to perform classification where the goal is to predict a class label assigned to each time series from features of its dynamics.
Class labels can be assigned to time series in an hctsa dataset using the function TS_LabelGroups
, which populates the Group
column of the TimeSeries
table as a categorical labeling of the data (the class labels the classifier will attempt to predict). These group labels are used by a range of analysis functions, including TS_PlotLowDim
, TS_TopFeatures
, and TS_Classify
.
The machinery of TS_LabelGroups
assumes that the class labels you are interested in are contained in the Keywords
column of the TimeSeries
table (set up from your original input file when TS_Init
was run).
If you want to label according to a specific set of keywords, you can do this by specifying the keywords that define the unique groups. The example below assigns labels to two groups of time series in the HCTSA.mat
(specifying the shorthand 'raw'
for this default, un-normalized data), corresponding to those labeled as 'parkinsons'
and those labeled as 'healthy'
:
Note that any time series that are not labeling by either 'parkinsons'
nor 'healthy'
are left unlabeled.
If every time series has a single keyword that uniquely labels it, TS_LabelGroups
can automatically detect this by running TS_LabelGroups('raw',{});
.
More complex labeling (e.g., using custom combinations of keywords) is not implemented in TS_LabelGroups
, but can be achieved by writing a script that does logical operations on calls to TS_LabelGroups
and saves back to the TimeSeries.Group
column. Here is an example of combining two labels (male/female and day/night) on fly movement data.
By default, the group labels are saved back to the data file (in this example, HCTSA.mat
).
Group labels can be reassigned at any time by re-running the TS_LabelGroups
function, and can be cleared by running, e.g., TS_LabelGroups('raw','clear')
.
Assigned labels are used by the analysis, plotting, and classification functions of hctsa.
Note: If you assign a labeling to a given dataset (e.g., HCTSA.mat
), then this labeling will remain with the normalized data (e.g., after running TS_Normalize
).
The hctsa package provides a simple means of plotting time series: the TS_PlotTimeSeries
function.
For example, to plot a set of time series that have not been assigned groups, we can run the following:
For our assorted set of time series, this produces the following:
Showing the first 400 samples of 10 selected time series, equally-spaced through the TimeSeries IDs in HCTSA_N.mat
.
Many more custom plotting options are available by passing an options structure to TS_PlotTimeSeries
, including the 'plotFreeForm'
option which allows very many time series to be shown in a single plot (without the usual axis borders):
producing an overview picture of the first 300 samples of 40 time series (spaced through the rows of the data matrix).
When the time series have been assigned groups (using TS_LabelGroups
, here), this information is automatically incorporated into TS_PlotTimeSeries
, which then plots a given number of each time series group, and colors them accordingly:
In this case the two labeled groups of time series are recognized by the function: red (noisy), blue (no noise), and then 5 time series in each group are plotted, showing the first 500 samples of each time series.
The clustered data matrix (if clustering has been performed, otherwise the non-clustered data matrix is used) can be visualized by running TS_PlotDataMatrix
.
This will produce a colored visualization of the data matrix such as that shown below.
When data is grouped according to a set of distinct keywords and stored as group metadata (using the TS_LabelGroups
function), these can also be visualized using TS_PlotDataMatrix('colorGroups',true)
.
Running TS_PlotDataMatrix('norm')
plots the data contained in the local file HCTSA_N.mat
, yielding:
where black rectangles label missing values, and other values are shown from low (blue) to high (red) after normalization using the scaled outlier-robust sigmoidal transformation. Due to the size of the matrix, operations are not labeled individually.
Examples of time series segments are shown to the left of the plot, and when the middle plot is zoomed, the time-series annotations remain matched to the data matrix:
It can be useful to display the matrix with the order of time series and operations preserved, but the relationships between rows and columns can be difficult to visualize when ordered randomly.
If you have run TS_Cluster
, it will save clustering information back to the HCTSA file, that will be loaded if possible and used to reorder rows and columns by similarity (if this information exists). Here's an example of running TS_PlotDataMatrix
on data that has this clustering info:
Much prettier, hey?! By reordering rows and columns, this representation reveals correlated patterns of outputs across different types of operations, and similar sets of properties between different types of time series.
In this example, we consider a set of 20 periodic and 20 noisy periodic signals. We assigned the time series in HCTSA.mat
to groups (using TS_LabelGroups('raw',{'periodic','noisy'})
), then normalized the data matrix (TS_Normalize
), and then clustered it (TS_Cluster
). So now we have a clustered data matrix containing thousands of summaries of each time series, as well as pre-assigned group information as to which time series are periodic and which are noisy. When the time series have been assigned to groups, this can be accessed by switching on the 'colorGroups'
setting:
producing the following two plots:
When group information is not used (the left plot), the data is visualized in the default blue/yellow/red color scheme, but when the assigned groups are colored (right plot), we see that the clustered dataset separates perfectly into the periodic (green) and noisy (blue) time series, and we can visualize the features that contribute to the separation.
While the global structure of a time-series dataset can be investigated by plotting the data matrix () or a low-dimensional representation of it (), sometimes it can be more interesting to retrieve and visualize relationships between a set of nearest neighbors to a particular time series of interest.
The hctsa framework provides a way to easily compute distances between pairs of time series, e.g., as a Euclidean distance between their normalized feature vectors. This allows very different time series (in terms of their origin, their method of recording and measurement, and their number of samples) to be compared straightforwardly according to their properties, measured by the algorithms in our hctsa library.
For this, we use the TS_SimSearch
function, specifying the id of the time series of interest (i.e., the ID
field of the TimeSeries
structure) with the first input and the number of neighbors with the 'numNeighbors' input specifier (default: 20). By default, data is loaded from HCTSA_N.loc
, but a custom source can be specified using the 'whatDataFile'
input specifier (e.g., TS_SimSearch('whatDataFile','HCTSA_custom.mat')
).
After specifying the target and how many neighbors to retrieve, TS_SimSearch
outputs the list of neighbors and their distances to screen, and the function also provides a range of plotting options to visualize the neighbors. The plots to produce are specified as a cell using the 'whatPlots' input.
To investigate the pairwise relationships between all neighbors retrieved, you specify the 'matrix'
option of the TS_SimSearch
function. An example output using a publicly-available , retrieving 14 neighbors from the time series with ID = 1
, as TS_SimSearch(1,'whatPlots',{'matrix'},'numNeighbors',14)
, is shown below:
The specified target time series (ID = 1
) is shown as a white star, and all 14 neighbors are shown, as labeled on the left of the plot with their respective IDs, and a 100-sample subset of their time traces.
Pairwise distances are computed between all pairs of time series (as a Euclidean distance between their feature vectors), and plotted using color, from low (red = more similar pairs of time series) to high (blue = more different pairs of time series).
In this case we see that the purple and green classes are relatively similar under this distance metric (eyes open and eyes closed), whereas the orange time series (seizure) are distinguished.
Another way to visualize the similarity (under our feature-based distance metric) of all pairs of neighbors is using a network visualization. This is specified as:
which produces something like the following:
The strongest links are visualized as blue lines (by default, the top 40% of strongest links are plotted, cf. the legend showing 0.9, 0.8, 0.7, and 0.6 for the top 10%, 20%, 30%, and 40% of links, respectively).
The target is distinguished (as purple in this case), and the other classes of time series are shown using color, with names and time-series segments annotated. Again, you can see that the EEG time series during seizure (blue) are distinguished from eyes open (red) and eyes closed (green).
The scatter setting visualizes the relationship between the target and each of 12 time series with the most similar properties to the target. Each subplot is a scatter of the (normalized) outputs of each feature for the specified target (x-axis) and the match (y-axis). An example is shown below.
Multiple output plots can be produced simultaneously by specifying many types of plots as follows:
This produces a plot of each type.
Note that pairwise distances can be pre-computed and saved in the HCTSA*.mat
file using TS_PairwiseDist
for custom distance metrics (which is done by default in TS_Cluster
for datasets containing fewer than 1000 objects). TS_SimSearch
checks for this information in the specified input data (containing the ts_clust
or op_clust
structure), and uses it to retrieve neighbors. If distances have not previously been computed, distances from the target are computed as euclidean distances (time series) or absolute correlation distances (operations) between feature vectors within TS_SimSearch
.
The software also provides a basic means of visualizing low-dimensional representations of the data, using PCA as TS_PlotLowDim
.
This can be done for a time-series dataset as follows:
This uses the normalized data (specifying 'norm'
), plotting time series in the reduced, two-dimensional principal components space of operations (the leading two principal components of the data matrix).
By default, the user will be prompted to select 10 points on the plot to annotate with their corresponding time series, which are annotated as the first 300 points of that time series (and their names by default).
After selecting 10 points, we have the following:
The proportion of variance explained by each principal component is provided in parentheses in the axis label.
Annotation properties can be altered with some detail by specifying properties as the annotateParams
input variable, for example:
which yields:
If groups of time series have been specified (using TS_LabelGroups
), then these are automatically recognized by TS_PlotLowDim
, which will then distinguish the labeled groups in the resulting 2-dimensional annotated time-series plot.
Consider the sample dataset containing 20 periodic signals with additive noise (given the keyword noisy in the database), and 20 purely periodic signals (given the keyword periodic in the database). After retrieving and normalizing the data, we store the two groups in the metadata for the normalized dataset HCTSA_N.mat:
Now when we plot the dataset in TS_PlotLowDim
, it will automatically distinguish the groups in the plot and attempt to classify the difference in the reduced principal components space.
Running the following:
The function then directs you to select 6 points to annotate time series to, producing the following:
Notice how the two labeled groups have been distinguished as red and blue points, and a linear classification boundary has been added (with in-sample misclassification rate annotated to the title and to each individual principal component). If marginal distributions are plotted (setting showDistribution = true
above), they are labeled according to the same colors.
Because this dataset contains 3 classes that were previously labeled (using as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'})
), the function shows these class assignments using color labels to the left of the plot (purple, green, and orange in this case).
In a time-series classification task, you are often interested in not only determining differences between the labeled classes, but also interpreting them in the context of your application. For example, an entropy estimate may give low values to the RR interval series measured from a healthy population, but higher values to those measured from patients with congestive heart failure. When performing a highly comparative time-series analysis, we have computed a large number of time-series features and can use these results to arrive at these types of interpretable conclusions automatically.
A simple way to determine which individual features are useful for distinguishing the labeled classes of your dataset is to compare each feature individually in terms of its ability to separate the labeled classes of time series. This can be achieved using the TS_TopFeatures
function.
In this example, we consider 100 EEG time series from seizure, and 100 EEG time series from eyes open (from the publicly-available Bonn University dataset). After running TS_LabelGroups
(as simply TS_LabelGroups()
, which automatically detects the two keywords assigned to the two classes of signals), we run simply:
By default this function will compare the in-sample linear classification performance of all operations in separating the labeled classes of the dataset (individually), produce plots summarizing the performance across the full library (compared to an empirical null distribution), characterize the top-performing features, and show their dependencies on the dataset. Inputs to the function control how these tasks are performed (including the statistic used to measure individual performance, what plots to produce, and whether to produce nulls).
First, we get an output to screen, showing the mean linear classification accuracy across all operations, and a list of the operations with top performance (ordered by their test statistic, with their ID shown in square brackets, and keywords in parentheses):
Here we see that measures of entropy are dominating this top list, including measures of the time-series distribution.
An accompanying plot summarizes the performance of the full library on the dataset, compared to a set of nulls (generated by shuffling the class labels randomly):
Here we can see that the default feature library (blue: 7275 features remaining after filtering and normalizing) are performing much better than the randomized features (null: red).
Next we get a set of plots showing the class probability distributions for the top features. For example:
This allows us to interpret the values of features in terms of the dataset. After some inspection of the listed operations, we find that various measures of EEG Sample Entropy are higher for healthy individuals with eyes open (blue) than for seizure signals (purple).
Finally, we can visualize how the top operations depend on each other, to try to deduce the main independent behaviors in this set of 25 top features for the given classification task:
In this plot, the magnitude of the linear correlation, R
(as 1-|R|
) between all pairs of top operations is shown using color, and a visualization of a clustering into 5 groups is shown (with green boxes corresponding to clusters of operations based on the dendrogram shown to the right of the plot). Blue boxes indicate a representative operation of each cluster (closest to the cluster centre).
Here we see that most pairs of operations are quite dependent (at |R| > ~0.8
), and the main clusters are those measuring distribution entropy (EN_DistributionEntropy_
), Lempel-Ziv complexity (EN_MS_LZcomplexity_
), Sample Entropy (EN_SampEn_
), and a one-step-ahead surprise metric (FC_Surprise_
).
This function produces all of the basic visualizations required to achieve a basic understanding of the differences between the labeled classes of the dataset, and should be considered a first step in a more detailed analysis. For example, in this case we may investigate the top-performing features in more detail, to understand in detail what they are measuring, and how. This process is described in the Interpreting features section.
Sometimes it's useful to be able to investigate the behavior of an individual operation (or feature) across a time-series dataset. What are the distribution of outputs, and what types of time series receive low values, and what receive high values?
These types of simple questions for specific features of interest can be investigated using the TS_FeatureSummary
function. The function takes in an operation ID as its input (and can also take inputs specifying a custom data source, or custom annotation parameters), and produces a distribution of outputs from that operation across the dataset, with the ability to then annotate time series onto that plot.
For example, the following:
Produces the following plot (where 6 points on the distribution have been clicked on to annotate them with short time-series segments):
You can visually see that time series with more autocorrelated patterns through time receive higher values from this operation. Because groups have not been assigned to this dataset, the time series are colored at random.
Running TS_FeatureSummary in violin plot mode provides another representation of the same result:
This plots the distribution of feature 4310 from HCTSA.mat
as a violin plot, with ten 500-point time series subsegments annotated at different points through the distribution, shown to the right of the plot:
When time series groups have been labeled (using TS_LabelGroups
as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'},'raw');
), TS_FeatureSummary
will plot the distribution for each class separately, as well as an overall distribution. Annotated points can then be added to each class-specific distributions. In the example shown below, we can see that the 'noisy' class (red) has low values for this feature (CO_tc3_2_denom
), whereas the 'periodic' class mostly has high values.
TS_SingleFeature
provides a simpler way of seeing the class distributions without annotations, as either kernel-smoothed distributions, as in TS_FeatureSummary
, or as violin plots.
See below for example implementations:
Shows the distributions with classification bar underneath (for where a linear classifier would classify different parts of the space as either noisy or periodic):
Shows the distributions shown as a violin plot, with means annotated and classification bar to the left:
Note that the title, which gives an indication of the 10-fold cross-validated balanced accuracy of a linear classifier in the space is done on the basis of a single 10-fold split and is stochastic. Thus, as shown above, this can yield slightly different results when repeated. For a more rigorous analysis than this simple indication, the procedure should be repeated many more times to give a converged estimate of the balanced classification accuracy.
A typical data analysis procedure might first involve carefully inspecting your data, then thinking about the patterns you might want to capture, then trying to devise an algorithm to capture them (testing the algorithm carefully on simulated data, perhaps, to test its behavior, dependence on noise level, time-series length, sensitivity to low-order trends or other types of stationarity, etc.). But in hctsa we do something somewhat dangerous—apply thousands of methods without even needing to first even look at the data
A conventional data analysis pipeline starts with the hard work of thinking about the dynamics and carefully selecting time-series analysis methods with deep knowledge of how they work and the assumptions that make their interpretation valid. This hard work is not avoided in hctsa, although if you're lucky, hctsa may save you a bunch of work—if it identifies high-performing features, you can focus your energy on interpreting just these. But this hard work comes at the end of an analysis, and is difficult and time-consuming to do well—often involving looking deeply into the algorithms and theory underlying these methods, combined with careful inspection of the data to ensure you are properly interpreting them.
Being the most substantial challenge—one that cannot be automated and requires alot of careful thought—this page provides some basic advice. While it's often challenging to work out what a given feature is capturing in your dataset, it is the most interesting part of an analysis, as it points you to interpretable scientific theory/algorithms, and makes you think carefully about the structure in your time series.
Sometimes you will run TS_TopFeatures
and find no features that significantly distinguish time series recorded from the labeled groups in your dataset. In this case, there may not be a signal, the signal may be too small given your sample size (and the perhaps large number of candidate time-series features, if using the hctsa set), or the right statistical estimators may not be present in hctsa.
But other times you will obtain a long list of statistically significant (after multiple hypothesis correction) features (e.g., from TS_TopFeatures
) with values that significantly distinguish the groups you care about (individuals with some disease diagnosis compared to that of healthy controls). In cases like this, the next step is to obtain some understanding of what's happening.
Consider a long and kinda gnarly list, that is quite hard to make sense of. This is a typical situation because the features names in hctsa are long and typically hard to interpret directly. Like perhaps:
If there are hundreds or thousands of statistically significant features, some of which have much higher accuracy than others, you may first want to inspect only a subset of features, e.g., the 50 or 100 top-performing features (with the most significant differences). If you can make sense of these first, you will usually have a pretty good basis for working out what's going on with the features with lower performance (if indeed you want to go beyond interpreting the best-performing features).
Plotting and inspecting clustered feature–feature correlation plots are crucial for identifying groups of features with similar behavior on the dataset. Then we can inspect and interpret these groups of similar (highly inter-correlated) features together as a group. This should be the first step in interpreting a set of significant features: which groups of features behave similarly, and what common concepts are they measuring?
When such a group of high-performing features capturing a common time-series property has been identified, how can we start to interpret and understand what each individual feature is measuring?
Some features in the group may be easy to interpret directly. For example, in the list above rms
is straightforward to interpret directly: it is simply the root-mean-square of the distribution of time-series values. Others have clues in the name (e.g., features starting with WL_coeffs
are to do with measuring wavelet coefficients, features starting with EN_mse
correspond to measuring the multiscale entropy ('mse'), and features starting with FC_LocalSimple_mean
are related to time-series forecasting using local means of the time series).
Below we outline a procedure for how a user can go from a time-series feature selected by hctsa towards a deeper understanding of the type of algorithm that feature is derived from, how that algorithm performs across the dataset, and thus how it can provide interpretable information about your specific time-series dataset.
The simplest way of getting a quick idea of what sort of property a feature might be measuring is from its keywords, that often label individual features by the class of time-series analysis method from which they were derived. Keywords have not been applied exhaustively to features (this is an ongoing challenge), but when they are there they can be useful at giving a sense of what's going on. In the list above, we see keywords listed in parentheses, such as:
'forecasting' (methods related to predicting future values of a time series),
'entropy' (methods related to predictability and information content in a time series),
'wavelet' (features derived from wavelet transforms of the time series),
'locationDependent' (location dependent: features that change under mean shifts of a time series),
'spreadDependent' (spread dependent: features that change under rescaling about their mean), and
'lengthDependent' (length dependent: features that are highly sensitive to the length of a time series).
To find more specific detailed information about a feature, beyond just a broad categorical label of the literature from which it was derived, the next step is find and inspect the code file that generates the feature of interest. For example, say we were interested in the top performing feature in the list above:
We know from the keyword that this feature has something to do with forecasting, and the name provides clues about the details (e.g., FC_
stands for forecasting, the function FC_LocalSimple
is the one that produces this feature, which, as the name suggests, performs simple local time-series prediction). We can use the feature ID (3016
) provided in square brackets to get information from the Operations
metadata table:
Inspecting the text before the dot, .
, in the CodeString
field (FC_LocalSimple_mean3
) tells us the name that hctsa uses to describe the Matlab function and its unique set of inputs that produces this feature. Whereas the text following the dot, .
, in the CodeString
field (taures
), tells us the field of the output structure produced by the Matlab function that was run.
We can use the MasterID
to get more information about the code that was run using the MasterOperations
metadata table:
This tells us that the code used to produce our feature was FC_LocalSimple(y,'mean',3)
. We can get information about this function in the command line by running a help
command:
We can also inspect the code in the function FC_LocalSimple
directly for more information. Like all code files for computing time-series features, FC_LocalSimple.m
is located in the Operations directory of the hctsa repository.
Inspecting the code file, we see that running FC_LocalSimple(y,'mean',3)
does forecasting using local estimates of the time-series mean (since the second input to FC_LocalSimple
, forecastMeth
is set to 'mean'
), using the previous three time-series values to make the prediction (since the third input to FC_LocalSimple
, trainLength
is set to 3
).
To understand what the specific output quantity from this code is that came up as being highly informative in our TS_TopFeatures
analysis, we need to look for the output labeled taures
of the output structure produced by FC_LocalSimple
. We discover the following relevant lines of code in FC_LocalSimple.m
:
This shows us that, after doing the local mean prediction, FC_LocalSimple
then outputs some features on whether there is any residual autocorrelation structure in the residuals of the rolling predictions (the outputs labeled ac1
, ac2
, and our output of interest: taures
).
The code shows that this taures
output computes the CO_FirstZero
of the residuals, which measures the first zero of the autocorrelation function (e.g., cf help CO_FirstZero
). When the local mean prediction still leaves a lot of autocorrelation structure in the residuals, our feature, FC_LocalSimple_mean3_taures
, will thus take a high value.
For example, we can run the following:
which produces a plot like that shown below. We have run this on a dataset containing noisy sine waves, labeled 'noisy' (red) and periodic signals without noise, labeled 'periodic' (blue):
In this plot, we see how this feature orders time series (with the distribution of values shown on the left, and split between the two groups: 'noisy', and 'periodic'). Our intuition from the code, that time series with longer correlation timescales will have highly autocorrelated residuals after a local mean prediction, appears to hold visually on this dataset.
In general, the mechanism provided by TS_FeatureSummary
to visualize how a given feature orders time series, including across labeled groups, can be very useful for feature interpretation.
In a specific domain context, you may need to decide on the trade-off between more complicated features that may have slightly higher in-sample performance on a given task, and simpler, more interpretable features that may help guide domain understanding. The procedures outlined above are typically the first step to understanding a time-series analysis algorithm, and its relationship to alternatives that have been developed across science.
When performing a time-series classification task, a basic first exploration of the data is to investigate how accurately a classifier can learn a mapping from time-series features to labels assigned to time series in your dataset.
The first step is to assign group labels to time series in your dataset using .
Depending on the classifier, you typically want to first normalize the features to put them all on a similar scale (using TS_Normalize
).
Depending on the question asked of the data, you should also consider whether certain types of features should be removed. For example, you may wish to exclude length-dependent features (if differences in time-series length vary between classes but are an uninteresting artefact of the measurement). This can be done using TS_Subset
(and functions like TS_CompareFeatureSets
described below allow you to test the sensitivity of these results).
TS_Classify
)TS_Classify
uses all of the features in a given hctsa data matrix to classify assigned class labels.
You can set classification settings, from the number of folds to use in cross-validation to the type of classifier, as the cfnParams
structure. For the labeling defined in a given TimeSeries
table, you can set defaults for this using cfnParams = GiveMeDefaultClassificationParams('norm')
(takes TimeSeries
labeling from HCTSA_N.mat
). This automatically sets an appropriate number of folds (for cross-validation), and includes settings for taking into account class imbalance in classifier training and evaluation. It is best to alter the values inside this function to suit your needs, such that these settings can be applied consistently.
First let's run a simple classification of the groups labeled in HCTSA_N.mat
, using default classification settings:
In large feature spaces like in hctsa, simpler classifiers (like 'svm_linear'
) tend to generalize well, but you can play with the settings in cfnParams
to get a sense for how the performance varies.
As well as the classification results, the function also produces a confusion matrix, which is especially useful for evaluating where classification errors are occurring. Here's an example for a five-class problem:
In datasets containing fewer time series, it is more likely to obtain high classification accuracies by chance. You may therefore wonder how confident you can be with your classification accuracy. For example if you get a two-class classification accuracy of 60%, you might wonder what the probability is of obtaining such an accuracy by chance?
You can set numNulls
in TS_Classify
to iterate over the classification settings defined in cfnParams
except using shuffled class labels. This builds up a null distribution from which you can estimate a p-value to infer the significance of the classification accuracy obtained with the true data labeling provided.
You can also choose to run across multiple cores by switching on doParallel
:
This gives you a p-value estimate (both via a direct permutation test, and by assuming a Gaussian null distribution), and plots the null distribution with the true result annotated:
TS_CompareFeatureSets
)You might wonder whether the classification results are driven by simple types of features that aren't related to time-series dynamics at all (such as the mean of the data, or time-series length).
These can be filtered out from the initial computation (e.g., when performing TS_Init
), or subsequently (e.g., using TS_Subset
), but you can test the effect such features are having on your dataset using TS_CompareFeatureSets
. Here's an example output:
Here we see that length-dependent features are contributing to accurate classification (above-50% accuracy for this two-class balanced problem). We can take some relief from the fact that excluding these features ('notLengthDependent'
) does not significantly alter the classification accuracy, so these features are not single-handedly driving the classification results. Nevertheless, assuming differences in recording length is not an interesting difference we want to bias our classification results, it would be advisable to remove these for peace of mind.
The complexity of the time-series analysis literature is necessary for strong classification results to different degrees, depending on the task. You can quickly assess how accurately a smaller number of reduced components (e.g., Principal Components) can better classify your dataset using TS_Classify_LowDim
:
The classification accuracy is shown for all features (green, dashed), and as a function of the number of leading PCs included in the classifier (black circles). Note that this is cumulative: '5 PCs' means classification in the five-dimensional space of the five leading PCs.
Here we find that we can get decent classification accuracy with just four PCs (and perhaps even more complex classifiers will give even better results in the lower-dimensional spaces).
You can quickly interpret the type of features loading strongly onto each PC from the information shown to screen. For example:
Demonstrates that, on this dataset, long-lag autocorrelations are the most strongly correlated features to PC5.
TS_TopFeatures
is helpful for showing us how features with different, long, and complicated names might cluster into groups of similarly behaving features that are measuring similar properties of the data (cf. the ). This is essential to interpret groups of features that are capturing the same underlying conceptual property of the dynamics. It helps you work out what's going on, and avoids over-interpreting any specific feature in isolation. For example, sometimes features with very complicated sounding names (like nonlinear embedding dimension estimates) exhibit very similar behavior to very simple features (like lag-1 autocorrelation or distributional outlier measures), indicating that the nonlinear embedding dimension estimates are likely sensitive to simpler properties in the time series (e.g., sensitive to outliers in the data). Intepreting the nonlinear embedding dimension in isolation would be a mistake in such a case.
An example is here, were identified 3 important groups of features—measuring stationarity of self-correlation properties, stationarity of distributional properties, and global distributional properties—and then go on to interpret each in turn:
Once we've seen the code that was used to produce a feature, and started to think about how such an algorithm might be measuring useful structure in our time series, we can then check our intuition by inspecting its performance on our dataset (as described in ).
hctsa contains a large number of features, many of which can be expected to be highly inter-correlated on a given time-series dataset. It is thus crucial to explore how a given feature relates to other features in the library, e.g., using the correlation matrix produced by TS_TopFeatures
(cf. ), or by searching for features with similar behavior on the dataset to a given feature of interest (cf. ).
Although many sections of the time-series analysis literature has worked to develop methods for quantifying complex temporal structure in long time-series recordings, many time series that are analyzed in practice are relatively short. hctsa has been successfully applied to time-series classification problems in the data mining literature, which includes datasets of time series as short as 60 samples (link to paper). However, time-series data are sometimes even shorter, including yearly economic data across perhaps six years, or biological data measured at say 10 points across a lifespan. Although many features in hctsa will not give a meaningful output when applied to a short time series, hctsa includes methods for filtering such features (cf. TS_Normalize
), after which the remaining features can be used for analysis.
The number of features with a meaningful output, from time series as short as 5 samples, up to those with as many as 500 samples, is shown below (where the maximum set of 7749 is shown as a dashed horizontal line):
In each case, over 3000 features can be computed. Note that one must be careful when representing a 5-dimensional object with thousands of features, the vast majority of which will be highly intercorrelated.
To demonstrate the feasibility of running hctsa analysis on datasets of short time series, we applied hctsa to gene expression data in the cerebellar brain region, r1A, across seven developmental time points (from the Allen Institute's Developing Mouse Brain Atlas), for a subset of 50 genes. After filtering and normalizing (TS_Normalize
), then clustering (TS_Cluster
), we plotted the clustered time-series data matrix (TS_PlotDataMatrix('cl')
):
Inspecting the time series plots to the left of the colored matrix, we can see that genes with similar temporal expression profiles are clustered together based on their 2829-long feature vector representations. Thus, these feature-based representations provide a meaningful representation of these short time series. Further, while these 2829-long feature vectors are shorter than those that can be computed from longer time series, they still constitute a highly comprehensive representation that can be used as the starting point to obtain interpretable understanding in addressing specific domain questions.
One of the key goals of highly comparative time-series analysis, is to allow unbiased methodological comparison between the vast literature of time-series analysis tools developed for different applications. By representing features in terms of their outputs across a time-series dataset, the context of a given feature can be assessed by searching the database for features with similar behavior. The search can be done using a diverse range of real and model-generated data, or using a more specific dataset if this is more appropriate for a given application (e.g., looking just at EEG signals). Just like similar time series to a target can be retrieved and visualized, similar features to a given target feature can also be retrieved using TS_SimSearch
.
This chapter will give instructions on how you can compare a new time-series analysis feature to our library of over 7000 time-series features using hctsa. We assume that the reader has installed hctsa, which will be required to work with files and compute features.
The first step is defining the set of features to compare to (here we use the default hctsa library), and the set of time-series data that behavior is going to be assessed on. If you have just developed a new algorithm for time-series analysis and want to see how it performs across a range of interdisciplinary time-series data, then you may want to use a diverse set of time series sampled from across science. This can be easily achieved using our set of 1000 time series, a random selection of 25 such time series are plotted below (only the first 250 samples are plotted to aid visualization):
Pre-computed results for a recent version of hctsa can be downloaded from figshare as HCTSA_Empirical1000.mat
.
Alternatively, features can be recomputed using our input file for the time-series dataset, using the input file provided in the same figshare data repository. This ensures implementation consistencies on your local compute architecture; i.e., using TS_Init('INP_Empirical1000.mat');
to initialize, followed by compute commands involving TS_Compute
).
However, if you only ever analyze a particular type of data (e.g., rainfall), then perhaps you're more interested in which methods perform similarly on rainfall data. For this case, you can produce your own data context for custom data using properly structured input files as explained here.
We use the (hypothetical) example of a hot new feature, :boom: hot_feature1
:boom:, recently published in Science (and not yet in the hctsa library), and attempt to determine whether it is completely new, or whether there are existing features that exhibit similar performance to it. Think first about the data context (described above), which allows you to understand the behavior of thousands of features on a diverse dataset with which to compare the behavior of our new feature, hot_feature1
. This example uses the Empirical1000
data context downloaded as HCTSA_Empirical1000.mat
from figshare.
Getting the feature values for the new feature, hot_feature1
, could be done directly (using TS_CalculateFeatureVector
), but in order to maintain the HCTSA structure, we instead produce a new HCTSA.mat
file containing just hot_feature
and the same time series. For example, to compare to the HCTSA_Empirical1000.mat
file hosted on figshare, you should use the same version of hctsa to enable a valid comparison to the same set of features.
We first generate an input file, INP_hot_master.txt
containing the function call, that takes in a time series, x
:
Any additional arguments to the function MyHotFeature.m
should be specified here. MyHotFeature.m
must also be in a form that outputs a structure (or a single real number, as explained here).
The interesting field in the structure output produced by MyHotFeature(x)
is hotFeature1
, which needs to be specified in another input text file, INP_hot_features.txt
, for example, as:
where we have given this feature two keywords: hot
and science
.
So now we are able to initiate a new hctsa calculation, specifying custom code calls (master) and features to extract from the code call (features), as:
This generates a new file, HCTSA_hot.mat
, containing information about the 1000 time series, and the new hot feature, hot_feature1
, which can then be computed as:
So now we have both a context of the behavior of a library of >7000 features on 1000 diverse time series, and we also have the behavior of our three hot new features. It is time to combine them and look for inter-relationships!
Now that we have all of the data in the same HCTSA file, we can compare the behavior of the new feature to the existing library of features. This can be done manually by the researcher, or by using standard hctsa functions; the most relevant is TS_SimSearch
. We can find the ID assigned to our new hot_feature
in the merged HCTSA file as:
which tells us that the ID of my_hot_feature
in HCTSA_merged.mat
is 7703. Then we can use TS_SimSearch
to explore the relationship of our hot new feature to other features in the hctsa library (in terms of linear, Pearson, correlations):
We find that our feature is reproducing the behavior of the first zero of the autocorrelation function (the first match: first_zero_ac
; see Interpreting Features for more info on how to interpret matching features):
In this case, the hot new feature wasn't so hot: it was highly (linearly) correlated to many existing features (including the simple zero-crossing of the autocorrelation function, first_zero_ac
), even across a highly diverse time-series dataset. However, if you have more luck and come up with a hot new feature that shows distinctive (and useful) performance, then it can be incorporated in the default set of features used by hctsa by adding the necessary master and feature definitions (i.e., the text in INP_hot_master.txt
and the text in INP_hot_features.txt
) to the library files (INP_mops.txt
and INP_ops.txt
in the Database directory of hctsa), as explained here. You might even celebrate your success by sharing your new feature with the community, by sending a Pull Request to the hctsa github repository!! :satisfied:
If using a set of 1000 time series, then this is easy because all the data is already computed in HCTSA_Empirical1000.mat
on figshare :relaxed:
For example, say we want to find neighbors to the fastdfa
algorithm from Max Little's website. This algorithm is already implemented in hctsa in the code SC_fastdfa.m
as the feature SC_fastdfa_exponent
. We can find the ID of this feature by finding the matching row in the Operations table (ID=750
):
and then find similar features using TS_SimSearch
, e.g., as:
Yielding:
We see that other features in the library indeed have strong relationships to SC_fastdfa_exponent
, including some unexpected relationships with the stationarity estimate, StatAvl25
.
Combining the network visualization with scatter plots produces the figures in our original paper on the empirical structure of time series and their methods (cf. Sec. 2.4 of the supplementary text), see below:
Specific pairwise relationships can be probed in more detail (visualizing the types of time series that drive any relationship) using TS_Plot2d
, e.g., as:
The pairwise distance matrix (distances are , for Pearson correlation coefficients, ) produced by TS_SimSearch
provides another visualization of the context of this hot new feature (in this case there are so many highly correlated features, that the matrix doesn't reveal much subtle structure):