Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
When running large-scale hctsa computations, it can be useful to set up a mySQL database for time series, operations, and the computation results, and have many Matlab instances (running on different nodes of a compute cluster, for example) communicate directly with the database.
The hctsa software comes with this (optional) functionality, allowing a more powerful, distributed way to compute and store results of a large-scale computation.
This chapter outlines the steps involved in setting up, and running hctsa computations using a linked mySQL database.
The hctsa package requires some preliminary set up to work with a mySQL database, described here:
Installation of mySQL, either locally, or on an accessible server.
Setting up Matlab with a mySQL java connector (done by running the install_jconnector
script in the Database directory, and then restarting Matlab).
After the database is set up, and the packages required by hctsa are installed (by running the install
script), linking to a mySQL database can be done by running the install_database
script, which:
Sets up Matlab to be able to communicate with the mySQL server and creates a new database to store Matlab calculations in, described here.
This section contains additional details about each of these steps.
Note that the above steps are one-off installation steps; once the software is installed and compiled, a typical workflow will simply involve opening Matlab, running the startup
script (which adds all paths required for the hctsa software), and then working within Matlab from any desired directory.
Once installed using our default library of operations, the typical next step is to add a dataset of time series to the database using the SQL_Add
command. Custom master operations and operations can also be added, if required.
After installing the software and importing a time-series dataset to a mySQL database, the process by which data is retrieved from the database to local Matlab files (using SQL_Retrieve
), feature sets computed within Matlab (using TS_Compute
), and computed data stored back in the database (SQL_store
) is described in detail here.
After the computation is complete for a time-series dataset, a range of processing, analysis, and plotting functions are also provided with the software, as described here.
After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to personal preference of the user whether to keep all time-series data in a single database, or have a different database for each dataset.
Time series are added using the same function used to add master operations and operations to the database, SQL_Add
, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database. The time-series data files to import, and the keywords to assign to each time series are specified in either: (i) an appropriately formatted matlab (.mat
) file, or (ii) a structured input text file, as explained below.
Note that, when using the .mat
file input method, time-series data is stored in the database to six significant figures. However, when using the .txt
file input method, time-series data values are stored as written in the input text file of each time series.
Time series can be indexed by assigning keywords to them (which are stored in the TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the database). Assigning keywords to time series makes it easier to retrieve a set of time series with a given set of keywords for analysis, and to group time series annotated with different keywords for classification tasks.
When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.
SQL_Add
syntaxAdding a set of time series to the database requires an appropriately formatted input file, INP_ts.txt, for example, the appropriate code is:
We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt')
, as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat')
.
We assume that the user has access to and appropriate read/write privileges for a local or network mySQL server database. Instructions on how to install and set up a mySQL database on a variety of operating systems can be found here.
Before the structure of the database can be created, Matlab must be set up to be able to talk to the mySQL server, which requires installing a mySQL java connector. The steps required to achieve this are performed by the script install_jconnector
, which should be run from the main hctsa directory. If this script runs successfully and a mySQL server has been installed (either on your local machine or on an external server, see above), you are then ready to run the install
script.
The following outlines the actions performed by the install_jconnector
script (including instructions on how to perform the steps manually):
It is necessary to relocate the J connector from the Database directory of this code repository (which is also freely available here): the file mysql-connector-java-5.1.35-bin.jar
(for version 5.1.35). Instructions are here and are summarized below, and described in the Matlab documentation. This .jar file must be added to a static path where it can always be found by Matlab. A good candidate directory is the java/jarext/ subdirectory of the Matlab root directory (to determine the Matlab root directory, simply type matlabroot
in an open Matlab command window).
For Matlab to see this file, you need to add a reference to it in the javaclasspath.txt file (an alternative is to modify the classpath.txt file directly, but this may not be supported by newer versions of Matlab). This file can be found (or if it does not exist, should be created) in Matlab’s preferences directory (to determine this location, type prefdir
in a command window, or navigate to it within Matlab using cd(prefdir)
).
This javaclasspath.txt file must contain a text reference to the location of the java connector on the disk. In the case recommended above, where it has been added to the java/jarext directory, we would add the following to the javaclasspath.txt file:
ensuring that the version number (5.1.35) matches your version of the J connector (if you are using a more recent version, for example).
Note that javaclasspath.txt can also be in Matlab’s startup directory (for example, to modify this just for an individual user).
After restarting Matlab, Matlab should then have the ability to communicate with mySQL servers (we will check whether this works below).
The main tasks involved in installing the Matlab/mySQL interface are achieved by the install.m
script, which runs the user through the steps below.
If you have not already done so, creating a mySQL database to use with Matlab can be done using the SQL_create_db
function. This requires that mySQL is installed on an accessible server, or on the local machine (i.e., using localhost
). If the database has already been set up, then you do not need to use the SQL_create_db
function but you must then create a text file, sql-setting.conf
, in the Database directory of the repository. This file contains four comma-delimited entries corresponding to the server name, database name, username, and password, as per the following:
The settings listed here are those used to connect to the mySQL server. Remember that your password is sitting here in this document in unencrypted plain text, so do not use a secure or important password.
To check that Matlab can connect to external servers using the mySQL J-connector, using correct host name, username, and password settings, we introduce the Matlab routines SQL_OpenDatabase
and SQL_CloseDatabase
. An example usage is as follows:
For this to work, the sql_settings.conf file must be set up properly. This file specifies (in unencrypted plain text!) the login details for your mySQL database in the form hostName,databaseName,username,password
.
An example sql_settings.conf file:
Once you have configured your sql_settings.conf file, and you can run dbc = SQL_OpenDatabase;
and SQL_CloseDatabase(dbc)
without errors, then you can smile to yourself and you should at this point be happy because Matlab can communicate successfully with your mySQL server! You should also be excited because you are now ready to set up the database structure!
Note that if your database is not set up on your local machine (i.e., localhost
), then Matlab can communicate with a mySQL server through an ssh tunnel, which requires some additional setup (described below).
Note also that the SQL_OpenDatabase
function uses Matlab's Database Toolbox if a license is available, but otherwise will use java commands; both are supported and should give identical operational behavior.
To start writing a new dataset to a new database, or start retrieving data from a different database, you will need to change the database that Matlab is configured to connect to. This can be done using the SQL_ChangeDatabase
script (which walks you through the steps and writes over the existing sql_settings.conf file), or by altering the sql_settings.conf file directly.
Note that one can swap between multiple databases easily by commenting out lines of the sql_settings.conf file (adding %
to the start of a line to comment it out).
In some cases, the mySQL server you wish to connect to requires an ssh tunnel. One solution is to use port forwarding from your local machine to the server. The port forward can be set up in the terminal using a command like:
This command connects port 1234 on your local computer to port 3306 (the default mySQL port) on the server. Now, telling Matlab to connect to localhost
through port 1234 will connect it, through the established ssh tunnel, to the server. This can be achieved by specifying the server as localhost
and the port number as 1234 in the sql_settings.conf file (or during the install
process), which can be specified as the (optional) fifth entry, i.e.,:
Retrieving data from the Results table of the database is typically done for one of two purposes:
To calculate as-yet uncalculated entries to be stored back into the database, and
To analyze already-computed data stored in the database in Matlab.
The function SQL_Retrieve
performs both of these functions, using different inputs. Here we describe the use of the SQL_Retrieve
function for the purposes of populating uncalculated (NULL) entries in the Results table of the database in Matlab.
For calculating missing entries in the database, SQL_Retrieve
can be run as follows:
The third input, 'null'
, retrieves ts_ids and op_ids from the sets provided that contain (as-yet) uncalculated (i.e., NULL) elements in the database; these can then be calculated and stored back in the database. An example usage is given below:
Running this code will retrieve null (uncalculated) data from the database for time series with ts_ids between 1 and 5 (inclusive) and all operations in the database, keeping only the rows and columns of the resulting time series x operations matrix that contain NULLs.
When calculations are complete and one wishes to analyze all of the data stored in the database (not just NULL entries requiring computation), the third input should be set to ‘all’ to retrieve all entries in the Results table of the database, as described later.
SQL_Retrieve
writes to a local Matlab file, HCTSA.mat, that contains the data retrieved from the database.
In this section we describe how keywords and other types of metadata stored in the database can be manipulated, and how results of whole sets of operations and time series can be cleared or completely deleted from the database. These tasks are implemented as Matlab functions, and ensure that key database structures are maintained. Instead performing such tasks by acting directly on the database can cause inconsistencies and should be avoided.
Running out of java heap space throws the error java.lang.OutOfMemoryError
, and usually happens when trying to retrieve a large set of time series/operations from the database. Matlab needs to keep the whole retrieval in memory, and has a hard limit on this. The java heap size can be increased in the Matlab preferences, under General Java Heap Memory.
As described above, computation involves three main steps:
The procedure involves three main steps:
Retrieve a set of time series and operations from (the Results table) of the database to a local Matlab file, HCTSA.mat (using SQL_Retrieve
).
Compute the operations on the retrieved time series in Matlab and store the results locally (using TS_Compute
).
Write the results back to the Results table of the database (using SQL_store
).
It is usually the most efficient practice to retrieve a small number of time series at each iteration of the SQL_Retrieve
–TS_Compute
–SQL_store
loop, and distribute this computation across multiple machines if possible. An example runscript is given in the code that accompanies this document, as sample_runscript_sql
, which retrieves a single time series at a time, computes it, and then writes the results back to the database in a loop. This can be viewed as a template for runscripts that one may wish to use when performing time-series calculations across the database.
This workflow is well suited to distributed computing for large datasets, whereby each node can iterate over a small set of time series, with all the results being written back to a central location (the mySQL database).
By designating different sections of the database to cycle through, this procedure can also be used to (manually) distribute the computation across different machines. Retrieving a large section of the database at once can be problematic because it requires large disk reads and writes, uses a lot of memory, and if problems occur in the reading or writing to/from files, one may have to abandon a large number of existing computations.
The mySQL database is structured into four main components:
A lists of all the filenames and other metadata of time series (the TimeSeries table),
A list of all the names and other metadata of pieces of time-series analysis operations (the Operations table),
A list of all the pieces of code that must be evaluated to give each operation a value, which is necessary, for example, when one piece of code produces multiple outputs (the MasterOperations table), and
A list of the results of applying operations to time series in the database (the Results table).
Additional tables are related to indexing and managing efficient keyword labeling, etc.
Time series and operations have their own tables that contain metadata associated with each piece of data, and each operation, and the results of applying each method to each time series is in the Results table, that has a row for every combination of time series and operation, where we also record calculation times and the quality of outputs (for cases where there the output of the operation was not a real number, or when some error occurred in the computation). Note that while data for each time series data is stored on the database, the executable time-series analysis code files are not, such that all code files must be in Matlab’s path (all required paths can be added by running startup.m
).
Another handy (but dangerous) function to know about is SQL_reset
, which will delete all data in the mySQL database, create all the new tables, and then fill the database with all the time-series analysis operations. The TimeSeries, Operations, and MasterOperations tables can be generated by running SQL_CreateAllTables
, with master operations and operations added to the database using SQL_Add
commands (described here).
You now you have the basic table structure set up in the database and have done the first bits of mySQL manipulation through the Matlab interface.
It is very useful to be able to inspect the database directly, using a graphical interface. A very good example for Mac is the excellent, free application, Sequel Pro (a screenshot is shown below, showing the first 40 rows of the Operations table of our default operations library). Applications similar to Sequal Pro exist for Windows and Linux platforms. Applications that allow the database to be inspected are extremely useful, however they should not be used to manipulate the database directly. Instead, Matlab scripts should be used to interact with the database to ensure that the proper relationships between the different tables are maintained (including the indexing of keyword labels).
When linking Matlab to a mySQL database, metadata associated with time series, operations, and master operations, as well as the results of computations are all stored in an indexed database. Adding master operations, operations, and time series to the database can be achieved using the SQL_Add
function, as described below.
The following table summarizes the terminology used for each type of object in hctsa land:
SQL_Add
SQL_Add
has two key inputs that specify:
Whether to import a set of time series (specify ‘ts’
), a set of operations (specify ‘ops’
), or a set of master operations (specify ‘mops’
),
The name of the input file that contains appropriately-formatted information about the time series, master operations, or operations to be imported.
In this section, we describe how to use SQL_Add
to add master operations, operations, and time series to the database.
Users wishing to run the default hctsa code library their own time-series dataset will only need to add time series to the database, as the full operation library is added by default by the install.m
script. Users wishing to add additional features using custom time-series code or different types of inputs to existing code files, can either edit the default INP_ops.txt and INP_mops.txt files provided with the repository, or create new input files for their custom analysis methods (as explained for operations and master operations).
REMINDER: Manually editing the database, including adding or deleting rows, is very dangerous, as it can create inconsistencies and errors in the database structure. Adding time series and operations to the database should only be done using SQL_Add
which sets up the Results table of the database and ensures that the indexing relationships in the database are properly maintained.
By default, the install
script populates the database with the default library of highly comparative time-series analysis code. The formatted input file specifying these pieces of code and their input parameters is INP_mops.txt in the Database directory of the repository. This step can be reproduced using the following command:
Once added, each master operation is assigned a unique integer, mop_id, that can be used to identify it. For example, when adding individual operations, the mop_id is used to map each individual operation to a corresponding master operation.
New functions and their input parameters to execute can be added to the database using SQL_Add
in the same way as described above. For example, lines corresponding to the new code can be added to the current INP_mops.txt file, or by generating a new input file and running SQL_Add
on the new input file. Once in the database, the software will then run the new pieces of code. Note that SQL_Add
checks for repeats that already exist in the database, so that duplicate database entries cannot be added with SQL_Add
.
New code added to the database should be checked for the following: 1. Output is a real number or structure (and uses an output of NaN to assign all outputs to a NaN). 2. The function is accessible in the Matlab path. 3. Output(s) from the function have matching operations (or features), which also need to be added to the database.
Corresponding operations (or features) will then need to added separately, to link to the structured outputs of master operations.
Operations can be added to the mySQL database using an appropriately-formatted input file, such as INP_ops.txt
, as follows:
Every operation added to the database will be given a unique integer identifier, op_id, which provides a common way of retrieving specific operations from the database.
Note that after (hopefully successfully) adding the operations to the database, the SQL_Add
function indexes the operation keywords to an OperationKeywords table that produces a unique identifier for each keyword, and another linking table that allows operations with each keyword to be retrieved efficiently.
After setting up a database with a library of time-series features, the next task is to add a dataset of time series to the database. It is up to the user whether to keep all time-series data in a single database, or have a different database for each dataset.
Time series are added using the same function used to add master operations and operations to the database, SQL_Add
, which imports time series data (stored in time-series data files) and associated keyword metadata (assigned to each time series) to the database.
Time series can be indexed by assigning keywords to them (which are stored in the TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the database).
When added to the mySQL database, every time series added to the database is assigned a unique integer identifier, ts_id, which can be used to retrieve specific time series from the database.
Adding a set of time series to the database requires an appropriately formatted input file, following either of the following:
We provide an example input file in the Database directory as INP_test_ts.txt, which can be added to the database, following the syntax above, using SQL_Add('ts','INP_test_ts.txt')
, as well as a sample .mat file input as INP_test_ts.mat, which can be added as SQL_Add('ts','INP_test_ts.mat')
.
The first step of any analysis is to retrieve a relevant portion of data from the mySQL database to local Matlab files for analysis. This is done using the SQL_Retrieve
function described , except we use the 'all'
input to retrieve all data, rather than the 'null'
input used to retrieve just missing data (requiring calculation).
Example usage is as follows:
SQL_Retrieve(ts_ids, op_ids,'all');
for vectors ts_ids
and op_ids
, specifying the ts_ids and op_ids to be retrieved from the database.
Sets of ts_ids and op_ids to retrieve can be selected by inspecting the database, or by retrieving relevant sets of keywords using the SQL_GetIDs
function. Running the code in this way, using the ‘all’ tag, ensures that the full range of ts_ids and op_ids specified are retrieved from the database and stored in the local file, HCTSA.mat, which can then form the basis of subsequent analysis.
The database structure provides much flexibility in storing and indexing the large datasets that can be analyzed using the hctsa approach, however the process of retrieving and storing large amounts of data from a database can take a considerable amount of time, depending on database latencies.
Note that missing, or NULL, entries in the database are converted to NaN entries in the local Matlab matrices.
Sometimes you might wish to remove a problematic set of time series from the database (that might have been input incorrectly, for example), including their metadata, and all records of computed data. Other times you might find a problem with the implementation of one of the operations. In this case, you would like to retain that operation in the database, but flush all of the data currently computed for it (so that you can recompute new values). Both of these types of tasks (both removing and clearing time series or operations) can be achieved with the function SQL_ClearRemove
.
This function takes in information about whether to clear (clear any calculated data) or remove (completely delete the given time series or operations from the database).
Example usages are given below:
After retrieving data from the mySQL database, missing entries (NULL in the database, and NaN in the local Matlab file) can be computed using TS_Compute
, and stored back to the database using SQL_store
. These functions are described below.
TS_Compute
Values retrieved using SQL_Retrieve
(to the local HCTSA.mat
file) that have not previously been calculated are evaluated using TS_Compute
, as described . These results can then be inspected directly (if needed), or simply written back to the database using SQL_store
, as described below.
SQL_store
Once calculations have been performed using Matlab on local files, the results must be written back to the database. This task is performed by SQL_store
, which reads the data in HCTSA.mat
, checks that the metadata still matches the database, and then begins updating the Output, Quality, and CalculationTime columns of the Results table in the mySQL database. This can be done by simply running:
Depending on database latencies, this can be a relatively slow process, up to 20-25 s per time series, updating each row in the Results table individually using mySQL UPDATE statements. However, the delay in this step means that the computation can be distributed across multiple compute nodes, and that stored data can be indexed and retrieved systematically. Keeping results in local Matlab files can be extremely inefficient, and can indeed be untenable for large datasets.
Master Operation
Operation
Time Series
Database identifier:
mop_id
op_id
ts_id
Input to SQL_Add
:
'mops'
'ops'
'ts'