[1]:
import sys
import pandas as pd

sys.path.insert(1, '../')

Let’s demonstrate in this how to import datasets and check some information about them. To begin with, we can check all the available datasets:

[2]:
from tsad.base.datasets import list_of_datasets

list_of_datasets()
[2]:
{'Combines state monitoring': 'load_combines()',
 'SKAB (skoltech anomaly benchmark) teaser': 'load_skab_teaser()',
 'SKAB (skoltech anomaly benchmark)': 'load_skab()',
 'NASA Turbofan Jet Engine Data Set': 'load_turbofan_jet_engine()',
 'TEP (Tennessee Eastman process)': 'load_tep()',
 'Pressurized Water Reactor (PWR) Dataset for Fault Detection': 'load_pwr_anomalies()',
 'NPP Power Transformer RUL': 'load_transformer_rul()'}

In this dictionary keys represent names of the datasets, and values represent modules with the datasets. Let’s try them out.

Docstrings contain links to the detailed dataset description, if one exists.

Combines state monitoring dataset

Importing

[3]:
from tsad.base.datasets import load_combines

dataset = load_combines()

Dataset info

[4]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: Combines state monitoring

Dataset's description:

Tast to solve with dataset:

Dataset's features: ['Anker', 'Cut', 'Go', 'Uncert']

Dataset's target: None

Dataset

[5]:
dataset.frame.head(2)
[5]:
Описание Anker Cut Go Uncert
Время
2023-04-21 13:32:48.228 0.0 NaN NaN NaN
2023-04-21 13:32:48.230 NaN NaN 0.0 NaN

SKAB (skoltech anomaly benchmark) teaser

Importing

[6]:
from tsad.base.datasets import load_skab_teaser

dataset = load_skab_teaser()

Dataset info

[7]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: SKAB (skoltech anomaly benchmark) teaser

Dataset's description: Dataset for process monitoring (changepoint detection) benchmarking. It is just a short version (teaser) of SKAB

Tast to solve with dataset: Process monitoring (changepoint detection)

Dataset's features: ['Accelerometer1RMS', 'Accelerometer2RMS', 'Current', 'Pressure', 'Temperature', 'Thermocouple', 'Voltage', 'Volume Flow RateRMS']

Dataset's target: None

Dataset

Dataset has separate markup (labels). Dataset itself:

[8]:
dataset.frame[0].head(2)
[8]:
id Accelerometer1RMS Accelerometer2RMS Current Pressure Temperature Thermocouple Voltage Volume Flow RateRMS
datetime
2019-07-08 17:02:14 0.04283 0.080612 0.000749 -0.273216 28.2099 23.4457 252.743 37.0242
2019-07-08 17:02:32 0.04333 0.084116 0.000849 0.054711 28.3486 23.4492 240.488 37.0000

Labels:

[9]:
dataset.frame[1]
[9]:
[('2019-07-08 18:39:22', '2019-07-08 18:42:32'),
 ('2019-07-08 18:44:36', '2019-07-08 18:46:51'),
 ('2019-07-08 19:06:57', '2019-07-08 19:11:31'),
 ('2019-07-08 19:14:40', '2019-07-08 19:21:16')]

SKAB (skoltech anomaly benchmark)

Importing

[10]:
from tsad.base.datasets import load_skab

dataset = load_skab()

Dataset info

[11]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: SKAB (skoltech anomaly benchmark)

Dataset's description: Dataset for process monitoring (changepoint detection) benchmarking

Tast to solve with dataset: Process monitoring (changepoint detection)

Dataset's features: ['Accelerometer1RMS', 'Accelerometer2RMS', 'Current', 'Pressure', 'Temperature', 'Thermocouple', 'Voltage', 'Volume Flow RateRMS']

Dataset's target: ['anomaly', 'changepoint']

Dataset

[12]:
dataset.frame.head(2)
[12]:
Accelerometer1RMS Accelerometer2RMS Current Pressure Temperature Thermocouple Voltage Volume Flow RateRMS anomaly changepoint
experiment datetime
valve1/6 2020-03-09 12:14:36 0.027429 0.040353 0.77031 0.382638 71.2129 25.0827 219.789 32.0000 0.0 0.0
2020-03-09 12:14:37 0.027269 0.040226 1.09696 0.710565 71.4284 25.0863 233.117 32.0104 0.0 0.0

NASA Turbofan Jet Engine Data Set

Importing

[13]:
from tsad.base.datasets import load_turbofan_jet_engine

dataset = load_turbofan_jet_engine()

Dataset info

[14]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: NASA Turbofan Jet Engine Data Set

Dataset's description: Dataset includes Run-to-Failure simulated data from turbo fan jet engines. In this dataset the goal is to predict the remaining useful life (RUL) of each engine in the test dataset. RUL is equivalent of number of flights remained for the engine after the last datapoint in the test dataset.
    - In train dataset there are 100 engines. The last cycle for each engine represents the cycle when failure had happened.
    - In test dataset there are 100 engines as well. But this time, failure cycle was not provided.

Tast to solve with dataset: Remaining useful life prediction

Dataset's features: ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']

Dataset's target: ['ttf']

Dataset

Dataset has separate X_train, X_test and y_test.

X_train:

[15]:
dataset.frame[0].head(2)
[15]:
id cycle setting1 setting2 setting3 s1 s2 s3 s4 s5 ... s12 s13 s14 s15 s16 s17 s18 s19 s20 s21
0 1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 ... 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 ... 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236

2 rows × 26 columns

X_test:

[16]:
dataset.frame[1].head(2)
[16]:
id cycle setting1 setting2 setting3 s1 s2 s3 s4 s5 ... s12 s13 s14 s15 s16 s17 s18 s19 s20 s21
0 1 1 0.0023 0.0003 100.0 518.67 643.02 1585.29 1398.21 14.62 ... 521.72 2388.03 8125.55 8.4052 0.03 392 2388 100.0 38.86 23.3735
1 1 2 -0.0027 -0.0003 100.0 518.67 641.71 1588.45 1395.42 14.62 ... 522.16 2388.06 8139.62 8.3803 0.03 393 2388 100.0 39.02 23.3916

2 rows × 26 columns

y_test:

[17]:
dataset.frame[2].head(2)
[17]:
ttf
0 112
1 98

TEP (Tennessee Eastman process)

Importing

[18]:
from tsad.base.datasets import load_tep

dataset = load_tep()

Dataset info

[19]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: TEP (Tennessee Eastman process)

Dataset's description: Each training data file contains 480 rows and 52 columns and each testing data file contains 960 rows and 52 columns.  An observation vector at a particular time instant is given by x=[XMEAS(1), XMEAS(2), ..., XMEAS(41), XMV(1), ..., XMV(11)]^T where XMEAS(n) is the n-th measured variable and XMV(n) is the n-th manipulated variable.

Tast to solve with dataset: Outlier detection

Dataset's features: ['XMEAS(1)', 'XMEAS(2)', 'XMEAS(3)', 'XMEAS(4)', 'XMEAS(5)', 'XMEAS(6)', 'XMEAS(7)', 'XMEAS(8)', 'XMEAS(9)', 'XMEAS(10)', 'XMEAS(11)', 'XMEAS(12)', 'XMEAS(13)', 'XMEAS(14)', 'XMEAS(15)', 'XMEAS(16)', 'XMEAS(17)', 'XMEAS(18)', 'XMEAS(19)', 'XMEAS(20)', 'XMEAS(21)', 'XMEAS(22)', 'XMEAS(23)', 'XMEAS(24)', 'XMEAS(25)', 'XMEAS(26)', 'XMEAS(27)', 'XMEAS(28)', 'XMEAS(29)', 'XMEAS(30)', 'XMEAS(31)', 'XMEAS(32)', 'XMEAS(33)', 'XMEAS(34)', 'XMEAS(35)', 'XMEAS(36)', 'XMEAS(37)', 'XMEAS(38)', 'XMEAS(39)', 'XMEAS(40)', 'XMEAS(41)', 'XMV(1)', 'XMV(2)', 'XMV(3)', 'XMV(4)', 'XMV(5)', 'XMV(6)', 'XMV(7)', 'XMV(8)', 'XMV(9)', 'XMV(10)', 'XMV(11)']

Dataset's target: None

Dataset

[20]:
dataset.frame.head(2)
[20]:
XMEAS(1) XMEAS(2) XMEAS(3) XMEAS(4) XMEAS(5) XMEAS(6) XMEAS(7) XMEAS(8) XMEAS(9) XMEAS(10) ... XMV(2) XMV(3) XMV(4) XMV(5) XMV(6) XMV(7) XMV(8) XMV(9) XMV(10) XMV(11)
experiment index
1 0 0.25025 3657.2 4520.1 9.3965 26.715 42.191 2704.5 74.593 120.42 0.33701 ... 53.850 24.670 61.839 22.101 40.078 33.041 48.969 47.459 41.841 18.049
1 0.25135 3662.1 4532.3 9.4020 26.644 42.812 2704.9 75.044 120.39 0.33723 ... 53.705 24.562 61.348 22.264 40.050 39.154 49.870 47.403 41.188 18.008

2 rows × 52 columns

Pressurized Water Reactor (PWR) Dataset for Fault Detection

Importing

[21]:
from tsad.base.datasets import load_pwr_anomalies

dataset = load_pwr_anomalies()

Dataset info

[22]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: Pressurized Water Reactor (PWR) Dataset for Fault Detection

Dataset's description: Our collected dataset is benchmark data in case of reactor abnormalities detection with labels. There are 267 readings from 14 sensors of three categories: a temperature sensor, pressure sensor, and vibration sensor (including ionization chamber, accelerometer, and relative displacement sensors). This particular dataset can be utilized in the case of unsupervised abnormality detection.

Tast to solve with dataset: Anomaly detection

Dataset's features: ['Temperature', 'Pressure', 'Flow1', 'Flow2', 'VRR12', 'VRR22', 'VRR23', 'VRR33', 'VRS01', 'VRS03', 'VRS21', 'VRS31', 'VRS02', 'VRI01', 'VRI02', 'VRI03']

Dataset's target: None

Dataset

[23]:
dataset.frame.head(2)
[23]:
Temperature Pressure Flow1 Flow2 VRR12 VRR22 VRR23 VRR33 VRS01 VRS03 VRS21 VRS31 VRS02 VRI01 VRI02 VRI03
Readings
1 248.852987 9.689813 4462.130014 13302.9265 19.060938 0.059119 0.050589 0.111864 0.033951 0.047812 0.232627 0.253775 0.400726 1.763223 0.003031 0.004995
2 269.315740 1.279532 4480.252595 13784.45225 19.062128 0.059089 0.048788 0.111340 0.034060 0.052611 0.233342 0.315067 0.128517 1.769272 0.003164 0.004999

NPP Power Transformer RUL

Importing

[3]:
from tsad.base.datasets import load_transformer_rul

dataset = load_transformer_rul()

Dataset info

[4]:
print(f"Dataset's name: {dataset.name}\n")
print(f"Dataset's description: {dataset.description}\n")
print(f"Tast to solve with dataset: {dataset.task}\n")
print(f"Dataset's features: {dataset.feature_names}\n")
print(f"Dataset's target: {dataset.target_names}")
Dataset's name: NPP Power Transformer RUL

Dataset's description: Dataset for Determining the Remaining Useful Life of Transformers. It is necessary to create a mathematical model that will determine RUL by the final 420 points. The period between time points is 12 hours.

Tast to solve with dataset: Remaining useful life prediction

Dataset's features: ['H2', 'CO', 'C2H4', 'C2H2']

Dataset's target: ['predicted']

Dataset

Dataset has four separate files with X_train, X_test, y_train, y_test sets.

X_train:

[5]:
dataset.frame[0].head(2)
[5]:
H2 CO C2H4 C2H2
id time point
2_trans_497.csv 0 0.001202 0.029565 0.001069 0.000251
1 0.001202 0.029563 0.001068 0.000251

X_test:

[6]:
dataset.frame[1].head(2)
[6]:
H2 CO C2H4 C2H2
id time point
2_trans_1853.csv 0 0.001664 0.026699 0.003253 0.000104
1 0.001664 0.026705 0.003253 0.000104

y_train:

[7]:
dataset.frame[2].head(2)
[7]:
predicted
id
2_trans_497.csv 550
2_trans_483.csv 1093

y_test:

[8]:
dataset.frame[3].head(2)
[8]:
predicted
id
2_trans_1853.csv 693
2_trans_1106.csv 1093
[ ]: