[1]:
# %load_ext autoreload
# %autoreload 2
[2]:
import pandas as pd
import numpy as np
import pandas as pd
import scipy.stats as ss
[3]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
warnings.simplefilter('ignore', np.RankWarning)

import logging, sys
logging.disable(sys.maxsize)

Load Data

[4]:
try:
    from tsad.base.datasets import load_skab
except:
    import sys
    sys.path.append('../')
    from tsad.base.datasets import load_skab
[5]:
dataset = load_skab()
df = dataset.frame
df = df.reset_index(level=[0])
df = df[df['experiment']=='valve1/6']
df = df.drop(columns='experiment')
df.shape
[5]:
(1154, 10)
[6]:
#TODO use task in pipeline to resample dataframe
df = df.resample('1s').mean().ffill()
df.shape
[6]:
(1200, 10)
[7]:
features = dataset.feature_names
target = dataset.target_names[0]

Feature Generation

Create a FeatureGenerationTask Instance

The FeatureGenerationTask is designed to facilitate feature generation from a given DataFrame based on a specified configuration. It is responsible for generating features from time series data according to a user-defined or default configuration.

[8]:
try:
    from tsad.tasks.feature_generation import FeatureGenerationTask
except:
    import sys
    sys.path.append('../')
    from tsad.tasks.feature_generation import FeatureGenerationTask

[9]:
feature_generation_task = FeatureGenerationTask(config=None, features=features)
  • config (optional): Configuration for feature generation, provided as a list of dictionaries. If not provided, default configurations will be used.

  • features (optional): A list of features to consider. If not specified, all available columns in the DataFrame will be used.

Default configuration

Default feature generation functions:

By default, this method uses the EfficientFCParameters function for feature generation, which provides most common optimized set of feature extraction functions.

Default windows:

The default window sizes for feature generation are determined based on the index frequency of the input DataFrame (freq_df). The following window sizes are used:

  • Window 1: 4 times the frequency of the DataFrame (4 * freq_df)

  • Window 2: 10 times the frequency of the DataFrame (10 * freq_df)

These window sizes are selected to capture a range of temporal patterns in the time series data.

Fitting the Task

Now that you have initialized the task, it’s time to fit it to your input DataFrame. The fit method will perform feature generation based on your configuration. Here’s how to use it:

[10]:
%%time
# Fit the FeatureGenerationTask to your DataFrame
df_generated, generation_result = feature_generation_task.fit(df)
df_generated.shape
CPU times: user 1.79 s, sys: 482 ms, total: 2.28 s
Wall time: 17.2 s
[10]:
(1200, 4342)
  • df should be your input DataFrame containing the data you want to generate features from.

  • df_generated will be a new DataFrame containing the original columns plus the generated features.

  • generation_result will hold information about the generated features.

Making Predictions

If you want to use these generated features for predictions, you can do so easily using the predict method:

[11]:
%%time
df_predicted, _ = feature_generation_task.predict(df, generation_result)
df_predicted.shape
CPU times: user 1.78 s, sys: 485 ms, total: 2.27 s
Wall time: 16.4 s
[11]:
(1200, 4342)
[ ]:

Custom configuration

When performing feature generation, you have the flexibility to define a custom configuration tailored to your specific needs. This custom configuration allows you to select a set of feature extraction functions, specify the series (columns) to which these functions will be applied, and define the windows for calculating these features.

Custom Configuration Example

If you need to customize the feature generation process, you can provide your own configuration. The config parameter allows you to define a list of dictionaries, each specifying a set of features to generate.

[12]:
import scipy.stats as ss

from tsflex.features import FuncWrapper
from tsflex.features.utils import make_robust
[13]:
def slope(x): return (x[-1] - x[0]) / x[0] if x[0] else 0
def abs_diff_mean(x): return np.mean(np.abs(x[1:] - x[:-1])) if len(x) > 1 else 0
def diff_std(x): return np.std(x[1:] - x[:-1]) if len(x) > 1 else 0

funcs = [make_robust(f) for f in [np.min, np.max, np.std, np.mean, slope, ss.skew, abs_diff_mean, diff_std, sum, len,]]

custom_config = [
    {"functions": funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["10s", "60s"],
    }
]
  • functions: This is a list of feature extraction functions that will be applied to the selected series. These functions are defined in the funcs list, which includes functions like minimum, maximum, standard deviation, mean, slope, skewness, and more. You can customize this list to include the specific functions that are relevant to your analysis.

  • series_names: This is a list of column names in your DataFrame to which the feature extraction functions will be applied. In this example, the functions will be applied to the Pressure and Temperature series. You can modify this list to include the names of the series you want to analyze.

  • windows: This is a list of window sizes for feature calculation. In this example, two window sizes are specified: “1s” (1 second) and “60s” (60 seconds). These window sizes determine how the time series data will be segmented for feature extraction. Adjust these window sizes based on your analysis requirements.

Feature Extraction Functions

Feature Extraction functions compute various statistical, temporal, spectral, and other characteristics of time series data. In your feature generation task, you can use a variety of feature extraction functions from libraries like tsfresh, tsfel, numpy, scipy, or even custom functions.

Feature Extraction Categories:

  1. Statistical Features: These features capture statistical properties of the time series data. Common statistical features include mean, median, standard deviation, skewness, kurtosis, variance, and more. Example: np.mean, tsfresh.feature_extraction.feature_calculators.median, etc.

  2. Temporal Features: Temporal features describe patterns over time within the time series. Examples include autocorrelation, mean absolute difference, mean difference, distance, absolute energy, and more. Example: tsfresh.feature_extraction.features.autocorr, tsfel.features.mean_abs_diff, etc.

  3. Spectral Features: Spectral features provide insights into the frequency domain characteristics of the time series. These features include wavelet entropy, spectral entropy, power spectral density, and more. Example: tsfresh.feature_extraction.features.wavelet_entropy, tsfel.features.spectral_entropy, etc.

  4. Custom Functions: You can define custom feature extraction functions tailored to your specific analysis requirements. These functions can capture domain-specific insights or unique patterns in the data. Example: Custom functions like slope(x), abs_diff_mean(x), and diff_std(x) defined in code.

  5. External Libraries: You can leverage external libraries like tsfresh and tsfel for a wide range of pre-defined feature extraction functions. These libraries offer functions for calculating advanced features such as entropy, time-domain, and frequency-domain features. Example: tsfresh.feature_extraction.features.entropy, tsfel.features.abs_energy

[14]:
from tsfel.feature_extraction.features import (
    # Some temporal features
    autocorr, mean_abs_diff, mean_diff, distance, zero_cross,
    abs_energy, pk_pk_distance, entropy, neighbourhood_peaks,
    # Some statistical features
    interq_range, kurtosis, skewness, calc_max, calc_median,
    median_abs_deviation, rms,
    # Some spectral features
    #  -> Almost all are "advanced" features
    wavelet_entropy
)

tsfel_funcs = [
    # Temporal
    autocorr, mean_abs_diff, mean_diff, distance,
    abs_energy, pk_pk_distance, neighbourhood_peaks,
    # FuncWrapper(entropy, prob="kde", output_names="entropy_kde"),
    # FuncWrapper(entropy, prob="gauss", output_names="entropy_gauss"),
    # # Statistical
    interq_range, kurtosis, skewness, calc_max, calc_median,
    median_abs_deviation, rms,
    # Spectral
    wavelet_entropy,
]

# tsfresh
from tsfresh.feature_extraction.feature_calculators import (
    cid_ce,
    variance_larger_than_standard_deviation,
)

tsfresh_funcs=[
        variance_larger_than_standard_deviation,
        FuncWrapper(cid_ce, normalize=True),
    ]

Choosing Feature Extraction Functions

When choosing feature extraction functions for your analysis, consider the following factors:

  • Relevance: Select functions that are relevant to your analysis goals. For instance, if you’re interested in detecting periodicity, consider using autocorrelation or spectral features.

  • Computational Efficiency: Consider the computational cost of the functions, especially when dealing with large datasets. Some functions may be computationally expensive.

  • Domain Knowledge: Leverage your domain knowledge to identify features that have interpretability and meaning in your specific domain.

  • Customization: Don’t hesitate to define custom functions if the standard functions do not capture the patterns you’re interested in.

Applying Custom Configuration

To apply the custom configuration for feature generation, you can use the FeatureGenerationTask class. Here’s an example of how to use it:

[15]:
%%time
# Define your custom configuration
custom_config = [
    {"functions": funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["10s", "60s"],
    },
    {"functions": tsfel_funcs,
     'series_names': ['Pressure', 'Temperature', 'Thermocouple', 'Voltage'],
     "windows": ["30s", "60s"],
    },
    {"functions": tsfresh_funcs,
     'series_names': ['Pressure', 'Temperature'],
     "windows": ["20s", "60s"],
    },
]

custom_feature_generation_task = FeatureGenerationTask(config=custom_config, features=features)
CPU times: user 7 µs, sys: 2 µs, total: 9 µs
Wall time: 10 µs
[16]:
%%time
df_generated, generation_result = custom_feature_generation_task.fit(df)
df_generated.shape
CPU times: user 93.4 ms, sys: 54.8 ms, total: 148 ms
Wall time: 1.15 s
[16]:
(1200, 178)
[17]:
%%time
df_predicted, _ = custom_feature_generation_task.predict(df, generation_result)
df_predicted.shape
CPU times: user 85.7 ms, sys: 47.7 ms, total: 133 ms
Wall time: 1.1 s
[17]:
(1200, 178)

Feature Selection

Create a FeatureSelectionTask Instance

[18]:
try:
    from tsad.tasks.feature_selection import FeatureSelectionTask
except:
    import sys
    sys.path.append('../')
    from tsad.tasks.feature_selection import FeatureSelectionTask

The FeatureSelectionTask class is part of tsad framework and is used for feature selection. Here’s an overview of its main attributes:

  • target: The target feature name you want to predict.

  • n_features_to_select: Number of features to select (a fraction or an integer).

  • feature_selection_method: Method for feature selection. Options include ‘univariate’, ‘tsfresh’, ‘sequential’, or ‘frommodel’.

  • feature_selection_estimator: Estimator used for feature selection (e.g., ‘regressor’ or ‘classifier’). remove_constant_features: Whether to remove constant features.

When creating a FeatureSelectionTask instance, you can customize several parameters to tailor the feature selection process to your specific needs. Here’s a detailed explanation of these parameters:

  • target (str):

    • Required: Specify the name of your target feature. This is the feature you want to predict using your machine learning model.

  • n_features_to_select (float | int | None):

    • Optional: Number of features to select.

    • If you provide an integer value, it will select that exact number of features.

    • If you provide a float value (e.g., 0.2), it will select a fraction of features based on the total number of available features.

    • Setting it to None (default) will not perform any feature selection based on the number of features.

  • feature_selection_method (str | None):

    • Optional: Method for feature selection.

    • Options include:

      • univariate: Perform univariate feature selection based on statistical tests.

      • tsfresh: Utilize the tsfresh library for automated time series feature selection.

      • sequential: Sequential feature selection using an estimator (e.g., RandomForest) for classification or regression.

      • frommodel: Select features using an estimator (e.g., RandomForest) for classification or regression.

    • If set to None (default), it will use the frommodel method by default.

  • feature_selection_estimator (str | None):

    • Optional: Feature selection estimator.

    • If you choose ‘sequential’ or ‘frommodel’ as the feature selection method, you need to specify the estimator.

    • Options depend on your specific use case (e.g., ‘classifier’ or ‘regressor’ for classification or regression tasks).

    • If set to None (default), it will use ‘regressor’ as the default estimator.

  • remove_constant_features (bool):

    • Optional: Whether to remove constant features from the dataset.

    • Constant features have the same value for all samples and usually don’t provide valuable information.

    • Setting it to True (default) will remove constant features, and False will keep them in the dataset.

Let’s create an instance of FeatureSelectionTask and specify the configuration for feature selection.

[19]:
feature_selection_task = FeatureSelectionTask(
    target=target,  # Specify your target feature name
    n_features_to_select=0.2,  # Number of features to select (you can use an integer or fraction)
    feature_selection_method='univariate',  # Choose your feature selection method
    feature_selection_estimator='classifier',  # Choose your estimator (for classification)
    remove_constant_features=True  # Remove constant features
)

Fit and Select Features

Next, we’ll fit the FeatureSelectionTask to our dataset and perform feature selection. This step will return a DataFrame with the selected features and a result object for further analysis.

[20]:
print(df_generated.shape)
df_selected, result = feature_selection_task.fit(df_generated)
print(df_selected.shape)
(1200, 178)
(1200, 33)

Now that we have our selected features in the df_selected DataFrame, we can proceed with model training and evaluation.

[21]:
feature_selection_task.predict(df_generated, result)[0].shape
[21]:
(1200, 33)

Make Pipeline

[22]:
try:
    from tsad.base.pipeline import Pipeline
except:
    import sys
    sys.path.append('../')
    from tsad.base.pipeline import Pipeline

You have defined a combined pipeline using the Pipeline class, which allows you to define and execute multiple data processing tasks sequentially. The pipeline consists of two main tasks: feature generation and feature selection.

[23]:
pipeline = Pipeline([
    FeatureGenerationTask(features=features, config=None),
    FeatureSelectionTask(target=target,
                         remove_constant_features=True,
                         feature_selection_method='univariate',
                         feature_selection_estimator='classifier'
                        ),
                    ]
)

Feature Generation Task

The first task in the pipeline is the FeatureGenerationTask. This task is responsible for generating new features from the input data. You can customize the features to generate by providing a list of feature names in the features parameter. In this case, config is set to None, indicating that the default configuration for feature generation will be used.

Feature Selection Task

The second task in the pipeline is the FeatureSelectionTask. This task is focused on selecting a subset of relevant features from the ones generated in the previous step.

Fitting the Pipeline

After defining the pipeline, you can fit it to your dataset using the fit method:

[24]:
%%time
df_fit = pipeline.fit(df)
df_fit.shape
CPU times: user 2.39 s, sys: 518 ms, total: 2.91 s
Wall time: 17 s
[24]:
(1200, 637)

Here, df represents input DataFrame. When you call pipeline.fit(df), it performs the following steps:

  1. The FeatureGenerationTask generates new features from the input DataFrame df.

  2. The FeatureSelectionTask selects a subset of features based on the specified criteria, including removing constant features and using a classification-based estimator.

  3. The resulting DataFrame with the selected features is stored in the variable df_fit.

Predicting with the Pipeline

Once you have fitted your combined pipeline to your dataset using the fit method, you can also use it to make predictions. Here’s how you can predict using the pipeline:

[25]:
%%time
df_predict = pipeline.predict(df)
df_predict.shape
CPU times: user 323 ms, sys: 188 ms, total: 511 ms
Wall time: 3.47 s
[25]:
(1200, 637)

The predict method is used to apply the trained pipeline to a new dataset (df in this case) and generate predictions or transformations based on the previously learned feature generation and selection steps.

[ ]: