Mastering Pipelines: Integrating Feature Engineering into Your Predictive Models

Predictive modeling in the realm of data science involves not just algorithms, but a meticulous understanding of data and the art of feature engineering. In this guide, we will explore the intricate world of enhancing predictive modeling with Scikit-Learn pipelines. Whether you’re a student venturing into the world of machine learning or a professional seeking to bolster your skills, this guide will equip you with essential knowledge.

Understanding the Power of Feature Engineering

At the heart of predictive modeling lies feature engineering. It’s not merely about selecting data; it’s about transforming raw information into meaningful insights. Imagine tackling a spam email classification task. By converting raw text data into features like word frequency or presence of specific keywords, machine learning algorithms gain the ability to distinguish between spam and non-spam emails accurately. This process is the essence of feature engineering.While large deep learning models can usually learn this patterns on their own, more lightweight models might need some more help.

Examples of Feature Engineering

Text Data: Extract features such as word length, existence of certain keywords, word frequencies, etc.
Datetime Data: Extract year, month, day, and hour from timestamps to recognize temporal patterns. Additionally, you can categorize dates by day of the week, weekend vs weekday, morning vs. afternoon, holidays, among others.
Geographical Data: Integrate external demographic data and merge geographical and datetime data to reveal correlations, like weather impact on incidents.

Integrating Feature Engineering into Scikit-Learn Pipelines

Pipelines in Scikit-Learn are not just convenient; they are indispensable. They provide a systematic approach to data processing, ensuring uniformity and guarding against data leakage.

Data leakage can significantly impact the reliability of your model. Pipelines act as shields, guaranteeing that preprocessing steps, including feature engineering, are applied consistently during both training and testing. For instance, when computing aggregate statistics, pipelines ensure calculations are based solely on the training data, enhancing the model’s credibility.

Cross-validation is a potent technique to evaluate a model’s performance. When preprocessing steps are integrated into pipelines, they adapt for each fold of cross-validation. This variation exposes the model to a more diverse dataset that can enhance its robustness against unseen data.

Practical Application: Predicting IT Service Ticket Durations

Let’s delve into a real-world project to grasp the practical implications of these techniques. Our challenge was predicting the duration between ticket updates (‘sys_update_at’) and their closing time (‘closed_at’) in an IT service setting. Each unique incident ticket was denoted by an identifier ‘number’. Each row with the same number represents a certain update to that ticket. There are columns that represent information that are only uncovered through investigation sometime in the middle of the ticket’s lifespan. However, in our dataset, this information is propagated to all rows of that incident ticket. Given that we do not have the specifics about which stage were these columns should’ve been first available, we will be excluding these columns.

Exploratory Data Analysis

The following are the findings of our brief exploratory data analysis:

“New” and “Active” ticket states have more than 30,000 rows each. Note that this counts update instances with that state, and not just unique ticket numbers. Tickets can cycle between states throughout its life cycle.

While very few, tickets that are Awaiting Evidence and Awaiting Vendor has a very huge range of values to the days before they were finally closed. This is expected as these states are usually out of the hands of the service team. It is noteworthy that once a ticket is “Resolved” it will be closed in about 5 days. The variation is too small to be visible. This implies some rule-based closing.

Ticket numbers with outlier durations were removed. See below that “Awaiting Vendor” state has been less varied, and “Awaiting Evidence” state has been completely removed.

When splitting using Ticket priority, it can be seen that higher priority tickets get closed faster, but not by much. This might imply that higher priority tickets are also more complex so everything roughly evens out the a median of about 9 days.

We will also be excluding tickets that never reached the ‘Resolved’ state. For the final dataset, the ‘Closed’ states of the tickets were also removed as there is no longer a need to predict at that stage. Our final dataset contains about 19,200 unique incident tickets, down from the original 20,700.

Building the Predictive Model

Constructing a predictive model involves a series of systematic steps, and Scikit-Learn pipelines simplify this process.

Datetime Feature Extraction: Dissected date and time variables for nuanced temporal insights.
Target Encoding: Enhanced the model’s understanding of categorical data.
Handling Missing Data: Imputed missing values intelligently using the ‘most_frequent’ strategy.
Feature Selection: Variance Threshold ensured the model focused on significant features, enhancing prediction accuracy.
Model Selection: XGBoost Regressor, a robust gradient boosting algorithm, was employed for its adaptability and efficiency.

Transformer Implementation

The code below shows how the feature engineering steps were implemented in Scikit-learn. For the target encoding, we encode categories by the mean of the target variable for that category. See that we only compute the mean during the “fit” step. That value is saved, and when we “transform” the test data, we just use the pre-calculated values from the training set.

class SplitDateTimeTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_transformed = X.copy()

        cols = X_transformed.select_dtypes(include=['datetime64']).columns.to_list()
        for col in cols:
            for component in ['year', 'month', 'day', 'hour', 'minute', 'second']:
                new_col_name = f'{col}_{component}'
                X_transformed[new_col_name] = X_transformed[col].dt.__getattribute__(component).astype(float)
        X_transformed.drop(columns=cols, inplace=True)

        return X_transformed

# Custom transformer for target encoding
class TargetEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        cols = X.columns.to_list()
        temp_df = pd.concat([X, y], axis=1)
        category_means_ = []
        for col in cols:
            category_means_.append(temp_df.groupby(col)[y.name].mean().to_dict())
        self.category_means_ = category_means_
        return self

    def transform(self, X):
        X_encoded = X.copy()
        cols = X_encoded.columns.to_list()
        for i, col in enumerate(cols):
            X_encoded[col] = X[col].map(self.category_means_[i])
        return X_encoded

Evaluating Model Performance

To split the data between training and testing, we sorted it via creation date and ticket number. The older tickets will be used to predict newer tickets. No rows with the same ‘number’ will be split between the two. A baseline “naive” model is constructed by randomly picking values from a Poisson distribution with the same mean as the training data’s target column. This tries to mimic the distribution of values from the training set. We then measure its performance against the actual values by measuring the Root Mean Square Error and Mean Absolute Error.

Baseline Model:

RMSE: 5.33 days

MAE: 4.31 days

For the pipeline, we used random search to try out various combinations of hyperparameters that will yield the best results. You can see that we implemented a Group K-Fold. While the actual ‘number’ columns are not used by the XGBoost Regressor it is used to ensure that rows from the same ticket ‘number’ are not separated during cross validation.

Fine-Tuned Model:

RMSE: 3.63 days

MAE: 2.38 days

Using the trained pipeline cuts the average error by about half. If we compute the MAE per state, it goes even lower.

Seasonal Anime Dashboard: Automated ETL Pipeline for Top Anime Insights