Sklearn preprocessors work sequentially but produce NAs when used in Pipeline

Here's the context:

I'm working with a dataset containing various feature types (numerical, categorical).My task is the binary prediction of startup success dependent on a target variable defined earlier.I have several preprocessing steps in my ML pipeline for my HistGradientBoostingClassifier:Log transformation, square root transformation, winsorization (at two levels for different variables), polynomial feature creation, sine transformation, and standard scaling (with separate scalers for each transformed feature group).Target encoding for categorical features.I'm using a TimeSeriesSplit cross-validation strategy with GridSearchCV for Logistic Regression hyperparameter tuning. The reason I have several preprocessors is that I somehow need to combine preprocessing acts that raise new columns with the scaling of these columns.

The Problem:

The pipeline raises a warning that it encounters NAs (but doesn't fail) when I run log_grid.fit(X_train, y_train). However, if I apply the preprocessor steps individually to X_train and y_train before feeding them into the pipeline, everything works as expected (a dataset with 17,000 observations and 0 missing values). The following are my preprocessors:

log_transformer = FunctionTransformer(np.log1p, validate=False)  sqrt_transformer = FunctionTransformer(np.sqrt, validate=False)winsorizer_low = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.01, 0.01]}, validate=False)winsorizer_strong = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.05, 0.05]}, validate=False)poly_transformer = PolynomialFeatures(degree=2)sin_transformer = FunctionTransformer(np.sin, validate=False)normal_scaler = StandardScaler()scaler_log = StandardScaler()  # Scaler for log-transformed featuresscaler_sqrt = StandardScaler()  # Scaler for sqrt-transformed featuresscaler_winsor_low = StandardScaler()  # Scaler for winsorized (low) featuresscaler_winsor_strong = StandardScaler()  # Scaler for winsorized (strong) featuresscaler_poly = StandardScaler()  # Scaler for polynomial featurespreprocessor_target = ColumnTransformer(    transformers=[        ('target', TargetEncoder(handle_unknown='ignore'), dummy_cols),  # Assuming 'country_code' needs one-hot encoding        ('log', log_transformer, log_transformer_cols),        ('sqrt', sqrt_transformer, sqrt_transformer_cols),        ('winsor_low', winsorizer_low, low_winsor_cols),        ('winsor_strong', winsorizer_strong, strong_winsor_cols),        ('poly', poly_transformer, poly_cols),        ('sin', sin_transformer, sin_transformer_cols),        ('normal_scale', normal_scaler, normal_scale_cols)    ],    remainder='passthrough'  # Include columns that are not specified without any transformations)preprocessor_target.set_output(transform='pandas')preprocessor_scaling = ColumnTransformer(    transformers=[        ('scale_log', scaler_log, log_transformer_cols_scaler),        ('scale_sqrt', scaler_sqrt, sqrt_transformer_cols_scaler),        ('scale_winsor_low', scaler_winsor_low, low_winsor_cols_scaler),        ('scale_winsor_strong', scaler_winsor_strong, strong_winsor_cols_scaler),        ('scale_poly', scaler_poly, poly_features_cols_scaler)    ],    remainder='passthrough'  # Include columns that are not specified without any transformations)preprocessor_scaling.set_output(transform='pandas')variancer = VarianceThreshold(0.0001)

Then I combine them in my pipeline (in this case logistiv regression for simplicity)

log_pipe = Pipeline(steps=[('preprocessor_target', preprocessor_target),('preprocessor_scaling', preprocessor_scaling),('zero_variance', variancer),('logreg', LogisticRegression(penalty='l1', solver='liblinear'))])hyperparameters = {'logreg__C': np.logspace(-4, 4, 10)}

I then call the funciton as follows:

log_grid = GridSearchCV(log_pipe, hyperparameters, cv=tscv, n_jobs=-1, verbose=1) # scoring='roc_auc'log_grid.fit(X_train, y_train)print('Best Hyperparameters:', log_grid.best_params_)print('Best Cross-validation Score:', log_grid.best_score_)print('Test Set Score:', log_grid.score(X_test, y_test))y_pred = log_grid.predict(X_test)# classification reportprint(classification_report(y_test, y_pred))

Here is the warning message I receive:

/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:778: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last):  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 767, in _score    scores = scorer(estimator, X_test, y_test)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 444, in _passthrough_scorer    return estimator.score(*args, **kwargs)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 722, in score    return self.steps[-1][1].score(Xt, y, **score_params)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 668, in score    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)                             ^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 419, in predict    scores = self.decision_function(X)             ^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 400, in decision_function    X = self._validate_data(X, accept_sparse="csr", reset=False)        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 565, in _validate_data    X = check_array(X, input_name="X", **check_params)        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 921, in check_array    _assert_all_finite(...ValueError: Input X contains NaN.LogisticRegression does not accept missing values encoded as NaN natively.

What I have tried to do so far is to assess the preprocessors individually and sequentally:

Separate Preprocessing:

I applied the preprocessor_target.fit_transform(X_train, y_train) to transform the training features and target variable. Similarly, I used preprocessor_scaling.fit_transform(preprocessed_data, y_train) (where preprocessed_data is the output from the previous step) to perform scaling and lastly, the zero variance preprocessor.

When inspecting the data coming out of these three individual sequential transformations, the output is clean (no missing values). But when used in the logpipe, I get the above mentioned warning.

Thank you for your help! If nothing works out, I will need to abort the preprocessing steps within the pipeline and use them "manually". This is all right, since I have static data, however I would still appreciate a higher understanding of this issue and pipelines generally.

Sklearn preprocessors work sequentially but produce NAs when used in Pipeline

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...