Here's the context:
I'm working with a dataset containing various feature types (numerical, categorical).My task is the binary prediction of startup success dependent on a target variable defined earlier.I have several preprocessing steps in my ML pipeline for my HistGradientBoostingClassifier:Log transformation, square root transformation, winsorization (at two levels for different variables), polynomial feature creation, sine transformation, and standard scaling (with separate scalers for each transformed feature group).Target encoding for categorical features.I'm using a TimeSeriesSplit cross-validation strategy with GridSearchCV for Logistic Regression hyperparameter tuning. The reason I have several preprocessors is that I somehow need to combine preprocessing acts that raise new columns with the scaling of these columns.
The Problem:
The pipeline raises a warning that it encounters NAs (but doesn't fail) when I run log_grid.fit(X_train, y_train). However, if I apply the preprocessor steps individually to X_train and y_train before feeding them into the pipeline, everything works as expected (a dataset with 17,000 observations and 0 missing values). The following are my preprocessors:
log_transformer = FunctionTransformer(np.log1p, validate=False) sqrt_transformer = FunctionTransformer(np.sqrt, validate=False)winsorizer_low = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.01, 0.01]}, validate=False)winsorizer_strong = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.05, 0.05]}, validate=False)poly_transformer = PolynomialFeatures(degree=2)sin_transformer = FunctionTransformer(np.sin, validate=False)normal_scaler = StandardScaler()scaler_log = StandardScaler() # Scaler for log-transformed featuresscaler_sqrt = StandardScaler() # Scaler for sqrt-transformed featuresscaler_winsor_low = StandardScaler() # Scaler for winsorized (low) featuresscaler_winsor_strong = StandardScaler() # Scaler for winsorized (strong) featuresscaler_poly = StandardScaler() # Scaler for polynomial featurespreprocessor_target = ColumnTransformer( transformers=[ ('target', TargetEncoder(handle_unknown='ignore'), dummy_cols), # Assuming 'country_code' needs one-hot encoding ('log', log_transformer, log_transformer_cols), ('sqrt', sqrt_transformer, sqrt_transformer_cols), ('winsor_low', winsorizer_low, low_winsor_cols), ('winsor_strong', winsorizer_strong, strong_winsor_cols), ('poly', poly_transformer, poly_cols), ('sin', sin_transformer, sin_transformer_cols), ('normal_scale', normal_scaler, normal_scale_cols) ], remainder='passthrough' # Include columns that are not specified without any transformations)preprocessor_target.set_output(transform='pandas')preprocessor_scaling = ColumnTransformer( transformers=[ ('scale_log', scaler_log, log_transformer_cols_scaler), ('scale_sqrt', scaler_sqrt, sqrt_transformer_cols_scaler), ('scale_winsor_low', scaler_winsor_low, low_winsor_cols_scaler), ('scale_winsor_strong', scaler_winsor_strong, strong_winsor_cols_scaler), ('scale_poly', scaler_poly, poly_features_cols_scaler) ], remainder='passthrough' # Include columns that are not specified without any transformations)preprocessor_scaling.set_output(transform='pandas')variancer = VarianceThreshold(0.0001)Then I combine them in my pipeline (in this case logistiv regression for simplicity)
log_pipe = Pipeline(steps=[('preprocessor_target', preprocessor_target),('preprocessor_scaling', preprocessor_scaling),('zero_variance', variancer),('logreg', LogisticRegression(penalty='l1', solver='liblinear'))])hyperparameters = {'logreg__C': np.logspace(-4, 4, 10)}I then call the funciton as follows:
log_grid = GridSearchCV(log_pipe, hyperparameters, cv=tscv, n_jobs=-1, verbose=1) # scoring='roc_auc'log_grid.fit(X_train, y_train)print('Best Hyperparameters:', log_grid.best_params_)print('Best Cross-validation Score:', log_grid.best_score_)print('Test Set Score:', log_grid.score(X_test, y_test))y_pred = log_grid.predict(X_test)# classification reportprint(classification_report(y_test, y_pred))Here is the warning message I receive:
/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:778: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 767, in _score scores = scorer(estimator, X_test, y_test) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 444, in _passthrough_scorer return estimator.score(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 722, in score return self.steps[-1][1].score(Xt, y, **score_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 668, in score return accuracy_score(y, self.predict(X), sample_weight=sample_weight) ^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 419, in predict scores = self.decision_function(X) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 400, in decision_function X = self._validate_data(X, accept_sparse="csr", reset=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 565, in _validate_data X = check_array(X, input_name="X", **check_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 921, in check_array _assert_all_finite(...ValueError: Input X contains NaN.LogisticRegression does not accept missing values encoded as NaN natively.What I have tried to do so far is to assess the preprocessors individually and sequentally:
Separate Preprocessing:
I applied the preprocessor_target.fit_transform(X_train, y_train) to transform the training features and target variable. Similarly, I used preprocessor_scaling.fit_transform(preprocessed_data, y_train) (where preprocessed_data is the output from the previous step) to perform scaling and lastly, the zero variance preprocessor.
When inspecting the data coming out of these three individual sequential transformations, the output is clean (no missing values). But when used in the logpipe, I get the above mentioned warning.
Thank you for your help! If nothing works out, I will need to abort the preprocessing steps within the pipeline and use them "manually". This is all right, since I have static data, however I would still appreciate a higher understanding of this issue and pipelines generally.