I was using this insurance dataset on Kaggle insurance dataset to try to build a simple regressor to predict on the final two columns ['coverage_level','charges'], while using all the other 10 columns as features to feed into the regressor model.
I was aware that the 10 columns to be used as features are of both numeric and categorical type, therefore I did some transformation using LabelEncoder:
df2 = df.copy()# gengerle = LabelEncoder()le.fit(df2.gender.drop_duplicates())df2.gender = le.transform(df2.gender)... so forth for the remaing categorical columns such as 'smoker','region' etc.
Then I applied a minmax scaler on the transformed dataframe:
inputs = df2[["age", "gender", "bmi", "children", "smoker", "region", "medical_history", "family_medical_history","exercise_frequency","occupation"]]targets = df2[["coverage_level", "charges"]]scaler = MinMaxScaler()scaledInputs = np.array(scaler.fit_transform(inputs))X_train, X_test, y_train, y_test = train_test_split(scaledInputs, targets,test_size=0.20, random_state = 42)
Finally is the training and testing part:
rf_model = RandomForestRegressor(n_estimators=10, random_state=42)# Fit the training setsrf_model.fit(X_train, y_train)rf_outputs = rf_model.predict(X_test)rf_mse = mean_squared_error(y_test, rf_outputs)rf_score = rf_model.score(X_test, y_test)
However, the performance is very low, with a score 0.27 and a mse nearly 2615601.
I tried some fixes. The first one is instead of only scaling the inputs, I scaled the two target columns ['coverage_level','charges'], as well before feeding, however, it does not help at all. The second fix is to use one-hot encoding instead of label encoding, but still no gain.
How can I look into this problem?