Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13891

Summing the values of leafs in XGBRegressor trees do not match prediction

$
0
0

It was my understanding that the final prediction of an XGBoost model (in this particular case an XGBRegressor) was obtained by summing the values of the predicted leaves [1] [2]. Yet I'm failing to match the prediction summing the values. Here is a MRE:

import jsonfrom collections import dequeimport numpy as npfrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_splitimport xgboost as xgbdef leafs_vector(tree):"""Returns a vector of nodes for each tree, only leafs are different of 0"""    stack = deque([tree])    while stack:        node = stack.popleft()        if "leaf" in node:            yield node["leaf"]        else:            yield 0            for child in node["children"]:                stack.append(child)# Load the diabetes datasetdiabetes = load_diabetes()X, y = diabetes.data, diabetes.target# Split the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Define the XGBoost regressor modelxg_reg = xgb.XGBRegressor(objective='reg:squarederror',                          max_depth=5,                          n_estimators=10)# Train the modelxg_reg.fit(X_train, y_train)# Compute the original predictionsy_pred = xg_reg.predict(X_test)# get the index of each predicted leafpredicted_leafs_indices = xg_reg.get_booster().predict(xgb.DMatrix(X_test), pred_leaf=True).astype(np.int32)# get the treestrees = xg_reg.get_booster().get_dump(dump_format="json")trees = [json.loads(tree) for tree in trees]# get a vector of nodes (ordered by node id)leafs = [list(leafs_vector(tree)) for tree in trees]l_pred = []for pli in predicted_leafs_indices:    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)))assert np.allclose(np.array(l_pred), y_test, atol=0.5) # fails

I also tried adding the default value (0.5) of the base_score (as written here) to the total sum but it also didn't work.

l_pred = []for pli in predicted_leafs_indices:    l_pred.append(sum(li[p] for li, p in zip(leafs, pli)) + 0.5) 

Viewing all articles
Browse latest Browse all 13891

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>