It was my understanding that the final prediction of an XGBoost model (in this particular case an XGBRegressor) was obtained by summing the values of the predicted leaves [1] [2]. Yet I'm failing to match the prediction summing the values. Here is a MRE:
import jsonfrom collections import dequeimport numpy as npfrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_splitimport xgboost as xgbdef leafs_vector(tree):"""Returns a vector of nodes for each tree, only leafs are different of 0""" stack = deque([tree]) while stack: node = stack.popleft() if "leaf" in node: yield node["leaf"] else: yield 0 for child in node["children"]: stack.append(child)# Load the diabetes datasetdiabetes = load_diabetes()X, y = diabetes.data, diabetes.target# Split the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Define the XGBoost regressor modelxg_reg = xgb.XGBRegressor(objective='reg:squarederror', max_depth=5, n_estimators=10)# Train the modelxg_reg.fit(X_train, y_train)# Compute the original predictionsy_pred = xg_reg.predict(X_test)# get the index of each predicted leafpredicted_leafs_indices = xg_reg.get_booster().predict(xgb.DMatrix(X_test), pred_leaf=True).astype(np.int32)# get the treestrees = xg_reg.get_booster().get_dump(dump_format="json")trees = [json.loads(tree) for tree in trees]# get a vector of nodes (ordered by node id)leafs = [list(leafs_vector(tree)) for tree in trees]l_pred = []for pli in predicted_leafs_indices: l_pred.append(sum(li[p] for li, p in zip(leafs, pli)))assert np.allclose(np.array(l_pred), y_test, atol=0.5) # fails
I also tried adding the default value (0.5
) of the base_score
(as written here) to the total sum but it also didn't work.
l_pred = []for pli in predicted_leafs_indices: l_pred.append(sum(li[p] for li, p in zip(leafs, pli)) + 0.5)