Here's my basic code for two-feature classification of the well-known Iris dataset:
from sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifier, export_graphvizfrom graphviz import Sourceiris = load_iris()iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.# I'm using the max depth as a way to avoid overfitting# and simplify the tree since I'm using it for educational purposesclf = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)clf.fit(iris_limited, iris.target)visualization_raw = export_graphviz(clf, out_file=None, special_characters=True, feature_names=["length", "width"], class_names=iris.target_names, node_ids=True)visualization_source = Source(visualization_raw)visualization_png_bytes = visualization_source.pipe(format='png')with open('my_file.png', 'wb') as f: f.write(visualization_png_bytes)
When I checked the visualization of my tree, I found this:
This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.
Is it possible to disable, or is this in fact correct?