Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 14126

Why is my decision tree creating a split that doesn't actually divide the samples?

$
0
0

Here's my basic code for two-feature classification of the well-known Iris dataset:

from sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifier, export_graphvizfrom graphviz import Sourceiris = load_iris()iris_limited = iris.data[:, [2, 3]] # This gets only petal length & width.# I'm using the max depth as a way to avoid overfitting# and simplify the tree since I'm using it for educational purposesclf = DecisionTreeClassifier(criterion="gini",                             max_depth=3,                             random_state=42)clf.fit(iris_limited, iris.target)visualization_raw = export_graphviz(clf,                                     out_file=None,                                    special_characters=True,                                    feature_names=["length", "width"],                                    class_names=iris.target_names,                                    node_ids=True)visualization_source = Source(visualization_raw)visualization_png_bytes = visualization_source.pipe(format='png')with open('my_file.png', 'wb') as f:    f.write(visualization_png_bytes)

When I checked the visualization of my tree, I found this:

Tree visualization result

This is a fairly normal tree at first glance, but I noticed something odd about it. Node #6 has 46 samples total, only one of which is in versicolor, so the node is marked as virginica. This seems like a fairly reasonable place to stop. However, for some reason I can't understand, the algorithm decides to split further into nodes #7 and #8. But the odd thing is, the 1 versicolor still in there still gets misclassified, since both the nodes end up having the class of virginica anyway. Why is it doing this? Does it blindly look at only the Gini decrease without looking at whether it makes a difference at all - that seems like odd behaviour to me, and I can't find it documented anywhere.

Is it possible to disable, or is this in fact correct?


Viewing all articles
Browse latest Browse all 14126

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>