Now we will train a decision tree classifier. The goal here is not to build the optimal decision tree, but rather just to illustrate practically how to make one. Lets briefly discuss the 'under the hood` mechanics of how decision tree models actually make their decisions. It is important to understand this principle well as it is the foundation for so many other machine learning models.
Gini Impurity
We know that decision trees start at the top of the tree, or root node. This node is determined by the feature in the data that has the lowest 'Gini Impurity' (information gain is another metric you can use, but we will use Gini here).
Gini impurity is a number that is between 0-0.5, and gives the likelihood that new, random data would be misclassified if it were given a random class label. For example, if we introduced a new patient into the dataset and tried to use a decision tree with .5 Gini impurity to predict readmission, we would essentially be flipping a coin.

The goal of the decision tree in training is thus to minimize the Gini impurity. It does this by starting at the root and traversing down the tree, making decisions at each node. Once it reaches a leaf node (or the final output), it returns a predicted class based on the maximum number of samples at that particular leaf.
The process of deciding where to split a node involves calculating the Gini impurity for every potential split of the data, and the one that results in the largest decrease in impurity is chosen.
In a nutshell, the model will iteratively try to make a split based on every feature at each node in the tree using the Gini impurity metric as the deciding factor. Here is a great video that discusses how Gini impurity is calculated.
Decision Tree Hyperparameters
There are three main hyperparameters that you will want to tune for decision trees. You can do this manually, or via automated methods which we will discuss in the random forests section, but it is important to at least have an idea what these hyperparameters mean:
Max Depth
: This parameter sets the maximum depth of the decision tree, controlling how deep the tree can go. A deeper tree can capture more complex patterns but risks overfitting by capturing noise. Conversely, a shallow tree may be too simplistic, failing to capture important patterns in the data. Balancing this parameter helps in managing the trade-off between underfitting and overfitting.Min Samples Split
: This parameter specifies the minimum number of samples required to split an internal node. A higher value prevents the tree from making splits based on small sample sizes, which can capture noise rather than meaningful patterns. This helps ensure that splits are made only when there is sufficient data, contributing to better decisions.Min Samples Leaf
This parameter sets the minimum number of samples required to be at a leaf node. By enforcing a minimum number of samples per leaf, you reduce the risk of leaves making decisions based on a very small subset of the data, which can be unreliable. This helps in smoothing the model and making it less sensitive to noise.

OK! Let's now do it. As always, notice that the code to do these things is easy. It really is the conceptual understanding that is important. First lets make a Decision Tree
with a max_depth
of 3. We will also fit the Decision Tree with the training data and make some predictions using the test set