# Decision Trees

An introduction to Decision trees — a Machine Learning algorithm used as a foundation for complex ensemble techniques.

**Decision Tree** is a **supervised learning** algorithm used for predictive modeling and solving **CART** — classification and regression problems. It generates a tree in the form of simple ‘if-then’ rules where each node represents a feature(attribute), each link(branch) represents a decision(rule) and each leaf represents an outcome(categorical or continuous value).

It is used as a foundation for many classical machine learning algorithms like Random Forest, Bagging, and Boosted Decision Trees.

Geometrically, the ‘**if-then**’ rules are axis parallel cuts forming rectangles in 2D or hyperplanes in high dimensional space. So, the decision boundary of the decision trees is axis parallel cuts (and a non-linear decision boundary).

**Key Terms**

The diagram below shows a basic structure of a decision tree and shows how they work when a new input is given for prediction.

It explains the basic structure of the decision tree. Every tree has a **root node**, where the inputs are passed through. This root node is further divided into sets of decision nodes where results and observations are conditionally based. The process of dividing a single node into multiple nodes is called **splitting**. If a node doesn’t split into further nodes, then it’s called a **leaf node**, or **terminal node**. A subsection of a decision tree is called a **branch** or **sub-tree **(e.g. in the box in the image below).

**Example of a Decision Tree**

There is also another concept that is quite opposite of splitting. If there are ever decision rules which can be eliminated, we cut them from the tree. This process is known as **pruning** and is useful to minimize the complexity of the algorithm.

Now that we have a clear idea of what a basic decision tree looks like, let’s dive into how the splitting is done, and how we can construct a decision tree ourselves.

**Assumptions while creating Decision Tree**

- In the beginning, the whole training set is considered as the root.
- Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to place attributes as the root or internal node of the tree is done by using some statistical approach.

In the process of attaining homogeneity, we can go on splitting the data on variables such that at some point we have nodes with zero entropy (In a hypothetical situation all records in that node belong to the same class. In real time this may not be achieved, but is it worth the time and effort?)

As the tree grows larger, the rules become complicated and may not be that useful as it may lead to overfitting the data (it is now not learning the pattern in data but memorizing the train data such that models cannot perform well on test data). People may live with some amount of misclassification. So now the question would be where to stop growing the tree (what amount of misclassification is acceptable?).

a) One method is **Early stopping**.

Fix a threshold for information gain value. (IG<threshold stops growing the tree) or pre-define the depth of the tree (for example that we need only 4 splits)

b) Another method is called **Pruning**.

Let the tree grow fully and cut at the desired point based on what is known as cost complexity pruning.

**How do Decision Trees work?**

In case of **Classification:**

While splitting the data based on variables:

- If the variable is categorical, then it makes a binary split (like one vs the rest of all) and checks for which binary split has the least confusion.

2. If the variable is numeric, then the data is arranged in ascending order of values and tested with splits at positions where there is a switch in the target variable. Entropy/Gini index is computed at all such splits and comes up with one split that has the least confusion.

In case of **Regression:**

- If the target variable is a continuous value, then instead of entropy, variance is measured, and the independent variable that has split the data such that the variance is minimized is considered as a root node and further splitting is also performed in a similar fashion.

2. Once we have enough homogeneity at leaf nodes, we consider the mean/median of these records as the prediction for a record if it falls in that node.

**Applications of Decision Trees**

- Healthcare

Decision trees can help in understanding whether a patient is suffering from a disease or not based on certain demographic conditions such as age, weight, sex, and more. It assists in deciding the effect of the medicine based on composition/ingredients, period of manufacture, etc.

2. Banking

Decision trees help in assessing the eligibility of a person for availing of a loan or not based on his/her financial status, salary, members in the family, etc. It assists in the detection of credit card frauds, loan defaults, etc. to help banks take corrective actions.

3. Education

Decision Trees can be used in colleges and universities for shortlisting a student based on his/her merit scores, attendance, overall score, extracurricular activities, etc.

**Decision trees** work efficiently on **multi-class** classification problems as compared to less-variance prone algorithms such as **Logistic Regression** as it computes every possible class of each node until it reaches homogeneity.

References:

- Machine Learning by Tom M. Mitchell