# Decision tree

In decision theory (for example risk management), a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions.

In machine learning, a decision tree is a predictive model; that is, a mapping of observations about an item to conclusions about the item's target value. Each inner node corresponds to variable; an arc to a child represents a possible value of that variable. A leaf represents the predicted value of target variable given the values of the variables represented by the path from the root.

The machine learning technique for inducing a decision tree from data is called decision tree learning, or (colloquially) decision trees.

Decision tree learning is also a common method used in data mining. Here, a decision tree describes a tree structure wherein leaves represent classifications and branches represent conjunctions of features that lead to those classifications [1]. A decision tree can be learned by splitting the source set into subsets based on an attribute value test [1]. This process is repeated on each derived subset in a recursive manner. The recursion is completed when splitting is either non-feasible, or a singular classification can be applied to each element of the derived subset.

Decision trees are also a descriptive means for calculating conditional probabilities.

Decision tree can be described also as the synergy of mathematical and computing techniques that aids on the description, categorisation and generalisation of a given set of data. Data comes in records of the form:

(x, y) = (x1, x2, x3…, xk, y)

The dependent variable, Y, is the variable that we are trying to understand, classify or generalise. The other variables x1, x2, x3 etc are the variables that will help us on that job.

 Contents

## Types

Decision tree has two other names:

Regression tree, if we are trying to understand a numerical variable like prices of houses or a patient’s length of stay in a hospital.

Classification tree, if the Y is a categorical variable like: sex (male or female), the result of a game (lose or win).

## Practical example

We will use an example to explain decision trees:

Our friend David; is the manager of a famous golf club. Sadly, he is having some trouble with his customers attendance. There are days that everyone wants to play golf and the staff of the club is not enough for them; on some other days for no apparent reason, no one plays golf; and the club has a high slack of employees.

David’s objective is to optimise the staff availability by trying to predict when people will play golf using the METOFFICE week forecast. To accomplish that he needs to understand the reason people decide to play and if there is any explanation for that.

So during two weeks he has been recording the:

Outlook, whether it was sunny, clouded or raining. The temperature in degrees Fahrenheit. The Relative Humidity in percent. Whether it was windy or not.

And of course if people attended to the Golf club on that day. He ended with this dataset containing 14 rows and 5 columns.

Missing image
Golf_dataset.png
Image:golf_dataset.png

So, let’s look at this table for 10 seconds and try to find out an explanation why to play or not play!!! It is not very clear because our brain is not able to interpret relations is in to kind of representation.

A decision tree model is then proposed to solve David’s problem.

Missing image
Decision_tree_model.png
Image:decision_tree_model.png

A decision tree is a model of the data that encodes the distribution of the class label (again the Y) in terms of the predictor attributes.

It is a directed, acyclic graph in form of a tree. The top node represents all the data. The classification tree algorithm finds out that the best way to explain the dependent variable, play, is by using the variable Outlook. Using the categories of the variable outlook three different groups were found:

The group that plays golf when is sunny, the group that plays when is clouded and surprisingly we realise that when it is raining some people do play golf!

Our 1st conclusion: if the outlook is overcast people always play golf and there are some fanatical people that play golf even in the rain.

Then again we divide the sunny group in two groups. We realise that customers doesn’t like to play golf if the humidity is higher than seventy percent.

Finally we divide the rain category in two and find out that customers will not play golf if it is windy.

And here is the short solution of the problem given by the classification tree software. David; dismiss most of the staff on days that are sunny and humid or on rainy days that are windy because almost no one is going to play golf on those days. On the other days when a lot of people will play golf, you can hire some temporary staff to help you on the job.

The conclusion is that decision tree helped us turn a complex data representation into a much easier structure (parsimonious).

## Formulas

### Gini impurity

Used by the CART algorithm (Classification and Regression Trees). It is based on squared probabilities of membership for each target category in the node. It reaches its minimum (zero) when all cases in the node fall into a single target category.

Suppose y takes on values in {1, 2, ..., m}, and let f(i, j) = frequency of value j in node i. That is, f(i, j) is the proportion of records assigned to node i for which y = j.

[itex]I_{G}(i) = 1 - \sum^{m}_{j=1} f (i,j)^{2}[itex]

### Entropy

Used by the ID3, C4.5 and C5.0 tree generation algorithms. This measure is based on the concept of entropy used in information theory.

[itex] I_{E}(i) = - \sum^{m}_{j=1} f (i,j) \log f (i, j) [itex]

## External sources

• Art and Cultures
• Countries of the World (http://www.academickids.com/encyclopedia/index.php/Countries)
• Space and Astronomy