bugl
bugl
HomeLearnPatternsSearch
HomeLearnPatternsSearch

Loading lesson path

Learn/Python/Data Science and Scientific Python
Python•Data Science and Scientific Python

Machine Learning - Decision Tree

Concept visual

Machine Learning - Decision Tree

keys map to buckets01kiwi:12pear:43apple:7grape:9

Decision Tree

In this chapter we will show you how to make a "Decision Tree". A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience.

Formula

In the example, a person will try to decide if he/she should go to a comedy show or not.

Luckily our example person has registered every time there was a comedy show in town, and registered some information about the comedian, and also registered if he/she went or not.

Age

Experience

Rank

Nationality

Go 36 10

UK NO 42 12

Usa

NO 23

N NO 52

Usa

NO 43 21

Usa

Yes

44 14

UK NO 66

N

Yes

35 14

UK

Yes

52 13

N

Yes

35

N

Yes

24

Usa

NO 18

UK

Yes

45

UK

Yes

Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth attending to.

How Does it Work?

First, read the dataset with pandas:

Example

Read and print the data set:

Formula

import pandas df = pandas.read_csv("data.csv")
print(df)

To make a decision tree, all data has to be numerical. We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.

Pandas has a map()

method that takes a dictionary with information on how to convert the values.

{'UK': 0, 'USA': 1, 'N': 2}
Means convert the values 'UK' to 0, 'USA' to 1, and 'N' to 2.

Example

Change string values into numerical values:

d = {'UK': 0,
'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d =
{'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)

Then we have to separate the feature columns from the target column. The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.

Example

X is the feature columns, y is the target column: features = ['Age', 'Experience', 'Rank', 'Nationality']

Formula

X = df[features]
y = df['Go']
print(X)
print(y)

Now we can create the actual decision tree, fit it with our details. Start by importing the modules we need:

Example

Create and display a Decision Tree: import pandas from sklearn import tree from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt df = pandas.read_csv("data.csv")

d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality']
= df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
features = ['Age', 'Experience', 'Rank', 'Nationality']

Formula

X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
tree.plot_tree(dtree, feature_names = features)

Result Explained

The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or not. Let us read the different aspects of the decision tree:

Rank

Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the rest will follow the False arrow (to the right). gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle. samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is the first step. value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".

Previous

Matplotlib Pie Charts

Next

Machine Learning - Confusion Matrix