Loading lesson path
Concept visual
In this chapter we will show you how to make a "Decision Tree". A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience.
Formula
In the example, a person will try to decide if he/she should go to a comedy show or not.Luckily our example person has registered every time there was a comedy show in town, and registered some information about the comedian, and also registered if he/she went or not.
Go 36 10
UK NO 42 12
NO 23
N NO 52
NO 43 21
44 14
UK NO 66
N
35 14
UK
52 13
N
35
N
24
NO 18
UK
45
UK
Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth attending to.
First, read the dataset with pandas:
Read and print the data set:
Formula
import pandas df = pandas.read_csv("data.csv")print(df)To make a decision tree, all data has to be numerical. We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values.
method that takes a dictionary with information on how to convert the values.
{'UK': 0, 'USA': 1, 'N': 2}
Means convert the values 'UK' to 0, 'USA' to 1, and 'N' to 2.Change string values into numerical values:
d = {'UK': 0,
'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d =
{'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)Then we have to separate the feature columns from the target column. The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.
X is the feature columns, y is the target column: features = ['Age', 'Experience', 'Rank', 'Nationality']
Formula
X = df[features]
y = df['Go']print(X)
print(y)Now we can create the actual decision tree, fit it with our details. Start by importing the modules we need:
Create and display a Decision Tree: import pandas from sklearn import tree from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt df = pandas.read_csv("data.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality']
= df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
features = ['Age', 'Experience', 'Rank', 'Nationality']Formula
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
tree.plot_tree(dtree, feature_names = features)The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or not. Let us read the different aspects of the decision tree:
Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the rest will follow the False arrow (to the right). gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle. samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is the first step. value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".