• # Decision Trees in Machine Learning

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, its also widely used in machine learning, which will be the main focus of this article.

### How can an algorithm be represented as a tree?

For this let’s consider a very basic example that uses titanic data set for predicting whether a passenger will survive or not. Below model uses 3 features/attributes/columns from the data set, namely sex, age and sibsp (number of spouses or children along).
A decision tree is drawn upside down with its root at the top. In the image on the left, the bold text in black represents a condition/internal node, based on which the tree splits into branches/ edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived, represented as red and green text respectively.

In this post we will attempt to predict whether a student portuguese scores pass or fail based on the attribute based on the data given below in the link here https://archive.ics.uci.edu/ml/datasets/Student+Performance

# Data Description

### Data Set Information:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Attribute Information:

### Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

``````1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2 sex - student's sex (binary: 'F' - female or 'M' - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
``````

### these grades are related with the course subject, Math or Portuguese:

``````31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
``````

# Analysis and Visualization of data

### Load dataset (student Portuguese scores)

``````import pandas as pd
print(len(d))
``````

#### Output

``````   school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob ...  \
0     GP   F   18       U                     GT3       A     4             4  at_home   teacher ...
1     GP   F   17       U                     GT3       T     1             1  at_home     other ...
2     GP   F   15       U                     LE3       T     1             1  at_home     other ...
3     GP   F   15       U                     GT3       T     4            2   health  services ...
4     GP   F   16       U                     GT3       T     3            3    other     other ...

famrel freetime  goout  Dalc  Walc health absences  G1  G2  G3
0         4             3          4        1          1         3               4   0    11  11
1         5             3          3        1          1         3               2   9    11  11
2         4             3          2        2          3         3               6  12   13  12
3         3             2          2        1          1         5               0  14   14  14
4         4             3          2        1          2         5               0  11   13  13

[5 rows x 33 columns]
649
``````

### Generate binary label (pass/fail) based on G1+G2+G3 (test grades, each 0-20 pts); threshold for passing is sum>=35

We create another column `pass` which will be the column we use to predict later. First we create a binary value `1` and `0` replace test grade `G1`, `G2`, `G3`.

``````d['pass'] = d.apply(lambda row: 1 if (row['G1']+row['G2']+row['G3']) >= 35 else 0, axis=1)
d = d.drop(['G1', 'G2', 'G3'], axis=1)
``````

#### Output

``````school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob ...   \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher ...
1     GP   F   17       U     GT3       T     1     1  at_home     other ...
2     GP   F   15       U     LE3       T     1     1  at_home     other ...
3     GP   F   15       U     GT3       T     4     2   health  services ...
4     GP   F   16       U     GT3       T     3     3    other     other ...

internet romantic  famrel  freetime  goout Dalc Walc health absences pass
0       no       no       4         3      4    1    1      3        4    0
1      yes       no       5         3      3    1    1      3        2    0
2      yes       no       4         3      2    2    3      3        6    1
3      yes      yes       3         2      2    1    1      5        0    1
4       no       no       4         3      2    1    2      5        0    1
``````

We notice that a new column `pass` with binary value `0` and `1` has been created and columns `G1`, `G2`, `G3` has been dropped.

### Use one-hot encoding on categorical columns

There are many columns which are of type categorical (string). But in order to apply Decision Tree Learning we have to convert these column to binary value. To achieve this goal we create dummy variable for each value of the categorical column.

``````d = pd.get_dummies(d, columns=['sex', 'school', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic'])
``````

#### Output

``````age  Medu  Fedu  traveltime  studytime  failures  famrel  freetime  goout  \
0   18     4     4           2          2         0       4         3      4
1   17     1     1           1          2         0       5         3      3
2   15     1     1           1          2         0       4         3      2
3   15     4     2           1          3         0       3         2      2
4   16     3     3           1          2         0       4         3      2

Dalc      ...       activities_no  activities_yes  nursery_no  nursery_yes  \
0     1      ...                   1               0           0            1
1     1      ...                   1               0           1            0
2     2      ...                   1               0           0            1
3     1      ...                   0               1           0            1
4     1      ...                   1               0           0            1

higher_no  higher_yes  internet_no  internet_yes  romantic_no  romantic_yes
0          0           1            1             0            1             0
1          0           1            0             1            1             0
2          0           1            0             1            1             0
3          0           1            0             1            0             1
4          0           1            1             0            1             0

[5 rows x 57 columns]
``````

Notice that the number of columns increase to 57.

### Shuffle rows

To make the data more random we shuffle the rows of data

``````d = d.sample(frac=1)
``````

### Split training and testing data

To have an unbiased, accurate evaluation of the accuracy of the model we split the data into training data and test data from the start of analysis.
Here we will try to manually split the data.

``````d_train = d[:500]
d_test = d[500:]

d_train_att = d_train.drop(['pass'], axis=1)
d_train_pass = d_train['pass']

d_test_att = d_test.drop(['pass'], axis=1)
d_test_pass = d_test['pass']

d_att = d.drop(['pass'], axis=1)
d_pass = d['pass']
``````

### Number of passing students in whole dataset

Let's check the number of passsing students and its percentage of the whole students.

``````import numpy as np
print("Students passing: %d out of %d (%.2f%%)" % (np.sum(d_pass), len(d_pass), 100*float(np.sum(d_pass))/ len(d_pass)))
``````

#### Output

``````Students passing: 328 out of 649 (50.54%)
``````

### Fit a decision tree

Let's train our train our training data to fit the model.

``````from sklearn import tree
t = tree.DecisionTreeClassifier(criterion="entropy", max_depth=5)
t = t.fit(d_train_att, d_train_pass)
``````

### Visualize tree

We have a module `graphviz` used to visualize how our model making decision whether a student passes or fails.

``````import graphviz
dot_data = tree.export_graphviz(t, out_file=None, label="all", impurity=False,
proportion=True, feature_names=list(d_train_att),
class_names=["fail", "pass"], filled=True, rounded=True)
graph = graphviz.Source(dot_data)
graph.render('student_perf_tree.gv', view=True)
``````

### Save tree

Let's save the visualization tree graph and print out the accuracy score.

``````tree.export_graphviz(t, out_file="student-performance.dot", label="all",
impurity=False, proportion=True, feature_names=list(d_train_att),
class_names=["fail", "pass"], filled=True, rounded=True)
score = t.score(d_test_att, d_test_pass)
print(score)
``````

#### Output

``````0.718120805369
``````

To make sure our evaluation of the model does not happen by chance, we use cross validation by splitting the data into 5 portions and rotate our evaluation on one portion at a time.

``````from sklearn.model_selection import cross_val_score
scores = cross_val_score(t, d_att, d_pass, cv=5)
``````

### Show average of score and +/- two standard deviations away (covering 95% of scores)

Let check the accuracy again

``````print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
``````

#### Output

``````Accuracy: 0.67 (+/- 0.04)
``````

Our accuracy is 0.67 with standard deviation of 0.04. Not so bad!

To check if we can tweek out some parameter and make the model performance better.
`max_depth` is one parameter that we can do something about. Let's train the model for max_depth in the value from 1 to 20.

``````for max_depth in range(1, 20):
t = tree.DecisionTreeClassifier(criterion="entropy", max_depth=max_depth)
scores = cross_val_score(t, d_att, d_pass, cv=5)
print("Max depth: %d, Accuracy: %0.2f (+/- %0.2f" % (max_depth, scores.mean(), scores.std() * 2))
``````

#### Output

``````Max depth: 1, Accuracy: 0.64 (+/- 0.05
Max depth: 2, Accuracy: 0.69 (+/- 0.04
Max depth: 3, Accuracy: 0.69 (+/- 0.06
Max depth: 4, Accuracy: 0.69 (+/- 0.06
Max depth: 5, Accuracy: 0.68 (+/- 0.07
Max depth: 6, Accuracy: 0.68 (+/- 0.07
Max depth: 7, Accuracy: 0.66 (+/- 0.06
Max depth: 8, Accuracy: 0.66 (+/- 0.06
Max depth: 9, Accuracy: 0.65 (+/- 0.07
Max depth: 10, Accuracy: 0.65 (+/- 0.05
Max depth: 11, Accuracy: 0.65 (+/- 0.06
Max depth: 12, Accuracy: 0.65 (+/- 0.06
Max depth: 13, Accuracy: 0.64 (+/- 0.06
Max depth: 14, Accuracy: 0.63 (+/- 0.06
Max depth: 15, Accuracy: 0.64 (+/- 0.06
Max depth: 16, Accuracy: 0.65 (+/- 0.07
Max depth: 17, Accuracy: 0.64 (+/- 0.09
Max depth: 18, Accuracy: 0.63 (+/- 0.07
Max depth: 19, Accuracy: 0.63 (+/- 0.07
``````

We found that the accuracy is best for max_depth of 2 with accuracy score of 0.69 and standard deviation of 0.04 ( the least standard deviation, the better is the model).

We make our evaluation of the accuracy of the model even better, we create an array of max_depth, accuracy score mean, accuracy standard deviation.

``````depth_acc = np.empty((19,3), float)
i = 0
for max_depth in range(1, 20):
t = tree.DecisionTreeClassifier(criterion="entropy", max_depth=max_depth)
scores = cross_val_score(t, d_att, d_pass, cv=5)
depth_acc[i,0] = max_depth
depth_acc[i,1] = scores.mean()
depth_acc[i,2] = scores.std() * 2
i += 1

print(depth_acc)
``````

#### Output

``````[[  1.           0.62086241   0.06324119]
[  2.           0.6872555    0.0297607 ]
[  3.           0.68877011   0.03533171]
[  4.           0.68262765   0.03117233]
[  5.           0.67331394   0.0354643 ]
[  6.           0.6717753    0.06470699]
[  7.           0.65788162   0.05747407]
[  8.           0.64557338   0.0599654 ]
[  9.           0.64401143   0.05901169]
[ 10.           0.65028381   0.0615136 ]
[ 11.           0.65796492   0.06716715]
[ 12.           0.64711257   0.06821761]
[ 13.           0.66874644   0.05938595]
[ 14.           0.65631967   0.0805524 ]
[ 15.           0.64557393   0.08538056]
[ 16.           0.63633069   0.05864936]
[ 17.           0.65014216   0.08418839]
[ 18.           0.6332534    0.06560505]
[ 19.           0.65018931   0.06713562]]
``````

To even better visualize the result we will make a plot of the scores and standard deviations above.

``````import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.errorbar(depth_acc[:,0], depth_acc[:,1], yerr=depth_acc[:,2])
plt.show()
``````

# Conclusion

Through decision tree we make our model. Decision tree is based asking the question on each attribute to make a decision of `yes` or `no` and arrive at the final decision on the leaf.
Hope you find it useful. Any questions or suggestions are welcome.
Nguồn: Viblo

Có vẻ như bạn đã mất kết nối tới LaptrinhX, vui lòng đợi một lúc để chúng tôi thử kết nối lại.