In
this article, I described a method how we can code CART algorithm in
python language. I think it is a good exercise to build your own
algorithm to increase your coding skills and tree knowledge. It's not
the fastest algorithm implementation but it's enough to understand
CART and object oriented programming.
Cart
algorithm
In
1983, Breiman et al first described CART algorithm like classified
and Regression Tree. It could be used for regression and
classification of binary target.
It's
a recursive algorithm, at each iteration it finds the best splitting
of data which could increase the probability of predicting the target
values.
Tree
definition
A
tree is composed of nodes. Each node could have one, two children or
could be a leaf node, a node with no children.
In
a list, I stored each object Node.
Tree=[node1,
node2, node3, node4….]
Tree
structure
To
store links and paths between Nodes, I am using Breadth first Search
layout.
If
the node(x) has a left child the index of the child node is : node(2*x + 1)
If
the node(x) has a right child the index of the child node is : node(2*x + 2)
Each
node is a
python object :
class
Node:
def
__init__(self,t,L,R,D,S,V,M) :
self.t=t
Index
of Node
self.L=L
Index
of Left child
self.R=R
Index
of right child
self.D=D
Depth
of the Node
self.S=S
Value
of split
self.V=V
Feature
of split
self.M=M
Subset
array
self.X=X Execution Flag
self.X=X Execution Flag
Splitting
criteria : Gini impurity gain
$$i (n) = 1 - \sum p^2_j $$ where
$$p_j ( t) = \frac{n_t ( t)}{n ( t)}$$
is the relative proportion of category j in Node(t)
For instance, if a data set has only one class target, its gini index is zero, is a purity data set.
We have two subsets : left and right induced by the value A and feature of the split.
$$ \Delta i ( n) = i ( n) - p_{left} i ( n_{left}) - p_{right} i ( n_{right}) $$
For each features, and for each split value A, we compute gini impurity gain and we choose the partition that maximize gini impurity gain.
is the relative proportion of category j in Node(t)
For instance, if a data set has only one class target, its gini index is zero, is a purity data set.
We have two subsets : left and right induced by the value A and feature of the split.
$$ \Delta i ( n) = i ( n) - p_{left} i ( n_{left}) - p_{right} i ( n_{right}) $$
For each features, and for each split value A, we compute gini impurity gain and we choose the partition that maximize gini impurity gain.
Example of split calculus
In this example, we considered only one feature x which is continuous.
(0+1)/2=0.5 => 0.5 split
In this example, we considered only one feature x which is continuous.
y
|
0
|
1
|
1
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
x
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
(0+1)/2=0.5 => 0.5 split
Feature x for 0.5 split value
|
|||
0
|
1
|
||
Target y class
|
0
|
10
|
0
|
1
|
3
|
7
|
$$i ( n) = 1 - \left( \frac{10}{20} \right)^2 - \left( \frac{10}{20} \right)^2 = 0.5$$
$$i ( left) = 1 - \left( \frac{10}{13} \right)^2 - \left( \frac{3}{13} \right)^2 = 0.3550295$$
$$i ( right) = 1 - \left( \frac{10}{20} \right)^2 = 0$$
$$\Delta i ( n) = 0.5 - \left( \frac{13}{20} \right)0.355 = 0.2692308$$
If we have had more than one feature, we compute for each feature all the splits, and we choose the best gini impurity gain.
Code of the Cart Algorithm :
GitHub link : Tree_Cart_clean.py
Function Build(Tr):
Tr : node of the tree
Ex: T[0] : node one of the list T(list of the all nodes)
Using result of ginisplit function to define or not children Node
----------------------------------------------------------------------------------------------------
Function Split(v,a):
v: feature, a:array
Ex: Split(2,a) : feature 2, array a
Compute maximum gini gain for one feature, it computes all the split possibility
----------------------------------------------------------------------------------------------------
Function Giniplit(matrix):
Matrix: Array of features
Agregate all the best split, feature, and find the best partitionning.
----------------------------------------------------------------------------------------------------
Function TT(depth): Main program
depth : depth of the tree, if you want a max depth of 3 you must type 2.
It's the maximum depth of root node.
Build Tree list and define node
Comparison between my Cart algorithm and sklearn Cart : Same data, same depth, same criterion
import numpy as np
import random
np.random.seed(42)
X=np.random.randint(10, size=(100, 4))
Y=np.random.randint(2, size=100)
a=np.column_stack((Y,X))
Tree from Sklearn
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='gini',max_depth=3)
clf = clf.fit(X, Y)
tree.export_graphviz(clf,out_file='tree.dot')
clf = tree.DecisionTreeClassifier(criterion='gini',max_depth=3)
clf = clf.fit(X, Y)
tree.export_graphviz(clf,out_file='tree.dot')
Tree from my Cart algorithm
#Initialisation ot the Tree list
Tree=[]
#Initialisation of Node[0]
Tree.append(Node(0,"L","R",0,"S","V",a,"X"))
#2 for 3 depth
TT(2)
#Print Tree
for index,node in enumerate(Tree):
print index,node.t,"*",node.L,node.R,"*","Depth:",node.D,node.S,len(node.M)
Index | t | left | right | depth | split value | shape |
0 | 0 | 1 | 2 | 0 | 7,5 | 100 |
1 | 1 | 3 | 4 | 1 | 0,5 | 78 |
2 | 2 | 5 | 6 | 1 | 3,5 | 22 |
3 | 3 | * | * | 2 | * | 7 |
4 | 4 | 9 | 10 | 2 | 2,5 | 71 |
5 | 9 | * | * | 3 | * | 24 |
6 | 10 | * | * | 3 | * | 47 |
7 | 5 | 11 | 12 | 2 | 8,5 | 9 |
8 | 6 | 13 | 14 | 2 | 8,5 | 13 |
11 | 11 | * | * | 3 | * | 4 |
12 | 12 | * | * | 3 | * | 5 |
13 | 13 | * | * | 3 | * | 12 |
14 | 14 | * | * | 3 | * | 1 |
If we compare the tree generate by sklearn Cart and my tree generate by my Cart algorithm, they have the same results.
Usefull documents :
To do :
- improve recursion loop in TT function to remove execution flag. This flag prevent algorithm to create children from a Node which has already been built.
- create predict function
- change the storage of array in object to rules to save memory