Data Sport: Using python to build a CART algorithm

In this article, I described a method how we can code CART algorithm in python language. I think it is a good exercise to build your own algorithm to increase your coding skills and tree knowledge. It's not the fastest algorithm implementation but it's enough to understand CART and object oriented programming.

Cart algorithm

In 1983, Breiman et al first described CART algorithm like classified and Regression Tree. It could be used for regression and classification of binary target.

It's a recursive algorithm, at each iteration it finds the best splitting of data which could increase the probability of predicting the target values.

Tree definition

A tree is composed of nodes. Each node could have one, two children or could be a leaf node, a node with no children.

In a list, I stored each object Node.

Tree=[node1, node2, node3, node4….]

Tree structure

To store links and paths between Nodes, I am using Breadth first Search layout.

If the node(x) has a left child the index of the child node is : node(2*x + 1)

If the node(x) has a right child the index of the child node is : node(2*x + 2)

Each node is a python object :

class Node:

def __init__(self,t,L,R,D,S,V,M) :

self.t=t Index of Node

self.L=L Index of Left child

self.R=R Index of right child

self.D=D Depth of the Node

self.S=S Value of split

self.V=V Feature of split

self.M=M Subset array

self.X=X Execution Flag

Splitting criteria : Gini impurity gain

$$i (n) = 1 - \sum p^2_j $$ where

$$p_j ( t) = \frac{n_t ( t)}{n ( t)}$$
is the relative proportion of category j in Node(t)

For instance, if a data set has only one class target, its gini index is zero, is a purity data set.

We have two subsets : left and right induced by the value A and feature of the split.

$$ \Delta i ( n) = i ( n) - p_{left} i ( n_{left}) - p_{right} i ( n_{right}) $$

For each features, and for each split value A, we compute gini impurity gain and we choose the partition that maximize gini impurity gain.

Example of split calculus

In this example, we considered only one feature x which is continuous.

y	0	1	1	1	0	1	0	1	1	0	0	1	1	0	1	0	0	1	0	0
x	0	0	0	0	0	1	0	1	1	0	0	1	1	0	1	0	0	1	0	0

(0+1)/2=0.5 => 0.5 split

		Feature x for 0.5 split value
		0	1
Target y class	0	10	0
Target y class	1	3	7

$$i ( n) = 1 - \left( \frac{10}{20} \right)^2 - \left( \frac{10}{20} \right)^2 = 0.5$$

$$i ( left) = 1 - \left( \frac{10}{13} \right)^2 - \left( \frac{3}{13} \right)^2 = 0.3550295$$

$$i ( right) = 1 - \left( \frac{10}{20} \right)^2 = 0$$

$$\Delta i ( n) = 0.5 - \left( \frac{13}{20} \right)0.355 = 0.2692308$$

If we have had more than one feature, we compute for each feature all the splits, and we choose the best gini impurity gain.

Code of the Cart Algorithm :

GitHub link : Tree_Cart_clean.py

Function Build(Tr):
Tr : node of the tree
Ex: T[0] : node one of the list T(list of the all nodes)

Using result of ginisplit function to define or not children Node

----------------------------------------------------------------------------------------------------
Function Split(v,a):
v: feature, a:array
Ex: Split(2,a) : feature 2, array a

Compute maximum gini gain for one feature, it computes all the split possibility
----------------------------------------------------------------------------------------------------
Function Giniplit(matrix):
Matrix: Array of features

Agregate all the best split, feature, and find the best partitionning.
----------------------------------------------------------------------------------------------------
Function TT(depth): Main program

depth : depth of the tree, if you want a max depth of 3 you must type 2.
It's the maximum depth of root node.

Build Tree list and define node

Comparison between my Cart algorithm and sklearn Cart : Same data, same depth, same criterion

import numpy as np

import random

np.random.seed(42)

X=np.random.randint(10, size=(100, 4))

Y=np.random.randint(2, size=100)

a=np.column_stack((Y,X))

Tree from Sklearn

from sklearn import tree

clf = tree.DecisionTreeClassifier(criterion='gini',max_depth=3)
clf = clf.fit(X, Y)
tree.export_graphviz(clf,out_file='tree.dot')

Tree from my Cart algorithm

#Initialisation ot the Tree list
Tree=[]

#Initialisation of Node[0]
Tree.append(Node(0,"L","R",0,"S","V",a,"X"))

#2 for 3 depth
TT(2)

#Print Tree
for index,node in enumerate(Tree):
print index,node.t,"*",node.L,node.R,"*","Depth:",node.D,node.S,len(node.M)





Index
t
left
right
depth
split value
shape


0
0
1
2
0
7,5
100


1
1
3
4
1
0,5
78


2
2
5
6
1
3,5
22


3
3
*
*
2
*
7


4
4
9
10
2
2,5
71


5
9
*
*
3
*
24


6
10
*
*
3
*
47


7
5
11
12
2
8,5
9


8
6
13
14
2
8,5
13


11
11
*
*
3
*
4


12
12
*
*
3
*
5


13
13
*
*
3
*
12


14
14
*
*
3
*
1

If we compare the tree generate by sklearn Cart and my tree generate by my Cart algorithm, they have the same results.

Usefull documents :

To do :

- improve recursion loop in TT function to remove execution flag. This flag prevent algorithm to create children from a Node which has already been built.
- create predict function

- change the storage of array in object to rules to save memory

Index	t	left	right	depth	split value	shape
0	0	1	2	0	7,5	100
1	1	3	4	1	0,5	78
2	2	5	6	1	3,5	22
3	3	*	*	2	*	7
4	4	9	10	2	2,5	71
5	9	*	*	3	*	24
6	10	*	*	3	*	47
7	5	11	12	2	8,5	9
8	6	13	14	2	8,5	13
11	11	*	*	3	*	4
12	12	*	*	3	*	5
13	13	*	*	3	*	12
14	14	*	*	3	*	1

lundi 18 mai 2015

Using python to build a CART algorithm