Decision tree
Decision Tree classification implementation for SciREX.
This module provides a Decision Tree implementation using scikit-learn with automatic parameter tuning using grid search. The implementation focuses on both accuracy and interpretability.
Mathematical Background
Decision Trees recursively partition the feature space using:
- Splitting Criteria:
- Gini Impurity: 1 - ∑ᵢpᵢ²
-
Entropy: -∑ᵢpᵢlog(pᵢ) where pᵢ is the proportion of class i in the node
-
Information Gain: IG(parent, children) = I(parent) - ∑(nⱼ/n)I(childⱼ) where I is impurity measure (Gini or Entropy)
-
Tree Pruning: Cost-Complexity: Rα(T) = R(T) + α|T| where R(T) is tree error, |T| is tree size, α is complexity parameter
Key Features
- Automatic parameter optimization
- Multiple splitting criteria
- Built-in tree visualization
- Pruning capabilities
- Feature importance estimation
References
[1] Breiman, L., et al. (1984). Classification and Regression Trees [2] Quinlan, J. R. (1986). Induction of Decision Trees [3] Hastie, T., et al. (2009). Elements of Statistical Learning, Ch. 9
DecisionTreeClassifier
Bases: Classification
Decision Tree with automatic parameter tuning.
This implementation includes automatic selection of optimal parameters using grid search with cross-validation. It balances model complexity with performance through pruning and parameter optimization.
Attributes:
Name | Type | Description |
---|---|---|
cv |
Number of cross-validation folds |
|
best_params |
Optional[Dict[str, Any]]
|
Best parameters found by grid search |
model |
Optional[Any]
|
Fitted DecisionTreeClassifier instance |
Example
classifier = DecisionTreeClassifier(cv=5) X_train = np.array([[1, 2], [2, 3], [3, 4]]) y_train = np.array([0, 0, 1]) classifier.fit(X_train, y_train) print(classifier.best_params)
Source code in scirex/core/ml/supervised/classification/decision_tree.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
|
__init__(cv=5, **kwargs)
Initialize Decision Tree classifier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cv
|
int
|
Number of cross-validation folds. Defaults to 5. |
5
|
**kwargs
|
Any
|
Additional keyword arguments passed to parent class. |
{}
|
Notes
The classifier uses GridSearchCV for parameter optimization, searching over different tree depths, splitting criteria, and minimum sample thresholds.
Source code in scirex/core/ml/supervised/classification/decision_tree.py
evaluate(X_test, y_test)
Evaluate model performance on test data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_test
|
ndarray
|
Test features of shape (n_samples, n_features) |
required |
y_test
|
ndarray
|
True labels of shape (n_samples,) |
required |
Returns:
Type | Description |
---|---|
Dict[str, float]
|
Dictionary containing evaluation metrics: - accuracy: Overall classification accuracy - precision: Precision score (micro-averaged) - recall: Recall score (micro-averaged) - f1_score: F1 score (micro-averaged) |
Source code in scirex/core/ml/supervised/classification/decision_tree.py
fit(X, y)
Fit Decision Tree model with parameter tuning.
Performs grid search over tree parameters to find optimal model configuration using cross-validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
ndarray
|
Training feature matrix of shape (n_samples, n_features) |
required |
y
|
ndarray
|
Training labels of shape (n_samples,) |
required |
Notes
The grid search optimizes over: - Splitting criterion (gini vs entropy) - Maximum tree depth - Minimum samples for splitting - Minimum samples per leaf - Maximum features considered per split
Source code in scirex/core/ml/supervised/classification/decision_tree.py
get_feature_importance()
Get feature importance scores.
Returns:
Type | Description |
---|---|
Dict[str, float]
|
Dictionary mapping feature indices to importance scores |
Raises:
Type | Description |
---|---|
ValueError
|
If model hasn't been fitted yet |
Notes
Feature importance is computed based on the decrease in impurity (Gini or entropy) brought by each feature across all tree splits.
Source code in scirex/core/ml/supervised/classification/decision_tree.py
get_model_params()
Get parameters of the fitted model.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary containing: - model_type: Type of classifier - best_params: Best parameters found by grid search - cv: Number of cross-validation folds used |
Source code in scirex/core/ml/supervised/classification/decision_tree.py
predict(X)
Predict class labels for samples in X.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
ndarray
|
Test samples of shape (n_samples, n_features) |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Array of predicted class labels |
Raises:
Type | Description |
---|---|
ValueError
|
If model hasn't been fitted yet |
Source code in scirex/core/ml/supervised/classification/decision_tree.py
predict_proba(X)
Predict class probabilities for samples in X.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
ndarray
|
Test samples of shape (n_samples, n_features) |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Array of shape (n_samples, n_classes) with class probabilities |
Raises:
Type | Description |
---|---|
ValueError
|
If model hasn't been fitted yet |