My RandomForest & Stacking

10 minute read

Machine Learning

과제 1: 랜덤 포레스트 모델 직접 구현

(핵심) 랜덤 포레스트 모델 설명

랜덤 포레스트 모델에 대해 먼저 설명하겠습니다. 랜덤 포레스트 모델은 여러개의 결정 트리 모델을 이용하여 배깅 앙상블 학습을 통해 훈련되어지는 모델입니다. 즉, 각각의 결정트리의 예측값을 보고 그 중 가장 많은 빈도의 값을 최종 예측값으로 정하는 모델입니다.

다시 말해 어떤 샘플에 대한 랜덤 포레스트 모델의 예측값은 각각의 훈련된 결정트리들이 그 샘플에 대해 예측한 예측값들을 이용해서 정합니다.

각각 분류기에서 예측한 예측값 중 최빈값을 랜덤 포레스트 모델의 예측값으로 합니다.

기본 설정

# 파이썬 ≥3.5 필수 (파이썬 3.7 추천)
import sys
assert sys.version_info >= (3, 5) 

# 사이킷런 ≥0.20 필수
import sklearn
assert sklearn.__version__ >= "0.20"

# 공통 모듈 임포트
import numpy as np
import os

# 노트북 실행 결과를 동일하게 유지하기 위해
np.random.seed(42)

1단계 : 결정 트리 모델 훈련

일단 사이킷런에서 지원하는 moons dataset을 가져오겠습니다 ( make_moons(n_samples=10000, noise=0.4)로 가져옵니다.)

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

(출력의 일정함을 위해 random_state=42 옵션도 추가하였습니다.)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

훈련 세트와 테스트 세트를 분류하였습니다. 훈련세트 80% 테스트세트 20% 입니다.

DecisionTreeClassifier 클래스는 결정트리 알고리즘을 활용한 분류 모델을 지원한다.

from sklearn.tree import DecisionTreeClassifier    # 사이킷런에서 지원하는 결정트리 분류기 입니다.

from sklearn.model_selection import GridSearchCV

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 882 out of 882 | elapsed:    9.0s finished





GridSearchCV(cv=3, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...],
                         'min_samples_split': [2, 3, 4]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

교차검증을 이용한 그리드 탐색을 이용하여 DecisionTreeClassifier 모델의 최적의 하이퍼파라미터 값을 찾습니다.

max_leaf_nodes : 최종 분류되는 잎(리프) 노드의 갯수를 의미합니다.
min_samples_split : 노드를 분할하기 위해 필요한 최소 샘플 수

여기서는 max_leaf_nodes 에 큰 중점을 두고 그리드 탐색을 하여 최적의 파라미터 값을 구하도록 합니다.

grid_search_cv.best_estimator_

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=17,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

그렇게 찾은 최적의 하이퍼파라미터 값을 찾아보니

max_leaf_nodes = 17
min_samples_split = 2 를 확인할 수 있었습니다.

GridSearchCV은 자연스럽게 훈련세트에 대해 찾은 최적의 하이퍼파라미터로 훈련을 시킵니다. (기본값으로 refit=True 로 설정되기 때문입니다.) 따라서 바로 훈련된 결정트리 모델에 대하여 테스트 세트에 대한 성능 평가를 해보겠습니다.

from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8695

성능이 약 86.9%로 확인되었습니다.

2단계 : 랜덤 포레스트 구현

이번에는 위에서 구한 결정트리를 이용하여 랜덤 포레스트 모델을 구현하여 보겠습니다.

먼저 데이타 세트의 부분집합을 1000개 만듭니다. 각각의 부분집합은 무작위로 100개의 훈련 샘플들을 선택하여 가지고 있습니다. (중복 허용)

여기서 사이킷런의 ShuffleSplit을 이용하여 진행하였습니다.

from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances, random_state=42)
for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

각각의 만들어진 훈련 부분 집합 샘플세트들을 위에서 찾은 최적의 하이퍼파라미터의 결정트리 모델에 훈련시킵니다. 그리고 그 1000개의 훈련된 결정트리 모델을 테스트 세트로 성능을 평가합니다.

from sklearn.base import clone

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.8054499999999999

1단계에서는 전체 훈련세트를 이용하였지만, 여기서는 각각의 모델이 더 작은 훈련세트를 이용하였기 때문에 성능이 1단계에서 찾은 모델에 비해 안 좋아 보입니다. ( 약 80%를 보입니다. - 이 수치는 1000개의 결정트리 모델의 성능 수치에 대한 평균입니다.)

자 이제 랜덤 포레스트의 핵심적인 부분을 구현하겠습니다.

1000개의 훈련된 분류기가 테스트 세트의 각각의 샘플에 대해 예측값을 예측합니다.
그리고 최종적으로 각각의 테스트 샘플에 대한 1000개의 예측값 중 최빈값(여기서는 클래스를 2개로 분류하므로 과반수)을 최종적으로 샘플에 대한 랜덤 포레스트 모델의 예측값으로 합니다.

Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

SciPy의 mode()를 이용해서 과반수 값을 최종 예측값으로 합니다.

from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

자 이제 랜덤 포레스트 모델의 테스트 세트에 대한 성능을 평가하여 보겠습니다.

accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.872

약 87.2%로 약 0.3% 가량 정확도 성능의 향상이 보입니다. 즉, 앙상블 학습인 랜덤 포레스트 훈련을 통해 단일의 결정 트리 모델의 성능보다 더 좋아졌다는 것이 확인이 되었습니다.

3단계 : 사이킷런의 모델과 성능 비교

이번에는 사이킷 런에서 제공하는 RandomForestClassifier를 이용하여 훈련시키고 성능을 비교해 보겠습니다.

2단계와 똑같은 조건을 만족 시켜주기 위해

n_estimatiors=1000 ( 결정트리 모델의 갯수 1000개 )
max_leaf_nodes = 17 ( 최소 리프 노드의 갯수 17개 )
min_samples_split=2 ( 기본값 )

로 설정해주었습니다.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=1000, max_leaf_nodes=17, random_state=42)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=17, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

훈련이 완료되었고 성능을 확인해보겠습니다.

rnd_clf.score(X_test, y_test)

0.8715

약 87.1% 2단계의 모델과 비슷한 성능을 보입니다. 구현이 적절히 잘 이루어진거 같습니다.

과제 2: 스태킹 모델 직접 구현

(핵심) 스태킹 모델에 대한 설명

스태킹 모델은 배깅 방식과 부스팅 방식을 함께 사용하는 방식입니다. 1층에서 모델들을 한번 거치고 그것을 통해 얻어진 예측값으로 이루어진 샘플을 다시 한 번 더 최종 모델을 통해 예측하는 방식인데 훈련방법에 대해 자세히 살펴보면

배깅 방식 : 1층에서 각각의 모델을 훈련 시키고 그 훈련된 모델을 이용하여 각각의 예측값을 생성합니다.

부스팅 방식 : 1층의 훈련된 모델들의 예측값을 특성으로 하는 훈련 샘플을 만들어 블렌더 모델을 훈련시킵니다. ( 2개 층의 모델들을 순차적으로 훈련합니다.즉, 부스팅 방식입니다. )

최종적으로 스태킹 모델이 훈련이 되고, 새로운 샘플에 대해 예측할 때는 1층 모델들에 주입되어 각각의 예측값을 만들고 이것을 특성으로 하는 샘플로 취급하여 2층의 블렌더 모델에 넣어 최종 예측값을 만들어냅니다.

1단계 : 투표식 분류기 훈련(1층)

먼저 MNIST 데이타를 다운로드 받습니다. 그리고 타깃을 정합니다.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(np.uint8)

훈련세트, 검증세트, 테스트세트로 분류합니다. 이때 사이킷런의 train_test_split 을 이용합니다.

훈련세트 : 50,000 개
검증세트 : 10,000 개
테스트세트 : 10,000 개

from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

그리고 각각의 훈련세트에 대해 각각의 Random Forest classifier, Extra-Trees classifier, SVM 분류기, MLPClassifier를 훈련시킵니다.

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier


random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)


estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
Training the ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=42, verbose=0,
                     warm_start=False)
Training the LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=100,
          multi_class='ovr', penalty='l2', random_state=42, tol=20, verbose=0)
Training the MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=42, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

각각의 분류기에 대해 성능을 평가해보겠습니다.

[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.8662, 0.9639]

3번째 분류기 즉 LinearSVC 분류기가 비교적 성능이 안 좋아보이지만 투표식 분류기에 도움이 될 것이기 때문에 사용하겠습니다.

다음으로 이들을 함께 합치는 앙상블 모델을 만들어보겠습니다.

앙상블 모델은 소프트 혹은 하드 보팅 분류기를 사용
이 앙상블 모델은 검증세트에 대한 성능이 각각의 분류기보다 더 좋습니다.

from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]                                                     # 앙상블 모델에 포함될 분류기 목록


voting_clf = VotingClassifier(named_estimators)      # 사이킷런에서 지원하는 투표식 분류기 VotiongClassifier

voting_clf.fit(X_train, y_train)                    # 훈련

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs...
                                            epsilon=1e-08,
                                            hidden_layer_sizes=(100,),
                                            learning_rate='constant',
                                            learning_rate_init=0.001,
                                            max_fun=15000, max_iter=200,
                                            momentum=0.9, n_iter_no_change=10,
                                            nesterovs_momentum=True,
                                            power_t=0.5, random_state=42,
                                            shuffle=True, solver='adam',
                                            tol=0.0001, validation_fraction=0.1,
                                            verbose=False, warm_start=False))],
                 flatten_transform=True, n_jobs=None, voting='hard',
                 weights=None)

성능을 측정해보겠습니다.

voting_clf.score(X_val, y_val)

0.9702

아까 개별 분류기의 성능과 비교해 보겠습니다.

[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.8662, 0.9639]

확실히 좋아졌음이 확인되었습니다.

이번에는 아까 SVM 모델의 성능이 좋지 않다고 판단하여 빼고 훈련시켰다면 어떤 결과가 만들어질 지 확인해 보겠습니다.

다음과 같이 set_params()를 이용하여 None을 주어 SVM모델을 빼보겠습니다.

voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     n_estimators=100,
                                                     n_jobs...
                                            epsilon=1e-08,
                                            hidden_layer_sizes=(100,),
                                            learning_rate='constant',
                                            learning_rate_init=0.001,
                                            max_fun=15000, max_iter=200,
                                            momentum=0.9, n_iter_no_change=10,
                                            nesterovs_momentum=True,
                                            power_t=0.5, random_state=42,
                                            shuffle=True, solver='adam',
                                            tol=0.0001, validation_fraction=0.1,
                                            verbose=False, warm_start=False))],
                 flatten_transform=True, n_jobs=None, voting='hard',
                 weights=None)

투표식 분류기에 사용하는 분류기들을 확인해보겠습니다.

voting_clf.estimators

[('random_forest_clf',
  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                         criterion='gini', max_depth=None, max_features='auto',
                         max_leaf_nodes=None, max_samples=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators=100,
                         n_jobs=None, oob_score=False, random_state=42, verbose=0,
                         warm_start=False)),
 ('extra_trees_clf',
  ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)),
 ('svm_clf', None),
 ('mlp_clf',
  MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
                beta_2=0.999, early_stopping=False, epsilon=1e-08,
                hidden_layer_sizes=(100,), learning_rate='constant',
                learning_rate_init=0.001, max_fun=15000, max_iter=200,
                momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
                power_t=0.5, random_state=42, shuffle=True, solver='adam',
                tol=0.0001, validation_fraction=0.1, verbose=False,
                warm_start=False))]

하지만 다시 확인해보니 설정이 빠져있게 되지 않았네요

voting_clf.estimators_

[RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=42, verbose=0,
                        warm_start=False),
 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                      criterion='gini', max_depth=None, max_features='auto',
                      max_leaf_nodes=None, max_samples=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False),
 LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
           intercept_scaling=1, loss='squared_hinge', max_iter=100,
           multi_class='ovr', penalty='l2', random_state=42, tol=20, verbose=0),
 MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
               beta_2=0.999, early_stopping=False, epsilon=1e-08,
               hidden_layer_sizes=(100,), learning_rate='constant',
               learning_rate_init=0.001, max_fun=15000, max_iter=200,
               momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
               power_t=0.5, random_state=42, shuffle=True, solver='adam',
               tol=0.0001, validation_fraction=0.1, verbose=False,
               warm_start=False)]

우리는 다시 VotingClassifier를 맞춰줄수도 있지만, 그냥 훈련된 분류기 리스트에서 SVM 분류기를 삭제 해 주도록 할게요

del voting_clf.estimators_[2]

다시 한번 성능 평가를 해볼게요.

voting_clf.score(X_val, y_val)

0.9736

오히려 성능이 조금더 좋아졌습니다.

지금까지는 하드 투표식 분류기를 사용하였습니다.

이번에는 soft 방식을 이용해보겠습니다.

다시 훈련할 필요없이 훈련된 분류기의 voting을 soft로 바꿔주면 됩니다.

voting_clf.voting = "soft"

성능 평가를 해보겠습니다.

voting_clf.score(X_val, y_val)

0.97

성능이 떨어졌네요. 이번 경우에는 hard방식이 더 좋은 분류기 성능을 나타내고 있습니다.

이번에는 테스트 세트에 대한 성능을 비교해보겠습니다.

voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9704

투표식 분류기에 대한 성능입니다.

[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9645, 0.9691, 0.9604]

각각의 분류기에 대한 성능입니다. 앙상블 학습을 통한 투표식 분류기가 각각의 분류기보다 조금이지만 더 좋은 성능을 나타내고 있습니다.

2단계 : 스태킹 앙상블 학습 구현

스태킹 앙상블 모델은 배깅학습 위에 또다른 블랜드 모델을 올리는 거라고 할 수 있습니다. 즉, 새로운 샘플이 들어오면 여러 개의 훈련된 분류기를 이용하여 얻은 각각의 예측값에 대해 최빈값이나 확률의 평균을 이용하여 최종 예측값을 정하는 것이 아니라 그 예측값들을 이용해서 다시 다른 훈련 된 모델의 예측을 최종값으로 하는 모델입니다.

1단계에서 훈련세트에 의해 훈련된 각각의 분류기를 이용하여 검증세트에 대한 각각의 예측값을 만듭니다.

이것들을 새로운 훈련 세트로 만듭니다. 즉, 검증 세트의 각 샘플에 대한 각각의 분류기들의 예측값들을 다시 하나의 샘플로 하는 훈련 세트를 만듭니다.

각각의 새로운 훈련세트는 이미지에 대한 각각의 분류기가 예측하는 예측값의 집합을 포함하는 벡터이며, 타깃은 원래 샘플의 이미지에 대한 클래스값입니다.

자 이제 새로운 훈련 세트에 대한 훈련도 해보겠습니다.

X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

각각의 분류기에 대해 검증세트에 대한 예측값을 만듭니다. 이것이 새로운 훈련세트가 됩니다.

X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

새로 만들어진 훈련세트를 이용해 랜덤포레스트 모델을 훈련시킵니다.

즉 최종적으로 예측하는 블랜더 모델은 랜덤포레스트 모델로 사용하는 모델입니다.

rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

블랜더 모델인 랜덤포레스트 모델을 oob를 이용한 성능 평가를 해보았습니다.

rnd_forest_blender.oob_score_

0.968

약 97%로 꽤 괜찮은 성능을 보이고 있습니다.

만약 필요하다면 더 좋은 성능의 모델을 이용하기 위해 교차 검증을 통한 그리드 탐색을 이용할 수도 있고, 또는 블랜더 모델로 다른 모델을 선택해 볼 수도 있습니다.

하지만 괜찮은 성능을 보이고 있기에 그냥 랜덤포레스트 모델로 선택하겠습니다.

자 이제 각각의 분류기와 블랜더 모델까지 훈련이 완료 되었습니다. 그래서 전체적으로 앙상블 학습인 스태킹 모델을 형성하였습니다. 이제 테스트세트에 대한 성능 평가를 해보겠습니다.

일단 각각의 테스트 세트에 대한 각각의 분류기의 이미지에 대한 예측값을 만듭니다.

X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

이것을 다시 블렌더 모델에 주입해서 최종 예측값을 만듭니다.

y_pred = rnd_forest_blender.predict(X_test_predictions)

이제 성능을 평가해보겠습니다.

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9655

1단계에서 한 투표식 분류기와 성능을 비교해보았을 때 오히려 더 안 좋아졌습니다. 심지어 각각의 분류기보다도 항상 좋은 것도 아니었습니다.

3단계 : 사이킷런의 모델과 성능 비교

사이킷 런에서 지원하는 StackingClassifier 분류기 모델을 이용해서 훈련 보겠습니다.

2단계에서 직접 구현한 모델과의 비교를 위해 각각의 분류기와 블렌더 모델에 사용되는 모델도 동일한 것으로 하였습니다. 그리고 같은 데이터를 이용하였습니다.

그래서 다음과 같이 모델을 구현하였습니다.

from sklearn.ensemble import StackingClassifier

estimators = [("random_forest_clf", random_forest_clf),
              ("extra_trees_clf", extra_trees_clf),
              ("mlp_clf", mlp_clf)]                             # 각각의 분류기 목록
                           

clf = StackingClassifier(estimators=estimators, 
                         final_estimator=rnd_forest_blender)    # 각각의 분류기와 블렌더 모델을 설정한 스태킹 모델

그리고 훈련세트를 이용하여 모델을 훈련시키고 바로 테스트 세트를 이용하여 성능을 확인하였습니다. (15분 이상 많은 시간이 걸림)

clf.fit(X_train, y_train).score(X_test, y_test)

0.9753

약 97%로 성능이 2단계에서 직접 구현한 스태킹 모델보다 더 좋은 것을 확인 할 수 있었습니다. 이 차이는 훈련할 때 데이터를 이용하는 방식의 차이에 있다고 생각됩니다.

2단계에서는 블렌더 모델을 훈련시킬때 따로 빼두었던 검증세트를 이용하여 훈련시켰지만, 3단계의 사이킷런의 스태킹 모델에서는 애초에 훈련 세트를 2개로 분리 해두고 그 중 하나는 각각의 분류기 훈련에, 나머지 하나는 블렌더 모델 훈련에 사용했기 때문에 차이가 발생한 것 같습니다.

Share on

Twitter Facebook LinkedIn

Kim Jin Sang, Student

My RandomForest & Stacking

Machine Learning

과제 1: 랜덤 포레스트 모델 직접 구현

1단계 : 결정 트리 모델 훈련

2단계 : 랜덤 포레스트 구현

3단계 : 사이킷런의 모델과 성능 비교

과제 2: 스태킹 모델 직접 구현

1단계 : 투표식 분류기 훈련(1층)

2단계 : 스태킹 앙상블 학습 구현

3단계 : 사이킷런의 모델과 성능 비교

Share on

You may also enjoy

Python Algorithm Chapter9

Python Algorithm Chapter8

Python Algorithm Chapter4

Python Algorithm Chapter5