1. 데이터 불러오기

깃허브에 있는 데이터를 다운받아옵니다.

import os
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/Jin-Sang/titanic1/main/"
TITANIC_PATH = os.path.join("datasets", "titanic")
TITANIC_TRAIN_URL = DOWNLOAD_ROOT + "train.csv"
TITANIC_TEST_URL = DOWNLOAD_ROOT + "test.csv"

훈련세트와 테스트 세트를 구분하여 저장하는 함수를 만듭니다.

def download_data():

  if not os.path.isdir(TITANIC_PATH):
          os.makedirs(TITANIC_PATH)

  train_path = os.path.join(TITANIC_PATH, "train.csv")
  urllib.request.urlretrieve(TITANIC_TRAIN_URL, train_path)

  test_path = os.path.join(TITANIC_PATH, "test.csv")
  urllib.request.urlretrieve(TITANIC_TEST_URL, test_path)   

데이터 저장

download_data()

csv파일을 판다스 데이터프레임으로 가져오는 함수를 만듭니다.

import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):

    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

테스트세트와 훈련세트를 데이터프레임으로 가져옵니다.

train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

훈련세트 확인

train_data

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

테스트 세트 확인
테스트 세트에는 레이블이 존재 하지 않습니다. 이것은 홈페이지에서 우리가 확인하기 위함이 목표이기 때문입니다.

test_data

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.0500	NaN	S
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C

418 rows × 11 columns

속성은 다음과 같은 의미를 가집니다:

Survived: 타깃입니다. 0은 생존하지 못한 것이고 1은 생존을 의미합니다.
Pclass: 승객 등급. 1, 2, 3등석.
Name, Sex, Age: 이름 그대로 의미입니다.
SibSp: 함께 탑승한 형제, 배우자의 수.
Parch: 함께 탑승한 자녀, 부모의 수.
Ticket: 티켓 아이디
Fare: 티켓 요금 (파운드)
Cabin: 객실 번호
Embarked: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)
훈련데이터의 누락데이터를 확인하겠습니다.

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Age, Cabin, Embarked 속성의 일부가 null입니다(891개의 non-null 보다 작습니다). 특히 Cabin은 77%가 null입니다. 일단 Cabin은 무시하고 나머지를 활용하겠습니다. Age는 19%가 null이므로 이를 어떻게 처리할지 결정해야 합니다. null을 중간 나이로 바꾸는 것이 괜찮아 보입니다.

Name과 Ticket 속성도 값을 가지고 있지만 머신러닝 모델이 사용할 수 있는 숫자로 변환하는 것이 조금 까다롭습니다. 그래서 지금은 이 두 속성을 무시하겠습니다.

다음은 통계치를 확인해보겠습니다.

train_data.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

38%만 Survived입니다. :( 거의 40%에 가까우므로 정확도를 사용해 모델을 평가해도 괜찮을 것 같습니다.
평균 Fare는 32.20 파운드라 그렇게 비싸보이지는 않습니다(아마 요금을 많이 반환해 주었기 때문일 것입니다)
평균 Age는 30보다 작습니다.
레이블이 0과 1로 되어있는지 확인해봅시다.

train_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

범주형 특성들을 확인해 보겠습니다

train_data["Pclass"].value_counts()

  491
  216
  184
Name: Pclass, dtype: int64

train_data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

train_data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Embarked 특성은 승객이 탑승한 곳을 알려 줍니다: C=Cherbourg, Q=Queenstown, S=Southampton.

2. 특성 전처리를 위한 파이프라인

각열을 다르게 처리하기 위해 파이프라인과 DataFrameSelector 사용자 정의 클래스를 사용하겠습니다.

2.1 특정열을 선택 클래스 DataFrameSelector

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

2.2 숫자 특성 처리를 위한 파이프 라인

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
        ("imputer", SimpleImputer(strategy="median")),
    ])

num_pipeline.fit_transform(train_data)

array([[22.    ,  1.    ,  0.    ,  7.25  ],
       [38.    ,  1.    ,  0.    , 71.2833],
       [26.    ,  0.    ,  0.    ,  7.925 ],
       ...,
       [28.    ,  1.    ,  2.    , 23.45  ],
       [26.    ,  0.    ,  0.    , 30.    ],
       [32.    ,  0.    ,  0.    ,  7.75  ]])

숫자 특성에 대해 파이프라인 처리를 한 것입니다.

2.3 범주형 특성 처리 파이프 라인

문자열의 범주형 특성 처리를 위해 별도의 imputer 클래스가 필요합니다.( SimpleImputer로는 할 수 없기 때문입니다.)

# stackoverflow.com/questions/25239958 에서 착안했습니다
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

비어있는 값을 가장 많이 나오는 값으로 채워준다고 할 수 있다.

from sklearn.preprocessing import OneHotEncoder

범주형 특성을 위한 파이프 라인

cat_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

cat_pipeline.fit_transform(train_data)

array([[0., 0., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 1.],
       ...,
       [0., 0., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

문자열의 범주형 특성에 대해 파이프라인 처리를 한 것입니다.

2.4 전처리 파이프 라인 완성

이제 수치형과 범주형 파이프라인을 연결하여 만듭니다.

from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

머신러닝 모델을 훈련시키기위한 데이터 전처리 파이프를 완성하였습니다.

X_train = preprocess_pipeline.fit_transform(train_data)
X_train

array([[22.,  1.,  0., ...,  0.,  0.,  1.],
       [38.,  1.,  0., ...,  1.,  0.,  0.],
       [26.,  0.,  0., ...,  0.,  0.,  1.],
       ...,
       [28.,  1.,  2., ...,  0.,  0.,  1.],
       [26.,  0.,  0., ...,  1.,  0.,  0.],
       [32.,  0.,  0., ...,  0.,  1.,  0.]])

레이블도 가지고 옵니다.

y_train = train_data["Survived"]

3.SVC 모델 훈련

3.1 SVC 분류기

SVC 모델에 훈련을 시킵니다.

from sklearn.svm import SVC

svm_clf = SVC(gamma="auto")
svm_clf.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

이제 테스트 셋트에 대한 예측을 만들어서 홈페이지에서 검사를 맡을 수 있습니다.

X_test = preprocess_pipeline.transform(test_data)
y_pred = svm_clf.predict(X_test)

하지만 좋은 점수를 위해 자체적으로 평가해보겠습니다.

3.2 모델 평가

교차검증

from sklearn.model_selection import cross_val_score

svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()

0.7329588014981274

73% 정도입니다. 좀 더 높은 모델을 훈련시켜 보도록 하겠습니다.

RandomForestClassifier을 훈련시키고 교차검증 해보았습니다.

4. RandomForestClassifier 모델

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
forest_scores.mean()

0.8126466916354558

81% 로 성능이 상승 되었음을 알 수 있습니다.

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], labels=("SVM","Random Forest"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

png

10 폴드 교차 검증에 대한 평균 정확도 대신 대신 모델에서 얻은 10개의 점수를 1사분위, 3사분위를 표현해주는 상자 수염 그림 그래프로 보면 SVM에서 보다 Random Forest 모델이 훨씬 더 1사분위에서 3사분위에 박스 안에 모여 있고, 이상치(수염 밖의 값들은 표시 되지 않는데, 이것이 이상치이다.)도 훨씬 적음을 알 수 있다. 즉, 더욱 성능이 높은 모델은 Random Forest 라고 할 수 있다.

5. 성능 향상

1. 동행자의 수

부모, 자녀, 형제의 수로 하지말고 동행자의 수로 보는 것은 어떨지 특성을 변화시켜보자.

동행자 특성 추가

train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()

	Survived
RelativesOnboard
0	0.303538
1	0.552795
2	0.578431
3	0.724138
4	0.200000
5	0.136364
6	0.333333
7	0.000000
10	0.000000

부모,자녀와 형제를 묶어서 동행자로 나누어보았다. 그리고 생존률도 확인해보았다.

밑에서 훈련세트를 확인해보니 RelativesOnboard 열이 추가 되었음이 확인된다.

train_data

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	RelativesOnboard
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	0
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	0
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	3
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	0
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	0

891 rows × 13 columns

train_data[train_data["Survived"] == 1]["RelativesOnboard"].value_counts()

  163
   89
   59
   21
    4
    3
    3
Name: RelativesOnboard, dtype: int64

혼자만 탑승한 탑승객의 생존수가 가장 많다. 따라서 위의 전처리는 의미가 있을 것 같다.

파이프 라인 수정

수치형 파이프 라인을 수정한다. 부모&자녀, 형제 특성을 선택하지 않고 동행자수 특성을 선택하여 파이프 라인을 수정해주었다.

num_pipeline1 = Pipeline([
        ("select_numeric", DataFrameSelector(["Age", "Fare", "RelativesOnboard"])),
        ("imputer", SimpleImputer(strategy="median")),
    ])

따라서 전체 전처리 파이프라인도 수정해주었다.

preprocess_pipeline1 = FeatureUnion(transformer_list=[
        ("num_pipeline1", num_pipeline1),
        ("cat_pipeline", cat_pipeline),
    ])

파이프 라인에 넣어서 데이터를 전처리 해주었다.

X_train1 = preprocess_pipeline1.fit_transform(train_data)
X_train1

array([[22.    ,  7.25  ,  1.    , ...,  0.    ,  0.    ,  1.    ],
       [38.    , 71.2833,  1.    , ...,  1.    ,  0.    ,  0.    ],
       [26.    ,  7.925 ,  0.    , ...,  0.    ,  0.    ,  1.    ],
       ...,
       [28.    , 23.45  ,  3.    , ...,  0.    ,  0.    ,  1.    ],
       [26.    , 30.    ,  0.    , ...,  1.    ,  0.    ,  0.    ],
       [32.    ,  7.75  ,  0.    , ...,  0.    ,  1.    ,  0.    ]])

훈련 및 평가

랜덤포레스트 모델을 훈련시키고 교차 검증해 보았다.

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train1, y_train, cv=10)
forest_scores.mean()

0.8025717852684146

80%로 오히려 성능이 더 안좋아졌다.

아마도 전체 생존자 수는 혼자 다닌 사람이 많았지만, 혼자 다닌 사람의 생존률은 30%로 그다지 높지 않았다. 그래서 오히려 생존률과 동행자의 수가 밀접함이 확연하게 크지 않았던 것 같다.

2. 나이 범주화

그렇다면 여기다가 구체적인 나이보다 나이 범위 특성으로 설정하여 하는 것을 추가하여 보자.

나이 범주화 특성 추가

train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()

	Survived
AgeBucket
0.0	0.576923
15.0	0.362745
30.0	0.423256
45.0	0.404494
60.0	0.240000
75.0	1.000000

나이를 15로 나누어서 몪을 구하고 15를 곱해 15단위로 범주화 시켰다.

그리고 생존률도 확인해 보았다.

밑을 확인하면 훈련세트에 나이를 범주화 시켜 추가한 것이 확인된다.

train_data

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	RelativesOnboard	AgeBucket
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	1	15.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	1	30.0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	0	15.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	1	30.0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	0	30.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S	0	15.0
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	0	15.0
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	3	NaN
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	0	15.0
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q	0	30.0

891 rows × 14 columns

train_data[train_data["Survived"] == 1]["AgeBucket"].value_counts()

0    111
0     91
0      45
0     36
0      6
0      1
Name: AgeBucket, dtype: int64

나이에 따른 생존자 수를 보니 15세로 묶인 범주화가 가장 높고 다음 순으로 되어있다. 비교해보니 위의 생존률의 크기와 큰 연관이 없어 보이긴 하지만 일단 진행하겠다.

파이프 라인 수정

수치형 파이프라인에서 Age 대신 방금 추가한 AgeBucket 특성을 선택하게 해주어야한다.

num_pipeline2 = Pipeline([
        ("select_numeric", DataFrameSelector(["AgeBucket", "Fare", "RelativesOnboard"])),
        ("imputer", SimpleImputer(strategy="median")),
    ])

따라서 전체 전처리 파이프 라인도 바뀐다.

preprocess_pipeline2 = FeatureUnion(transformer_list=[
        ("num_pipeline2", num_pipeline2),
        ("cat_pipeline", cat_pipeline),
    ])

파이프 라인에 넣어서 데이터를 전처리 해주었다.

X_train2 = preprocess_pipeline2.fit_transform(train_data)
X_train2

array([[15.    ,  7.25  ,  1.    , ...,  0.    ,  0.    ,  1.    ],
       [30.    , 71.2833,  1.    , ...,  1.    ,  0.    ,  0.    ],
       [15.    ,  7.925 ,  0.    , ...,  0.    ,  0.    ,  1.    ],
       ...,
       [15.    , 23.45  ,  3.    , ...,  0.    ,  0.    ,  1.    ],
       [15.    , 30.    ,  0.    , ...,  1.    ,  0.    ,  0.    ],
       [30.    ,  7.75  ,  0.    , ...,  0.    ,  1.    ,  0.    ]])

훈련 및 평가

랜덤포레스트 모델을 훈련시키고 교차 검증해 보았다.

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train2, y_train, cv=10)
forest_scores.mean()

0.812621722846442

81%로 성능이 향상 되었음을 볼 수 있다!!!

테스트 세트로 예측하기

테스트 세트에 앞에서 추가한 RelativesOnboard 특성과 AgeBucket 특성을 추가한다.

test_data["RelativesOnboard"] = test_data["SibSp"] + test_data["Parch"]
test_data["AgeBucket"] = test_data["Age"] // 15 * 15

테스트 세트를 새로 바꾼 전처리 파이프로 넣어서 전처리를 해준다.

X_test = preprocess_pipeline2.fit_transform(test_data)

원래 데이터로 모델을 훈련시키고 테스트 세트로 예측한다.

forest_clf.fit(X_train2,  y_train)
submit = forest_clf.predict(X_test)
submit

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0])

제출을 csv 파일로 제출해야 하므로 파일을 바꿔준다

numpy array → DataFrame로 바꿔주었다.

submit_DataFrame = pd.DataFrame(submit)
submit_DataFrame

	0
0	0
1	0
2	0
3	0
4	1
...	...
413	0
414	1
415	0
416	0
417	0

418 rows × 1 columns

제출 양식에 맞추기 위해 새로운 데이터 프레임을 만들어 일단 PassengerId 특성을 넣어준다.

submit_DataFrame1 = pd.DataFrame(test_data["PassengerId"])
submit_DataFrame1

	PassengerId
0	892
1	893
2	894
3	895
4	896
...	...
413	1305
414	1306
415	1307
416	1308
417	1309

418 rows × 1 columns

테스트세트에 대한 예측값들을 넣어준다.

submit_DataFrame1["Survived"] = submit_DataFrame[0]
submit_DataFrame1 

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1
...	...	...
413	1305	0
414	1306	1
415	1307	0
416	1308	0
417	1309	0

418 rows × 2 columns

양식을 맞춰주기 위해 PassengerId를 인덱스로 바꾼다.

submit_DataFrame1.set_index('PassengerId', inplace=True)
submit_DataFrame1

	Survived
PassengerId
892	0
893	0
894	0
895	0
896	1
...	...
1305	0
1306	1
1307	0
1308	0
1309	0

418 rows × 1 columns

DataFrame → csv 파일로 바꿔주었다.

submit_DataFrame1.to_csv('submit1.csv',sep=',', na_rep='NaN')