diff --git a/5. ML_tutorial/.ipynb_checkpoints/ML_tutorial-checkpoint.ipynb b/5. ML_tutorial/.ipynb_checkpoints/ML_tutorial-checkpoint.ipynb
new file mode 100644
index 0000000..4be1216
--- /dev/null
+++ b/5. ML_tutorial/.ipynb_checkpoints/ML_tutorial-checkpoint.ipynb
@@ -0,0 +1,1847 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ML패키지\n",
+ "MLlib은 RDD를 대상으로만 사용, ML은 데이터프레임을 대상으로 사용할 수 있음\n",
+ "- 트랜스포머, 에스티메이터, 파이프라인 준비\n",
+ "- ML패키지에 있는 모델을 사용하여 유아 생존율 예측하기\n",
+ "- 모델의 성능 평가하기\n",
+ "- 하이퍼파라미터 튜닝\n",
+ "- 패키지에서 다른 머신러닝 모델 사용"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## ML패키지 개요\n",
+ "- 최상단 레벨에 3가지의 추상 클래스(트랜스포머, 에스티메이터, 파이프라인)을 갖고 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 트랜스포머\n",
+ "새로운 칼럼을 추가하고 데이터를 변형하는 역할을 하는 클래스\n",
+ "- 트랜스포머의 추상 클래스로부터 상속 될 때 각각의 모든 트랜스포머는 transform()함수를 구현해야 함\n",
+ "- 함수를 변형하고 사용할 때 첫번째 파라미터로 데이터프레임을 받음.
\n",
+ "---\n",
+ "**대표적 클래스**\n",
+ "> - **Binarizer** : 임계치를 기준으로 연속형 변수를 이진형으로 변환
\n",
+ "> - **Bucketizer** : 연속적인 변수를 주어진 임계치의 리스트를 기반으로 쪼개어 몇 개의 범위로 변환함(binning)
\n",
+ "> - **ChiSqSelector** : 범주형 변수 중에서 카이제곱검정을 통해 몇 가지의 변수를 선택하는 기능을 제공(fit(), transform()), ChiSqSelector객체 리턴
\n",
+ "> - **CounterVectorizer** : CounterVectorizerModel객체를 리턴
\n",
+ "> - **DCT(Discrete Cosine Tranform)** : 실수로 이뤄진 벡터를 입력으로 받고, 다른 빈도로 진동하는 같은 길이의 벡터를 리턴함. 데이터셋에서의 기본 빈도를 추출하거나 데이터를 압축할 때 유용함
\n",
+ "> - **ElementwiseProduct** : 전달된 벡터와 ScalingVec파라미터를 곱한 것을 리턴하는 함수
\n",
+ "> - **HashingTF** : 분리된 텍스트를 리스트로 받아서 카운트 벡터를 리턴하는 트랜스포머.
\n",
+ "> - **IDF** : 주어진 도큐먼트 리스트에 대한 IDF값을 구함. HashingTF나 CounterVectorizer를 이용해 미리 벡터로 표현된 객체가 있어야 함
\n",
+ "> - **StringIndexer** : 한 컬럼에 주어진 모든 워드 리스트에 대해 인덱스 벡터를 생성
\n",
+ "> - **IndexToString** : 스트링 인덱스를 원본 값으로 역정렬하기 위해 StringIndexerModel갹체로부터 인코딩을 수행함.
\n",
+ "> - **MaxAbsScaler** : -1과 1 사이로 데이터의 범위를 재조정함
\n",
+ "> - **MinMaxScaler** : 0과 1 범위 사이로 재조정
\n",
+ "> - **StandardScaler** : 표준정규분포로 변수를 재조정
\n",
+ "> - **NGram** : 분리된 텍스트를 입력받아서 n-gram을 쌍으로 리턴함.
\n",
+ "> - **Normalizer** : p-norm단위를 제조정(L1정규화, L2를 설정, default는 L2)
\n",
+ "> - **OneHotEncoder** : 범주형 변수를 이진 벡터 컬럼으로 인코딩
\n",
+ "> - **PCA** : 데이터 축소
\n",
+ "> - **PolynomialExpansion** : 벡터에 대한 다항 확장 기능
\n",
+ "> - **QuantileDiscretizer** : Bucketizer함수와 비슷하지만 split파라미터를 전달하는 대신에 numBuckets라는 파림터를 전달함.
\n",
+ "> - **RegexTokenizer** : 정규표현식
\n",
+ "> - **RFormula** : R문법을 사용한 vec표현 기능을 제공
\n",
+ "> - **SQLTransformer** : R대신 SQL문법을 사용하는 기능을 제공
\n",
+ "> - **StopWordsRemover** : stopwords를 제거하는 기능
\n",
+ "> - **Tokenizer** : 스트링을 소문자로 변환하고 스페이스를 기준으로 분리하는 토크나이저
\n",
+ "> - **VectorAssembler** : 여러 개의 숫자 컬럼을 벡터 형태의 한 컬럼으로 변환해주는 트랜스포머
\n",
+ "> - **VectorIndexer** : 범주형 변수를 벡터 인덱스로 변환하는 데 사용. 각 컬럼마다 동작하며, 각 컬럼에서의 고유 값을 선택하고 정렬해 원래의 값이 아닌 맵으로부터 인덱스 값을 리턴함
\n",
+ "> - **VectorSlicer** : dense든 sparse든 관계없이 피처 벡터에 대해 동작함. 주어진 인데스 리스트에 대해 피처 벡터의 값을 리턴함
\n",
+ "> - ** Word2Vec** : 스트링 문장을 입력으로 받아 {String, Vector}형태로 변형함."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### VectorAssembler예시"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:17.688632Z",
+ "start_time": "2018-03-06T13:30:14.646913Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(a=12, b=10, c=3), Row(a=1, b=4, c=2)]"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = spark.createDataFrame(\n",
+ " [(12, 10, 3), (1, 4, 2)],\n",
+ " ['a', 'b', 'c']\n",
+ ")\n",
+ "df.take(2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:19.129832Z",
+ "start_time": "2018-03-06T13:30:18.464188Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(features=DenseVector([12.0, 10.0, 3.0])),\n",
+ " Row(features=DenseVector([1.0, 4.0, 2.0]))]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pyspark.ml.feature as ft\n",
+ "ft.VectorAssembler(inputCols = ['a','b','c'], outputCol='features').transform(df).select('features').collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 에스티메이터\n",
+ "데이터들에 대해 예측이나 분류를 수행하는데 사용되는 추상클래스.\n",
+ "추상 에스티메이터 글래스로부터 상속받으려면, 새로운 모델을 데이터프레임에 있는 데이터와 디폴트 도는 사용자가 제공해야 하는 파라미터를 기반으로 모델을 학습하는 fit()함수를 구현해야 함\n",
+ "\n",
+ "### 분류모델\n",
+ "- LogisticRegression\n",
+ "- DecisionTreeClassifier\n",
+ "- GBTClassifier\n",
+ "- RandomForesetClassifer\n",
+ "- NaiveBayes\n",
+ "- MulitilayerPerceptronClassifier\n",
+ "- OneVsRest\n",
+ "\n",
+ "### 회귀모델\n",
+ "- AFTSurvivalRegression : Stepwise선택 회귀모형\n",
+ "- DecisionTreeRegressor\n",
+ "- GBTRegressor\n",
+ "- GeneralizedLinearRegression : 오차항의 정규성을 무시한 다른 분포모형을 사용할 수 있음. 감마분포, 포아송 등..\n",
+ "- IsotonicRegression : 선형성에 대한 가정이 필요없는 회귀모형\n",
+ "- LinearRegression\n",
+ "- RandomForestRegressor\n",
+ "\n",
+ "### 군집화모델\n",
+ "- BisectingKMeans\n",
+ "- KMeans\n",
+ "- GaussianMixture"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 파이프라인\n",
+ "transform기능만을 제공함. 파이프라인은 여러 분리된 단계에 대한 연결 고리임. fit()함수가 파이프라인 객체에서 수행될 때, 모든 단계는 states파라미터에서 명시된 수선대로 수행됨. stage파라미터는 트랜스포머와 에스티메이터 객체로 이뤄진 리스트형태. 파이프라인 객체의 fit()함수는 트랜스포머에 대해 transform()함수를 수행하고 에스티메이터에 대해 fit()함수를 수행함. \n",
+ "\n",
+ "일반적으로 이전 단계의 결과는 다음 간계의 입력 값이 됨. 트랜스포머나 에스티메이터 추상 클래스로부터 상속될 때, 각각의 것들은 outputCol파라미터의 값을 리턴하는 getOutputCol()함수를 구현해야함. outputCol 파라미터는 파이프라인 객체 생성시 명시해야 함."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 유아 생존률 예측"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:22.330648Z",
+ "start_time": "2018-03-06T13:30:22.202322Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.sql.types as typ\n",
+ "\n",
+ "labels = [\n",
+ " ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),\n",
+ " ('BIRTH_PLACE', typ.StringType()),\n",
+ " ('MOTHER_AGE_YEARS', typ.IntegerType()),\n",
+ " ('FATHER_COMBINED_AGE', typ.IntegerType()),\n",
+ " ('CIG_BEFORE', typ.IntegerType()),\n",
+ " ('CIG_1_TRI', typ.IntegerType()),\n",
+ " ('CIG_2_TRI', typ.IntegerType()),\n",
+ " ('CIG_3_TRI', typ.IntegerType()),\n",
+ " ('MOTHER_HEIGHT_IN', typ.IntegerType()),\n",
+ " ('MOTHER_PRE_WEIGHT', typ.IntegerType()),\n",
+ " ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),\n",
+ " ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),\n",
+ " ('DIABETES_PRE', typ.IntegerType()),\n",
+ " ('DIABETES_GEST', typ.IntegerType()),\n",
+ " ('HYP_TENS_PRE', typ.IntegerType()),\n",
+ " ('HYP_TENS_GEST', typ.IntegerType()),\n",
+ " ('PREV_BIRTH_PRETERM', typ.IntegerType())\n",
+ "]\n",
+ "\n",
+ "schema = typ.StructType([\n",
+ " typ.StructField(e[0], e[1], False) for e in labels\n",
+ "])\n",
+ "\n",
+ "births = spark.read.csv('births_transformed.csv.gz', \n",
+ " header=True, \n",
+ " schema=schema)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:23.113549Z",
+ "start_time": "2018-03-06T13:30:23.108778Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- INFANT_ALIVE_AT_REPORT: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE: string (nullable = true)\n",
+ " |-- MOTHER_AGE_YEARS: integer (nullable = true)\n",
+ " |-- FATHER_COMBINED_AGE: integer (nullable = true)\n",
+ " |-- CIG_BEFORE: integer (nullable = true)\n",
+ " |-- CIG_1_TRI: integer (nullable = true)\n",
+ " |-- CIG_2_TRI: integer (nullable = true)\n",
+ " |-- CIG_3_TRI: integer (nullable = true)\n",
+ " |-- MOTHER_HEIGHT_IN: integer (nullable = true)\n",
+ " |-- MOTHER_PRE_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)\n",
+ " |-- DIABETES_PRE: integer (nullable = true)\n",
+ " |-- DIABETES_GEST: integer (nullable = true)\n",
+ " |-- HYP_TENS_PRE: integer (nullable = true)\n",
+ " |-- HYP_TENS_GEST: integer (nullable = true)\n",
+ " |-- PREV_BIRTH_PRETERM: integer (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "births.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:24.103923Z",
+ "start_time": "2018-03-06T13:30:23.770186Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINED_AGE|CIG_BEFORE|CIG_1_TRI|CIG_2_TRI|CIG_3_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|MOTHER_DELIVERY_WEIGHT|MOTHER_WEIGHT_GAIN|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|\n",
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "| 0| 1| 29| 99| 0| 0| 0| 0| 99| 999| 999| 99| 0| 0| 0| 0| 0|\n",
+ "| 0| 1| 22| 29| 0| 0| 0| 0| 65| 180| 198| 18| 0| 0| 0| 0| 0|\n",
+ "| 0| 1| 38| 40| 0| 0| 0| 0| 63| 155| 167| 12| 0| 0| 0| 0| 0|\n",
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "births.show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 트랜스포머 생성\n",
+ "확률모형을 연속형변수를 사용하기 때문에 데이터타입을 변경하는 작업이 선행되어야 함."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:24.937242Z",
+ "start_time": "2018-03-06T13:30:24.912525Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- INFANT_ALIVE_AT_REPORT: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE: string (nullable = true)\n",
+ " |-- MOTHER_AGE_YEARS: integer (nullable = true)\n",
+ " |-- FATHER_COMBINED_AGE: integer (nullable = true)\n",
+ " |-- CIG_BEFORE: integer (nullable = true)\n",
+ " |-- CIG_1_TRI: integer (nullable = true)\n",
+ " |-- CIG_2_TRI: integer (nullable = true)\n",
+ " |-- CIG_3_TRI: integer (nullable = true)\n",
+ " |-- MOTHER_HEIGHT_IN: integer (nullable = true)\n",
+ " |-- MOTHER_PRE_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)\n",
+ " |-- DIABETES_PRE: integer (nullable = true)\n",
+ " |-- DIABETES_GEST: integer (nullable = true)\n",
+ " |-- HYP_TENS_PRE: integer (nullable = true)\n",
+ " |-- HYP_TENS_GEST: integer (nullable = true)\n",
+ " |-- PREV_BIRTH_PRETERM: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE_INT: integer (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pyspark.ml.feature as ft\n",
+ "\n",
+ "# 데이터 타입을 변경해주는 작업을 수행\n",
+ "births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'].cast(typ.IntegerType()))\n",
+ "births.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:25.437313Z",
+ "start_time": "2018-03-06T13:30:25.425803Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'BIRTH_PLACE_VEC'"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 트랜스포머를 생성\n",
+ "encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC')\n",
+ "encoder.getOutputCol()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:25.917886Z",
+ "start_time": "2018-03-06T13:30:25.907536Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "VectorAssembler_4761ab090823489bcc96"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "featureCreator = ft.VectorAssembler(\n",
+ " inputCols = [col[0] for col in labels[2:]] + [encoder.getOutputCol()],\n",
+ " outputCol = 'features'\n",
+ ")\n",
+ "featureCreator"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "VectorAssembler객체에 전달된 inputCols파라미터는 outputCol을 형성하기 위해 합쳐진 모든 컬럼을 포함하는 리스트임. inputCols는 파라미터의 값을 변경하고자 할 때는 inputCols파라미터의 값을 직접 바꿀 것이 아니라 인코더 객체의 output갈럼명을 바꿔야 함. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 에스티메이터 생성하기\n",
+ "로지스틱 회귀 모형을 사용."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:27.276459Z",
+ "start_time": "2018-03-06T13:30:27.265549Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.classification as cl"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:28.046627Z",
+ "start_time": "2018-03-06T13:30:28.015353Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 파이프라인 생성"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:28.940556Z",
+ "start_time": "2018-03-06T13:30:28.937265Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from pyspark.ml import Pipeline\n",
+ "\n",
+ "pipeline = Pipeline(stages=[\n",
+ " encoder,\n",
+ " featureCreator,\n",
+ " logistic\n",
+ "])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모형 학습"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:33.264531Z",
+ "start_time": "2018-03-06T13:30:29.707794Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test_model = model.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "birth_train데이터셋은 인코더 객체에 전달됨. 인코더 단계에서 생성된 데이터프레임은 'features'를 생성하는 featuresCreator로 전달됨. 마지막으로 이 단계의 출력이 최종 모델을 학습하는 로지스틱 회귀로 전달됨\n",
+ "fit()함수는 예측에 상용될 수 있는 파이프라인 모델 객체를 리턴함. 예측값은 이전에 생성한 테스트 데이터셋을 transform()함수에 전달함으로써 생성될 수 있음."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:34.687530Z",
+ "start_time": "2018-03-06T13:30:34.190260Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0545, -1.0545]), probability=DenseVector([0.7416, 0.2584]), prediction=0.0)]"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_model.take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모델 성능 측정\n",
+ "- test_model.take()\n",
+ "> probability의 DenseVector객체를 뜯어온다"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:35.584856Z",
+ "start_time": "2018-03-06T13:30:35.572678Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.evaluation as ev\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(\n",
+ " rawPredictionCol = 'probability',\n",
+ " labelCol = 'INFANT_ALIVE_AT_REPORT'\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:38.617925Z",
+ "start_time": "2018-03-06T13:30:36.124386Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7405439747919526\n",
+ "0.7152348988715325\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(evaluator.evaluate(test_model, {evaluator.metricName : 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(test_model, {evaluator.metricName : 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모형 저장\n",
+ "**파이프라인 구조체**를 저장"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:40.075596Z",
+ "start_time": "2018-03-06T13:30:39.758183Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'\n",
+ "pipeline.write().overwrite().save(pipelinePath)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:40.751333Z",
+ "start_time": "2018-03-06T13:30:40.639762Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Spark_ML.ipynb\t\t infant_oneHotEncoder_Logistic_Pipeline\r\n",
+ "births_transformed.csv.gz infant_oneHotEncoder_Logistic_PipelineModel\r\n",
+ "derby.log\t\t metastore_db\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "!ls"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:43.560122Z",
+ "start_time": "2018-03-06T13:30:41.341121Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0545, -1.0545]), probability=DenseVector([0.7416, 0.2584]), prediction=0.0)]"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "loadedPipeline = Pipeline.load(pipelinePath)\n",
+ "loadedPipeline.fit(births_train).transform(births_test).take(1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:45.733936Z",
+ "start_time": "2018-03-06T13:30:44.217113Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from pyspark.ml import PipelineModel\n",
+ "\n",
+ "modelPath = './infant_oneHotEncoder_Logistic_PipelineModel'\n",
+ "model.write().overwrite().save(modelPath)\n",
+ "\n",
+ "loadedPipeModel = PipelineModel.load(modelPath)\n",
+ "test_loadedModel = loadedPipeModel.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 하이퍼파라미터 최적화\n",
+ "그리드탐색기법을 사용(ParamGridBuilder객체를 사용)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:46.533063Z",
+ "start_time": "2018-03-06T13:30:46.528995Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.tuning as tune"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:47.058555Z",
+ "start_time": "2018-03-06T13:30:47.043831Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(\n",
+ " labelCol = 'INFANT_ALIVE_AT_REPORT'\n",
+ ")\n",
+ "\n",
+ "grid = tune.ParamGridBuilder().addGrid(logistic.maxIter, [2, 10, 50]).addGrid(logistic.regParam, [0.01, 0.1, 0.3]).build()\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "\n",
+ "cv = tune.CrossValidator(estimator=logistic, estimatorParamMaps=grid, evaluator=evaluator)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:26.114944Z",
+ "start_time": "2018-03-06T13:30:47.815001Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline= Pipeline(stages=[encoder, featureCreator])\n",
+ "\n",
+ "data_transformer = pipeline.fit(births_train)\n",
+ "\n",
+ "# 동일하게 파이프라인을 설정하고 트랜스포머 기능을 수행함\n",
+ "# 차이점은 미리 설정한 cv를 설정하는 작업임\n",
+ "# cross-validation 설정\n",
+ "cv_model = cv.fit(dataset=data_transformer.transform(births_train))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:29.747993Z",
+ "start_time": "2018-03-06T13:31:28.589118Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7404526641072416\n",
+ "0.7157767684747429\n"
+ ]
+ }
+ ],
+ "source": [
+ "data_train = data_transformer \\\n",
+ " .transform(births_test)\n",
+ "results = cv_model.transform(data_train)\n",
+ "\n",
+ "print(evaluator.evaluate(results, \n",
+ " {evaluator.metricName: 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(results, \n",
+ " {evaluator.metricName: 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": true
+ },
+ "source": [
+ "그리드 방식으로 접근한 결과를 살펴보면 기존의 모형보다 조금 성능이 좋아진 것을 확인할 수 있음\n",
+ "\n",
+ "최적의 성능을 보여주는 하이퍼파미터의 집합을 찾아보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:30.750863Z",
+ "start_time": "2018-03-06T13:31:30.742933Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "([{'maxIter': 50}, {'regParam': 0.01}], 0.738652833807851)"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results = [\n",
+ " (\n",
+ " [\n",
+ " {key.name : paramValue}\n",
+ " for key, paramValue in zip(params.keys(), params.values())\n",
+ " ], metric\n",
+ " )\n",
+ " for params, metric\n",
+ " in zip(\n",
+ " cv_model.getEstimatorParamMaps(),\n",
+ " cv_model.avgMetrics\n",
+ " )\n",
+ "]\n",
+ "\n",
+ "sorted(results, key=lambda el: el[1], reverse=True)[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 학습/검증 데이터셋\n",
+ "최선의 모델을 선택하기 위해 TrainValidationSplit모델을 이용해 입력 데이터셋을 training과 validation으로 두 개를 나눔\n",
+ "\n",
+ "좋은 변수들만 추출하기 위해 ChiSqSelector를 사용할 것임"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:32.252416Z",
+ "start_time": "2018-03-06T13:31:32.240831Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "selector = ft.ChiSqSelector(numTopFeatures=5, \n",
+ " featuresCol=featureCreator.getOutputCol(), \n",
+ " outputCol='selectedFeatures', \n",
+ " labelCol='INFANT_ALIVE_AT_REPORT'\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "numTopFeatures는 리턴할 피처의 갯수를 명시함. featureCreator의 getOutputCol()을 호출할 수 있도록 featureCreator 이후에 selector를 정의한다"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:35.009781Z",
+ "start_time": "2018-03-06T13:31:33.256148Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(labelCol='INFANT_ALIVE_AT_REPORT', featuresCol='selectedFeatures')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, selector])\n",
+ "data_transformer = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "trainValidationSplit객체는 CrossValidator모델과 같은 방법으로 생성됨"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:36.200307Z",
+ "start_time": "2018-03-06T13:31:36.197153Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tvs = tune.TrainValidationSplit(estimator=logistic, \n",
+ " estimatorParamMaps=grid, # 설정한 그리드\n",
+ " evaluator=evaluator # 그리드와 함께 설정한 evaluator\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:53.480989Z",
+ "start_time": "2018-03-06T13:31:36.857454Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7294296314442145\n",
+ "0.7037759446410553\n"
+ ]
+ }
+ ],
+ "source": [
+ "# data_transformer는 pipeline으로 설정한 객체\n",
+ "tvs_model = tvs.fit(data_transformer.transform(births_train))\n",
+ "data_train = data_transformer.transform(births_test)\n",
+ "results = tvs_model.transform(data_train)\n",
+ "\n",
+ "print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "적은 변수를 사용한 모델의 성능이 상대적으로 더 좋지 않은 것을 알 수 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## PySpark ML의 다른 features"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:55.805598Z",
+ "start_time": "2018-03-06T13:31:55.787103Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "text_data = spark.createDataFrame([\n",
+ " ['''Machine learning can be applied to a wide variety \n",
+ " of data types, such as vectors, text, images, and \n",
+ " structured data. This API adopts the DataFrame from \n",
+ " Spark SQL in order to support a variety of data types.'''],\n",
+ " ['''DataFrame supports many basic and structured types; \n",
+ " see the Spark SQL datatype reference for a list of \n",
+ " supported types. In addition to the types listed in \n",
+ " the Spark SQL guide, DataFrame can use ML Vector types.'''],\n",
+ " ['''A DataFrame can be created either implicitly or \n",
+ " explicitly from a regular RDD. See the code examples \n",
+ " below and the Spark SQL programming guide for examples.'''],\n",
+ " ['''Columns in a DataFrame are named. The code examples \n",
+ " below use names such as \"text,\" \"features,\" and \"label.\"''']\n",
+ "], ['input'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:56.655512Z",
+ "start_time": "2018-03-06T13:31:56.573300Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+--------------------+\n",
+ "| input|\n",
+ "+--------------------+\n",
+ "|Machine learning ...|\n",
+ "|DataFrame support...|\n",
+ "|A DataFrame can b...|\n",
+ "|Columns in a Data...|\n",
+ "+--------------------+\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "text_data.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "열이 한 개인 데이터프레임을 생성. 행의 관측치에 존재하는 문장들을 단어를 기준으로 분리하고자 함. 특정 패턴을 설정하기 위해 regexTokenizer를 사용"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:57.572582Z",
+ "start_time": "2018-03-06T13:31:57.564329Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tokenizer = ft.RegexTokenizer(inputCol='input', outputCol='input_arr', pattern='\\s+|[,.\\\"]')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:58.352732Z",
+ "start_time": "2018-03-06T13:31:58.253102Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_arr=['machine', 'learning', 'can', 'be', 'applied', 'to', 'a', 'wide', 'variety', 'of', 'data', 'types', 'such', 'as', 'vectors', 'text', 'images', 'and', 'structured', 'data', 'this', 'api', 'adopts', 'the', 'dataframe', 'from', 'spark', 'sql', 'in', 'order', 'to', 'support', 'a', 'variety', 'of', 'data', 'types'])]"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tok = tokenizer.transform(text_data).select('input_arr')\n",
+ "tok.take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "불용어를 제거해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:59.609604Z",
+ "start_time": "2018-03-06T13:31:59.411908Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_stop=['machine', 'learning', 'applied', 'wide', 'variety', 'data', 'types', 'vectors', 'text', 'images', 'structured', 'data', 'api', 'adopts', 'dataframe', 'spark', 'sql', 'order', 'support', 'variety', 'data', 'types'])]"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='input_stop')\n",
+ "stopwords.transform(tok).select('input_stop').take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "nGram모델과 pipeline을 설정해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:00.522282Z",
+ "start_time": "2018-03-06T13:32:00.513151Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "ngram = ft.NGram(n=2, inputCol=stopwords.getOutputCol(), outputCol='NGrams')\n",
+ "pipeline = Pipeline(stages=[tokenizer, stopwords, ngram])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:01.255262Z",
+ "start_time": "2018-03-06T13:32:01.117080Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(NGrams=['machine learning', 'learning applied', 'applied wide', 'wide variety', 'variety data', 'data types', 'types vectors', 'vectors text', 'text images', 'images structured', 'structured data', 'data api', 'api adopts', 'adopts dataframe', 'dataframe spark', 'spark sql', 'sql order', 'order support', 'support variety', 'variety data', 'data types'])]"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data_ngram = pipeline.fit(text_data).transform(text_data)\n",
+ "data_ngram.select('NGrams').take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 연속 변수 분별하기\n",
+ "지금까지는 비선형이고 하나의 계수를 사영해서는 모델 학습을 하기 힘든 연속형 변수들을 사용했음. \n",
+ "이러한 상황에서는 피처의 타깃을 하나의 계수로 설명하기 힘듬. 때로는 값들을 특정 버킷으로 분별하는 것도 굉장히 유용함\n",
+ "\n",
+ "예시 데이터를 생성해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:02.266943Z",
+ "start_time": "2018-03-06T13:32:02.181314Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+------------------+\n",
+ "| continuous_var|\n",
+ "+------------------+\n",
+ "| 20.1234|\n",
+ "|20.132344452369832|\n",
+ "|20.159087064491775|\n",
+ "+------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "x = np.arange(0, 100)\n",
+ "x = x / 100.0 * np.pi * 4\n",
+ "y = x * np.sin(x / 1.764) + 20.1234\n",
+ "\n",
+ "schema = typ.StructType([\n",
+ " typ.StructField('continuous_var', typ.DoubleType(), False)\n",
+ "])\n",
+ "\n",
+ "data = spark.createDataFrame([[float(e), ] for e in y], schema=schema)\n",
+ "data.show(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:04.204286Z",
+ "start_time": "2018-03-06T13:32:02.968489Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(discretized=0.0, avg(continuous_var)=12.314360733007913),\n",
+ " Row(discretized=1.0, avg(continuous_var)=16.046244793347473),\n",
+ " Row(discretized=2.0, avg(continuous_var)=20.250799478352594),\n",
+ " Row(discretized=3.0, avg(continuous_var)=22.040988218437327),\n",
+ " Row(discretized=4.0, avg(continuous_var)=24.264824657002862)]"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "discretizer = ft.QuantileDiscretizer(numBuckets=5, inputCol='continuous_var', outputCol='discretized')\n",
+ "\n",
+ "data_discretized = discretizer.fit(data).transform(data)\n",
+ "data_discretized.groupby('discretized').mean('continuous_var').sort('discretized').collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 연속형 변수에 대한 standarizing"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:04.870707Z",
+ "start_time": "2018-03-06T13:32:04.861143Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "vectorizer = ft.VectorAssembler(inputCols=['continuous_var'], outputCol='continuous_vec')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:05.752991Z",
+ "start_time": "2018-03-06T13:32:05.538350Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+------------------+--------------------+--------------------+\n",
+ "| continuous_var| continuous_vec| normalized|\n",
+ "+------------------+--------------------+--------------------+\n",
+ "| 20.1234| [20.1234]|[0.23429139554502...|\n",
+ "|20.132344452369832|[20.132344452369832]|[0.23630959828688...|\n",
+ "|20.159087064491775|[20.159087064491775]| [0.242343731051792]|\n",
+ "+------------------+--------------------+--------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "normalizer = ft.StandardScaler(inputCol=vectorizer.getOutputCol(), outputCol='normalized',withMean=True, withStd=True)\n",
+ "\n",
+ "pipeline = Pipeline(stages=[vectorizer, normalizer])\n",
+ "data_standardized = pipeline.fit(data).transform(data)\n",
+ "\n",
+ "data_standardized.show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 분류 모델"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:06.631664Z",
+ "start_time": "2018-03-06T13:32:06.614433Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.sql.functions as func\n",
+ "\n",
+ "births = births.withColumn('INFANT_ALIVE_AT_REPORT', func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType()))\n",
+ "births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:11.067953Z",
+ "start_time": "2018-03-06T13:32:07.361151Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "classifer = cl.RandomForestClassifier(\n",
+ " numTrees = 5,\n",
+ " maxDepth=5,\n",
+ " labelCol='INFANT_ALIVE_AT_REPORT'\n",
+ ")\n",
+ "\n",
+ "# 파라미터 튜닝을 제외한 순수 접근 방법\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, classifer])\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test = model.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:12.652890Z",
+ "start_time": "2018-03-06T13:32:11.758322Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7625231306933616\n",
+ "0.7474287997552782\n"
+ ]
+ }
+ ],
+ "source": [
+ "evaluator = ev.BinaryClassificationEvaluator(\n",
+ " labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderROC\"}))\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderPR\"}))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:16.316611Z",
+ "start_time": "2018-03-06T13:32:13.405779Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7582781726635287\n",
+ "0.7787580540118526\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 모형 자체에 파이프라인을 설정하는 방법\n",
+ "classifier = cl.DecisionTreeClassifier(maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, classifier])\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test = model.transform(births_test)\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderROC\"}))\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderPR\"}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 군집화"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:18.325669Z",
+ "start_time": "2018-03-06T13:32:17.127629Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.clustering as clus\n",
+ "\n",
+ "kmeans = clus.KMeans(k = 5, featuresCol = 'features')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, kmeans])\n",
+ "model = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:20.227778Z",
+ "start_time": "2018-03-06T13:32:18.987601Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(prediction=1, avg(MOTHER_HEIGHT_IN)=83.91154791154791, count(1)=407),\n",
+ " Row(prediction=3, avg(MOTHER_HEIGHT_IN)=66.64658634538152, count(1)=249),\n",
+ " Row(prediction=4, avg(MOTHER_HEIGHT_IN)=64.31597357170618, count(1)=10292),\n",
+ " Row(prediction=2, avg(MOTHER_HEIGHT_IN)=67.69473684210526, count(1)=475),\n",
+ " Row(prediction=0, avg(MOTHER_HEIGHT_IN)=64.43472584856397, count(1)=2298)]"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test = model.transform(births_test)\n",
+ "\n",
+ "test.groupBy('prediction').agg({\n",
+ " '*': 'count',\n",
+ " 'MOTHER_HEIGHT_IN' : 'avg'\n",
+ "}).collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "결과를 확인하면 MOTHER_HEIGHT_IN은 군집 2에서 많이 다르다는 것을 알 수 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Topic Mining\n",
+ "군집화 모델은 숫자 데이터로만 가능하지 않음. NLP분야에서 토픽 추출과 같은 영역은 같은 주제를 가진 문서들을 찾아내는 데 군집화를 이용함. \n",
+ "6개의 인스턴스들로 구성된 데이터이며 3개는 국립공원과 관련된 내용을 서술하고 있으며 나머지 3개는 기술영역의 내용을 갖고 있음"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:22.029951Z",
+ "start_time": "2018-03-06T13:32:21.990328Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "text_data = spark.createDataFrame([\n",
+ " ['''To make a computer do anything, you have to write a \n",
+ " computer program. To write a computer program, you have \n",
+ " to tell the computer, step by step, exactly what you want \n",
+ " it to do. The computer then \"executes\" the program, \n",
+ " following each step mechanically, to accomplish the end \n",
+ " goal. When you are telling the computer what to do, you \n",
+ " also get to choose how it's going to do it. That's where \n",
+ " computer algorithms come in. The algorithm is the basic \n",
+ " technique used to get the job done. Let's follow an \n",
+ " example to help get an understanding of the algorithm \n",
+ " concept.'''],\n",
+ " ['''Laptop computers use batteries to run while not \n",
+ " connected to mains. When we overcharge or overheat \n",
+ " lithium ion batteries, the materials inside start to \n",
+ " break down and produce bubbles of oxygen, carbon dioxide, \n",
+ " and other gases. Pressure builds up, and the hot battery \n",
+ " swells from a rectangle into a pillow shape. Sometimes \n",
+ " the phone involved will operate afterwards. Other times \n",
+ " it will die. And occasionally—kapow! To see what's \n",
+ " happening inside the battery when it swells, the CLS team \n",
+ " used an x-ray technology called computed tomography.'''],\n",
+ " ['''This technology describes a technique where touch \n",
+ " sensors can be placed around any side of a device \n",
+ " allowing for new input sources. The patent also notes \n",
+ " that physical buttons (such as the volume controls) could \n",
+ " be replaced by these embedded touch sensors. In essence \n",
+ " Apple could drop the current buttons and move towards \n",
+ " touch-enabled areas on the device for the existing UI. It \n",
+ " could also open up areas for new UI paradigms, such as \n",
+ " using the back of the smartphone for quick scrolling or \n",
+ " page turning.'''],\n",
+ " ['''The National Park Service is a proud protector of \n",
+ " America’s lands. Preserving our land not only safeguards \n",
+ " the natural environment, but it also protects the \n",
+ " stories, cultures, and histories of our ancestors. As we \n",
+ " face the increasingly dire consequences of climate \n",
+ " change, it is imperative that we continue to expand \n",
+ " America’s protected lands under the oversight of the \n",
+ " National Park Service. Doing so combats climate change \n",
+ " and allows all American’s to visit, explore, and learn \n",
+ " from these treasured places for generations to come. It \n",
+ " is critical that President Obama acts swiftly to preserve \n",
+ " land that is at risk of external threats before the end \n",
+ " of his term as it has become blatantly clear that the \n",
+ " next administration will not hold the same value for our \n",
+ " environment over the next four years.'''],\n",
+ " ['''The National Park Foundation, the official charitable \n",
+ " partner of the National Park Service, enriches America’s \n",
+ " national parks and programs through the support of \n",
+ " private citizens, park lovers, stewards of nature, \n",
+ " history enthusiasts, and wilderness adventurers. \n",
+ " Chartered by Congress in 1967, the Foundation grew out of \n",
+ " a legacy of park protection that began over a century \n",
+ " ago, when ordinary citizens took action to establish and \n",
+ " protect our national parks. Today, the National Park \n",
+ " Foundation carries on the tradition of early park \n",
+ " advocates, big thinkers, doers and dreamers—from John \n",
+ " Muir and Ansel Adams to President Theodore Roosevelt.'''],\n",
+ " ['''Australia has over 500 national parks. Over 28 \n",
+ " million hectares of land is designated as national \n",
+ " parkland, accounting for almost four per cent of \n",
+ " Australia's land areas. In addition, a further six per \n",
+ " cent of Australia is protected and includes state \n",
+ " forests, nature parks and conservation reserves.National \n",
+ " parks are usually large areas of land that are protected \n",
+ " because they have unspoilt landscapes and a diverse \n",
+ " number of native plants and animals. This means that \n",
+ " commercial activities such as farming are prohibited and \n",
+ " human activity is strictly monitored.''']\n",
+ "], ['documents'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:22.908791Z",
+ "start_time": "2018-03-06T13:32:22.871362Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tokenizer = ft.RegexTokenizer(\n",
+ " inputCol = 'documents',\n",
+ " outputCol = 'input_arr',\n",
+ " pattern = '\\s+|[,.\\\"]')\n",
+ "\n",
+ "stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='intput_stop')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "CounterVectorizer가 파이프라인 아네 들어감. CountVectorizer는 문서에서 단어를 세서 카운트 벡터를 리턴함. 벡터의 길이는 모든 문서에서 고유한 단어의 수와 같음"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:24.107521Z",
+ "start_time": "2018-03-06T13:32:23.845135Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_indexed=SparseVector(257, {2: 7.0, 6: 1.0, 7: 3.0, 8: 3.0, 10: 3.0, 24: 1.0, 29: 2.0, 31: 1.0, 33: 1.0, 37: 2.0, 39: 1.0, 46: 1.0, 58: 1.0, 59: 1.0, 61: 1.0, 64: 1.0, 70: 1.0, 72: 1.0, 81: 1.0, 96: 1.0, 128: 1.0, 132: 1.0, 133: 1.0, 134: 1.0, 135: 1.0, 142: 1.0, 164: 1.0, 169: 1.0, 189: 1.0, 212: 1.0, 225: 1.0, 247: 1.0, 254: 1.0})),\n",
+ " Row(input_indexed=SparseVector(257, {14: 1.0, 16: 2.0, 23: 2.0, 25: 2.0, 31: 1.0, 42: 2.0, 49: 1.0, 51: 1.0, 55: 1.0, 56: 1.0, 67: 1.0, 73: 1.0, 76: 1.0, 77: 1.0, 84: 1.0, 87: 1.0, 97: 1.0, 105: 1.0, 113: 1.0, 114: 1.0, 116: 1.0, 117: 1.0, 125: 1.0, 139: 1.0, 141: 1.0, 143: 1.0, 151: 1.0, 152: 1.0, 153: 1.0, 154: 1.0, 157: 1.0, 166: 1.0, 171: 1.0, 174: 1.0, 181: 1.0, 185: 1.0, 187: 1.0, 194: 1.0, 195: 1.0, 199: 1.0, 202: 1.0, 204: 1.0, 209: 1.0, 213: 1.0, 234: 1.0, 236: 1.0, 246: 1.0}))]"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stringIndexer = ft.CountVectorizer(\n",
+ " inputCol =stopwords.getOutputCol(),\n",
+ " outputCol = 'input_indexed')\n",
+ "\n",
+ "tokenized = stopwords.transform(tokenizer.transform(text_data))\n",
+ "stringIndexer.fit(tokenized).transform(tokenized).select('input_indexed').take(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "결과를 확인하면 257개의 단어들이 있고, 각각의 문서는 이제 단어 갯수를 나타내는 벡터로 표현된 것을 확인할 수 있음. 이제 토픽을 예측할 수 있게 되었음. LDA모형을 사용\n",
+ "- k는 총 몇 개의 주제를 명시하는 부분\n",
+ "- optimizer : online, em"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:25.106298Z",
+ "start_time": "2018-03-06T13:32:25.093063Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "clustering = clus.LDA(k=2, optimizer='online', featuresCol=stringIndexer.getOutputCol())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:25.688409Z",
+ "start_time": "2018-03-06T13:32:25.685222Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline = Pipeline(stages=[tokenizer, stopwords, stringIndexer, clustering])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "토픽의 결과를 확인해보는 단계"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:29.390061Z",
+ "start_time": "2018-03-06T13:32:26.468223Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(topicDistribution=DenseVector([0.053, 0.947])),\n",
+ " Row(topicDistribution=DenseVector([0.9776, 0.0224])),\n",
+ " Row(topicDistribution=DenseVector([0.0147, 0.9853])),\n",
+ " Row(topicDistribution=DenseVector([0.9753, 0.0247])),\n",
+ " Row(topicDistribution=DenseVector([0.9876, 0.0124])),\n",
+ " Row(topicDistribution=DenseVector([0.8183, 0.1817]))]"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "topics = pipeline.fit(text_data).transform(text_data)\n",
+ "topics.select('topicDistribution').collect() # topicDistribution은 LDA모형을 실행한 이후 생성되는 값임"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 회귀모델"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:30.399664Z",
+ "start_time": "2018-03-06T13:32:30.395721Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "features = ['MOTHER_AGE_YEARS','MOTHER_HEIGHT_IN',\n",
+ " 'MOTHER_PRE_WEIGHT','DIABETES_PRE',\n",
+ " 'DIABETES_GEST','HYP_TENS_PRE', \n",
+ " 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM',\n",
+ " 'CIG_BEFORE','CIG_1_TRI', 'CIG_2_TRI', \n",
+ " 'CIG_3_TRI'\n",
+ " ]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:30.963396Z",
+ "start_time": "2018-03-06T13:32:30.950748Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "featuresCreator = ft.VectorAssembler(inputCols=[col for col in features[1:]], outputCol='features')\n",
+ "selector = ft.ChiSqSelector(numTopFeatures=6, outputCol='selectedFeatures', labelCol='MOTHER_WEIGHT_GAIN')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:31.410976Z",
+ "start_time": "2018-03-06T13:32:31.395876Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.regression as reg\n",
+ "\n",
+ "regressor = reg.GBTRegressor(maxIter=15, maxDepth = 3, labelCol = 'MOTHER_WEIGHT_GAIN')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:50.223814Z",
+ "start_time": "2018-03-06T13:32:43.894970Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline = Pipeline(stages=[\n",
+ " featuresCreator, \n",
+ " selector,\n",
+ " regressor])\n",
+ "\n",
+ "weightGain = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:33:50.694341Z",
+ "start_time": "2018-03-06T13:33:50.151492Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.48862170400240335\n"
+ ]
+ }
+ ],
+ "source": [
+ "evaluator = ev.RegressionEvaluator(\n",
+ " predictionCol='prediction',\n",
+ " labelCol = 'MOTHER_WEIGHT_GAIN'\n",
+ ")\n",
+ "\n",
+ "print(evaluator.evaluate(weightGain.transform(births_test), {evaluator.metricName:'r2'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 요약\n",
+ "파이스파크의 메인 머신러닝 라이브러리인 파이스파크 ML을 어떻게 쓰는지 확인함. 트랜스포머와 에스티메이터가 어떤 것이진지 설명하고, ML라이브러리에 소개된 다른 개념인 파이프라인을 사용함. 동시에 변수를 추출하는 방법과 라이브러리의 모델을 어떻게 사용하는지를 시도"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python (python3_0901)",
+ "language": "python",
+ "name": "python3_0901"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.3"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/5. ML_tutorial/ML_tutorial.ipynb b/5. ML_tutorial/ML_tutorial.ipynb
new file mode 100644
index 0000000..4be1216
--- /dev/null
+++ b/5. ML_tutorial/ML_tutorial.ipynb
@@ -0,0 +1,1847 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ML패키지\n",
+ "MLlib은 RDD를 대상으로만 사용, ML은 데이터프레임을 대상으로 사용할 수 있음\n",
+ "- 트랜스포머, 에스티메이터, 파이프라인 준비\n",
+ "- ML패키지에 있는 모델을 사용하여 유아 생존율 예측하기\n",
+ "- 모델의 성능 평가하기\n",
+ "- 하이퍼파라미터 튜닝\n",
+ "- 패키지에서 다른 머신러닝 모델 사용"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## ML패키지 개요\n",
+ "- 최상단 레벨에 3가지의 추상 클래스(트랜스포머, 에스티메이터, 파이프라인)을 갖고 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 트랜스포머\n",
+ "새로운 칼럼을 추가하고 데이터를 변형하는 역할을 하는 클래스\n",
+ "- 트랜스포머의 추상 클래스로부터 상속 될 때 각각의 모든 트랜스포머는 transform()함수를 구현해야 함\n",
+ "- 함수를 변형하고 사용할 때 첫번째 파라미터로 데이터프레임을 받음.
\n",
+ "---\n",
+ "**대표적 클래스**\n",
+ "> - **Binarizer** : 임계치를 기준으로 연속형 변수를 이진형으로 변환
\n",
+ "> - **Bucketizer** : 연속적인 변수를 주어진 임계치의 리스트를 기반으로 쪼개어 몇 개의 범위로 변환함(binning)
\n",
+ "> - **ChiSqSelector** : 범주형 변수 중에서 카이제곱검정을 통해 몇 가지의 변수를 선택하는 기능을 제공(fit(), transform()), ChiSqSelector객체 리턴
\n",
+ "> - **CounterVectorizer** : CounterVectorizerModel객체를 리턴
\n",
+ "> - **DCT(Discrete Cosine Tranform)** : 실수로 이뤄진 벡터를 입력으로 받고, 다른 빈도로 진동하는 같은 길이의 벡터를 리턴함. 데이터셋에서의 기본 빈도를 추출하거나 데이터를 압축할 때 유용함
\n",
+ "> - **ElementwiseProduct** : 전달된 벡터와 ScalingVec파라미터를 곱한 것을 리턴하는 함수
\n",
+ "> - **HashingTF** : 분리된 텍스트를 리스트로 받아서 카운트 벡터를 리턴하는 트랜스포머.
\n",
+ "> - **IDF** : 주어진 도큐먼트 리스트에 대한 IDF값을 구함. HashingTF나 CounterVectorizer를 이용해 미리 벡터로 표현된 객체가 있어야 함
\n",
+ "> - **StringIndexer** : 한 컬럼에 주어진 모든 워드 리스트에 대해 인덱스 벡터를 생성
\n",
+ "> - **IndexToString** : 스트링 인덱스를 원본 값으로 역정렬하기 위해 StringIndexerModel갹체로부터 인코딩을 수행함.
\n",
+ "> - **MaxAbsScaler** : -1과 1 사이로 데이터의 범위를 재조정함
\n",
+ "> - **MinMaxScaler** : 0과 1 범위 사이로 재조정
\n",
+ "> - **StandardScaler** : 표준정규분포로 변수를 재조정
\n",
+ "> - **NGram** : 분리된 텍스트를 입력받아서 n-gram을 쌍으로 리턴함.
\n",
+ "> - **Normalizer** : p-norm단위를 제조정(L1정규화, L2를 설정, default는 L2)
\n",
+ "> - **OneHotEncoder** : 범주형 변수를 이진 벡터 컬럼으로 인코딩
\n",
+ "> - **PCA** : 데이터 축소
\n",
+ "> - **PolynomialExpansion** : 벡터에 대한 다항 확장 기능
\n",
+ "> - **QuantileDiscretizer** : Bucketizer함수와 비슷하지만 split파라미터를 전달하는 대신에 numBuckets라는 파림터를 전달함.
\n",
+ "> - **RegexTokenizer** : 정규표현식
\n",
+ "> - **RFormula** : R문법을 사용한 vec표현 기능을 제공
\n",
+ "> - **SQLTransformer** : R대신 SQL문법을 사용하는 기능을 제공
\n",
+ "> - **StopWordsRemover** : stopwords를 제거하는 기능
\n",
+ "> - **Tokenizer** : 스트링을 소문자로 변환하고 스페이스를 기준으로 분리하는 토크나이저
\n",
+ "> - **VectorAssembler** : 여러 개의 숫자 컬럼을 벡터 형태의 한 컬럼으로 변환해주는 트랜스포머
\n",
+ "> - **VectorIndexer** : 범주형 변수를 벡터 인덱스로 변환하는 데 사용. 각 컬럼마다 동작하며, 각 컬럼에서의 고유 값을 선택하고 정렬해 원래의 값이 아닌 맵으로부터 인덱스 값을 리턴함
\n",
+ "> - **VectorSlicer** : dense든 sparse든 관계없이 피처 벡터에 대해 동작함. 주어진 인데스 리스트에 대해 피처 벡터의 값을 리턴함
\n",
+ "> - ** Word2Vec** : 스트링 문장을 입력으로 받아 {String, Vector}형태로 변형함."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### VectorAssembler예시"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:17.688632Z",
+ "start_time": "2018-03-06T13:30:14.646913Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(a=12, b=10, c=3), Row(a=1, b=4, c=2)]"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = spark.createDataFrame(\n",
+ " [(12, 10, 3), (1, 4, 2)],\n",
+ " ['a', 'b', 'c']\n",
+ ")\n",
+ "df.take(2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:19.129832Z",
+ "start_time": "2018-03-06T13:30:18.464188Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(features=DenseVector([12.0, 10.0, 3.0])),\n",
+ " Row(features=DenseVector([1.0, 4.0, 2.0]))]"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pyspark.ml.feature as ft\n",
+ "ft.VectorAssembler(inputCols = ['a','b','c'], outputCol='features').transform(df).select('features').collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 에스티메이터\n",
+ "데이터들에 대해 예측이나 분류를 수행하는데 사용되는 추상클래스.\n",
+ "추상 에스티메이터 글래스로부터 상속받으려면, 새로운 모델을 데이터프레임에 있는 데이터와 디폴트 도는 사용자가 제공해야 하는 파라미터를 기반으로 모델을 학습하는 fit()함수를 구현해야 함\n",
+ "\n",
+ "### 분류모델\n",
+ "- LogisticRegression\n",
+ "- DecisionTreeClassifier\n",
+ "- GBTClassifier\n",
+ "- RandomForesetClassifer\n",
+ "- NaiveBayes\n",
+ "- MulitilayerPerceptronClassifier\n",
+ "- OneVsRest\n",
+ "\n",
+ "### 회귀모델\n",
+ "- AFTSurvivalRegression : Stepwise선택 회귀모형\n",
+ "- DecisionTreeRegressor\n",
+ "- GBTRegressor\n",
+ "- GeneralizedLinearRegression : 오차항의 정규성을 무시한 다른 분포모형을 사용할 수 있음. 감마분포, 포아송 등..\n",
+ "- IsotonicRegression : 선형성에 대한 가정이 필요없는 회귀모형\n",
+ "- LinearRegression\n",
+ "- RandomForestRegressor\n",
+ "\n",
+ "### 군집화모델\n",
+ "- BisectingKMeans\n",
+ "- KMeans\n",
+ "- GaussianMixture"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 파이프라인\n",
+ "transform기능만을 제공함. 파이프라인은 여러 분리된 단계에 대한 연결 고리임. fit()함수가 파이프라인 객체에서 수행될 때, 모든 단계는 states파라미터에서 명시된 수선대로 수행됨. stage파라미터는 트랜스포머와 에스티메이터 객체로 이뤄진 리스트형태. 파이프라인 객체의 fit()함수는 트랜스포머에 대해 transform()함수를 수행하고 에스티메이터에 대해 fit()함수를 수행함. \n",
+ "\n",
+ "일반적으로 이전 단계의 결과는 다음 간계의 입력 값이 됨. 트랜스포머나 에스티메이터 추상 클래스로부터 상속될 때, 각각의 것들은 outputCol파라미터의 값을 리턴하는 getOutputCol()함수를 구현해야함. outputCol 파라미터는 파이프라인 객체 생성시 명시해야 함."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 유아 생존률 예측"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:22.330648Z",
+ "start_time": "2018-03-06T13:30:22.202322Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.sql.types as typ\n",
+ "\n",
+ "labels = [\n",
+ " ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),\n",
+ " ('BIRTH_PLACE', typ.StringType()),\n",
+ " ('MOTHER_AGE_YEARS', typ.IntegerType()),\n",
+ " ('FATHER_COMBINED_AGE', typ.IntegerType()),\n",
+ " ('CIG_BEFORE', typ.IntegerType()),\n",
+ " ('CIG_1_TRI', typ.IntegerType()),\n",
+ " ('CIG_2_TRI', typ.IntegerType()),\n",
+ " ('CIG_3_TRI', typ.IntegerType()),\n",
+ " ('MOTHER_HEIGHT_IN', typ.IntegerType()),\n",
+ " ('MOTHER_PRE_WEIGHT', typ.IntegerType()),\n",
+ " ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),\n",
+ " ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),\n",
+ " ('DIABETES_PRE', typ.IntegerType()),\n",
+ " ('DIABETES_GEST', typ.IntegerType()),\n",
+ " ('HYP_TENS_PRE', typ.IntegerType()),\n",
+ " ('HYP_TENS_GEST', typ.IntegerType()),\n",
+ " ('PREV_BIRTH_PRETERM', typ.IntegerType())\n",
+ "]\n",
+ "\n",
+ "schema = typ.StructType([\n",
+ " typ.StructField(e[0], e[1], False) for e in labels\n",
+ "])\n",
+ "\n",
+ "births = spark.read.csv('births_transformed.csv.gz', \n",
+ " header=True, \n",
+ " schema=schema)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:23.113549Z",
+ "start_time": "2018-03-06T13:30:23.108778Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- INFANT_ALIVE_AT_REPORT: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE: string (nullable = true)\n",
+ " |-- MOTHER_AGE_YEARS: integer (nullable = true)\n",
+ " |-- FATHER_COMBINED_AGE: integer (nullable = true)\n",
+ " |-- CIG_BEFORE: integer (nullable = true)\n",
+ " |-- CIG_1_TRI: integer (nullable = true)\n",
+ " |-- CIG_2_TRI: integer (nullable = true)\n",
+ " |-- CIG_3_TRI: integer (nullable = true)\n",
+ " |-- MOTHER_HEIGHT_IN: integer (nullable = true)\n",
+ " |-- MOTHER_PRE_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)\n",
+ " |-- DIABETES_PRE: integer (nullable = true)\n",
+ " |-- DIABETES_GEST: integer (nullable = true)\n",
+ " |-- HYP_TENS_PRE: integer (nullable = true)\n",
+ " |-- HYP_TENS_GEST: integer (nullable = true)\n",
+ " |-- PREV_BIRTH_PRETERM: integer (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "births.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:24.103923Z",
+ "start_time": "2018-03-06T13:30:23.770186Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINED_AGE|CIG_BEFORE|CIG_1_TRI|CIG_2_TRI|CIG_3_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|MOTHER_DELIVERY_WEIGHT|MOTHER_WEIGHT_GAIN|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|\n",
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "| 0| 1| 29| 99| 0| 0| 0| 0| 99| 999| 999| 99| 0| 0| 0| 0| 0|\n",
+ "| 0| 1| 22| 29| 0| 0| 0| 0| 65| 180| 198| 18| 0| 0| 0| 0| 0|\n",
+ "| 0| 1| 38| 40| 0| 0| 0| 0| 63| 155| 167| 12| 0| 0| 0| 0| 0|\n",
+ "+----------------------+-----------+----------------+-------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "births.show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 트랜스포머 생성\n",
+ "확률모형을 연속형변수를 사용하기 때문에 데이터타입을 변경하는 작업이 선행되어야 함."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:24.937242Z",
+ "start_time": "2018-03-06T13:30:24.912525Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "root\n",
+ " |-- INFANT_ALIVE_AT_REPORT: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE: string (nullable = true)\n",
+ " |-- MOTHER_AGE_YEARS: integer (nullable = true)\n",
+ " |-- FATHER_COMBINED_AGE: integer (nullable = true)\n",
+ " |-- CIG_BEFORE: integer (nullable = true)\n",
+ " |-- CIG_1_TRI: integer (nullable = true)\n",
+ " |-- CIG_2_TRI: integer (nullable = true)\n",
+ " |-- CIG_3_TRI: integer (nullable = true)\n",
+ " |-- MOTHER_HEIGHT_IN: integer (nullable = true)\n",
+ " |-- MOTHER_PRE_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)\n",
+ " |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)\n",
+ " |-- DIABETES_PRE: integer (nullable = true)\n",
+ " |-- DIABETES_GEST: integer (nullable = true)\n",
+ " |-- HYP_TENS_PRE: integer (nullable = true)\n",
+ " |-- HYP_TENS_GEST: integer (nullable = true)\n",
+ " |-- PREV_BIRTH_PRETERM: integer (nullable = true)\n",
+ " |-- BIRTH_PLACE_INT: integer (nullable = true)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import pyspark.ml.feature as ft\n",
+ "\n",
+ "# 데이터 타입을 변경해주는 작업을 수행\n",
+ "births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'].cast(typ.IntegerType()))\n",
+ "births.printSchema()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:25.437313Z",
+ "start_time": "2018-03-06T13:30:25.425803Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'BIRTH_PLACE_VEC'"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 트랜스포머를 생성\n",
+ "encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT', outputCol='BIRTH_PLACE_VEC')\n",
+ "encoder.getOutputCol()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:25.917886Z",
+ "start_time": "2018-03-06T13:30:25.907536Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "VectorAssembler_4761ab090823489bcc96"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "featureCreator = ft.VectorAssembler(\n",
+ " inputCols = [col[0] for col in labels[2:]] + [encoder.getOutputCol()],\n",
+ " outputCol = 'features'\n",
+ ")\n",
+ "featureCreator"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "VectorAssembler객체에 전달된 inputCols파라미터는 outputCol을 형성하기 위해 합쳐진 모든 컬럼을 포함하는 리스트임. inputCols는 파라미터의 값을 변경하고자 할 때는 inputCols파라미터의 값을 직접 바꿀 것이 아니라 인코더 객체의 output갈럼명을 바꿔야 함. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 에스티메이터 생성하기\n",
+ "로지스틱 회귀 모형을 사용."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:27.276459Z",
+ "start_time": "2018-03-06T13:30:27.265549Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.classification as cl"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:28.046627Z",
+ "start_time": "2018-03-06T13:30:28.015353Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(maxIter=10, regParam=0.01, labelCol='INFANT_ALIVE_AT_REPORT')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 파이프라인 생성"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:28.940556Z",
+ "start_time": "2018-03-06T13:30:28.937265Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from pyspark.ml import Pipeline\n",
+ "\n",
+ "pipeline = Pipeline(stages=[\n",
+ " encoder,\n",
+ " featureCreator,\n",
+ " logistic\n",
+ "])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모형 학습"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:33.264531Z",
+ "start_time": "2018-03-06T13:30:29.707794Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test_model = model.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "birth_train데이터셋은 인코더 객체에 전달됨. 인코더 단계에서 생성된 데이터프레임은 'features'를 생성하는 featuresCreator로 전달됨. 마지막으로 이 단계의 출력이 최종 모델을 학습하는 로지스틱 회귀로 전달됨\n",
+ "fit()함수는 예측에 상용될 수 있는 파이프라인 모델 객체를 리턴함. 예측값은 이전에 생성한 테스트 데이터셋을 transform()함수에 전달함으로써 생성될 수 있음."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:34.687530Z",
+ "start_time": "2018-03-06T13:30:34.190260Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0545, -1.0545]), probability=DenseVector([0.7416, 0.2584]), prediction=0.0)]"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_model.take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모델 성능 측정\n",
+ "- test_model.take()\n",
+ "> probability의 DenseVector객체를 뜯어온다"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:35.584856Z",
+ "start_time": "2018-03-06T13:30:35.572678Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.evaluation as ev\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(\n",
+ " rawPredictionCol = 'probability',\n",
+ " labelCol = 'INFANT_ALIVE_AT_REPORT'\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:38.617925Z",
+ "start_time": "2018-03-06T13:30:36.124386Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7405439747919526\n",
+ "0.7152348988715325\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(evaluator.evaluate(test_model, {evaluator.metricName : 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(test_model, {evaluator.metricName : 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 모형 저장\n",
+ "**파이프라인 구조체**를 저장"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:40.075596Z",
+ "start_time": "2018-03-06T13:30:39.758183Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'\n",
+ "pipeline.write().overwrite().save(pipelinePath)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:40.751333Z",
+ "start_time": "2018-03-06T13:30:40.639762Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Spark_ML.ipynb\t\t infant_oneHotEncoder_Logistic_Pipeline\r\n",
+ "births_transformed.csv.gz infant_oneHotEncoder_Logistic_PipelineModel\r\n",
+ "derby.log\t\t metastore_db\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "!ls"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:43.560122Z",
+ "start_time": "2018-03-06T13:30:41.341121Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0545, -1.0545]), probability=DenseVector([0.7416, 0.2584]), prediction=0.0)]"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "loadedPipeline = Pipeline.load(pipelinePath)\n",
+ "loadedPipeline.fit(births_train).transform(births_test).take(1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:45.733936Z",
+ "start_time": "2018-03-06T13:30:44.217113Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from pyspark.ml import PipelineModel\n",
+ "\n",
+ "modelPath = './infant_oneHotEncoder_Logistic_PipelineModel'\n",
+ "model.write().overwrite().save(modelPath)\n",
+ "\n",
+ "loadedPipeModel = PipelineModel.load(modelPath)\n",
+ "test_loadedModel = loadedPipeModel.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 하이퍼파라미터 최적화\n",
+ "그리드탐색기법을 사용(ParamGridBuilder객체를 사용)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:46.533063Z",
+ "start_time": "2018-03-06T13:30:46.528995Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.tuning as tune"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:30:47.058555Z",
+ "start_time": "2018-03-06T13:30:47.043831Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(\n",
+ " labelCol = 'INFANT_ALIVE_AT_REPORT'\n",
+ ")\n",
+ "\n",
+ "grid = tune.ParamGridBuilder().addGrid(logistic.maxIter, [2, 10, 50]).addGrid(logistic.regParam, [0.01, 0.1, 0.3]).build()\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(rawPredictionCol='probability', labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "\n",
+ "cv = tune.CrossValidator(estimator=logistic, estimatorParamMaps=grid, evaluator=evaluator)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:26.114944Z",
+ "start_time": "2018-03-06T13:30:47.815001Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline= Pipeline(stages=[encoder, featureCreator])\n",
+ "\n",
+ "data_transformer = pipeline.fit(births_train)\n",
+ "\n",
+ "# 동일하게 파이프라인을 설정하고 트랜스포머 기능을 수행함\n",
+ "# 차이점은 미리 설정한 cv를 설정하는 작업임\n",
+ "# cross-validation 설정\n",
+ "cv_model = cv.fit(dataset=data_transformer.transform(births_train))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:29.747993Z",
+ "start_time": "2018-03-06T13:31:28.589118Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7404526641072416\n",
+ "0.7157767684747429\n"
+ ]
+ }
+ ],
+ "source": [
+ "data_train = data_transformer \\\n",
+ " .transform(births_test)\n",
+ "results = cv_model.transform(data_train)\n",
+ "\n",
+ "print(evaluator.evaluate(results, \n",
+ " {evaluator.metricName: 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(results, \n",
+ " {evaluator.metricName: 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": true
+ },
+ "source": [
+ "그리드 방식으로 접근한 결과를 살펴보면 기존의 모형보다 조금 성능이 좋아진 것을 확인할 수 있음\n",
+ "\n",
+ "최적의 성능을 보여주는 하이퍼파미터의 집합을 찾아보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:30.750863Z",
+ "start_time": "2018-03-06T13:31:30.742933Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "([{'maxIter': 50}, {'regParam': 0.01}], 0.738652833807851)"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results = [\n",
+ " (\n",
+ " [\n",
+ " {key.name : paramValue}\n",
+ " for key, paramValue in zip(params.keys(), params.values())\n",
+ " ], metric\n",
+ " )\n",
+ " for params, metric\n",
+ " in zip(\n",
+ " cv_model.getEstimatorParamMaps(),\n",
+ " cv_model.avgMetrics\n",
+ " )\n",
+ "]\n",
+ "\n",
+ "sorted(results, key=lambda el: el[1], reverse=True)[0]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 학습/검증 데이터셋\n",
+ "최선의 모델을 선택하기 위해 TrainValidationSplit모델을 이용해 입력 데이터셋을 training과 validation으로 두 개를 나눔\n",
+ "\n",
+ "좋은 변수들만 추출하기 위해 ChiSqSelector를 사용할 것임"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:32.252416Z",
+ "start_time": "2018-03-06T13:31:32.240831Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "selector = ft.ChiSqSelector(numTopFeatures=5, \n",
+ " featuresCol=featureCreator.getOutputCol(), \n",
+ " outputCol='selectedFeatures', \n",
+ " labelCol='INFANT_ALIVE_AT_REPORT'\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "numTopFeatures는 리턴할 피처의 갯수를 명시함. featureCreator의 getOutputCol()을 호출할 수 있도록 featureCreator 이후에 selector를 정의한다"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:35.009781Z",
+ "start_time": "2018-03-06T13:31:33.256148Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "logistic = cl.LogisticRegression(labelCol='INFANT_ALIVE_AT_REPORT', featuresCol='selectedFeatures')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, selector])\n",
+ "data_transformer = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "trainValidationSplit객체는 CrossValidator모델과 같은 방법으로 생성됨"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:36.200307Z",
+ "start_time": "2018-03-06T13:31:36.197153Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tvs = tune.TrainValidationSplit(estimator=logistic, \n",
+ " estimatorParamMaps=grid, # 설정한 그리드\n",
+ " evaluator=evaluator # 그리드와 함께 설정한 evaluator\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:53.480989Z",
+ "start_time": "2018-03-06T13:31:36.857454Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7294296314442145\n",
+ "0.7037759446410553\n"
+ ]
+ }
+ ],
+ "source": [
+ "# data_transformer는 pipeline으로 설정한 객체\n",
+ "tvs_model = tvs.fit(data_transformer.transform(births_train))\n",
+ "data_train = data_transformer.transform(births_test)\n",
+ "results = tvs_model.transform(data_train)\n",
+ "\n",
+ "print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderROC'}))\n",
+ "print(evaluator.evaluate(results, {evaluator.metricName: 'areaUnderPR'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "적은 변수를 사용한 모델의 성능이 상대적으로 더 좋지 않은 것을 알 수 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## PySpark ML의 다른 features"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:55.805598Z",
+ "start_time": "2018-03-06T13:31:55.787103Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "text_data = spark.createDataFrame([\n",
+ " ['''Machine learning can be applied to a wide variety \n",
+ " of data types, such as vectors, text, images, and \n",
+ " structured data. This API adopts the DataFrame from \n",
+ " Spark SQL in order to support a variety of data types.'''],\n",
+ " ['''DataFrame supports many basic and structured types; \n",
+ " see the Spark SQL datatype reference for a list of \n",
+ " supported types. In addition to the types listed in \n",
+ " the Spark SQL guide, DataFrame can use ML Vector types.'''],\n",
+ " ['''A DataFrame can be created either implicitly or \n",
+ " explicitly from a regular RDD. See the code examples \n",
+ " below and the Spark SQL programming guide for examples.'''],\n",
+ " ['''Columns in a DataFrame are named. The code examples \n",
+ " below use names such as \"text,\" \"features,\" and \"label.\"''']\n",
+ "], ['input'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:56.655512Z",
+ "start_time": "2018-03-06T13:31:56.573300Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+--------------------+\n",
+ "| input|\n",
+ "+--------------------+\n",
+ "|Machine learning ...|\n",
+ "|DataFrame support...|\n",
+ "|A DataFrame can b...|\n",
+ "|Columns in a Data...|\n",
+ "+--------------------+\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "text_data.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "열이 한 개인 데이터프레임을 생성. 행의 관측치에 존재하는 문장들을 단어를 기준으로 분리하고자 함. 특정 패턴을 설정하기 위해 regexTokenizer를 사용"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:57.572582Z",
+ "start_time": "2018-03-06T13:31:57.564329Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tokenizer = ft.RegexTokenizer(inputCol='input', outputCol='input_arr', pattern='\\s+|[,.\\\"]')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:58.352732Z",
+ "start_time": "2018-03-06T13:31:58.253102Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_arr=['machine', 'learning', 'can', 'be', 'applied', 'to', 'a', 'wide', 'variety', 'of', 'data', 'types', 'such', 'as', 'vectors', 'text', 'images', 'and', 'structured', 'data', 'this', 'api', 'adopts', 'the', 'dataframe', 'from', 'spark', 'sql', 'in', 'order', 'to', 'support', 'a', 'variety', 'of', 'data', 'types'])]"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tok = tokenizer.transform(text_data).select('input_arr')\n",
+ "tok.take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "불용어를 제거해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:31:59.609604Z",
+ "start_time": "2018-03-06T13:31:59.411908Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_stop=['machine', 'learning', 'applied', 'wide', 'variety', 'data', 'types', 'vectors', 'text', 'images', 'structured', 'data', 'api', 'adopts', 'dataframe', 'spark', 'sql', 'order', 'support', 'variety', 'data', 'types'])]"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='input_stop')\n",
+ "stopwords.transform(tok).select('input_stop').take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "nGram모델과 pipeline을 설정해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:00.522282Z",
+ "start_time": "2018-03-06T13:32:00.513151Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "ngram = ft.NGram(n=2, inputCol=stopwords.getOutputCol(), outputCol='NGrams')\n",
+ "pipeline = Pipeline(stages=[tokenizer, stopwords, ngram])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:01.255262Z",
+ "start_time": "2018-03-06T13:32:01.117080Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(NGrams=['machine learning', 'learning applied', 'applied wide', 'wide variety', 'variety data', 'data types', 'types vectors', 'vectors text', 'text images', 'images structured', 'structured data', 'data api', 'api adopts', 'adopts dataframe', 'dataframe spark', 'spark sql', 'sql order', 'order support', 'support variety', 'variety data', 'data types'])]"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data_ngram = pipeline.fit(text_data).transform(text_data)\n",
+ "data_ngram.select('NGrams').take(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 연속 변수 분별하기\n",
+ "지금까지는 비선형이고 하나의 계수를 사영해서는 모델 학습을 하기 힘든 연속형 변수들을 사용했음. \n",
+ "이러한 상황에서는 피처의 타깃을 하나의 계수로 설명하기 힘듬. 때로는 값들을 특정 버킷으로 분별하는 것도 굉장히 유용함\n",
+ "\n",
+ "예시 데이터를 생성해보자"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:02.266943Z",
+ "start_time": "2018-03-06T13:32:02.181314Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+------------------+\n",
+ "| continuous_var|\n",
+ "+------------------+\n",
+ "| 20.1234|\n",
+ "|20.132344452369832|\n",
+ "|20.159087064491775|\n",
+ "+------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "x = np.arange(0, 100)\n",
+ "x = x / 100.0 * np.pi * 4\n",
+ "y = x * np.sin(x / 1.764) + 20.1234\n",
+ "\n",
+ "schema = typ.StructType([\n",
+ " typ.StructField('continuous_var', typ.DoubleType(), False)\n",
+ "])\n",
+ "\n",
+ "data = spark.createDataFrame([[float(e), ] for e in y], schema=schema)\n",
+ "data.show(3)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:04.204286Z",
+ "start_time": "2018-03-06T13:32:02.968489Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(discretized=0.0, avg(continuous_var)=12.314360733007913),\n",
+ " Row(discretized=1.0, avg(continuous_var)=16.046244793347473),\n",
+ " Row(discretized=2.0, avg(continuous_var)=20.250799478352594),\n",
+ " Row(discretized=3.0, avg(continuous_var)=22.040988218437327),\n",
+ " Row(discretized=4.0, avg(continuous_var)=24.264824657002862)]"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "discretizer = ft.QuantileDiscretizer(numBuckets=5, inputCol='continuous_var', outputCol='discretized')\n",
+ "\n",
+ "data_discretized = discretizer.fit(data).transform(data)\n",
+ "data_discretized.groupby('discretized').mean('continuous_var').sort('discretized').collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 연속형 변수에 대한 standarizing"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:04.870707Z",
+ "start_time": "2018-03-06T13:32:04.861143Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "vectorizer = ft.VectorAssembler(inputCols=['continuous_var'], outputCol='continuous_vec')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:05.752991Z",
+ "start_time": "2018-03-06T13:32:05.538350Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+------------------+--------------------+--------------------+\n",
+ "| continuous_var| continuous_vec| normalized|\n",
+ "+------------------+--------------------+--------------------+\n",
+ "| 20.1234| [20.1234]|[0.23429139554502...|\n",
+ "|20.132344452369832|[20.132344452369832]|[0.23630959828688...|\n",
+ "|20.159087064491775|[20.159087064491775]| [0.242343731051792]|\n",
+ "+------------------+--------------------+--------------------+\n",
+ "only showing top 3 rows\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "normalizer = ft.StandardScaler(inputCol=vectorizer.getOutputCol(), outputCol='normalized',withMean=True, withStd=True)\n",
+ "\n",
+ "pipeline = Pipeline(stages=[vectorizer, normalizer])\n",
+ "data_standardized = pipeline.fit(data).transform(data)\n",
+ "\n",
+ "data_standardized.show(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 분류 모델"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:06.631664Z",
+ "start_time": "2018-03-06T13:32:06.614433Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.sql.functions as func\n",
+ "\n",
+ "births = births.withColumn('INFANT_ALIVE_AT_REPORT', func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType()))\n",
+ "births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:11.067953Z",
+ "start_time": "2018-03-06T13:32:07.361151Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "classifer = cl.RandomForestClassifier(\n",
+ " numTrees = 5,\n",
+ " maxDepth=5,\n",
+ " labelCol='INFANT_ALIVE_AT_REPORT'\n",
+ ")\n",
+ "\n",
+ "# 파라미터 튜닝을 제외한 순수 접근 방법\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, classifer])\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test = model.transform(births_test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:12.652890Z",
+ "start_time": "2018-03-06T13:32:11.758322Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7625231306933616\n",
+ "0.7474287997552782\n"
+ ]
+ }
+ ],
+ "source": [
+ "evaluator = ev.BinaryClassificationEvaluator(\n",
+ " labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderROC\"}))\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderPR\"}))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:16.316611Z",
+ "start_time": "2018-03-06T13:32:13.405779Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.7582781726635287\n",
+ "0.7787580540118526\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 모형 자체에 파이프라인을 설정하는 방법\n",
+ "classifier = cl.DecisionTreeClassifier(maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, classifier])\n",
+ "\n",
+ "model = pipeline.fit(births_train)\n",
+ "test = model.transform(births_test)\n",
+ "\n",
+ "evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderROC\"}))\n",
+ "print(evaluator.evaluate(test, \n",
+ " {evaluator.metricName: \"areaUnderPR\"}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 군집화"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:18.325669Z",
+ "start_time": "2018-03-06T13:32:17.127629Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.clustering as clus\n",
+ "\n",
+ "kmeans = clus.KMeans(k = 5, featuresCol = 'features')\n",
+ "pipeline = Pipeline(stages=[encoder, featureCreator, kmeans])\n",
+ "model = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:20.227778Z",
+ "start_time": "2018-03-06T13:32:18.987601Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(prediction=1, avg(MOTHER_HEIGHT_IN)=83.91154791154791, count(1)=407),\n",
+ " Row(prediction=3, avg(MOTHER_HEIGHT_IN)=66.64658634538152, count(1)=249),\n",
+ " Row(prediction=4, avg(MOTHER_HEIGHT_IN)=64.31597357170618, count(1)=10292),\n",
+ " Row(prediction=2, avg(MOTHER_HEIGHT_IN)=67.69473684210526, count(1)=475),\n",
+ " Row(prediction=0, avg(MOTHER_HEIGHT_IN)=64.43472584856397, count(1)=2298)]"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test = model.transform(births_test)\n",
+ "\n",
+ "test.groupBy('prediction').agg({\n",
+ " '*': 'count',\n",
+ " 'MOTHER_HEIGHT_IN' : 'avg'\n",
+ "}).collect()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "결과를 확인하면 MOTHER_HEIGHT_IN은 군집 2에서 많이 다르다는 것을 알 수 있음"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Topic Mining\n",
+ "군집화 모델은 숫자 데이터로만 가능하지 않음. NLP분야에서 토픽 추출과 같은 영역은 같은 주제를 가진 문서들을 찾아내는 데 군집화를 이용함. \n",
+ "6개의 인스턴스들로 구성된 데이터이며 3개는 국립공원과 관련된 내용을 서술하고 있으며 나머지 3개는 기술영역의 내용을 갖고 있음"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:22.029951Z",
+ "start_time": "2018-03-06T13:32:21.990328Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "text_data = spark.createDataFrame([\n",
+ " ['''To make a computer do anything, you have to write a \n",
+ " computer program. To write a computer program, you have \n",
+ " to tell the computer, step by step, exactly what you want \n",
+ " it to do. The computer then \"executes\" the program, \n",
+ " following each step mechanically, to accomplish the end \n",
+ " goal. When you are telling the computer what to do, you \n",
+ " also get to choose how it's going to do it. That's where \n",
+ " computer algorithms come in. The algorithm is the basic \n",
+ " technique used to get the job done. Let's follow an \n",
+ " example to help get an understanding of the algorithm \n",
+ " concept.'''],\n",
+ " ['''Laptop computers use batteries to run while not \n",
+ " connected to mains. When we overcharge or overheat \n",
+ " lithium ion batteries, the materials inside start to \n",
+ " break down and produce bubbles of oxygen, carbon dioxide, \n",
+ " and other gases. Pressure builds up, and the hot battery \n",
+ " swells from a rectangle into a pillow shape. Sometimes \n",
+ " the phone involved will operate afterwards. Other times \n",
+ " it will die. And occasionally—kapow! To see what's \n",
+ " happening inside the battery when it swells, the CLS team \n",
+ " used an x-ray technology called computed tomography.'''],\n",
+ " ['''This technology describes a technique where touch \n",
+ " sensors can be placed around any side of a device \n",
+ " allowing for new input sources. The patent also notes \n",
+ " that physical buttons (such as the volume controls) could \n",
+ " be replaced by these embedded touch sensors. In essence \n",
+ " Apple could drop the current buttons and move towards \n",
+ " touch-enabled areas on the device for the existing UI. It \n",
+ " could also open up areas for new UI paradigms, such as \n",
+ " using the back of the smartphone for quick scrolling or \n",
+ " page turning.'''],\n",
+ " ['''The National Park Service is a proud protector of \n",
+ " America’s lands. Preserving our land not only safeguards \n",
+ " the natural environment, but it also protects the \n",
+ " stories, cultures, and histories of our ancestors. As we \n",
+ " face the increasingly dire consequences of climate \n",
+ " change, it is imperative that we continue to expand \n",
+ " America’s protected lands under the oversight of the \n",
+ " National Park Service. Doing so combats climate change \n",
+ " and allows all American’s to visit, explore, and learn \n",
+ " from these treasured places for generations to come. It \n",
+ " is critical that President Obama acts swiftly to preserve \n",
+ " land that is at risk of external threats before the end \n",
+ " of his term as it has become blatantly clear that the \n",
+ " next administration will not hold the same value for our \n",
+ " environment over the next four years.'''],\n",
+ " ['''The National Park Foundation, the official charitable \n",
+ " partner of the National Park Service, enriches America’s \n",
+ " national parks and programs through the support of \n",
+ " private citizens, park lovers, stewards of nature, \n",
+ " history enthusiasts, and wilderness adventurers. \n",
+ " Chartered by Congress in 1967, the Foundation grew out of \n",
+ " a legacy of park protection that began over a century \n",
+ " ago, when ordinary citizens took action to establish and \n",
+ " protect our national parks. Today, the National Park \n",
+ " Foundation carries on the tradition of early park \n",
+ " advocates, big thinkers, doers and dreamers—from John \n",
+ " Muir and Ansel Adams to President Theodore Roosevelt.'''],\n",
+ " ['''Australia has over 500 national parks. Over 28 \n",
+ " million hectares of land is designated as national \n",
+ " parkland, accounting for almost four per cent of \n",
+ " Australia's land areas. In addition, a further six per \n",
+ " cent of Australia is protected and includes state \n",
+ " forests, nature parks and conservation reserves.National \n",
+ " parks are usually large areas of land that are protected \n",
+ " because they have unspoilt landscapes and a diverse \n",
+ " number of native plants and animals. This means that \n",
+ " commercial activities such as farming are prohibited and \n",
+ " human activity is strictly monitored.''']\n",
+ "], ['documents'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:22.908791Z",
+ "start_time": "2018-03-06T13:32:22.871362Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "tokenizer = ft.RegexTokenizer(\n",
+ " inputCol = 'documents',\n",
+ " outputCol = 'input_arr',\n",
+ " pattern = '\\s+|[,.\\\"]')\n",
+ "\n",
+ "stopwords = ft.StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='intput_stop')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "CounterVectorizer가 파이프라인 아네 들어감. CountVectorizer는 문서에서 단어를 세서 카운트 벡터를 리턴함. 벡터의 길이는 모든 문서에서 고유한 단어의 수와 같음"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:24.107521Z",
+ "start_time": "2018-03-06T13:32:23.845135Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(input_indexed=SparseVector(257, {2: 7.0, 6: 1.0, 7: 3.0, 8: 3.0, 10: 3.0, 24: 1.0, 29: 2.0, 31: 1.0, 33: 1.0, 37: 2.0, 39: 1.0, 46: 1.0, 58: 1.0, 59: 1.0, 61: 1.0, 64: 1.0, 70: 1.0, 72: 1.0, 81: 1.0, 96: 1.0, 128: 1.0, 132: 1.0, 133: 1.0, 134: 1.0, 135: 1.0, 142: 1.0, 164: 1.0, 169: 1.0, 189: 1.0, 212: 1.0, 225: 1.0, 247: 1.0, 254: 1.0})),\n",
+ " Row(input_indexed=SparseVector(257, {14: 1.0, 16: 2.0, 23: 2.0, 25: 2.0, 31: 1.0, 42: 2.0, 49: 1.0, 51: 1.0, 55: 1.0, 56: 1.0, 67: 1.0, 73: 1.0, 76: 1.0, 77: 1.0, 84: 1.0, 87: 1.0, 97: 1.0, 105: 1.0, 113: 1.0, 114: 1.0, 116: 1.0, 117: 1.0, 125: 1.0, 139: 1.0, 141: 1.0, 143: 1.0, 151: 1.0, 152: 1.0, 153: 1.0, 154: 1.0, 157: 1.0, 166: 1.0, 171: 1.0, 174: 1.0, 181: 1.0, 185: 1.0, 187: 1.0, 194: 1.0, 195: 1.0, 199: 1.0, 202: 1.0, 204: 1.0, 209: 1.0, 213: 1.0, 234: 1.0, 236: 1.0, 246: 1.0}))]"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stringIndexer = ft.CountVectorizer(\n",
+ " inputCol =stopwords.getOutputCol(),\n",
+ " outputCol = 'input_indexed')\n",
+ "\n",
+ "tokenized = stopwords.transform(tokenizer.transform(text_data))\n",
+ "stringIndexer.fit(tokenized).transform(tokenized).select('input_indexed').take(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "결과를 확인하면 257개의 단어들이 있고, 각각의 문서는 이제 단어 갯수를 나타내는 벡터로 표현된 것을 확인할 수 있음. 이제 토픽을 예측할 수 있게 되었음. LDA모형을 사용\n",
+ "- k는 총 몇 개의 주제를 명시하는 부분\n",
+ "- optimizer : online, em"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:25.106298Z",
+ "start_time": "2018-03-06T13:32:25.093063Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "clustering = clus.LDA(k=2, optimizer='online', featuresCol=stringIndexer.getOutputCol())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:25.688409Z",
+ "start_time": "2018-03-06T13:32:25.685222Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline = Pipeline(stages=[tokenizer, stopwords, stringIndexer, clustering])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "토픽의 결과를 확인해보는 단계"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:29.390061Z",
+ "start_time": "2018-03-06T13:32:26.468223Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[Row(topicDistribution=DenseVector([0.053, 0.947])),\n",
+ " Row(topicDistribution=DenseVector([0.9776, 0.0224])),\n",
+ " Row(topicDistribution=DenseVector([0.0147, 0.9853])),\n",
+ " Row(topicDistribution=DenseVector([0.9753, 0.0247])),\n",
+ " Row(topicDistribution=DenseVector([0.9876, 0.0124])),\n",
+ " Row(topicDistribution=DenseVector([0.8183, 0.1817]))]"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "topics = pipeline.fit(text_data).transform(text_data)\n",
+ "topics.select('topicDistribution').collect() # topicDistribution은 LDA모형을 실행한 이후 생성되는 값임"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 회귀모델"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:30.399664Z",
+ "start_time": "2018-03-06T13:32:30.395721Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "features = ['MOTHER_AGE_YEARS','MOTHER_HEIGHT_IN',\n",
+ " 'MOTHER_PRE_WEIGHT','DIABETES_PRE',\n",
+ " 'DIABETES_GEST','HYP_TENS_PRE', \n",
+ " 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM',\n",
+ " 'CIG_BEFORE','CIG_1_TRI', 'CIG_2_TRI', \n",
+ " 'CIG_3_TRI'\n",
+ " ]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:30.963396Z",
+ "start_time": "2018-03-06T13:32:30.950748Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "featuresCreator = ft.VectorAssembler(inputCols=[col for col in features[1:]], outputCol='features')\n",
+ "selector = ft.ChiSqSelector(numTopFeatures=6, outputCol='selectedFeatures', labelCol='MOTHER_WEIGHT_GAIN')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:31.410976Z",
+ "start_time": "2018-03-06T13:32:31.395876Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pyspark.ml.regression as reg\n",
+ "\n",
+ "regressor = reg.GBTRegressor(maxIter=15, maxDepth = 3, labelCol = 'MOTHER_WEIGHT_GAIN')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:32:50.223814Z",
+ "start_time": "2018-03-06T13:32:43.894970Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "pipeline = Pipeline(stages=[\n",
+ " featuresCreator, \n",
+ " selector,\n",
+ " regressor])\n",
+ "\n",
+ "weightGain = pipeline.fit(births_train)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2018-03-06T13:33:50.694341Z",
+ "start_time": "2018-03-06T13:33:50.151492Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.48862170400240335\n"
+ ]
+ }
+ ],
+ "source": [
+ "evaluator = ev.RegressionEvaluator(\n",
+ " predictionCol='prediction',\n",
+ " labelCol = 'MOTHER_WEIGHT_GAIN'\n",
+ ")\n",
+ "\n",
+ "print(evaluator.evaluate(weightGain.transform(births_test), {evaluator.metricName:'r2'}))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 요약\n",
+ "파이스파크의 메인 머신러닝 라이브러리인 파이스파크 ML을 어떻게 쓰는지 확인함. 트랜스포머와 에스티메이터가 어떤 것이진지 설명하고, ML라이브러리에 소개된 다른 개념인 파이프라인을 사용함. 동시에 변수를 추출하는 방법과 라이브러리의 모델을 어떻게 사용하는지를 시도"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python (python3_0901)",
+ "language": "python",
+ "name": "python3_0901"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.3"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/5. ML_tutorial/births_transformed.csv.gz b/5. ML_tutorial/births_transformed.csv.gz
new file mode 100644
index 0000000..a22a695
Binary files /dev/null and b/5. ML_tutorial/births_transformed.csv.gz differ
diff --git a/5. ML_tutorial/derby.log b/5. ML_tutorial/derby.log
new file mode 100644
index 0000000..4a417a6
--- /dev/null
+++ b/5. ML_tutorial/derby.log
@@ -0,0 +1,13 @@
+----------------------------------------------------------------
+Thu Mar 08 16:10:01 KST 2018:
+Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance a816c00e-0162-0471-fa52-0000087f0338
+on database directory /home/paulkim/workspace/Spark/LearningPySpark/Chapter6/metastore_db with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@58a10067
+Loaded from file:/home/paulkim/spark-2.2.0-bin-hadoop2.7/jars/derby-10.12.1.1.jar
+java.vendor=Oracle Corporation
+java.runtime.version=1.8.0_151-b12
+user.dir=/home/paulkim/workspace/Spark/LearningPySpark/Chapter6
+os.name=Linux
+os.arch=amd64
+os.version=4.13.0-36-generic
+derby.system.home=null
+Database Class Loader started - derby.database.classpath=''
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/.part-00000.crc
new file mode 100644
index 0000000..29fbbd5
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/part-00000
new file mode 100644
index 0000000..11e5587
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.Pipeline","timestamp":1520343039788,"sparkVersion":"2.2.0","uid":"Pipeline_4c17aa5cbf875f5b1da8","paramMap":{"stageUids":["OneHotEncoder_42ea8648036159e3b0c4","VectorAssembler_4761ab090823489bcc96","LogisticRegression_40e48beb73886d7d18bb"]}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc
new file mode 100644
index 0000000..5a12481
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000
new file mode 100644
index 0000000..231016c
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.feature.OneHotEncoder","timestamp":1520343039910,"sparkVersion":"2.2.0","uid":"OneHotEncoder_42ea8648036159e3b0c4","paramMap":{"inputCol":"BIRTH_PLACE_INT","outputCol":"BIRTH_PLACE_VEC","dropLast":true}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc
new file mode 100644
index 0000000..496b55a
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000
new file mode 100644
index 0000000..57d0816
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.feature.VectorAssembler","timestamp":1520343039964,"sparkVersion":"2.2.0","uid":"VectorAssembler_4761ab090823489bcc96","paramMap":{"outputCol":"features","inputCols":["MOTHER_AGE_YEARS","FATHER_COMBINED_AGE","CIG_BEFORE","CIG_1_TRI","CIG_2_TRI","CIG_3_TRI","MOTHER_HEIGHT_IN","MOTHER_PRE_WEIGHT","MOTHER_DELIVERY_WEIGHT","MOTHER_WEIGHT_GAIN","DIABETES_PRE","DIABETES_GEST","HYP_TENS_PRE","HYP_TENS_GEST","PREV_BIRTH_PRETERM","BIRTH_PLACE_VEC"]}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc
new file mode 100644
index 0000000..06d2987
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000
new file mode 100644
index 0000000..6285beb
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_Pipeline/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.classification.LogisticRegression","timestamp":1520343040021,"sparkVersion":"2.2.0","uid":"LogisticRegression_40e48beb73886d7d18bb","paramMap":{"fitIntercept":true,"labelCol":"INFANT_ALIVE_AT_REPORT","maxIter":10,"tol":1.0E-6,"regParam":0.01,"threshold":0.5,"predictionCol":"prediction","standardization":true,"probabilityCol":"probability","featuresCol":"features","family":"auto","rawPredictionCol":"rawPrediction","elasticNetParam":0.0,"aggregationDepth":2}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/.part-00000.crc
new file mode 100644
index 0000000..3cf9b2d
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/part-00000
new file mode 100644
index 0000000..a6003f0
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.PipelineModel","timestamp":1520343044233,"sparkVersion":"2.2.0","uid":"PipelineModel_460a81f5e204ad3ac72a","paramMap":{"stageUids":["OneHotEncoder_42ea8648036159e3b0c4","VectorAssembler_4761ab090823489bcc96","LogisticRegression_40e48beb73886d7d18bb"]}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc
new file mode 100644
index 0000000..290d6b8
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000
new file mode 100644
index 0000000..e23aee1
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/0_OneHotEncoder_42ea8648036159e3b0c4/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.feature.OneHotEncoder","timestamp":1520343044281,"sparkVersion":"2.2.0","uid":"OneHotEncoder_42ea8648036159e3b0c4","paramMap":{"inputCol":"BIRTH_PLACE_INT","outputCol":"BIRTH_PLACE_VEC","dropLast":true}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc
new file mode 100644
index 0000000..baa1a23
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000
new file mode 100644
index 0000000..77d093e
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/1_VectorAssembler_4761ab090823489bcc96/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.feature.VectorAssembler","timestamp":1520343044321,"sparkVersion":"2.2.0","uid":"VectorAssembler_4761ab090823489bcc96","paramMap":{"outputCol":"features","inputCols":["MOTHER_AGE_YEARS","FATHER_COMBINED_AGE","CIG_BEFORE","CIG_1_TRI","CIG_2_TRI","CIG_3_TRI","MOTHER_HEIGHT_IN","MOTHER_PRE_WEIGHT","MOTHER_DELIVERY_WEIGHT","MOTHER_WEIGHT_GAIN","DIABETES_PRE","DIABETES_GEST","HYP_TENS_PRE","HYP_TENS_GEST","PREV_BIRTH_PRETERM","BIRTH_PLACE_VEC"]}}
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/.part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/.part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet.crc
new file mode 100644
index 0000000..aafef67
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/.part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet
new file mode 100644
index 0000000..4035bc5
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/data/part-00000-b598f85a-4c7b-4c8f-b3f2-989a28ac6ef3-c000.snappy.parquet differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc
new file mode 100644
index 0000000..3b7b044
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/._SUCCESS.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc
new file mode 100644
index 0000000..a3d5e42
Binary files /dev/null and b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/.part-00000.crc differ
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/_SUCCESS b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/_SUCCESS
new file mode 100644
index 0000000..e69de29
diff --git a/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000 b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000
new file mode 100644
index 0000000..751381c
--- /dev/null
+++ b/5. ML_tutorial/infant_oneHotEncoder_Logistic_PipelineModel/stages/2_LogisticRegression_40e48beb73886d7d18bb/metadata/part-00000
@@ -0,0 +1 @@
+{"class":"org.apache.spark.ml.classification.LogisticRegressionModel","timestamp":1520343044363,"sparkVersion":"2.2.0","uid":"LogisticRegression_40e48beb73886d7d18bb","paramMap":{"fitIntercept":true,"labelCol":"INFANT_ALIVE_AT_REPORT","tol":1.0E-6,"maxIter":10,"regParam":0.01,"threshold":0.5,"predictionCol":"prediction","standardization":true,"probabilityCol":"probability","featuresCol":"features","family":"auto","rawPredictionCol":"rawPrediction","elasticNetParam":0.0,"aggregationDepth":2}}
diff --git a/5. ML_tutorial/metastore_db/README_DO_NOT_TOUCH_FILES.txt b/5. ML_tutorial/metastore_db/README_DO_NOT_TOUCH_FILES.txt
new file mode 100644
index 0000000..a4bc145
--- /dev/null
+++ b/5. ML_tutorial/metastore_db/README_DO_NOT_TOUCH_FILES.txt
@@ -0,0 +1,9 @@
+
+# *************************************************************************
+# *** DO NOT TOUCH FILES IN THIS DIRECTORY! ***
+# *** FILES IN THIS DIRECTORY AND SUBDIRECTORIES CONSTITUTE A DERBY ***
+# *** DATABASE, WHICH INCLUDES THE DATA (USER AND SYSTEM) AND THE ***
+# *** FILES NECESSARY FOR DATABASE RECOVERY. ***
+# *** EDITING, ADDING, OR DELETING ANY OF THESE FILES MAY CAUSE DATA ***
+# *** CORRUPTION AND LEAVE THE DATABASE IN A NON-RECOVERABLE STATE. ***
+# *************************************************************************
\ No newline at end of file
diff --git a/5. ML_tutorial/metastore_db/db.lck b/5. ML_tutorial/metastore_db/db.lck
new file mode 100644
index 0000000..495126f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/db.lck differ
diff --git a/5. ML_tutorial/metastore_db/dbex.lck b/5. ML_tutorial/metastore_db/dbex.lck
new file mode 100644
index 0000000..720d64f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/dbex.lck differ
diff --git a/5. ML_tutorial/metastore_db/log/README_DO_NOT_TOUCH_FILES.txt b/5. ML_tutorial/metastore_db/log/README_DO_NOT_TOUCH_FILES.txt
new file mode 100644
index 0000000..56df292
--- /dev/null
+++ b/5. ML_tutorial/metastore_db/log/README_DO_NOT_TOUCH_FILES.txt
@@ -0,0 +1,8 @@
+
+# *************************************************************************
+# *** DO NOT TOUCH FILES IN THIS DIRECTORY! ***
+# *** FILES IN THIS DIRECTORY ARE USED BY THE DERBY DATABASE RECOVERY ***
+# *** SYSTEM. EDITING, ADDING, OR DELETING FILES IN THIS DIRECTORY ***
+# *** WILL CAUSE THE DERBY RECOVERY SYSTEM TO FAIL, LEADING TO ***
+# *** NON-RECOVERABLE CORRUPT DATABASES. ***
+# *************************************************************************
\ No newline at end of file
diff --git a/5. ML_tutorial/metastore_db/log/log.ctrl b/5. ML_tutorial/metastore_db/log/log.ctrl
new file mode 100644
index 0000000..a11daac
Binary files /dev/null and b/5. ML_tutorial/metastore_db/log/log.ctrl differ
diff --git a/5. ML_tutorial/metastore_db/log/log1.dat b/5. ML_tutorial/metastore_db/log/log1.dat
new file mode 100644
index 0000000..23fe889
Binary files /dev/null and b/5. ML_tutorial/metastore_db/log/log1.dat differ
diff --git a/5. ML_tutorial/metastore_db/log/logmirror.ctrl b/5. ML_tutorial/metastore_db/log/logmirror.ctrl
new file mode 100644
index 0000000..a11daac
Binary files /dev/null and b/5. ML_tutorial/metastore_db/log/logmirror.ctrl differ
diff --git a/5. ML_tutorial/metastore_db/seg0/README_DO_NOT_TOUCH_FILES.txt b/5. ML_tutorial/metastore_db/seg0/README_DO_NOT_TOUCH_FILES.txt
new file mode 100644
index 0000000..2bdad06
--- /dev/null
+++ b/5. ML_tutorial/metastore_db/seg0/README_DO_NOT_TOUCH_FILES.txt
@@ -0,0 +1,8 @@
+
+# *************************************************************************
+# *** DO NOT TOUCH FILES IN THIS DIRECTORY! ***
+# *** FILES IN THIS DIRECTORY ARE USED BY THE DERBY DATABASE TO STORE ***
+# *** USER AND SYSTEM DATA. EDITING, ADDING, OR DELETING FILES IN THIS ***
+# *** DIRECTORY WILL CORRUPT THE ASSOCIATED DERBY DATABASE AND MAKE ***
+# *** IT NON-RECOVERABLE. ***
+# *************************************************************************
\ No newline at end of file
diff --git a/5. ML_tutorial/metastore_db/seg0/c10.dat b/5. ML_tutorial/metastore_db/seg0/c10.dat
new file mode 100644
index 0000000..33180ba
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c10.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c101.dat b/5. ML_tutorial/metastore_db/seg0/c101.dat
new file mode 100644
index 0000000..9a3d779
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c101.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c111.dat b/5. ML_tutorial/metastore_db/seg0/c111.dat
new file mode 100644
index 0000000..25389f9
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c111.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c121.dat b/5. ML_tutorial/metastore_db/seg0/c121.dat
new file mode 100644
index 0000000..8cd6d1c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c121.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c130.dat b/5. ML_tutorial/metastore_db/seg0/c130.dat
new file mode 100644
index 0000000..4d975cb
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c130.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c141.dat b/5. ML_tutorial/metastore_db/seg0/c141.dat
new file mode 100644
index 0000000..f957889
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c141.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c150.dat b/5. ML_tutorial/metastore_db/seg0/c150.dat
new file mode 100644
index 0000000..afa16c3
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c150.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c161.dat b/5. ML_tutorial/metastore_db/seg0/c161.dat
new file mode 100644
index 0000000..f7a2498
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c161.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c171.dat b/5. ML_tutorial/metastore_db/seg0/c171.dat
new file mode 100644
index 0000000..dd6f52d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c171.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c180.dat b/5. ML_tutorial/metastore_db/seg0/c180.dat
new file mode 100644
index 0000000..f137f50
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c180.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c191.dat b/5. ML_tutorial/metastore_db/seg0/c191.dat
new file mode 100644
index 0000000..590d9d0
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c191.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1a1.dat b/5. ML_tutorial/metastore_db/seg0/c1a1.dat
new file mode 100644
index 0000000..1a32a7c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1b1.dat b/5. ML_tutorial/metastore_db/seg0/c1b1.dat
new file mode 100644
index 0000000..2292fcd
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1c0.dat b/5. ML_tutorial/metastore_db/seg0/c1c0.dat
new file mode 100644
index 0000000..c5b91e2
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1c0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1d1.dat b/5. ML_tutorial/metastore_db/seg0/c1d1.dat
new file mode 100644
index 0000000..451f02f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1e0.dat b/5. ML_tutorial/metastore_db/seg0/c1e0.dat
new file mode 100644
index 0000000..efa6854
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1e0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c1f1.dat b/5. ML_tutorial/metastore_db/seg0/c1f1.dat
new file mode 100644
index 0000000..2074097
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c1f1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c20.dat b/5. ML_tutorial/metastore_db/seg0/c20.dat
new file mode 100644
index 0000000..a804094
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c20.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c200.dat b/5. ML_tutorial/metastore_db/seg0/c200.dat
new file mode 100644
index 0000000..97e5061
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c200.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c211.dat b/5. ML_tutorial/metastore_db/seg0/c211.dat
new file mode 100644
index 0000000..448c3cb
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c211.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c221.dat b/5. ML_tutorial/metastore_db/seg0/c221.dat
new file mode 100644
index 0000000..cc54bd1
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c221.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c230.dat b/5. ML_tutorial/metastore_db/seg0/c230.dat
new file mode 100644
index 0000000..3999677
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c230.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c241.dat b/5. ML_tutorial/metastore_db/seg0/c241.dat
new file mode 100644
index 0000000..3b25557
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c241.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c251.dat b/5. ML_tutorial/metastore_db/seg0/c251.dat
new file mode 100644
index 0000000..add1776
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c251.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c260.dat b/5. ML_tutorial/metastore_db/seg0/c260.dat
new file mode 100644
index 0000000..25f81fd
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c260.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c271.dat b/5. ML_tutorial/metastore_db/seg0/c271.dat
new file mode 100644
index 0000000..51cde57
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c271.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c281.dat b/5. ML_tutorial/metastore_db/seg0/c281.dat
new file mode 100644
index 0000000..cfed875
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c281.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c290.dat b/5. ML_tutorial/metastore_db/seg0/c290.dat
new file mode 100644
index 0000000..a85589e
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c290.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2a1.dat b/5. ML_tutorial/metastore_db/seg0/c2a1.dat
new file mode 100644
index 0000000..8e2ed6a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2b1.dat b/5. ML_tutorial/metastore_db/seg0/c2b1.dat
new file mode 100644
index 0000000..2a29692
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2c1.dat b/5. ML_tutorial/metastore_db/seg0/c2c1.dat
new file mode 100644
index 0000000..5511575
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2d0.dat b/5. ML_tutorial/metastore_db/seg0/c2d0.dat
new file mode 100644
index 0000000..4adc6e4
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2d0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2e1.dat b/5. ML_tutorial/metastore_db/seg0/c2e1.dat
new file mode 100644
index 0000000..b37b9b2
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c2f0.dat b/5. ML_tutorial/metastore_db/seg0/c2f0.dat
new file mode 100644
index 0000000..d854b4b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c2f0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c300.dat b/5. ML_tutorial/metastore_db/seg0/c300.dat
new file mode 100644
index 0000000..2053e01
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c300.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c31.dat b/5. ML_tutorial/metastore_db/seg0/c31.dat
new file mode 100644
index 0000000..3b48fbf
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c31.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c311.dat b/5. ML_tutorial/metastore_db/seg0/c311.dat
new file mode 100644
index 0000000..f60c260
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c311.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c321.dat b/5. ML_tutorial/metastore_db/seg0/c321.dat
new file mode 100644
index 0000000..a9d7453
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c321.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c331.dat b/5. ML_tutorial/metastore_db/seg0/c331.dat
new file mode 100644
index 0000000..85ee72b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c331.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c340.dat b/5. ML_tutorial/metastore_db/seg0/c340.dat
new file mode 100644
index 0000000..d99b11a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c340.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c351.dat b/5. ML_tutorial/metastore_db/seg0/c351.dat
new file mode 100644
index 0000000..f822f4c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c351.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c361.dat b/5. ML_tutorial/metastore_db/seg0/c361.dat
new file mode 100644
index 0000000..b5c8f25
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c361.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c371.dat b/5. ML_tutorial/metastore_db/seg0/c371.dat
new file mode 100644
index 0000000..ad11f01
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c371.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c380.dat b/5. ML_tutorial/metastore_db/seg0/c380.dat
new file mode 100644
index 0000000..75cdabe
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c380.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c391.dat b/5. ML_tutorial/metastore_db/seg0/c391.dat
new file mode 100644
index 0000000..0fcb63a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c391.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3a1.dat b/5. ML_tutorial/metastore_db/seg0/c3a1.dat
new file mode 100644
index 0000000..92d5d21
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3b1.dat b/5. ML_tutorial/metastore_db/seg0/c3b1.dat
new file mode 100644
index 0000000..8dd1210
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3c0.dat b/5. ML_tutorial/metastore_db/seg0/c3c0.dat
new file mode 100644
index 0000000..4d061cf
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3c0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3d1.dat b/5. ML_tutorial/metastore_db/seg0/c3d1.dat
new file mode 100644
index 0000000..45c9fa2
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3e1.dat b/5. ML_tutorial/metastore_db/seg0/c3e1.dat
new file mode 100644
index 0000000..48f53e6
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c3f1.dat b/5. ML_tutorial/metastore_db/seg0/c3f1.dat
new file mode 100644
index 0000000..08acdce
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c3f1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c400.dat b/5. ML_tutorial/metastore_db/seg0/c400.dat
new file mode 100644
index 0000000..1e8976f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c400.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c41.dat b/5. ML_tutorial/metastore_db/seg0/c41.dat
new file mode 100644
index 0000000..0b54c1c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c41.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c411.dat b/5. ML_tutorial/metastore_db/seg0/c411.dat
new file mode 100644
index 0000000..8aba2fb
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c411.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c421.dat b/5. ML_tutorial/metastore_db/seg0/c421.dat
new file mode 100644
index 0000000..65775ee
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c421.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c430.dat b/5. ML_tutorial/metastore_db/seg0/c430.dat
new file mode 100644
index 0000000..55c948d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c430.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c441.dat b/5. ML_tutorial/metastore_db/seg0/c441.dat
new file mode 100644
index 0000000..3948b2a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c441.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c451.dat b/5. ML_tutorial/metastore_db/seg0/c451.dat
new file mode 100644
index 0000000..fe1ab73
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c451.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c461.dat b/5. ML_tutorial/metastore_db/seg0/c461.dat
new file mode 100644
index 0000000..e6d9854
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c461.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c470.dat b/5. ML_tutorial/metastore_db/seg0/c470.dat
new file mode 100644
index 0000000..c9f2eb1
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c470.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c481.dat b/5. ML_tutorial/metastore_db/seg0/c481.dat
new file mode 100644
index 0000000..397b291
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c481.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c490.dat b/5. ML_tutorial/metastore_db/seg0/c490.dat
new file mode 100644
index 0000000..6d51219
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c490.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4a1.dat b/5. ML_tutorial/metastore_db/seg0/c4a1.dat
new file mode 100644
index 0000000..7319e62
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4b0.dat b/5. ML_tutorial/metastore_db/seg0/c4b0.dat
new file mode 100644
index 0000000..a411e4f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4b0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4c1.dat b/5. ML_tutorial/metastore_db/seg0/c4c1.dat
new file mode 100644
index 0000000..1d9bff1
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4d1.dat b/5. ML_tutorial/metastore_db/seg0/c4d1.dat
new file mode 100644
index 0000000..d87b641
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4e1.dat b/5. ML_tutorial/metastore_db/seg0/c4e1.dat
new file mode 100644
index 0000000..0d71fdb
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c4f0.dat b/5. ML_tutorial/metastore_db/seg0/c4f0.dat
new file mode 100644
index 0000000..d3eeb10
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c4f0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c501.dat b/5. ML_tutorial/metastore_db/seg0/c501.dat
new file mode 100644
index 0000000..19429ff
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c501.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c51.dat b/5. ML_tutorial/metastore_db/seg0/c51.dat
new file mode 100644
index 0000000..abf2f69
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c51.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c510.dat b/5. ML_tutorial/metastore_db/seg0/c510.dat
new file mode 100644
index 0000000..db55bb1
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c510.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c521.dat b/5. ML_tutorial/metastore_db/seg0/c521.dat
new file mode 100644
index 0000000..12ec83f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c521.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c530.dat b/5. ML_tutorial/metastore_db/seg0/c530.dat
new file mode 100644
index 0000000..1d981b7
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c530.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c541.dat b/5. ML_tutorial/metastore_db/seg0/c541.dat
new file mode 100644
index 0000000..baf7210
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c541.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c550.dat b/5. ML_tutorial/metastore_db/seg0/c550.dat
new file mode 100644
index 0000000..cc7c2ce
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c550.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c561.dat b/5. ML_tutorial/metastore_db/seg0/c561.dat
new file mode 100644
index 0000000..338f98e
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c561.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c570.dat b/5. ML_tutorial/metastore_db/seg0/c570.dat
new file mode 100644
index 0000000..69b6415
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c570.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c581.dat b/5. ML_tutorial/metastore_db/seg0/c581.dat
new file mode 100644
index 0000000..29d0eb5
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c581.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c590.dat b/5. ML_tutorial/metastore_db/seg0/c590.dat
new file mode 100644
index 0000000..390ea07
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c590.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5a1.dat b/5. ML_tutorial/metastore_db/seg0/c5a1.dat
new file mode 100644
index 0000000..2a19eed
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5b0.dat b/5. ML_tutorial/metastore_db/seg0/c5b0.dat
new file mode 100644
index 0000000..277f0c8
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5b0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5c1.dat b/5. ML_tutorial/metastore_db/seg0/c5c1.dat
new file mode 100644
index 0000000..c55b726
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5d0.dat b/5. ML_tutorial/metastore_db/seg0/c5d0.dat
new file mode 100644
index 0000000..fae4cb8
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5d0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5e1.dat b/5. ML_tutorial/metastore_db/seg0/c5e1.dat
new file mode 100644
index 0000000..6db40f6
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c5f0.dat b/5. ML_tutorial/metastore_db/seg0/c5f0.dat
new file mode 100644
index 0000000..573ea42
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c5f0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c60.dat b/5. ML_tutorial/metastore_db/seg0/c60.dat
new file mode 100644
index 0000000..6067d3f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c60.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c601.dat b/5. ML_tutorial/metastore_db/seg0/c601.dat
new file mode 100644
index 0000000..1123daf
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c601.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c610.dat b/5. ML_tutorial/metastore_db/seg0/c610.dat
new file mode 100644
index 0000000..6222554
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c610.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c621.dat b/5. ML_tutorial/metastore_db/seg0/c621.dat
new file mode 100644
index 0000000..3dfb417
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c621.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c630.dat b/5. ML_tutorial/metastore_db/seg0/c630.dat
new file mode 100644
index 0000000..d34b5f2
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c630.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c641.dat b/5. ML_tutorial/metastore_db/seg0/c641.dat
new file mode 100644
index 0000000..6393877
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c641.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c650.dat b/5. ML_tutorial/metastore_db/seg0/c650.dat
new file mode 100644
index 0000000..4a41833
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c650.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c661.dat b/5. ML_tutorial/metastore_db/seg0/c661.dat
new file mode 100644
index 0000000..88fae0a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c661.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c670.dat b/5. ML_tutorial/metastore_db/seg0/c670.dat
new file mode 100644
index 0000000..92a59a3
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c670.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c681.dat b/5. ML_tutorial/metastore_db/seg0/c681.dat
new file mode 100644
index 0000000..e5df84b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c681.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c690.dat b/5. ML_tutorial/metastore_db/seg0/c690.dat
new file mode 100644
index 0000000..885fb07
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c690.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6a1.dat b/5. ML_tutorial/metastore_db/seg0/c6a1.dat
new file mode 100644
index 0000000..91811c5
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6b0.dat b/5. ML_tutorial/metastore_db/seg0/c6b0.dat
new file mode 100644
index 0000000..e88f34c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6b0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6c1.dat b/5. ML_tutorial/metastore_db/seg0/c6c1.dat
new file mode 100644
index 0000000..586e530
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6d0.dat b/5. ML_tutorial/metastore_db/seg0/c6d0.dat
new file mode 100644
index 0000000..48676b8
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6d0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6e1.dat b/5. ML_tutorial/metastore_db/seg0/c6e1.dat
new file mode 100644
index 0000000..efe6db9
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c6f0.dat b/5. ML_tutorial/metastore_db/seg0/c6f0.dat
new file mode 100644
index 0000000..98aecba
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c6f0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c701.dat b/5. ML_tutorial/metastore_db/seg0/c701.dat
new file mode 100644
index 0000000..ecbc088
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c701.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c71.dat b/5. ML_tutorial/metastore_db/seg0/c71.dat
new file mode 100644
index 0000000..cee1f69
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c71.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c711.dat b/5. ML_tutorial/metastore_db/seg0/c711.dat
new file mode 100644
index 0000000..cc99972
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c711.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c721.dat b/5. ML_tutorial/metastore_db/seg0/c721.dat
new file mode 100644
index 0000000..256361d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c721.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c731.dat b/5. ML_tutorial/metastore_db/seg0/c731.dat
new file mode 100644
index 0000000..3ae8d82
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c731.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c741.dat b/5. ML_tutorial/metastore_db/seg0/c741.dat
new file mode 100644
index 0000000..c55242c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c741.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c751.dat b/5. ML_tutorial/metastore_db/seg0/c751.dat
new file mode 100644
index 0000000..4a2797d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c751.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c761.dat b/5. ML_tutorial/metastore_db/seg0/c761.dat
new file mode 100644
index 0000000..b871d2a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c761.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c771.dat b/5. ML_tutorial/metastore_db/seg0/c771.dat
new file mode 100644
index 0000000..a9a996f
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c771.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c781.dat b/5. ML_tutorial/metastore_db/seg0/c781.dat
new file mode 100644
index 0000000..1a45028
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c781.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c791.dat b/5. ML_tutorial/metastore_db/seg0/c791.dat
new file mode 100644
index 0000000..d392dcc
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c791.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7a1.dat b/5. ML_tutorial/metastore_db/seg0/c7a1.dat
new file mode 100644
index 0000000..d36839e
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7b1.dat b/5. ML_tutorial/metastore_db/seg0/c7b1.dat
new file mode 100644
index 0000000..85b529d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7c1.dat b/5. ML_tutorial/metastore_db/seg0/c7c1.dat
new file mode 100644
index 0000000..1d393d5
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7d1.dat b/5. ML_tutorial/metastore_db/seg0/c7d1.dat
new file mode 100644
index 0000000..bee91c5
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7e1.dat b/5. ML_tutorial/metastore_db/seg0/c7e1.dat
new file mode 100644
index 0000000..7bd5c24
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c7f1.dat b/5. ML_tutorial/metastore_db/seg0/c7f1.dat
new file mode 100644
index 0000000..28803f6
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c7f1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c801.dat b/5. ML_tutorial/metastore_db/seg0/c801.dat
new file mode 100644
index 0000000..ab99b96
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c801.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c81.dat b/5. ML_tutorial/metastore_db/seg0/c81.dat
new file mode 100644
index 0000000..8b5f7e2
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c81.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c811.dat b/5. ML_tutorial/metastore_db/seg0/c811.dat
new file mode 100644
index 0000000..9ef3d11
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c811.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c821.dat b/5. ML_tutorial/metastore_db/seg0/c821.dat
new file mode 100644
index 0000000..5d5d6bf
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c821.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c831.dat b/5. ML_tutorial/metastore_db/seg0/c831.dat
new file mode 100644
index 0000000..88f3166
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c831.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c840.dat b/5. ML_tutorial/metastore_db/seg0/c840.dat
new file mode 100644
index 0000000..9042097
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c840.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c851.dat b/5. ML_tutorial/metastore_db/seg0/c851.dat
new file mode 100644
index 0000000..274c69d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c851.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c860.dat b/5. ML_tutorial/metastore_db/seg0/c860.dat
new file mode 100644
index 0000000..b089628
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c860.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c871.dat b/5. ML_tutorial/metastore_db/seg0/c871.dat
new file mode 100644
index 0000000..2e5b378
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c871.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c880.dat b/5. ML_tutorial/metastore_db/seg0/c880.dat
new file mode 100644
index 0000000..5bf36e0
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c880.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c891.dat b/5. ML_tutorial/metastore_db/seg0/c891.dat
new file mode 100644
index 0000000..526a30a
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c891.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8a0.dat b/5. ML_tutorial/metastore_db/seg0/c8a0.dat
new file mode 100644
index 0000000..313419e
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8a0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8b1.dat b/5. ML_tutorial/metastore_db/seg0/c8b1.dat
new file mode 100644
index 0000000..0bbd603
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8c1.dat b/5. ML_tutorial/metastore_db/seg0/c8c1.dat
new file mode 100644
index 0000000..59a500d
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8c1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8d1.dat b/5. ML_tutorial/metastore_db/seg0/c8d1.dat
new file mode 100644
index 0000000..98c04ec
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8e1.dat b/5. ML_tutorial/metastore_db/seg0/c8e1.dat
new file mode 100644
index 0000000..36cf238
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8e1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c8f1.dat b/5. ML_tutorial/metastore_db/seg0/c8f1.dat
new file mode 100644
index 0000000..71f5d2b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c8f1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c90.dat b/5. ML_tutorial/metastore_db/seg0/c90.dat
new file mode 100644
index 0000000..c925c7b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c90.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c901.dat b/5. ML_tutorial/metastore_db/seg0/c901.dat
new file mode 100644
index 0000000..679c965
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c901.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c911.dat b/5. ML_tutorial/metastore_db/seg0/c911.dat
new file mode 100644
index 0000000..8b90e9e
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c911.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c920.dat b/5. ML_tutorial/metastore_db/seg0/c920.dat
new file mode 100644
index 0000000..d698933
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c920.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c931.dat b/5. ML_tutorial/metastore_db/seg0/c931.dat
new file mode 100644
index 0000000..b310909
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c931.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c940.dat b/5. ML_tutorial/metastore_db/seg0/c940.dat
new file mode 100644
index 0000000..60bdc95
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c940.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c951.dat b/5. ML_tutorial/metastore_db/seg0/c951.dat
new file mode 100644
index 0000000..590d449
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c951.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c960.dat b/5. ML_tutorial/metastore_db/seg0/c960.dat
new file mode 100644
index 0000000..5f9236b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c960.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c971.dat b/5. ML_tutorial/metastore_db/seg0/c971.dat
new file mode 100644
index 0000000..624bdcd
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c971.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c981.dat b/5. ML_tutorial/metastore_db/seg0/c981.dat
new file mode 100644
index 0000000..2b878b6
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c981.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c990.dat b/5. ML_tutorial/metastore_db/seg0/c990.dat
new file mode 100644
index 0000000..bc8becf
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c990.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9a1.dat b/5. ML_tutorial/metastore_db/seg0/c9a1.dat
new file mode 100644
index 0000000..7741cec
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9a1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9b1.dat b/5. ML_tutorial/metastore_db/seg0/c9b1.dat
new file mode 100644
index 0000000..b2be31c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9b1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9c0.dat b/5. ML_tutorial/metastore_db/seg0/c9c0.dat
new file mode 100644
index 0000000..1f0b028
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9c0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9d1.dat b/5. ML_tutorial/metastore_db/seg0/c9d1.dat
new file mode 100644
index 0000000..088bde0
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9d1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9e0.dat b/5. ML_tutorial/metastore_db/seg0/c9e0.dat
new file mode 100644
index 0000000..0d4e9b5
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9e0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/c9f1.dat b/5. ML_tutorial/metastore_db/seg0/c9f1.dat
new file mode 100644
index 0000000..419b628
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/c9f1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/ca01.dat b/5. ML_tutorial/metastore_db/seg0/ca01.dat
new file mode 100644
index 0000000..3f8f81c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/ca01.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/ca1.dat b/5. ML_tutorial/metastore_db/seg0/ca1.dat
new file mode 100644
index 0000000..0851ea7
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/ca1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/ca11.dat b/5. ML_tutorial/metastore_db/seg0/ca11.dat
new file mode 100644
index 0000000..f403aab
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/ca11.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/ca21.dat b/5. ML_tutorial/metastore_db/seg0/ca21.dat
new file mode 100644
index 0000000..ca5bad7
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/ca21.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/cb1.dat b/5. ML_tutorial/metastore_db/seg0/cb1.dat
new file mode 100644
index 0000000..630d408
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/cb1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/cc0.dat b/5. ML_tutorial/metastore_db/seg0/cc0.dat
new file mode 100644
index 0000000..2268720
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/cc0.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/cd1.dat b/5. ML_tutorial/metastore_db/seg0/cd1.dat
new file mode 100644
index 0000000..d919a1b
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/cd1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/ce1.dat b/5. ML_tutorial/metastore_db/seg0/ce1.dat
new file mode 100644
index 0000000..299e0c4
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/ce1.dat differ
diff --git a/5. ML_tutorial/metastore_db/seg0/cf0.dat b/5. ML_tutorial/metastore_db/seg0/cf0.dat
new file mode 100644
index 0000000..fedbb7c
Binary files /dev/null and b/5. ML_tutorial/metastore_db/seg0/cf0.dat differ
diff --git a/5. ML_tutorial/metastore_db/service.properties b/5. ML_tutorial/metastore_db/service.properties
new file mode 100644
index 0000000..407c3ac
--- /dev/null
+++ b/5. ML_tutorial/metastore_db/service.properties
@@ -0,0 +1,23 @@
+#/home/paulkim/workspace/Spark/LearningPySpark/Chapter6/metastore_db
+# ********************************************************************
+# *** Please do NOT edit this file. ***
+# *** CHANGING THE CONTENT OF THIS FILE MAY CAUSE DATA CORRUPTION. ***
+# ********************************************************************
+#Mon Mar 05 13:26:39 KST 2018
+SysschemasIndex2Identifier=225
+SyscolumnsIdentifier=144
+SysconglomeratesIndex1Identifier=49
+SysconglomeratesIdentifier=32
+SyscolumnsIndex2Identifier=177
+SysschemasIndex1Identifier=209
+SysconglomeratesIndex3Identifier=81
+SystablesIndex2Identifier=129
+SyscolumnsIndex1Identifier=161
+derby.serviceProtocol=org.apache.derby.database.Database
+SysschemasIdentifier=192
+derby.storage.propertiesId=16
+SysconglomeratesIndex2Identifier=65
+derby.serviceLocale=ko_KR
+SystablesIdentifier=96
+SystablesIndex1Identifier=113
+#--- last line, don't put anything after this line ---