BigData/Spark

Spark + H2O(Pysparkling)

Jonghee Jeon 2020. 5. 9. 17:21

Spark에서 H2O를 사용하는 방법

 

Spark는 Regression, Clustering, Classification 등 ML관련 라이브러리를 제공하지만 Deeplearning 등의 라이브러리는 제공되고 있지 않아 TensorflowOnSpark, BigDL, H2O, SparkDL과 같은 타 패키지와 함께 사용하여야 합니다.

 

본 포스팅에서는 H2O를 사용하는 방법을 설명합니다.


H2O란?

 오픈소스 머신러닝 플랫폼으로 다양한 머신러닝 모델과 딥러닝 등을 포함해 AutoML과 같은 최신 알고리즘도 제공하고 있다. 기존의 대형 분석 인프라와 Hadoop, Spark와 같은 분산 클러스터 환경과 S3, Azure 같은 클라우드 환경에서도 동작한다. 그리고 가장 대중적으로 사용되는 통계 프로그램인 R과 연결하여 사용 할 수도 있다.

 최근 GPU 환경도 지원하여 딥러닝 모델 개발에 더욱 많은 지원을 하게 되었다.

 

https://www.h2o.ai/

 

Home - Open Source Leader in AI and ML

H2O.ai is the creator of H2O the leading open source machine learning and artificial intelligence platform trusted by data scientists across 14K enterprises globally. Our vision is to democratize intelligence for everyone with our award winning “AI to do

www.h2o.ai

Apache Spark를 지원하는 H2O 패키지는 Sparkling Water라는 이름으로 지원하고 있다.

https://www.h2o.ai/products/h2o-sparkling-water/

 

H2O Sparkling Water - Open Source Leader in AI and ML

Open Source Leader in AI and ML - H2O Sparkling Water - Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.

www.h2o.ai

https://www.h2o.ai/products/h2o-sparkling-water/

Sparkling Water는 Spark 클러스터의 Worker Node executer상에 올라와 함께 동작하게 된다. Driver Node에서는 H2O Context를 제공하여 분산 환경에서 다양한 머신러닝 알고리즘과 딥러닝, AutoML을 사용 할 수 있게 된다.

 

Spakrling Water를 사용하기 위해서는 먼저 Spark가 설치되어 있어야 하며, 본 포스팅에서는 H2O 설치와 사용방법만 설명한다.

 

 

H2O 설치 및 실행

1. 기존 Spark가 설치되어 있으며, SPARK_HOME이 설정되어 있다는 전제하에 설치를 수행한다.

2. 아래의 Sparkling Water 다운로드 페이지에서 파일은 다운로드 하거나 wget으로 다운로드 합니다.

 - [root@spark tmp]# wget www.s3.amazonaws.com/h2o-release/sparkling-water/spark-2.4/3.30.0.2-1-2.4/sparkling-water-3.30.0.2-1-2.4.zip

[root@spark ~]# cd /tmp
[root@spark tmp]# unzip sparkling-water-3.28.0.1-1-2.4.zip
[root@spark tmp]# mkdir -p /opt/sparkling/3.28
[root@spark tmp]# mv sparkling-water-3.28.0.1-1-2.4 /opt/sparkling/3.28
[root@spark tmp]# ln -s /opt/sparkling/3.28 /opt/sparkling/current
[root@spark tmp]# chown -R carbig:carbig /opt/sparkling/
[root@spark tmp]# su - carbig
(base) [carbig@spark ~]$ vim .bash_profile

# Jupyter Notebook에서 Pysparkling 실행
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=192.168.56.101 --port=8888" pyspark
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=192.168.56.101 --port=8888" pysparkling

### H2O 3.28 #############################
export H2O_HOME=/opt/sparkling/current
export PATH=$PATH:$H2O_HOME/bin
#########################################

(base) [carbig@spark ~]$ source .bash_profile
(base) [carbig@spark ~]$ pip install tabulate
(base) [carbig@spark ~]$ pysparkling

PySparkling 사용

 

 

 

In [10]:
from pysparkling import *
import h2o
In [11]:
hc = H2OContext.getOrCreate(spark)
 
Connecting to H2O server at http://192.168.56.101:54321 ... successful.
 
H2O cluster uptime: 10 secs
H2O cluster timezone: Asia/Seoul
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.1
H2O cluster version age: 1 month and 20 days
H2O cluster name: sparkling-water-carbig_local-1580950997532
H2O cluster total nodes: 1
H2O cluster free memory: 794 Mb
H2O cluster total cores: 2
H2O cluster allowed cores: 2
H2O cluster status: accepting new members, healthy
H2O connection url: http://192.168.56.101:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: XGBoost, Algos, Amazon S3, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.7.4 final
 
Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.1-1-2.4
 * H2O name: sparkling-water-carbig_local-1580950997532
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.101,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.101:54321 (CMD + click in Mac OSX)

    
In [1]:
housingDF = spark.read.csv("data/housing.data", inferSchema=True, header=True)
In [3]:
housingDF.printSchema()
 
root
 |--  0.00632: double (nullable = true)
 |--   18.00: double (nullable = true)
 |--    2.310: double (nullable = true)
 |--   0: double (nullable = true)
 |--   0.5380: double (nullable = true)
 |--   6.5750: double (nullable = true)
 |--   65.20: double (nullable = true)
 |--   4.0900: double (nullable = true)
 |--    1: double (nullable = true)
 |--   296.0: double (nullable = true)
 |--   15.30: double (nullable = true)
 |--  396.90: double (nullable = true)
 |--    4.98: double (nullable = true)
 |--   24.00: double (nullable = true)

In [5]:
housingDF.show(5)
 
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.00632|  18.00|   2.310|  0|  0.5380|  6.5750|  65.20|  4.0900|   1|  296.0|  15.30| 396.90|   4.98|  24.00|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.02731|    0.0|    7.07|0.0|   0.469|   6.421|   78.9|  4.9671| 2.0|  242.0|   17.8|  396.9|   9.14|   21.6|
| 0.02729|    0.0|    7.07|0.0|   0.469|   7.185|   61.1|  4.9671| 2.0|  242.0|   17.8| 392.83|   4.03|   34.7|
| 0.03237|    0.0|    2.18|0.0|   0.458|   6.998|   45.8|  6.0622| 3.0|  222.0|   18.7| 394.63|   2.94|   33.4|
| 0.06905|    0.0|    2.18|0.0|   0.458|   7.147|   54.2|  6.0622| 3.0|  222.0|   18.7|  396.9|   5.33|   36.2|
| 0.02985|    0.0|    2.18|0.0|   0.458|    6.43|   58.7|  6.0622| 3.0|  222.0|   18.7| 394.12|   5.21|   28.7|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
only showing top 5 rows

In [6]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, DoubleType
In [7]:
housingSchema = StructType([
    StructField("crim", DoubleType(), True),
    StructField("zn", DoubleType(), True),
    StructField("indus", DoubleType(), True),
    StructField("chas", DoubleType(), True),
    StructField("nox", DoubleType(), True),
    StructField("rm", DoubleType(), True),
    StructField("age", DoubleType(), True),
    StructField("dis", DoubleType(), True),
    StructField("rad", DoubleType(), True),
    StructField("tax", DoubleType(), True),
    StructField("ptratio", DoubleType(), True),
    StructField("b", DoubleType(), True),
    StructField("lstat", DoubleType(), True),
    StructField("medv", DoubleType(), True),
])
In [8]:
housingDF = spark.read.csv("data/housing.data", inferSchema=True, schema=housingSchema)
In [9]:
housingDF.show(5)
 
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|  tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|0.00632|18.0| 2.31| 0.0|0.538|6.575|65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07| 0.0|0.469|6.421|78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07| 0.0|0.469|7.185|61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18| 0.0|0.458|6.998|45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18| 0.0|0.458|7.147|54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
only showing top 5 rows

In [12]:
housingDFh2o = hc.as_h2o_frame(housingDF, "housing")
In [14]:
splitDF = housingDFh2o.split_frame([0.75, 0.24])
In [15]:
train = splitDF[0]
In [16]:
train.frame_id = "housing_train"
In [17]:
test = splitDF[1]
test.frame_id = "housing_test"
In [18]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
In [23]:
m = H2ODeepLearningEstimator(hidden=[500, 500, 500, 200, 500, 100], epochs=200, \
                            activation='rectifierwithdropout', \
                            hidden_dropout_ratios=[0.2, 0.2, 0.3, 0.3, 0.2, 0.1])
In [24]:
m.train(x=train.names[:-1], y=train.names[13], training_frame=train, \
       validation_frame=test)
 
deeplearning Model Build progress: |██████████████████████████████████████| 100%
In [25]:
m.show()
 
Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1580951906836_2


Status of Neuron Layers: predicting medv, regression, gaussian distribution, Quadratic loss, 758,901 weights/biases, 8.7 MB, 76,600 training samples, mini-batch size 1
 
    layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
0   1 13 Input 0                  
1   2 500 RectifierDropout 40 0 0 0.0314611 0.024127 0 0.000921845 0.106943 0.445156 0.0468657
2   3 500 RectifierDropout 20 0 0 0.0842535 0.0618545 0 -0.0144979 0.0532604 0.953914 0.0260939
3   4 500 RectifierDropout 40 0 0 0.0909126 0.0673292 0 -0.00719517 0.0492706 0.953027 0.0540903
4   5 200 RectifierDropout 30 0 0 0.0424173 0.0298493 0 -0.00827373 0.0554473 0.974926 0.0182223
5   6 500 RectifierDropout 20 0 0 0.0597374 0.0423007 0 -0.00713246 0.0554903 0.964627 0.0517056
6   7 100 RectifierDropout 10 0 0 0.0274848 0.0529421 0 -0.00603756 0.0583177 0.997556 0.0143366
7   8 1 Linear   0 0 0.000374757 0.0002951 0 -0.0066581 0.115997 -0.0365526 1.09713e-154
 

ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 5.260548402792165
RMSE: 2.2935885426100655
MAE: 1.6893793644383934
RMSLE: 0.12128720117058393
Mean Residual Deviance: 5.260548402792165

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 12.554458548266163
RMSE: 3.5432271375493505
MAE: 2.704582976284349
RMSLE: 0.16411775566026215
Mean Residual Deviance: 12.554458548266163

Scoring History: 
 
    timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2 validation_rmse validation_deviance validation_mae validation_r2
0   2020-02-06 10:52:31 0.000 sec None 0.0 0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1   2020-02-06 10:52:39 8.138 sec 495 obs/sec 10.0 1 3830.0 4.559921 20.792879 3.063984 0.730718 6.853939 46.976474 4.171207 0.544797
2   2020-02-06 10:52:45 13.752 sec 581 obs/sec 20.0 2 7660.0 3.882560 15.074271 2.556876 0.804778 6.090655 37.096082 3.703239 0.640539
3   2020-02-06 10:52:54 22.788 sec 695 obs/sec 40.0 4 15320.0 3.462748 11.990626 2.324721 0.844713 5.514536 30.410111 3.257460 0.705326
4   2020-02-06 10:53:01 30.607 sec 774 obs/sec 60.0 6 22980.0 3.246251 10.538147 2.285182 0.863524 5.208311 27.126506 3.160191 0.737144
5   2020-02-06 10:53:09 37.949 sec 831 obs/sec 80.0 8 30640.0 2.698576 7.282314 1.995358 0.905689 4.361370 19.021545 3.056471 0.815681
6   2020-02-06 10:53:15 44.636 sec 883 obs/sec 100.0 10 38300.0 2.559404 6.550546 1.837415 0.915166 4.629112 21.428675 2.959663 0.792356
7   2020-02-06 10:53:22 51.509 sec 918 obs/sec 120.0 12 45960.0 2.593123 6.724287 1.871866 0.912916 4.725163 22.327166 2.834982 0.783649
8   2020-02-06 10:53:28 57.647 sec 958 obs/sec 140.0 14 53620.0 2.643691 6.989101 1.947474 0.909486 4.604811 21.204288 2.853517 0.794530
9   2020-02-06 10:53:34 1 min 3.741 sec 990 obs/sec 160.0 16 61280.0 2.487583 6.188068 1.788557 0.919860 4.323711 18.694475 2.786261 0.818850
10   2020-02-06 10:53:40 1 min 9.534 sec 1021 obs/sec 180.0 18 68940.0 2.385202 5.689189 1.750422 0.926321 4.147532 17.202023 2.709004 0.833312
11   2020-02-06 10:53:46 1 min 15.288 sec 1048 obs/sec 200.0 20 76600.0 2.293589 5.260548 1.689379 0.931872 3.543227 12.554459 2.704583 0.878347
 
Variable Importances: 
 
  variable relative_importance scaled_importance percentage
0 lstat 1.000000 1.000000 0.131614
1 rm 0.987815 0.987815 0.130010
2 ptratio 0.618191 0.618191 0.081362
3 crim 0.607228 0.607228 0.079920
4 age 0.598154 0.598154 0.078725
5 rad 0.558201 0.558201 0.073467
6 nox 0.534574 0.534574 0.070357
7 tax 0.529267 0.529267 0.069659
8 b 0.529082 0.529082 0.069634
9 dis 0.450081 0.450081 0.059237
10 indus 0.447967 0.447967 0.058959
11 chas 0.404590 0.404590 0.053250
12 zn 0.332843 0.332843 0.043807
In [ ]: