BigData/Spark
Spark + H2O(Pysparkling)
Jonghee Jeon
2020. 5. 9. 17:21
Spark에서 H2O를 사용하는 방법
Spark는 Regression, Clustering, Classification 등 ML관련 라이브러리를 제공하지만 Deeplearning 등의 라이브러리는 제공되고 있지 않아 TensorflowOnSpark, BigDL, H2O, SparkDL과 같은 타 패키지와 함께 사용하여야 합니다.
본 포스팅에서는 H2O 를 사용하는 방법을 설명합니다.
H2O란?
오픈소스 머신러닝 플랫폼으로 다양한 머신러닝 모델과 딥러닝 등을 포함해 AutoML과 같은 최신 알고리즘도 제공하고 있다. 기존의 대형 분석 인프라와 Hadoop, Spark와 같은 분산 클러스터 환경과 S3, Azure 같은 클라우드 환경에서도 동작한다. 그리고 가장 대중적으로 사용되는 통계 프로그램인 R과 연결하여 사용 할 수도 있다.
최근 GPU 환경도 지원하여 딥러닝 모델 개발에 더욱 많은 지원을 하게 되었다.
https://www.h2o.ai/
Home - Open Source Leader in AI and ML
H2O.ai is the creator of H2O the leading open source machine learning and artificial intelligence platform trusted by data scientists across 14K enterprises globally. Our vision is to democratize intelligence for everyone with our award winning “AI to do
www.h2o.ai
Apache Spark를 지원하는 H2O 패키지는 Sparkling Water 라는 이름으로 지원하고 있다.
https://www.h2o.ai/products/h2o-sparkling-water/
H2O Sparkling Water - Open Source Leader in AI and ML
Open Source Leader in AI and ML - H2O Sparkling Water - Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.
www.h2o.ai
https://www.h2o.ai/products/h2o-sparkling-water/
Sparkling Water는 Spark 클러스터의 Worker Node executer상에 올라와 함께 동작하게 된다. Driver Node에서는 H2O Context를 제공하여 분산 환경에서 다양한 머신러닝 알고리즘과 딥러닝, AutoML을 사용 할 수 있게 된다.
Spakrling Water를 사용하기 위해서는 먼저 Spark가 설치되어 있어야 하며, 본 포스팅에서는 H2O 설치와 사용방법만 설명한다.
H2O 설치 및 실행
1. 기존 Spark가 설치되어 있으며, SPARK_HOME 이 설정되어 있다는 전제하에 설치를 수행한다.
2. 아래의 Sparkling Water 다운로드 페이지에서 파일은 다운로드 하거나 wget으로 다운로드 합니다.
- [root@spark tmp]# wget www.s3.amazonaws.com/h2o-release/sparkling-water/spark-2.4/3.30.0.2-1-2.4/sparkling-water-3.30.0.2-1-2.4.zip
[root@spark ~]# cd /tmp
[root@spark tmp]# unzip sparkling-water-3.28.0.1-1-2.4.zip
[root@spark tmp]# mkdir -p /opt/sparkling/3.28
[root@spark tmp]# mv sparkling-water-3.28.0.1-1-2.4 /opt/sparkling/3.28
[root@spark tmp]# ln -s /opt/sparkling/3.28 /opt/sparkling/current
[root@spark tmp]# chown -R carbig:carbig /opt/sparkling/
[root@spark tmp]# su - carbig
(base) [carbig@spark ~]$ vim .bash_profile
# Jupyter Notebook에서 Pysparkling 실행
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=192.168.56.101 --port=8888" pyspark
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=192.168.56.101 --port=8888" pysparkling
### H2O 3.28 #############################
export H2O_HOME=/opt/sparkling/current
export PATH=$PATH:$H2O_HOME/bin
#########################################
(base) [carbig@spark ~]$ source .bash_profile
(base) [carbig@spark ~]$ pip install tabulate
(base) [carbig@spark ~]$ pysparkling
PySparkling 사용
Connecting to H2O server at http://192.168.56.101:54321 ... successful.
H2O cluster uptime:
10 secs
H2O cluster timezone:
Asia/Seoul
H2O data parsing timezone:
UTC
H2O cluster version:
3.28.0.1
H2O cluster version age:
1 month and 20 days
H2O cluster name:
sparkling-water-carbig_local-1580950997532
H2O cluster total nodes:
1
H2O cluster free memory:
794 Mb
H2O cluster total cores:
2
H2O cluster allowed cores:
2
H2O cluster status:
accepting new members, healthy
H2O connection url:
http://192.168.56.101:54321
H2O connection proxy:
None
H2O internal security:
False
H2O API Extensions:
XGBoost, Algos, Amazon S3, AutoML, Core V3, TargetEncoder, Core V4
Python version:
3.7.4 final
Sparkling Water Context:
* Sparkling Water Version: 3.28.0.1-1-2.4
* H2O name: sparkling-water-carbig_local-1580950997532
* cluster size: 1
* list of used nodes:
(executorId, host, port)
------------------------
(driver,192.168.56.101,54321)
------------------------
Open H2O Flow in browser: http://192.168.56.101:54321 (CMD + click in Mac OSX)
root
|-- 0.00632: double (nullable = true)
|-- 18.00: double (nullable = true)
|-- 2.310: double (nullable = true)
|-- 0: double (nullable = true)
|-- 0.5380: double (nullable = true)
|-- 6.5750: double (nullable = true)
|-- 65.20: double (nullable = true)
|-- 4.0900: double (nullable = true)
|-- 1: double (nullable = true)
|-- 296.0: double (nullable = true)
|-- 15.30: double (nullable = true)
|-- 396.90: double (nullable = true)
|-- 4.98: double (nullable = true)
|-- 24.00: double (nullable = true)
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.00632| 18.00| 2.310| 0| 0.5380| 6.5750| 65.20| 4.0900| 1| 296.0| 15.30| 396.90| 4.98| 24.00|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.02731| 0.0| 7.07|0.0| 0.469| 6.421| 78.9| 4.9671| 2.0| 242.0| 17.8| 396.9| 9.14| 21.6|
| 0.02729| 0.0| 7.07|0.0| 0.469| 7.185| 61.1| 4.9671| 2.0| 242.0| 17.8| 392.83| 4.03| 34.7|
| 0.03237| 0.0| 2.18|0.0| 0.458| 6.998| 45.8| 6.0622| 3.0| 222.0| 18.7| 394.63| 2.94| 33.4|
| 0.06905| 0.0| 2.18|0.0| 0.458| 7.147| 54.2| 6.0622| 3.0| 222.0| 18.7| 396.9| 5.33| 36.2|
| 0.02985| 0.0| 2.18|0.0| 0.458| 6.43| 58.7| 6.0622| 3.0| 222.0| 18.7| 394.12| 5.21| 28.7|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
only showing top 5 rows
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
| crim| zn|indus|chas| nox| rm| age| dis|rad| tax|ptratio| b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|0.00632|18.0| 2.31| 0.0|0.538|6.575|65.2| 4.09|1.0|296.0| 15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07| 0.0|0.469|6.421|78.9|4.9671|2.0|242.0| 17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07| 0.0|0.469|7.185|61.1|4.9671|2.0|242.0| 17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18| 0.0|0.458|6.998|45.8|6.0622|3.0|222.0| 18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18| 0.0|0.458|7.147|54.2|6.0622|3.0|222.0| 18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
only showing top 5 rows
deeplearning Model Build progress: |██████████████████████████████████████| 100%
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1580951906836_2
Status of Neuron Layers: predicting medv, regression, gaussian distribution, Quadratic loss, 758,901 weights/biases, 8.7 MB, 76,600 training samples, mini-batch size 1
layer
units
type
dropout
l1
l2
mean_rate
rate_rms
momentum
mean_weight
weight_rms
mean_bias
bias_rms
0
1
13
Input
0
1
2
500
RectifierDropout
40
0
0
0.0314611
0.024127
0
0.000921845
0.106943
0.445156
0.0468657
2
3
500
RectifierDropout
20
0
0
0.0842535
0.0618545
0
-0.0144979
0.0532604
0.953914
0.0260939
3
4
500
RectifierDropout
40
0
0
0.0909126
0.0673292
0
-0.00719517
0.0492706
0.953027
0.0540903
4
5
200
RectifierDropout
30
0
0
0.0424173
0.0298493
0
-0.00827373
0.0554473
0.974926
0.0182223
5
6
500
RectifierDropout
20
0
0
0.0597374
0.0423007
0
-0.00713246
0.0554903
0.964627
0.0517056
6
7
100
RectifierDropout
10
0
0
0.0274848
0.0529421
0
-0.00603756
0.0583177
0.997556
0.0143366
7
8
1
Linear
0
0
0.000374757
0.0002951
0
-0.0066581
0.115997
-0.0365526
1.09713e-154
ModelMetricsRegression: deeplearning
** Reported on train data. **
MSE: 5.260548402792165
RMSE: 2.2935885426100655
MAE: 1.6893793644383934
RMSLE: 0.12128720117058393
Mean Residual Deviance: 5.260548402792165
ModelMetricsRegression: deeplearning
** Reported on validation data. **
MSE: 12.554458548266163
RMSE: 3.5432271375493505
MAE: 2.704582976284349
RMSLE: 0.16411775566026215
Mean Residual Deviance: 12.554458548266163
Scoring History:
timestamp
duration
training_speed
epochs
iterations
samples
training_rmse
training_deviance
training_mae
training_r2
validation_rmse
validation_deviance
validation_mae
validation_r2
0
2020-02-06 10:52:31
0.000 sec
None
0.0
0
0.0
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
2020-02-06 10:52:39
8.138 sec
495 obs/sec
10.0
1
3830.0
4.559921
20.792879
3.063984
0.730718
6.853939
46.976474
4.171207
0.544797
2
2020-02-06 10:52:45
13.752 sec
581 obs/sec
20.0
2
7660.0
3.882560
15.074271
2.556876
0.804778
6.090655
37.096082
3.703239
0.640539
3
2020-02-06 10:52:54
22.788 sec
695 obs/sec
40.0
4
15320.0
3.462748
11.990626
2.324721
0.844713
5.514536
30.410111
3.257460
0.705326
4
2020-02-06 10:53:01
30.607 sec
774 obs/sec
60.0
6
22980.0
3.246251
10.538147
2.285182
0.863524
5.208311
27.126506
3.160191
0.737144
5
2020-02-06 10:53:09
37.949 sec
831 obs/sec
80.0
8
30640.0
2.698576
7.282314
1.995358
0.905689
4.361370
19.021545
3.056471
0.815681
6
2020-02-06 10:53:15
44.636 sec
883 obs/sec
100.0
10
38300.0
2.559404
6.550546
1.837415
0.915166
4.629112
21.428675
2.959663
0.792356
7
2020-02-06 10:53:22
51.509 sec
918 obs/sec
120.0
12
45960.0
2.593123
6.724287
1.871866
0.912916
4.725163
22.327166
2.834982
0.783649
8
2020-02-06 10:53:28
57.647 sec
958 obs/sec
140.0
14
53620.0
2.643691
6.989101
1.947474
0.909486
4.604811
21.204288
2.853517
0.794530
9
2020-02-06 10:53:34
1 min 3.741 sec
990 obs/sec
160.0
16
61280.0
2.487583
6.188068
1.788557
0.919860
4.323711
18.694475
2.786261
0.818850
10
2020-02-06 10:53:40
1 min 9.534 sec
1021 obs/sec
180.0
18
68940.0
2.385202
5.689189
1.750422
0.926321
4.147532
17.202023
2.709004
0.833312
11
2020-02-06 10:53:46
1 min 15.288 sec
1048 obs/sec
200.0
20
76600.0
2.293589
5.260548
1.689379
0.931872
3.543227
12.554459
2.704583
0.878347
variable
relative_importance
scaled_importance
percentage
0
lstat
1.000000
1.000000
0.131614
1
rm
0.987815
0.987815
0.130010
2
ptratio
0.618191
0.618191
0.081362
3
crim
0.607228
0.607228
0.079920
4
age
0.598154
0.598154
0.078725
5
rad
0.558201
0.558201
0.073467
6
nox
0.534574
0.534574
0.070357
7
tax
0.529267
0.529267
0.069659
8
b
0.529082
0.529082
0.069634
9
dis
0.450081
0.450081
0.059237
10
indus
0.447967
0.447967
0.058959
11
chas
0.404590
0.404590
0.053250
12
zn
0.332843
0.332843
0.043807