from pysparkling import *
import h2o

hc = H2OContext.getOrCreate(spark)

Connecting to H2O server at http://192.168.56.101:54321 ... successful.

Sparkling Water Context:
 * Sparkling Water Version: 3.28.0.1-1-2.4
 * H2O name: sparkling-water-carbig_local-1580950997532
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (driver,192.168.56.101,54321)
  ------------------------

  Open H2O Flow in browser: http://192.168.56.101:54321 (CMD + click in Mac OSX)

housingDF = spark.read.csv("data/housing.data", inferSchema=True, header=True)

housingDF.printSchema()

root
 |--  0.00632: double (nullable = true)
 |--   18.00: double (nullable = true)
 |--    2.310: double (nullable = true)
 |--   0: double (nullable = true)
 |--   0.5380: double (nullable = true)
 |--   6.5750: double (nullable = true)
 |--   65.20: double (nullable = true)
 |--   4.0900: double (nullable = true)
 |--    1: double (nullable = true)
 |--   296.0: double (nullable = true)
 |--   15.30: double (nullable = true)
 |--  396.90: double (nullable = true)
 |--    4.98: double (nullable = true)
 |--   24.00: double (nullable = true)

housingDF.show(5)

+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.00632|  18.00|   2.310|  0|  0.5380|  6.5750|  65.20|  4.0900|   1|  296.0|  15.30| 396.90|   4.98|  24.00|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
| 0.02731|    0.0|    7.07|0.0|   0.469|   6.421|   78.9|  4.9671| 2.0|  242.0|   17.8|  396.9|   9.14|   21.6|
| 0.02729|    0.0|    7.07|0.0|   0.469|   7.185|   61.1|  4.9671| 2.0|  242.0|   17.8| 392.83|   4.03|   34.7|
| 0.03237|    0.0|    2.18|0.0|   0.458|   6.998|   45.8|  6.0622| 3.0|  222.0|   18.7| 394.63|   2.94|   33.4|
| 0.06905|    0.0|    2.18|0.0|   0.458|   7.147|   54.2|  6.0622| 3.0|  222.0|   18.7|  396.9|   5.33|   36.2|
| 0.02985|    0.0|    2.18|0.0|   0.458|    6.43|   58.7|  6.0622| 3.0|  222.0|   18.7| 394.12|   5.21|   28.7|
+--------+-------+--------+---+--------+--------+-------+--------+----+-------+-------+-------+-------+-------+
only showing top 5 rows

from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, DoubleType

housingSchema = StructType([
    StructField("crim", DoubleType(), True),
    StructField("zn", DoubleType(), True),
    StructField("indus", DoubleType(), True),
    StructField("chas", DoubleType(), True),
    StructField("nox", DoubleType(), True),
    StructField("rm", DoubleType(), True),
    StructField("age", DoubleType(), True),
    StructField("dis", DoubleType(), True),
    StructField("rad", DoubleType(), True),
    StructField("tax", DoubleType(), True),
    StructField("ptratio", DoubleType(), True),
    StructField("b", DoubleType(), True),
    StructField("lstat", DoubleType(), True),
    StructField("medv", DoubleType(), True),
])

housingDF = spark.read.csv("data/housing.data", inferSchema=True, schema=housingSchema)

housingDF.show(5)

+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|  tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
|0.00632|18.0| 2.31| 0.0|0.538|6.575|65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07| 0.0|0.469|6.421|78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07| 0.0|0.469|7.185|61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18| 0.0|0.458|6.998|45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18| 0.0|0.458|7.147|54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+----+
only showing top 5 rows

housingDFh2o = hc.as_h2o_frame(housingDF, "housing")

splitDF = housingDFh2o.split_frame([0.75, 0.24])

train = splitDF[0]

train.frame_id = "housing_train"

test = splitDF[1]
test.frame_id = "housing_test"

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

m = H2ODeepLearningEstimator(hidden=[500, 500, 500, 200, 500, 100], epochs=200, \
                            activation='rectifierwithdropout', \
                            hidden_dropout_ratios=[0.2, 0.2, 0.3, 0.3, 0.2, 0.1])

m.train(x=train.names[:-1], y=train.names[13], training_frame=train, \
       validation_frame=test)

deeplearning Model Build progress: |██████████████████████████████████████| 100%

m.show()

Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  DeepLearning_model_python_1580951906836_2


Status of Neuron Layers: predicting medv, regression, gaussian distribution, Quadratic loss, 758,901 weights/biases, 8.7 MB, 76,600 training samples, mini-batch size 1


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 5.260548402792165
RMSE: 2.2935885426100655
MAE: 1.6893793644383934
RMSLE: 0.12128720117058393
Mean Residual Deviance: 5.260548402792165

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 12.554458548266163
RMSE: 3.5432271375493505
MAE: 2.704582976284349
RMSLE: 0.16411775566026215
Mean Residual Deviance: 12.554458548266163

Scoring History:

Variable Importances:

H2O cluster uptime:	10 secs
H2O cluster timezone:	Asia/Seoul
H2O data parsing timezone:	UTC
H2O cluster version:	3.28.0.1
H2O cluster version age:	1 month and 20 days
H2O cluster name:	sparkling-water-carbig_local-1580950997532
H2O cluster total nodes:	1
H2O cluster free memory:	794 Mb
H2O cluster total cores:	2
H2O cluster allowed cores:	2
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://192.168.56.101:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, Amazon S3, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.7.4 final

	layer	units	type	dropout	l1	l2	mean_rate	rate_rms	momentum	mean_weight	weight_rms	mean_bias	bias_rms
0	1	13	Input	0
1	2	500	RectifierDropout	40	0	0	0.0314611	0.024127	0	0.000921845	0.106943	0.445156	0.0468657
2	3	500	RectifierDropout	20	0	0	0.0842535	0.0618545	0	-0.0144979	0.0532604	0.953914	0.0260939
3	4	500	RectifierDropout	40	0	0	0.0909126	0.0673292	0	-0.00719517	0.0492706	0.953027	0.0540903
4	5	200	RectifierDropout	30	0	0	0.0424173	0.0298493	0	-0.00827373	0.0554473	0.974926	0.0182223
5	6	500	RectifierDropout	20	0	0	0.0597374	0.0423007	0	-0.00713246	0.0554903	0.964627	0.0517056
6	7	100	RectifierDropout	10	0	0	0.0274848	0.0529421	0	-0.00603756	0.0583177	0.997556	0.0143366
7	8	1	Linear		0	0	0.000374757	0.0002951	0	-0.0066581	0.115997	-0.0365526	1.09713e-154

	timestamp	duration	training_speed	epochs	iterations	samples	training_rmse	training_deviance	training_mae	training_r2	validation_rmse	validation_deviance	validation_mae	validation_r2
0	2020-02-06 10:52:31	0.000 sec	None	0.0	0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2020-02-06 10:52:39	8.138 sec	495 obs/sec	10.0	1	3830.0	4.559921	20.792879	3.063984	0.730718	6.853939	46.976474	4.171207	0.544797
2	2020-02-06 10:52:45	13.752 sec	581 obs/sec	20.0	2	7660.0	3.882560	15.074271	2.556876	0.804778	6.090655	37.096082	3.703239	0.640539
3	2020-02-06 10:52:54	22.788 sec	695 obs/sec	40.0	4	15320.0	3.462748	11.990626	2.324721	0.844713	5.514536	30.410111	3.257460	0.705326
4	2020-02-06 10:53:01	30.607 sec	774 obs/sec	60.0	6	22980.0	3.246251	10.538147	2.285182	0.863524	5.208311	27.126506	3.160191	0.737144
5	2020-02-06 10:53:09	37.949 sec	831 obs/sec	80.0	8	30640.0	2.698576	7.282314	1.995358	0.905689	4.361370	19.021545	3.056471	0.815681
6	2020-02-06 10:53:15	44.636 sec	883 obs/sec	100.0	10	38300.0	2.559404	6.550546	1.837415	0.915166	4.629112	21.428675	2.959663	0.792356
7	2020-02-06 10:53:22	51.509 sec	918 obs/sec	120.0	12	45960.0	2.593123	6.724287	1.871866	0.912916	4.725163	22.327166	2.834982	0.783649
8	2020-02-06 10:53:28	57.647 sec	958 obs/sec	140.0	14	53620.0	2.643691	6.989101	1.947474	0.909486	4.604811	21.204288	2.853517	0.794530
9	2020-02-06 10:53:34	1 min 3.741 sec	990 obs/sec	160.0	16	61280.0	2.487583	6.188068	1.788557	0.919860	4.323711	18.694475	2.786261	0.818850
10	2020-02-06 10:53:40	1 min 9.534 sec	1021 obs/sec	180.0	18	68940.0	2.385202	5.689189	1.750422	0.926321	4.147532	17.202023	2.709004	0.833312
11	2020-02-06 10:53:46	1 min 15.288 sec	1048 obs/sec	200.0	20	76600.0	2.293589	5.260548	1.689379	0.931872	3.543227	12.554459	2.704583	0.878347

	variable	relative_importance	scaled_importance	percentage
0	lstat	1.000000	1.000000	0.131614
1	rm	0.987815	0.987815	0.130010
2	ptratio	0.618191	0.618191	0.081362
3	crim	0.607228	0.607228	0.079920
4	age	0.598154	0.598154	0.078725
5	rad	0.558201	0.558201	0.073467
6	nox	0.534574	0.534574	0.070357
7	tax	0.529267	0.529267	0.069659
8	b	0.529082	0.529082	0.069634
9	dis	0.450081	0.450081	0.059237
10	indus	0.447967	0.447967	0.058959
11	chas	0.404590	0.404590	0.053250
12	zn	0.332843	0.332843	0.043807

티스토리

Spark + H2O(Pysparkling)

Spark + H2O(Pysparkling)