BigData/Spark

Spark ML 02 (Pyspark)

Jonghee Jeon 2020. 4. 26. 15:59

Spark ML Regression

 

기상데이터를 Spark ML을 이용하여 선형회귀를 수행하는 예제


선형회귀란?

통계학에서, 선형 회귀(線型回歸, 영어: linear regression)는 종속 변수 y와 한 개 이상의 독립 변수 (또는 설명 변수) X와의 선형 상관 관계를 모델링하는 회귀분석 기법이다. 한 개의 설명 변수에 기반한 경우에는 단순 선형 회귀, 둘 이상의 설명 변수에 기반한 경우에는 다중 선형 회귀라고 한다.

https://ko.wikipedia.org/wiki/%EC%84%A0%ED%98%95_%ED%9A%8C%EA%B7%80

 

선형 회귀 - 위키백과, 우리 모두의 백과사전

위키백과, 우리 모두의 백과사전. 독립변수 1개와 종속변수 1개를 가진 선형 회귀의 예 통계학에서, 선형 회귀(線型回歸, 영어: linear regression)는 종속 변수 y와 한 개 이상의 독립 변수 (또는 설명 변수) X와의 선형 상관 관계를 모델링하는 회귀분석 기법이다. 한 개의 설명 변수에 기반한 경우에는 단순 선형 회귀, 둘 이상의 설명 변수에 기반한 경우에는 다중 선형 회귀라고 한다.[참고 1] 선형 회귀는 선형 예측 함수를 사용해 회귀식을

ko.wikipedia.org

 

 

In [15]:
weatherDF = spark.read.csv("data/OBS_ASOS_DD_20200120112650.csv", inferSchema=True,header=True)
In [16]:
weatherDF.count()
Out[16]:
310594
In [17]:
weatherDF.show(5)
 
+---+-------+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+
|loc|locName|               date| avg|  min| max|rain|maxWin|avgWin|avgHumidity|sumSunshine|sumInsolation|sumSnow|avgCloud|avgGround|
+---+-------+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+
| 90|   속초|2010-12-21 00:00:00| 6.4|  3.8|10.1|null|   6.4|   2.9|       50.4|        2.9|         null|   null|     8.4|      2.8|
| 90|   속초|2010-12-22 00:00:00| 5.8|  2.5| 8.7|null|   3.8|   2.3|       59.8|        5.2|         null|   null|     3.9|      3.6|
| 90|   속초|2010-12-23 00:00:00| 3.9| -3.1| 8.6|null|   8.6|   3.8|       25.0|        7.0|         null|   null|     0.8|      0.5|
| 90|   속초|2010-12-24 00:00:00|-6.8| -9.6|-3.1|null|   7.2|   3.6|       17.9|        8.4|         null|   null|     0.0|     -4.3|
| 90|   속초|2010-12-25 00:00:00|-7.3|-10.7|-4.4|null|  11.7|   6.7|       14.9|        8.6|         null|   null|     0.0|     -6.4|
+---+-------+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+
only showing top 5 rows

In [18]:
weatherDF.summary().show()
 
+-------+------------------+-------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|summary|               loc|locName|               avg|               min|               max|              rain|            maxWin|            avgWin|       avgHumidity|       sumSunshine|     sumInsolation|           sumSnow|          avgCloud|         avgGround|
+-------+------------------+-------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
|  count|            310594| 310594|            310309|            310574|            310568|            109933|            310475|            310316|            310079|            310129|            130512|              4143|            141332|            310294|
|   mean|197.28717876069723|   null|12.916850623088695|  8.24918183750098|18.313607969913246| 9.913482757679343| 4.916743699170655|2.0878543162453718| 68.31187536080654| 6.206939692837397|13.644861545298792|3.3998551774076926|  5.18359677921491|15.053247242937303|
| stddev|63.999888489505366|   null| 9.862753377470318|10.433370572379587| 9.930215143856923|20.664915984484093|2.2154859778909906|1.4109632439327655|16.088124509956586|3.9309558122972947| 7.173553714351939| 5.803263786261778|3.0681217215420493|10.976030148493335|
|    min|                90|   강릉|             -19.4|             -27.7|             -15.2|               0.0|               0.0|               0.0|               8.5|               0.0|               0.0|               0.0|               0.0|             -14.9|
|    25%|               137|   null|               4.5|              -0.4|               9.9|               0.3|               3.4|               1.2|              57.5|               2.6|              8.12|               0.4|               2.6|               4.8|
|    50%|               201|   null|              13.9|               8.6|              19.8|               2.0|               4.5|               1.7|              70.0|               7.1|             12.88|               1.4|               5.3|              15.7|
|    75%|               258|   null|              21.4|              17.4|              26.6|              10.0|               5.9|               2.6|              80.3|               9.3|             19.07|               4.0|               7.8|              24.5|
|    max|               295| 흑산도|              34.1|              30.9|              41.0|             449.5|              49.0|              21.8|             100.0|              50.0|             42.86|              78.2|              10.0|              43.8|
+-------+------------------+-------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+

In [19]:
weatherDF.printSchema()
 
root
 |-- loc: integer (nullable = true)
 |-- locName: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- avg: double (nullable = true)
 |-- min: double (nullable = true)
 |-- max: double (nullable = true)
 |-- rain: double (nullable = true)
 |-- maxWin: double (nullable = true)
 |-- avgWin: double (nullable = true)
 |-- avgHumidity: double (nullable = true)
 |-- sumSunshine: double (nullable = true)
 |-- sumInsolation: double (nullable = true)
 |-- sumSnow: double (nullable = true)
 |-- avgCloud: double (nullable = true)
 |-- avgGround: double (nullable = true)

In [20]:
weatherDF = weatherDF.drop("locName")
In [21]:
weatherDF.printSchema()
 
root
 |-- loc: integer (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- avg: double (nullable = true)
 |-- min: double (nullable = true)
 |-- max: double (nullable = true)
 |-- rain: double (nullable = true)
 |-- maxWin: double (nullable = true)
 |-- avgWin: double (nullable = true)
 |-- avgHumidity: double (nullable = true)
 |-- sumSunshine: double (nullable = true)
 |-- sumInsolation: double (nullable = true)
 |-- sumSnow: double (nullable = true)
 |-- avgCloud: double (nullable = true)
 |-- avgGround: double (nullable = true)

In [22]:
train, test = weatherDF.randomSplit([0.7, 0.3])
In [23]:
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
In [24]:
weatherRF = RFormula().setFormula("avg ~.").setFeaturesCol("features") \
                      .setLabelCol("label").setHandleInvalid("skip")
In [25]:
trainRF = weatherRF.fit(train).transform(train)
testRF = weatherRF.fit(test).transform(test)
In [26]:
lr = LinearRegression()
lr_model = lr.fit(trainRF)
In [28]:
testFit = lr_model.transform(testRF)
In [29]:
testFit.show()
 
+---+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+--------------------+-----+--------------------+
|loc|               date| avg|  min| max|rain|maxWin|avgWin|avgHumidity|sumSunshine|sumInsolation|sumSnow|avgCloud|avgGround|            features|label|          prediction|
+---+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+--------------------+-----+--------------------+
| 93|2016-11-26 00:00:00|-1.6| -6.3| 0.8| 1.1|   1.9|   0.6|       83.0|        0.0|         2.53|    0.5|     8.1|     -0.5|[93.0,-6.3,0.8,1....| -1.6|  -2.496908458646051|
| 93|2016-12-08 00:00:00| 0.2| -6.3| 5.1| 1.8|   3.4|   0.5|       82.3|        2.7|         6.14|    0.5|     7.8|      0.5|[93.0,-6.3,5.1,1....|  0.2| -0.5542894685412494|
| 93|2017-01-21 00:00:00|-6.2|-13.0|-0.7| 1.1|   4.2|   1.1|       79.6|        6.7|          8.8|    1.3|     4.9|     -3.4|[93.0,-13.0,-0.7,...| -6.2|  -6.546222364120698|
| 93|2017-02-23 00:00:00| 0.8| -2.8| 4.4| 0.0|   4.1|   1.7|       56.1|        6.8|        12.17|    0.0|     4.4|      2.0|[93.0,-2.8,4.4,0....|  0.8|  0.4463458570826158|
| 93|2017-12-24 00:00:00| 0.9| -0.7| 2.4|10.4|   3.3|   0.7|       87.4|        0.0|          0.9|    1.8|     8.9|      0.2|[93.0,-0.7,2.4,10...|  0.9|  0.6885975508917763|
| 93|2018-01-08 00:00:00| 0.6| -2.1| 3.8| 1.1|   5.0|   1.4|       61.6|        0.8|         4.29|    2.0|     9.0|     -0.4|[93.0,-2.1,3.8,1....|  0.6|  0.7107292458997987|
| 93|2018-01-22 00:00:00|-0.9| -3.8| 3.2| 3.7|   3.5|   0.7|       72.6|        1.1|         4.29|    5.0|     9.5|     -0.7|[93.0,-3.8,3.2,3....| -0.9|-0.34255288616567536|
| 93|2018-02-23 00:00:00| 1.6| -3.4| 6.4| 4.4|   3.7|   0.9|       85.5|        4.8|         11.4|    7.6|     8.5|      1.9|[93.0,-3.4,6.4,4....|  1.6|  1.2598331080643832|
| 93|2018-11-24 00:00:00| 0.2| -1.7| 1.3| 4.4|   2.8|   0.5|       85.1|        0.0|         1.14|    8.5|     9.3|      0.9|[93.0,-1.7,1.3,4....|  0.2|-0.23102571771918803|
| 93|2019-02-19 00:00:00| 0.2| -1.3| 1.7| 5.7|   1.7|   0.2|       88.0|        0.0|         2.86|    4.0|     9.4|     -0.2|[93.0,-1.3,1.7,5....|  0.2| 0.11007416263727621|
|100|2011-01-03 00:00:00|-7.8|-16.7|-2.8| 2.9|   4.1|   1.5|       83.0|        1.3|         6.67|    6.6|     7.9|     -2.0|[100.0,-16.7,-2.8...| -7.8|  -8.866293153141047|
|100|2011-01-14 00:00:00|-8.8|-15.8|-4.2| 0.3|  10.2|   5.6|       71.3|        6.3|         9.59|    0.4|     4.6|     -6.9|[100.0,-15.8,-4.2...| -8.8|  -9.537748097941778|
|100|2011-02-14 00:00:00|-8.0|-11.7|-4.3| 4.7|   4.5|   1.5|       85.0|        0.0|         3.06|    9.0|     8.8|     -2.0|[100.0,-11.7,-4.3...| -8.0|  -7.334369975934852|
|100|2011-02-28 00:00:00|-3.0| -3.8|-1.9| 1.4|   6.5|   2.5|       91.8|        0.0|         5.76|    2.2|    10.0|      0.0|[100.0,-3.8,-1.9,...| -3.0| -2.5884578762426624|
|100|2011-03-22 00:00:00|-1.5| -4.7| 2.6| 2.3|   8.0|   3.8|       52.4|        4.8|        15.27|    6.4|     5.0|      0.0|[100.0,-4.7,2.6,2...| -1.5| -1.2683076241902866|
|100|2011-03-25 00:00:00|-5.2|-12.7|-1.4| 4.8|   7.9|   3.2|       76.9|        4.5|        12.65|    8.5|     6.5|      0.0|[100.0,-12.7,-1.4...| -5.2|  -6.562127722481371|
|100|2011-03-28 00:00:00|-0.2| -4.2| 4.0| 0.9|   6.4|   2.4|       79.0|        1.9|         9.63|    1.3|     8.4|      1.5|[100.0,-4.2,4.0,0...| -0.2|-0.07702383069997987|
|100|2011-12-02 00:00:00|-0.3| -1.9| 0.9| 3.0|   3.6|   1.9|       91.5|        0.0|         3.74|    3.4|    10.0|      0.6|[100.0,-1.9,0.9,3...| -0.3|-0.46808178532145883|
|100|2011-12-09 00:00:00|-6.3|-11.0|-2.7| 3.8|   4.7|   1.7|       85.9|        0.0|         1.27|   10.5|     8.5|      0.1|[100.0,-11.0,-2.7...| -6.3|  -6.218081992323221|
|100|2012-01-01 00:00:00|-6.2|-10.4|-3.4| 0.3|   6.7|   4.5|       73.1|        5.9|         8.52|    0.4|     7.4|     -4.0|[100.0,-10.4,-3.4...| -6.2|  -6.621153351217983|
+---+-------------------+----+-----+----+----+------+------+-----------+-----------+-------------+-------+--------+---------+--------------------+-----+--------------------+
only showing top 20 rows

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: