Spark DataFrame¶

spark.read.csv()
spark.read.json()
spark.read.format("csv")
spark.read.format("json")

file://
hdfs://
hbase://
s3://

stock = spark.read.csv("data/appl_stock.csv", inferSchema=True, header = True)

stock.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)

stock.columns

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

stock.describe().show()

+-------+------------------+------------------+------------------+-----------------+-------------------+------------------+
|summary|              Open|              High|               Low|            Close|             Volume|         Adj Close|
+-------+------------------+------------------+------------------+-----------------+-------------------+------------------+
|  count|              1762|              1762|              1762|             1762|               1762|              1762|
|   mean| 313.0763111589103| 315.9112880164581| 309.8282405079457|312.9270656379113|9.422577587968218E7| 75.00174115607275|
| stddev|185.29946803981522|186.89817686485767|183.38391664371008|185.1471036170943|6.020518776592709E7| 28.57492972179906|
|    min|              90.0|         90.699997|         89.470001|        90.279999|           11475900|         24.881912|
|    max|        702.409988|        705.070023|        699.569977|       702.100021|          470249500|127.96609099999999|
+-------+------------------+------------------+------------------+-----------------+-------------------+------------------+

stock.summary().show()

+-------+------------------+------------------+------------------+------------------+-------------------+------------------+
|summary|              Open|              High|               Low|             Close|             Volume|         Adj Close|
+-------+------------------+------------------+------------------+------------------+-------------------+------------------+
|  count|              1762|              1762|              1762|              1762|               1762|              1762|
|   mean| 313.0763111589103| 315.9112880164581| 309.8282405079457| 312.9270656379113|9.422577587968218E7| 75.00174115607275|
| stddev|185.29946803981522|186.89817686485767|183.38391664371008| 185.1471036170943|6.020518776592709E7| 28.57492972179906|
|    min|              90.0|         90.699997|         89.470001|         90.279999|           11475900|         24.881912|
|    25%|        115.199997|        116.349998|             114.0|        115.190002|           49161400|         50.260037|
|    50%|        317.990002|320.18001200000003|        316.340004|318.21000699999996|           80500000| 72.95419100000001|
|    75%|470.94001799999995|478.55001799999997|468.05001799999997|472.69001799999995|          121095800|        100.228673|
|    max|        702.409988|        705.070023|        699.569977|        702.100021|          470249500|127.96609099999999|
+-------+------------------+------------------+------------------+------------------+-------------------+------------------+

pandas Dataframe -> spark Datafrmae¶

import pandas as pd

pdf = pd.DataFrame({
    'x': [[1,2,3], [4,5,6]],
    'y': [['a','b','c'], ['d','e','f']]
})

pdf

sdf = spark.createDataFrame(pdf)

sdf.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[d, e, f]|
+---------+---------+

RDD -> Dataframe¶

from pyspark.sql import Row

rdd = spark.sparkContext.parallelize([
    Row(x=[1,2,3], y=['a','b','c']),
    Row(x=[4,5,6], y=['d','e','f'])
])

rdf = spark.createDataFrame(rdd)

rdf.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[d, e, f]|
+---------+---------+

json -> Dataframe¶

jdf = spark.read.json("data/people.json")

jdf.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

DataFrame Schema 설정¶

from pyspark.sql.types import StructField, IntegerType, StringType, StringType, StructType

data_schema = [StructField('age', IntegerType(), True), \
               StructField('name', StringType(), True)]

struct_schema = StructType(fields=data_schema)

jdf = spark.read.json("data/people.json", schema=struct_schema)

jdf.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

jdf.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

Spark DataFrame 03 (Pyspark) (0)	2020.04.11
Spark DataFrame 02 (Pyspark) (0)	2020.04.11
Spark RDD 문법 (0)	2020.04.06
Spark RDD (0)	2020.04.04
Apache Spark란? (0)	2020.04.04

Hee'World

Hee'World

Spark DataFrame01 (Pyspark) 본문

Spark DataFrame01 (Pyspark)

Spark DataFrame¶

pandas Dataframe -> spark Datafrmae¶

RDD -> Dataframe¶

json -> Dataframe¶

DataFrame Schema 설정¶

'BigData > Spark' 카테고리의 다른 글

티스토리툴바

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

	x	y
0	[1, 2, 3]	[a, b, c]
1	[4, 5, 6]	[d, e, f]