BigData/Spark
Spark SQL (PySpark)
Jonghee Jeon
2020. 4. 15. 18:28
- Spark 이전, SQL on Hadoop으로 Hive가 사실상 표준
- DataFrame을 createOrReplaceTempView로 등록하여 SQL 사용 가능
- Grobal TempView
• Spark Session 전역에서 사용 가능하도록 선언
• createOrReplaceTempView는 현재 SparkSession에서만 사용 가능
Spark SQL¶
- Spark DataFrame을 Database Table처럼 사용
In [1]:
import pandas as pd
Pandas 데이터프레임 생성¶
In [5]:
pandf = pd.read_csv("data/Uber-Jan-Feb-FOIL.csv", header=0)
In [6]:
pandf.head()
Out[6]:
Spark session 데이터 프레임 생성¶
In [7]:
uberDF = spark.read.csv("data/Uber-Jan-Feb-FOIL.csv", inferSchema=True, header=True)
spark.read.format("csv").option('').load("data/Uber-Jan-Feb-FOIL.csv")
In [8]:
uberDF.show()
In [9]:
uberDF.createOrReplaceTempView("uber")
Spark SQL SELECT¶
In [13]:
spark_selct = spark.sql("select * from uber limit 10").show()
SELECT column limit¶
In [16]:
spark.sql("select date, dispatching_base_number from uber limit 10").show()
SELECT DISTINCT¶
In [18]:
spark.sql("select distinct dispatching_base_number from uber").show()
WEHRE¶
In [19]:
spark.sql("select count(*) from uber where trips > 2000").show()
distinct, sum, group by, order by¶
In [21]:
spark.sql("""select distinct dispatching_base_number,
sum(trips) tripsum
from uber
group by dispatching_base_number
order by tripsum desc""").show()
In [22]:
spark.sql("""select distinct date,
sum(trips) tripsum
from uber
group by date
order by tripsum desc limit 10""").show()
between¶
In [25]:
spark.sql("select * from uber where trips between 1000 and 2000 limit 10").show()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: