관리 메뉴

Hee'World

Spark SQL (PySpark) 본문

BigData/Spark

Spark SQL (PySpark)

Jonghee Jeon 2020. 4. 15. 18:28

 - Spark 이전, SQL on Hadoop으로 Hive가 사실상 표준

 - DataFramecreateOrReplaceTempView로 등록하여 SQL 사용 가능

 - Grobal TempView

   • Spark Session 전역에서 사용 가능하도록 선언

   • createOrReplaceTempView는 현재 SparkSession에서만 사용 가능

 

 

Spark SQL 01

Spark SQL

  • Spark DataFrame을 Database Table처럼 사용
In [1]:
import pandas as pd

Pandas 데이터프레임 생성

In [5]:
pandf = pd.read_csv("data/Uber-Jan-Feb-FOIL.csv", header=0)
In [6]:
pandf.head()
Out[6]:
dispatching_base_number date active_vehicles trips
0 B02512 1/1/2015 190 1132
1 B02765 1/1/2015 225 1765
2 B02764 1/1/2015 3427 29421
3 B02682 1/1/2015 945 7679
4 B02617 1/1/2015 1228 9537

Spark session 데이터 프레임 생성

In [7]:
uberDF = spark.read.csv("data/Uber-Jan-Feb-FOIL.csv", inferSchema=True, header=True)

spark.read.format("csv").option('').load("data/Uber-Jan-Feb-FOIL.csv")

In [8]:
uberDF.show()
+-----------------------+--------+---------------+-----+
|dispatching_base_number|    date|active_vehicles|trips|
+-----------------------+--------+---------------+-----+
|                 B02512|1/1/2015|            190| 1132|
|                 B02765|1/1/2015|            225| 1765|
|                 B02764|1/1/2015|           3427|29421|
|                 B02682|1/1/2015|            945| 7679|
|                 B02617|1/1/2015|           1228| 9537|
|                 B02598|1/1/2015|            870| 6903|
|                 B02598|1/2/2015|            785| 4768|
|                 B02617|1/2/2015|           1137| 7065|
|                 B02512|1/2/2015|            175|  875|
|                 B02682|1/2/2015|            890| 5506|
|                 B02765|1/2/2015|            196| 1001|
|                 B02764|1/2/2015|           3147|19974|
|                 B02765|1/3/2015|            201| 1526|
|                 B02617|1/3/2015|           1188|10664|
|                 B02598|1/3/2015|            818| 7432|
|                 B02682|1/3/2015|            915| 8010|
|                 B02512|1/3/2015|            173| 1088|
|                 B02764|1/3/2015|           3215|29729|
|                 B02512|1/4/2015|            147|  791|
|                 B02682|1/4/2015|            812| 5621|
+-----------------------+--------+---------------+-----+
only showing top 20 rows

In [9]:
uberDF.createOrReplaceTempView("uber")

Spark SQL SELECT

In [13]:
spark_selct = spark.sql("select * from uber limit 10").show()
+-----------------------+--------+---------------+-----+
|dispatching_base_number|    date|active_vehicles|trips|
+-----------------------+--------+---------------+-----+
|                 B02512|1/1/2015|            190| 1132|
|                 B02765|1/1/2015|            225| 1765|
|                 B02764|1/1/2015|           3427|29421|
|                 B02682|1/1/2015|            945| 7679|
|                 B02617|1/1/2015|           1228| 9537|
|                 B02598|1/1/2015|            870| 6903|
|                 B02598|1/2/2015|            785| 4768|
|                 B02617|1/2/2015|           1137| 7065|
|                 B02512|1/2/2015|            175|  875|
|                 B02682|1/2/2015|            890| 5506|
+-----------------------+--------+---------------+-----+

SELECT column limit

In [16]:
spark.sql("select date, dispatching_base_number from uber limit 10").show()
+--------+-----------------------+
|    date|dispatching_base_number|
+--------+-----------------------+
|1/1/2015|                 B02512|
|1/1/2015|                 B02765|
|1/1/2015|                 B02764|
|1/1/2015|                 B02682|
|1/1/2015|                 B02617|
|1/1/2015|                 B02598|
|1/2/2015|                 B02598|
|1/2/2015|                 B02617|
|1/2/2015|                 B02512|
|1/2/2015|                 B02682|
+--------+-----------------------+

SELECT DISTINCT

In [18]:
spark.sql("select distinct dispatching_base_number from uber").show()
+-----------------------+
|dispatching_base_number|
+-----------------------+
|                 B02512|
|                 B02598|
|                 B02682|
|                 B02765|
|                 B02617|
|                 B02764|
+-----------------------+

WEHRE

In [19]:
spark.sql("select count(*) from uber where trips > 2000").show()
+--------+
|count(1)|
+--------+
|     284|
+--------+

distinct, sum, group by, order by

In [21]:
spark.sql("""select distinct dispatching_base_number, 
                             sum(trips) tripsum
                             from uber
                             group by dispatching_base_number
                             order by tripsum desc""").show()
+-----------------------+-------+
|dispatching_base_number|tripsum|
+-----------------------+-------+
|                 B02764|1914449|
|                 B02617| 725025|
|                 B02682| 662509|
|                 B02598| 540791|
|                 B02765| 193670|
|                 B02512|  93786|
+-----------------------+-------+

In [22]:
spark.sql("""select distinct date, 
                             sum(trips) tripsum
                             from uber
                             group by date
                             order by tripsum desc limit 10""").show()
+---------+-------+
|     date|tripsum|
+---------+-------+
|2/20/2015| 100915|
|2/14/2015| 100345|
|2/21/2015|  98380|
|2/13/2015|  98024|
|1/31/2015|  92257|
|2/15/2015|  89401|
|2/27/2015|  88806|
|2/19/2015|  88757|
|2/28/2015|  88181|
| 2/6/2015|  85940|
+---------+-------+

between

In [25]:
spark.sql("select * from uber where trips between 1000 and  2000 limit 10").show()
+-----------------------+--------+---------------+-----+
|dispatching_base_number|    date|active_vehicles|trips|
+-----------------------+--------+---------------+-----+
|                 B02512|1/1/2015|            190| 1132|
|                 B02765|1/1/2015|            225| 1765|
|                 B02765|1/2/2015|            196| 1001|
|                 B02765|1/3/2015|            201| 1526|
|                 B02512|1/3/2015|            173| 1088|
|                 B02765|1/5/2015|            227| 1133|
|                 B02765|1/6/2015|            234| 1376|
|                 B02512|1/6/2015|            218| 1314|
|                 B02765|1/7/2015|            248| 1704|
|                 B02512|1/7/2015|            217| 1446|
+-----------------------+--------+---------------+-----+

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

'BigData > Spark' 카테고리의 다른 글

Spark ML (Pyspark)  (0) 2020.04.25
Spark Streaming (PySpark)  (0) 2020.04.21
Spark DataFrame 03 (Pyspark)  (0) 2020.04.11
Spark DataFrame 02 (Pyspark)  (0) 2020.04.11
Spark DataFrame01 (Pyspark)  (0) 2020.04.11
Comments