BigData/Spark
Spark DataFrame 03 (Pyspark)
Jonghee Jeon
2020. 4. 11. 16:18
Titanic DataFrame 생성¶
In [1]:
import pandas as pd
In [15]:
data1 = {'PassengerId':{0:1, 1:2, 2:3, 3:4, 4:5},
'Name' : {0:'Owen', 1:'Florence', 2:'Laina', 3:'Lily', 4:"William"},
'sex' : {0: 'male', 1: 'female', 2:'female', 3:'female', 4:'male'},
'Survived': {0:0, 1:1, 2:1, 3:1, 4:0}
}
data2 = {'PassengerId':{0:1, 1:2, 2:3, 3:4, 4:5},
'Age' : {0: 22, 1: 38, 2: 33, 3: 35, 4: 35},
'Fare' : {0: 7.3, 1: 71.3, 2:7.9, 3:53.1, 4:8.0},
'Pclass': {0:3, 1:1, 2:3, 3:1, 4:3}
}
df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())
In [16]:
df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
In [5]:
df1.printSchema()
In [6]:
df1.show()
count - 갯수¶
In [7]:
df1.count()
Out[7]:
In [8]:
df2.count()
Out[8]:
select - 변수 선택¶
In [9]:
cols = ["PassengerId", "Name"]
df1.select(cols).show()
filter - 조건¶
In [12]:
df1.filter(df1.sex == 'female').show()
withcolumn - 컬럼생성¶
In [17]:
df2.withColumn('AgeFare', df2.Age*df2.Fare).show()
요약집계 groupby, avg, agg¶
In [20]:
gdf2 = df2.groupBy('Pclass')
In [21]:
avg_Cols = ['Age', 'Fare']
gdf2.avg(*avg_Cols).show()
sort 정렬¶
In [23]:
df2.sort('Fare', ascending=False).show()
데이터 프레임 결합 join¶
In [26]:
df1.join(df2, ['PassengerId']).show()
In [27]:
df1.createOrReplaceTempView('df1_tmp')
df2.createOrReplaceTempView('df2_tmp')
In [28]:
query = """
select *
from df1_tmp a
join df2_tmp b
on a.PassengerId = b.PassengerId
"""
dfj = spark.sql(query)
In [29]:
dfj.show()
In [31]:
df1.union(df1).show()
In [32]:
df1.explain()
In [33]:
dfj.explain()
In [ ]:
In [ ]:
In [ ]:
In [ ]: