Notice
Recent Posts
Recent Comments
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- 인공지능
- graph
- 딥러닝
- spark
- TigerGraph
- 그래프
- 빅데이터
- GraphX
- BigData
- Neo4j
- 그래프 에코시스템
- Federated Learning
- 연합학습
- SQL
- GDB
- 그래프 데이터베이스
- Python
- RStudio
- Graph Ecosystem
- RDD
- Graph Tech
- 그래프 질의언어
- GSQL
- Cypher
- TensorFlow
- SparkML
- 분산 병렬 처리
- r
- DeepLearning
- graph database
Archives
- Today
- Total
Hee'World
Spark DataFrame 03 (Pyspark) 본문
Titanic DataFrame 생성¶
In [1]:
import pandas as pd
In [15]:
data1 = {'PassengerId':{0:1, 1:2, 2:3, 3:4, 4:5},
'Name' : {0:'Owen', 1:'Florence', 2:'Laina', 3:'Lily', 4:"William"},
'sex' : {0: 'male', 1: 'female', 2:'female', 3:'female', 4:'male'},
'Survived': {0:0, 1:1, 2:1, 3:1, 4:0}
}
data2 = {'PassengerId':{0:1, 1:2, 2:3, 3:4, 4:5},
'Age' : {0: 22, 1: 38, 2: 33, 3: 35, 4: 35},
'Fare' : {0: 7.3, 1: 71.3, 2:7.9, 3:53.1, 4:8.0},
'Pclass': {0:3, 1:1, 2:3, 3:1, 4:3}
}
df1_pd = pd.DataFrame(data1, columns=data1.keys())
df2_pd = pd.DataFrame(data2, columns=data2.keys())
In [16]:
df1 = spark.createDataFrame(df1_pd)
df2 = spark.createDataFrame(df2_pd)
In [5]:
df1.printSchema()
In [6]:
df1.show()
count - 갯수¶
In [7]:
df1.count()
Out[7]:
In [8]:
df2.count()
Out[8]:
select - 변수 선택¶
In [9]:
cols = ["PassengerId", "Name"]
df1.select(cols).show()
filter - 조건¶
In [12]:
df1.filter(df1.sex == 'female').show()
withcolumn - 컬럼생성¶
In [17]:
df2.withColumn('AgeFare', df2.Age*df2.Fare).show()
요약집계 groupby, avg, agg¶
In [20]:
gdf2 = df2.groupBy('Pclass')
In [21]:
avg_Cols = ['Age', 'Fare']
gdf2.avg(*avg_Cols).show()
sort 정렬¶
In [23]:
df2.sort('Fare', ascending=False).show()
데이터 프레임 결합 join¶
In [26]:
df1.join(df2, ['PassengerId']).show()
In [27]:
df1.createOrReplaceTempView('df1_tmp')
df2.createOrReplaceTempView('df2_tmp')
In [28]:
query = """
select *
from df1_tmp a
join df2_tmp b
on a.PassengerId = b.PassengerId
"""
dfj = spark.sql(query)
In [29]:
dfj.show()
In [31]:
df1.union(df1).show()
In [32]:
df1.explain()
In [33]:
dfj.explain()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
'BigData > Spark' 카테고리의 다른 글
Spark Streaming (PySpark) (0) | 2020.04.21 |
---|---|
Spark SQL (PySpark) (0) | 2020.04.15 |
Spark DataFrame 02 (Pyspark) (0) | 2020.04.11 |
Spark DataFrame01 (Pyspark) (0) | 2020.04.11 |
Spark RDD 문법 (0) | 2020.04.06 |