Pandas API on Apache Spark

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

Hee'World

Pandas API on Apache Spark 본문

BigData/Spark

Pandas API on Apache Spark

Jonghee Jeon 2020. 2. 23. 14:40

Pandas는 Python에서 데이터 처리를 위한 사실상 표준에 가까운 패키지
Databricks에서 주도하고 있는 Koalas 프로젝트는 Apache Spark위에 Pandas API를 구현한 기능
Pandas 문법을 사용하면서 Spark의 성능을 그대로 활용 가능
현재, 베타버전

Koalas github page - https://github.com/databricks/koalas

databricks/koalas

Koalas: pandas API on Apache Spark. Contribute to databricks/koalas development by creating an account on GitHub.

github.com

Koalas Spark + AI Summit 2019 - https://databricks.com/session_eu19/koalas-pandas-on-apache-spark

Koalas: Pandas on Apache Spark – Databricks

In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big da

databricks.com

Koalas 패키지 설치 방법

- 기본 conda를 이용한 설치를 권장하고 있음

> conda install koalas -c conda-forge

Koalas 사용법

# 패키지 Import
import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(10)
>>> pdf = kdf.to_pandas()
>>> pdf.values
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

>>> ks.from_pandas(pdf)
   id
0   0
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9

>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(10)
>>> sdf = kdf.to_spark().filter("id > 5")
>>> sdf.show()
+---+
| id|
+---+
|  6|
|  7|
|  8|
|  9|
+---+