BigData/Spark

Pandas API on Apache Spark

Jonghee Jeon 2020. 2. 23. 14:40

 


  •   Pandas는 Python에서 데이터 처리를 위한 사실상 표준에 가까운 패키지
  •   Databricks에서 주도하고 있는 Koalas 프로젝트는 Apache Spark위에 Pandas API를 구현한 기능
  •   Pandas 문법을 사용하면서 Spark의 성능을 그대로 활용 가능
  •   현재, 베타버전 

 

Koalas github page - https://github.com/databricks/koalas

 

databricks/koalas

Koalas: pandas API on Apache Spark. Contribute to databricks/koalas development by creating an account on GitHub.

github.com

Koalas Spark + AI Summit 2019 - https://databricks.com/session_eu19/koalas-pandas-on-apache-spark

 

Koalas: Pandas on Apache Spark – Databricks

In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big da

databricks.com

 

  •  Koalas 패키지 설치 방법

      - 기본 conda를 이용한 설치를 권장하고 있음

      > conda install koalas -c conda-forge

  • Koalas 사용법
# 패키지 Import
import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(10)
>>> pdf = kdf.to_pandas()
>>> pdf.values
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])
>>> ks.from_pandas(pdf)
   id
0   0
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(10)
>>> sdf = kdf.to_spark().filter("id > 5")
>>> sdf.show()
+---+
| id|
+---+
|  6|
|  7|
|  8|
|  9|
+---+
  • Koalas 10분 튜토리얼 Notebook

10min.ipynb
0.08MB


Koalas Github에 있는 자료를 활용하였습니다.

https://github.com/databricks/koalas/