[PySpark] 여러 path 에서 데이터 읽는 방법

눈가락 2021. 6. 16. 09:53

2021. 6. 16. 09:53

아래처럼 [ ] 로 path 들을 감싸주면 된다.

paths=['foo','bar']
df=spark.read.parquet(*paths)

혹은

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)

stackoverflow 답변 참고

https://stackoverflow.com/a/43881623

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ...

stackoverflow.com

저작자표시 비영리 동일조건 (새창열림)

'Spark' 카테고리의 다른 글

[PySpark] sample dataframe 만들기 (0)	2022.05.09
[PySpark] 컬럼의 합 구하는 방법 (0)	2021.06.16
[Spark] SQL Built-in Functions 문서 링크 (0)	2021.06.16
[Spark] json string 값을 갖는 column 에서 json 값 추출하는 방법 + 삽질의 결과 (0)	2021.04.29
[PySpark] Python DataFrame 다양한 연산 모음 (0)	2021.04.05

눈가락★

[PySpark] 여러 path 에서 데이터 읽는 방법

'Spark' 카테고리의 다른 글

+ Recent posts

티스토리툴바