아래처럼 [ ] 로 path 들을 감싸주면 된다.

paths=['foo','bar']
df=spark.read.parquet(*paths)

혹은

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

 

 

stackoverflow 답변 참고

https://stackoverflow.com/a/43881623

 

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ...

stackoverflow.com

 

 

 

+ Recent posts