I have a dataset on HDFS and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is this does not happen when I read another dataset which is partitioned with 2 columns ym and ymd.
hdfs://HDFS_CLUSTER/ ym=202404/ ... ym=202405/ ... ymd=20240501/ eventName=event_name_0/ file_01.parquet file_02.parquet ...My code look like this:
# to way to read the dataset, Spark creates 'listing leaf files' job on both casesspark.read.parquet("hdfs://HDFS_CLUSTER/dataset")spark.read.parquet("hdfs://HDFS_CLUSTER/dataset/ym=202405/ymd=20240501")And the Spark Jobs is created as below
I have tried to read with a pre-defined schema and with option mergeSchema: true but the result is the same.My tries:
schema = ... # schema definitionspark.read.schema(schema).option('mergeSchema', 'true').parquet("hdfs://HDFS_CLUSTER/dataset")The 'listing leaf files' job takes about 7s to finish, my dataset contains data since 2021. Then reading the dataset with spark.read.parquet("hdfs://HDFS_CLUSTER/dataset") takes infinity time to finish (I got exception related to out of Java Heap Space).
My Spark version is 3.2.4.
I found a question on StackOverflow with the same problem since 2016 but it's not so helpful on my case: Spark lists all leaf node even in partitioned data
