Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS

I have a dataset on HDFS and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is this does not happen when I read another dataset which is partitioned with 2 columns ym and ymd.

hdfs://HDFS_CLUSTER/    ym=202404/    ...    ym=202405/        ...        ymd=20240501/            eventName=event_name_0/                file_01.parquet                file_02.parquet                ...

My code look like this:

# to way to read the dataset, Spark creates 'listing leaf files' job on both casesspark.read.parquet("hdfs://HDFS_CLUSTER/dataset")spark.read.parquet("hdfs://HDFS_CLUSTER/dataset/ym=202405/ymd=20240501")

And the Spark Jobs is created as below

I have tried to read with a pre-defined schema and with option mergeSchema: true but the result is the same.My tries:

schema = ... # schema definitionspark.read.schema(schema).option('mergeSchema', 'true').parquet("hdfs://HDFS_CLUSTER/dataset")

The 'listing leaf files' job takes about 7s to finish, my dataset contains data since 2021. Then reading the dataset with spark.read.parquet("hdfs://HDFS_CLUSTER/dataset") takes infinity time to finish (I got exception related to out of Java Heap Space).

My Spark version is 3.2.4.
I found a question on StackOverflow with the same problem since 2016 but it's not so helpful on my case: Spark lists all leaf node even in partitioned data

Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...