Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23247

Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS

$
0
0

I have a dataset on HDFS and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is this does not happen when I read another dataset which is partitioned with 2 columns ym and ymd.

hdfs://HDFS_CLUSTER/    ym=202404/    ...    ym=202405/        ...        ymd=20240501/            eventName=event_name_0/                file_01.parquet                file_02.parquet                ...

My code look like this:

# to way to read the dataset, Spark creates 'listing leaf files' job on both casesspark.read.parquet("hdfs://HDFS_CLUSTER/dataset")spark.read.parquet("hdfs://HDFS_CLUSTER/dataset/ym=202405/ymd=20240501")

And the Spark Jobs is created as below

Spark Jobs

I have tried to read with a pre-defined schema and with option mergeSchema: true but the result is the same.My tries:

schema = ... # schema definitionspark.read.schema(schema).option('mergeSchema', 'true').parquet("hdfs://HDFS_CLUSTER/dataset")

The 'listing leaf files' job takes about 7s to finish, my dataset contains data since 2021. Then reading the dataset with spark.read.parquet("hdfs://HDFS_CLUSTER/dataset") takes infinity time to finish (I got exception related to out of Java Heap Space).

My Spark version is 3.2.4.
I found a question on StackOverflow with the same problem since 2016 but it's not so helpful on my case: Spark lists all leaf node even in partitioned data


Viewing all articles
Browse latest Browse all 23247

Trending Articles